Qualification: 
M.Tech
sandhyaharikumar@am.amrita.edu

Sandhya Harikumar currently serves as the Assistant Professor (Senior Grade) at the Department of Computer Science Engineering at Amrita School of Engineering, Amritapuri. She pursued her M. Tech. in Computer Science from IIT Delhi.  She has 14 years of academic experience and 3 years of industrial experience.

Awards/Achievements

  • Best SPoC for Infosys Campus Connect

Publications

Publication Type: Conference Paper

Year of Publication Title

2018

C. Baladevi and Sandhya Harikumar, “Semantic Representation of Documents Based on Matrix Decomposition”, in 2018 International Conference on Data Science and Engineering, ICDSE 2018, 2018.[Abstract]


This paper addresses an important problem of semantic representation of documents for information retrieval in a data integration system. Quite often search query on documents seek relevant information. Conventional methods of feature extraction do not capture relevance but rather focus on term matching for query processing. Challenges of semantic representation of documents lie in identification of important features. Most of the techniques for identifying important features, transform original data to a different space. This gives a sparse matrix which is computationally expensive. So we come up with an alternative approach based on CUR matrix decomposition. This technique finds important documents and important terms in order to improvise the query processing. Experimentation results prove the efficacy of this approach on five data sets. © 2018 IEEE.

More »»

2017

Sandhya Harikumar and Thaha, S. S., “MapReduce model for k-medoid clustering”, in Proceedings of the 2016 International Conference on Data Science and Engineering, ICDSE 2016, 2017.[Abstract]


Distributed and Parallel computing are best alternatives for scalable clustering of huge amount of data with moderate to high dimensions, together with improved speed up. In this paper we address the problem of k-medoid clustering using MapReduce framework for distributed computing on commodity machines to evaluate its efficacy. There are mainly two issues to be tackled. The first one is, how to distribute the data for efficient clustering and the second one is, how to minimize the I/O and network cost among the machines. So, the main contributions of this paper are : (a)A map reduce methodology for distributed k-medoid clustering; (b) Reduction in the overall execution time and the overhead of data movement from one site to another leading to sub linear scaleup and speedup. This approach proves to be efficient, as the local clustering can be carried out independently from each other. Experimental analysis on millions of data using just 10 cores in parallel shows the clustering of data of size 1M × 17 requires only 4 minutes. So, such low transmission cost and low bandwidth requirement leads to improved speedup and scaleup of the distributed data. © 2016 IEEE.

More »»

2017

Sandhya Harikumar and Dilipkumar, D. U., “Apriori algorithm for association rule mining in high dimensional data”, in Proceedings of the 2016 International Conference on Data Science and Engineering, ICDSE 2016, 2017.[Abstract]


Apriori is one of the best algorithms for learning association rules. Due to the explosion of data, the storage and retrieval mechanisms in various database paradigms have revolutionized the technologies and methodologies used in the architecture. As a result, the database is not only utilized for mere information retrieval but also to infer the analytical aspect of data. Therefore it is essential to find association rules from high dimensional data because the correlation amongst the attributes can help in gaining deeper insight into the data and help in decision making, recommendations as well as reorganizing the data for effective retrieval. The traditional Apriori algorithm is computationally expensive and infeasible with high dimensional datasets. Hence we propose a variant of Apriori algorithm using the concept of QR decomposition for reducing the dimensions thereby reducing the complexity of the traditional Apriori algorithm. © 2016 IEEE.

More »»

2016

M. Ajeissh and Sandhya Harikumar, “An adaptive distributed approach of a self organizing map model for document clustering using ring topology”, in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016.[Abstract]


Document clustering aims at grouping the documents that are coherent internally with substantial difference amongst different groups. Due to huge availability of documents, the clustering face scalability and accuracy issues. Moreover, there is a dearth for a tool that performs clustering of such voluminous data efficiently. Conventional models focus either on fully centralized or fully distributed approach for document clustering. Hence, this paper proposes a novel approach to perform document clustering by modifying the conventional Self Organizing Map (SOM). The contribution of this work is threefold. The first is a distributed approach to pre-process the documents; the second being an adaptive bottom-up approach towards document clustering and the third being a neighbourhood model suitable for Ring Topology for document clustering. Experimentation on real datasets and comparison with traditional SOM show the efficacy of the proposed approach.

More »»

2016

J. Isaac and Sandhya Harikumar, “Logistic regression within DBMS”, in Proceedings of the 2016 2nd International Conference on Contemporary Computing and Informatics, IC3I 2016, 2016, pp. 661-666.[Abstract]


The context of this paper is to come up with an analytical query model for data categorization within DBMS. DBMS being the asset for most of the organizations, classification can help in getting better insight and control over the data. Conventionally, classification algorithms like logistic regression, KNN, etc. are applied after exporting the data out of DBMS, using non DBMS tools like R, matrix packages, generic data mining programs or large scale systems like Hadoop and Spark. However, this leads to I/O overhead since the data within DBMS is updated quite frequently and usually cannot be accommodated in the main memory. This paper proposes an alternative strategy, based on SQL and UDFs, to integrate the logistic regression for data categorization as well as prediction query processing within DBMS. A comparison of SQL with user defined functions (UDFs) as well as with statistical packages like R is presented, by experimentation on real datasets. The empirical results show the viability and validity of this approach for predicting the class of a given query. © 2016 IEEE.

More »»

2015

Sandhya Harikumar and PV, S., “K-Medoid Clustering for Heterogeneous DataSets”, in 4th International Conference on Eco-friendly Computing and Communication Systems (ICECCS), Procedia Computer Science, 2015.[Abstract]


Recent years have explored various clustering strategies to partition datasets comprising of heterogeneous domains or types such as categorical, numerical and binary. Clustering algorithms seek to identify homogeneous groups of objects based on the values of their attributes. These algorithms either assume the attributes to be of homogeneous types or are converted into homogeneous types. However, datasets with heterogeneous data types are common in real life applications, which if converted, can lead to loss of information. This paper proposes a new similarity measure in the form of triplet to find the distance between two data objects with heterogeneous attribute types. A new k-medoid type of clustering algorithm is proposed by leveraging the similarity measure in the form of a vector. The proposed k-medoid type of clustering algorithm is compared with traditional clustering algorithms, based on cluster validation using Purity Index and Davies Bouldin index. Results show that the new clustering algorithm with new similarity measure outperforms the k-means clustering for mixed datasets.

More »»

2015

Sandhya Harikumar and Raji Ramachandran, “Hybridized fragmentation of very large databases using clustering”, in 2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems, SPICES 2015, 2015.[Abstract]


Due to the ever growing needs of managing huge volume of data, together with the desire for consistent, scalable, reliable and efficient retrieval of information, an intelligent mechanism to design the storage structure for distributing the databases has become inevitable. The two critical facets of distributed databases are data fragmentation and allocation. Existing fragmentation techniques are based on the frequency and type of the queries as well as the statistics of the empirical data. However, very limited work is done to fragment the data based on the pattern of the tuples and the attributes responsible for such patterns. This paper presents a unique approach towards hybridized fragmentation, by applying subspace clustering algorithm, to come up with a set of fragments which partitions the data with respect to tuples as well as attributes. Projected clustering is the one that determines the clusters in the subspaces of high dimensional data. This concept leads to find the closely correlated attributes for different sets of instances thereby giving good hybridized fragments for distributed databases. Experimental results show that fragmenting the database based on clustering, results in reduced database access time as compared to the fragments chosen at design time using certain statistics. © 2015 IEEE.

More »»

2014

Sandhya Harikumar, Reethima, R., and Dr. Kaimal, M. R., “Semantic integration of heterogeneous relational schemas using multiple L1 linear regression and SVD”, in International Conference on Data Science and Engineering, ICDSE 2014, 2014, pp. 105-111.[Abstract]


The challenge of semantic integration of heterogeneous databases is one of the critical areas of interest due to scalability of data and the need to share the existing data as the technology advances. The schema level heterogeneity of the relations is the major issue for such integration. Though various approaches of schema analysis, transformation and integration have been explored, sometimes those become too general to solve the problem especially when the data is very high-dimensional and the schema information is unavailable or inadequate. In this paper, a method to integrate heterogeneous relational schema at instance-level is proposed, rather than the schema level. A global schema is designed consisting of the integration of most relevant attributes of different relational schema of a particular domain. In order to find the significant attributes, multiple linear regressions based on LI norm and Singular Value Decomposition(SVD) is applied on the data iteratively. This is a variant of L1-PCA, which is efficient, effective and meaningful method of linear subspace estimation. The most prominent instance - level similarity is found by finding the most significant attributes of each relational data source and then finding the similarity among those attributes using L1-norm. Thus an integrated schema is created that maps the relevant attributes of each local schema to a global schema. © 2014 IEEE.

More »»

2014

Sandhya Harikumar, Shyju, M., and Dr. Kaimal, M. R., “SQL-MapReduce hybrid approach towards distributed projected clustering”, in International Conference on Data Science and Engineering, ICDSE 2014, 2014, pp. 18-23.[Abstract]


Clustering high dimensional data is a major challenge in data mining due to the existence of inherent complexity and sparsity of the data. Projected clustering is one of the clustering approaches that determine the clusters in the subspaces of such high dimensional data. However, projected clustering within DBMS is quite computationally expensive in time and space complexity, when the volume of records is in terms of terabytes, petabytes and more. This expensive computation becomes a hurdle especially when the data clustering on transactional data is used as a preprocessing step for other tasks such as frequent decision making, efficient indexing, compression, etc. Hence, parallelizing and distributing expensive data clustering tasks becomes attractive in terms of speed-up of computation and the increased amount of memory available in a computing cluster. Inorder to achieve this, we propose a SQL-MapReduce hybrid approach for scalable projected clustering. © 2014 IEEE.

More »»

2013

Sandhya Harikumar and Vinay, A., “NSB-TREE for an efficient multidimensional indexing in non-spatial databases”, in IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2013 , 2013.[Abstract]


Query processing of high dimensional data with huge volume of records, especially in non-spatial domain require efficient multidimensional index. The present versions of DBMSs follow a single dimension indexing at multiple levels or indexing based on the formation of compound keys which is concatenation of the key values of the required attributes. The underlying structures, data models and query languages are not sufficient for the retrieval of information based on more complex data in terms of dimensions and size. This paper aims at designing an efficient indexing structure for multidimensional data access in non-spatial domain. This new indexing structure is evolved from R-tree with certain preprocessing steps to be applied on non-spatial data. The proposed indexing model, NSB-Tree (Non-Spatial Block tree) is balanced and has better performance than traditional B-trees and has less complicated algorithms as compared to UB tree. It has linear space complexity and logarithmic time complexity. The main drive of NSB tree is multidimensional indexing eliminating the need for multiple secondary indexes and concatenation of multiple keys. We cannot index non-spatial data using R-tree in the available DBMSs. Our index structure replaces an arbitrary number of secondary indexes for multicolumn index structure. This is implemented and feasibility check is done using the PostgreSQL database. More »»

2013

Sandhya Harikumar, Haripriya, H., and Kaimal, M. R., “Implementation of projected clustering based on SQL queries and UDFs in relational databases”, in Intelligent Computational Systems (RAICS), 2013 IEEE Recent Advances in, Trivandrum, 2013.[Abstract]


Projected clustering is one of the clustering approaches that determine the clusters in the subspaces of high dimensional data. Although it is possible to efficiently cluster a very large data set outside a relational database, the time and effort to export and import it can be significant. In commercial RDBMSs, there is no SQL query available for any type of subspace clustering, which is more suitable for large databases with high dimensions and large number of records. Integrating clustering with a relational DBMS using SQL is an important and challenging problem in todays world of Big Data. Projected clustering has the ability to find the closely correlated dimensions and find clusters in the corresponding subspaces. We have designed an SQL version of projected clustering which helps to get the clusters of the records in the database using a single SQL statement which in itself calls other SQL functions defined by us. We have used PostgreSQL DBMS to validate our implementation and have done experimentation with synthetic as well as real data. More »»

Publication Type: Journal Article

Year of Publication Title

2018

A. Mukundan and Sandhya Harikumar, “A mapreduce model for distributed self organizing map using apache spark”, Journal of Advanced Research in Dynamical and Control Systems, vol. 10, pp. 1229-1238, 2018.[Abstract]


Training Neural network models for clustering is computationally expensive in distributed environment. This paper presents a Self Organizing Map (SOM) Model suitable for clustering based on data parallelism. Though the proposed approach is a generic MapReduce prototype for any neural network model, this paper extends the prototype to leverage into SOM model. Our technique differs from the existing technique in training and agglomeration of the results. MapReduce using Apache Spark is more efficient than Apache Hadoop, due to Resilient Distributed Datasets (RDDs). Experimentation on real data sets on Spark platform shows the feasibility of the proposed approach. Further, comparison with Distributed KMeans based on the metrics like Purity and Scalability shows efficacy of the proposed solution. © 2018, Institute of Advanced Scientific Research, Inc. All rights reserved.

More »»

2018

Sandhya Harikumar and Akhil, A. S., “Semi supervised approach towards subspace clustering”, Journal of Intelligent & Fuzzy Systems, vol. 34, pp. 1619–1629, 2018.[Abstract]


High-dimensional data analysis is quite inevitable due to emerging technologies in various domains such as finance, healthcare, genomics and signal processing. Though data sets generated in these domains are high-dimensional, intrinsic dimensions that provide meaningful information are often much smaller. Conventionally, unsupervised clustering methods known as subspace clustering are utilized for finding clusters in different subspaces of high dimensional data, by identifying relevant features, irrespective of labels associated with each instance. Available label information, if incorporated in clustering algorithm, can bias the algorithm towards solutions more consistent with our knowledge, leading to improved cluster quality. Therefore, an Information Gain based Semi-supervised- subspace Clustering (IGSC) is proposed that identifies a subset of important attributes based on the known label for each data instance. The information about the labels associated with data sets is integrated with the search strategy for subspaces to leverage them into a model based clustering approach. Our experimentation on 13 real world labeled data sets proves the feasibility of IGSC and we validate the clusters obtained, using an improvised Davies Bouldin Index (DBI) for semi-supervised clusters.

More »»

2017

Sandhya Harikumar, Dilipkumar, D. Usha, and Dr. Kaimal, M. R., “Efficient attribute selection strategies for association rule mining in high dimensional data”, International Journal of Computational Science and Engineering, vol. 15, pp. 201–213, 2017.[Abstract]


This paper presents a new computational approach to discover interesting relations between variables, called association rules, in large and high dimensional datasets. State-of-the-art techniques are computationally expensive due to reasons like high dimensions, generation of huge number of candidate sets and multiple database scans. In general, most of the enormous discovered patterns are obvious, redundant or uninteresting to the user. So the context of this paper is to improve apriori algorithm to find association rules pertaining to only important attributes from high dimensional data. We employ an information theoretic method together with the concept of QR decomposition to represent the data in its proper substructure form without losing its semantics, by identifying significant attributes. Experiment on real datasets and comparison with the existing technique reveals that the proposed strategy is computationally always faster and statistically always comparable with the apriori algorithms in terms of rules generated and time complexity.

More »»

2015

Sandhya Harikumar and Reethima, R., “A method to induce indicative functional dependencies for relational data model”, Advances in Intelligent Systems and Computing, vol. 320, pp. 445-456, 2015.[Abstract]


Relational model is one of the extensively used database models. However, with the contemporary technologies, high dimensional data which may be structured or unstructured are required to be analyzed for knowledge interpretation. One of the significant aspects of analysis is exploring the relationships existing between the attributes of large dimensional data. In relational model, the integrity constraints in accordance with the relationships are captured by functional dependencies. Processing of high dimensional data to understand all the functional dependencies is computationally expensive. More specifically, functional dependencies of the most prominent attributes will be of significant use and can reduce the search space of functional dependencies to be searched for. In this paper we propose a regression model to find the most prominent attributes of a given relation. Functional dependencies of these prominent attributes are discovered which are indicative and lead to faster results in decreased amount of time

More »»

Publication Type: Conference Proceedings

Year of Publication Title

2016

Sandhya Harikumar and Roy, M. M., “Data integration of heterogeneous data sources using QR decomposition”, Advances in Intelligent Systems and Computing, 2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems, IEEE SPICES, vol. 385. Springer Verlag, Kochi, India, pp. 333-344, 2016.[Abstract]


Integration of data residing at different sites and providing users with a unified view of these data is being extensively studied for commercial and scientific purposes. Amongst various concerns of integration, semantic integration is the most challenging problem that addresses the resolution of the semantic conflicts between heterogeneous data sources. Even if the data sources may belong to similar domain, due to the lack of commonality in the schema of databases and the instances of databases, the unified result of the integration may be inaccurate and difficult to validate. So, identification of the most significant or independent attributes of each data source and then providing a unified view of these is a challenge in the realm of heterogeneity. This demands for proper analysis of each data source in order to have a comprehensive meaning and structure of the same. The contribution of this paper is in the realization of semantic integration of heterogeneous sources of similar domain using QR decomposition, together with a bridging knowledge base. The independent attributes of each data source are found that are integrated based on the similarity or correlation amongst them, for forming a global view of all the data sources, with the aid of a knowledge base. In case of an incomplete knowledge base, we also formulate a recommendation strategy for the integration of the possible set of attributes. Experimental results show the feasibility of this approach with the data sources of same domain. © Springer International Publishing Switzerland 2016.

More »»
Faculty Research Interest: