Carnegie Mellon University

Foundations of Clustering: New Models and Algorithms

In this dissertation, we study clustering, one of the most common unsupervised learning problems. This dissertation covers recent developments in both clustering theory and machine learning practice. In particular, it explores how to bridge the gap between theory and practice by making them benefit from each other. Many clustering algorithms are developed in an application-oriented way lacking the guidance of theory. For some clustering problems it is hard to mathematically characterize what is being optimized. The arising needs in the ML/AI community, such as fairness and scalability, also require updates in current problem formulations. The first few chapters of this dissertation lay the theoretical foundation for multiple clustering problems. We first establish the formal optimization framework. Such a framework gives us conceptual understanding of the problems and becomes the basis for optimization and algorithm design. We then discuss the performance of existing approaches and come up with new algorithms beating the state-of-the-art. Empirical evaluations also verify that the new algorithms perform better in both quality and efficiency, showing it is beneficial to view these problems through the lens of theory. We study one classic clustering problem: hierarchical clustering. Unlike other more well-formulated clustering problems such as k-means, the theoretical study of hierarchical clustering has kicked off recently. The first chapter focuses on new objective function design for hierarchical clustering on point inputs in a Euclidean space. It provides theoretical guarantees for a popular heuristic. The second chapter studies how to incorporate fairness into the hierarchical clustering optimization framework. It defines fair hierarchical clustering trees and discusses how to design algorithms that find fair solutions for previous hierarchical clustering objectives established by the community. Surprisingly, in this setting fairness could be imposed at minimal loss in clustering solution performance. The focus is then shifted to speeding up famous clustering algorithms in scenarios where they are known to be inefficient. We consider average-linkage. Building the hierarchical tree from bottom to top, it is one of the most commonly used hierarchical clustering algorithm.

It is known to scale poorly to large datasets, as it requires iteratively searching for two clusters with the smallest average pairwise distance among the current clustering, whichis time consuming. To speed up the cluster search, we introduce a new technique named “clustering embedding”. It maps clusters into points in Euclidean space. The points are then used as surrogates for the clusters, enabling us to apply Approximate Nearest Neighbors (ANN) techniques. We reduce the previous quadratic bound on running time to only slightly super-linear. New challenges could also be imposed by a new data input format other than the conventional sample-feature matrix. Consider relational database, one of the most common data storage format that is highly compact. The naive way of running conventional ML algorithms requires converting the given database into the matrix format. This could cause the input size to grow exponentially. Instead, we design algorithms that could work directly on the relational databases without recovering the sample-feature matrix. We give such algorithms for the classical k-means problem. We show how to adapt the famous k-means++ algorithm and find a constant approximation for the optimal k-means solution. On the other hand, this dissertation shows how we can rethink the design of combinatorial algorithms by augmenting the algorithms with learned oracles using a data driven approach. Traditional algorithm performance design and analysis is often bottle-necked by worst-case instances. Meanwhile, the practitioners often have historical records about past data and solutions. Training ML oracles on the data can give us knowledge about the current problem instance, and this knowledge could be used to help the algorithm go beyond the “hurdles” of hard instances. We call such knowledge “predictions”. The remaining chapters focus on proposing feasible predictions in the contexts of different clustering problems and discussing how to design better algorithms utilizing these predictions. We revisit the scalable hierarchical clustering algorithm designs explored in previous chapters and extend it to inputs in more general metric spaces. In Euclidean spaces we design cluster embeddings and couple it with ANN search to efficiently identify clusters to merge. However, the ANN technique is not known to exist for general metrics. We show how a

proxy metric, which approximates the original metric, could be used to support the ANN search with minimal loss in hierarchical clustering performance. Finally, we consider correlation clustering. Given a set of points along with recommendations whether each pair of points should be placed in the same cluster or into separate clusters, the goal is to cluster the points to minimize disagreements from the recommendations. We study this problem in the online setting, where points arrive one at a time, and upon arrival the algorithm must make an irrevocable cluster assignment decision. There is a simple lower bound that rules out any algorithm with a non-trivial competitive ratio. We propose using a small, randomized subset of nodes to help making online clustering decisions. Upon the arrival of a new node, the algorithm can check whether it is recommended to be in thesame/different cluster(s) with the reference set. We prove that the famous Pivot algorithm performs well in this setting. Moreover, the performance is robust to adversarial perturbations of the reference set.

Degree Type

  • Dissertation
  • Tepper School of Business

Degree Name

  • Doctor of Philosophy (PhD)

Usage metrics

  • Operations Research

Functional Data Clustering Analysis via the Learning of Gaussian Processes with Wasserstein Distance

  • Conference paper
  • First Online: 20 November 2020
  • Cite this conference paper

Book cover

  • Tao Li 14 &
  • Jinwen Ma 14  

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12533))

Included in the following conference series:

  • International Conference on Neural Information Processing

2453 Accesses

Functional data clustering analysis becomes an urgent and challenging task in the new era of big data. In this paper, we propose a new framework for functional data clustering analysis, which adopts a similar structure as the k -means algorithm for the conventional clustering analysis. Under this framework, we clarify three issues: how to represent functions, how to measure distances between functions, and how to calculate centers of functions. We utilize Gaussian processes to represent the clusters of functions which are actually their sample curves or trajectories on a finite set of sample points. Moreover, we take the Wasserstein distance to measure the similarity between Gaussian distributions. With the choice of Wasserstein distance, the centers of Gaussian processes can be calculated analytically and efficiently. To demonstrate the effectiveness of the proposed method, we compare it with existing competitive clustering methods on synthetic datasets and the obtained results are encouraging. We finally apply the proposed method to three real-world datasets with satisfying results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The details here are not so important, and the definition of Wasserstein 2-distance of Gaussian measures is enough for the development of this work. We present the formal definition here for completeness.

Agueh, M., Carlier, G.: Barycenters in the wasserstein space. SIAM J. Math. Anal. 43 (2), 904–924 (2011)

Article   MathSciNet   Google Scholar  

Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retriev. 12 (4), 461–486 (2009)

Article   Google Scholar  

Bezdek, J.C., Ehrlich, R., Full, W.: FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 10 (2–3), 191–203 (1984)

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)

MATH   Google Scholar  

Chen, Z., Ma, J., Zhou, Y.: A precise hard-cut EM algorithm for mixtures of gaussian processes. In: Huang, D.-S., Jo, K.-H., Wang, L. (eds.) ICIC 2014. LNCS (LNAI), vol. 8589, pp. 68–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09339-0_7

Chapter   Google Scholar  

Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley, Hoboken (2012)

Desgraupes, B.: Clustering indices. University of Paris Ouest-Lab Modal’X, vol. 1, p. 34 (2013)

Google Scholar  

Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, Springer series in statistics New York, vol. 1 (2001)

Gaffney, S.: Probabilistic curve-aligned clustering and prediction with regression mixture models. Ph.D. thesis, University of California, Irvine (2004)

Gaffney, S.J., Smyth, P.: Joint probabilistic curve clustering and alignment. In: Advances in Neural Information Processing Systems, pp. 473–480 (2005)

Kolouri, S., Park, S.R., Thorpe, M., Slepcev, D., Rohde, G.K.: Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34 (4), 43–59 (2017)

Li, T., Ma, J.: Fuzzy clustering with automated model selection: entropy penalty approach. In: 2018 14th IEEE International Conference on Signal Processing (ICSP), pp. 571–576 (2018)

López-Pintado, S., Romo, J.: On the concept of depth for functional data. J. Am. Stat. Assoc. 104 (486), 718–734 (2009)

Mallasto, A., Feragen, A.: Learning from uncertain curves: the 2-wasserstein metric for gaussian processes. In: Advances in Neural Information Processing Systems, pp. 5660–5670 (2017)

Nocedal, J., Wright, S.: Numerical Optimization. Springer, Heidelberg (2006). https://doi.org/10.1007/978-0-387-40065-5

Book   MATH   Google Scholar  

Peyré, G., et al.: Computational optimal transport. Found. Trends® Mach. Learn. 11 (5—-6), 355–607 (2019)

Ramsay, J.O., Hooker, G., Graves, S.: Functional Data Analysis with R and MATLAB, 1st edn. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-98185-7

Ramsay, J.O.: Functional data analysis. Encycl. Stat. Sci. 4 (2004)

Rasmussen, C.E.: Gaussian processes in machine learning. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) ML -2003. LNCS (LNAI), vol. 3176, pp. 63–71. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28650-9_4

Rasmussen, C.E., Nickisch, H.: Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11 (Nov), 3011–3015 (2010)

MathSciNet   MATH   Google Scholar  

Toth, A., Kelley, C.: Convergence analysis for anderson acceleration. SIAM J. Numeric. Anal. 53 (2), 805–819 (2015)

Trefethen, L.N., Bau III, D.: Numerical Linear Algebra, vol. 50. SIAM (1997)

Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-71050-9

Wagner, S., Wagner, D.: Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe (2007)

Walker, H.F., Ni, P.: Anderson acceleration for fixed-point iterations. SIAM J. Numeric. Anal. 49 (4), 1715–1735 (2011)

Download references

Acknowledgements

This work was supported by the National Key R & D Program of China (2018YFC0808305).

Author information

Authors and affiliations.

Department of Information Science, School of Mathematical Sciences and LMAM, Peking University, Beijing, China

Tao Li & Jinwen Ma

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jinwen Ma .

Editor information

Editors and affiliations.

Department of AI, Ping An Life, Shenzhen, China

Haiqin Yang

Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand

Kitsuchart Pasupa

City University of Hong Kong, Kowloon, China

Andrew Chi-Sing Leung

Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong

James T. Kwok

School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand

Jonathan H. Chan

The Chinese University of Hong Kong, New Territories, Hong Kong

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Cite this paper.

Li, T., Ma, J. (2020). Functional Data Clustering Analysis via the Learning of Gaussian Processes with Wasserstein Distance. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12533. Springer, Cham. https://doi.org/10.1007/978-3-030-63833-7_33

Download citation

DOI : https://doi.org/10.1007/978-3-030-63833-7_33

Published : 20 November 2020

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-63832-0

Online ISBN : 978-3-030-63833-7

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Clustering algorithms: A comparative approach

Roles Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Affiliation Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, São Paulo, Brazil

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Computer Science, Federal University of São Carlos, São Carlos, São Paulo, Brazil

ORCID logo

Roles Validation, Writing – original draft, Writing – review & editing

Affiliation Federal University of Technology, Paraná, Paraná, Brazil

Roles Funding acquisition, Project administration, Supervision, Validation, Writing – review & editing

Affiliation São Carlos Institute of Physics, University of São Paulo, São Carlos, São Paulo, Brazil

Roles Funding acquisition, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

  • Mayra Z. Rodriguez, 
  • Cesar H. Comin, 
  • Dalcimar Casanova, 
  • Odemir M. Bruno, 
  • Diego R. Amancio, 
  • Luciano da F. Costa, 
  • Francisco A. Rodrigues

PLOS

  • Published: January 15, 2019
  • https://doi.org/10.1371/journal.pone.0210236
  • Reader Comments

Table 1

Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach tended to present particularly good performance. We also found that the default configuration of the adopted implementations was not always accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.

Citation: Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa LdF, et al. (2019) Clustering algorithms: A comparative approach. PLoS ONE 14(1): e0210236. https://doi.org/10.1371/journal.pone.0210236

Editor: Hans A. Kestler, University of Ulm, GERMANY

Received: December 26, 2016; Accepted: December 19, 2018; Published: January 15, 2019

Copyright: © 2019 Rodriguez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All datasets used for evaluating the algorithms can be obtained from Figshare: https://figshare.com/s/29005b491a418a667b22 .

Funding: This work has been supported by FAPESP - Fundação de Amparo à Pesquisa do Estado de São Paulo (grant nos. 15/18942-8 and 18/09125-4 for CHC, 14/20830-0 and 16/19069-9 for DRA, 14/08026-1 for OMB and 11/50761-2 and 15/22308-2 for LdFC), CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico (grant nos. 307797/2014-7 for OMB and 307333/2013-2 for LdFC), Núcleo de Apoio à Pesquisa (LdFC) and CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Finance Code 001).

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, the automation of data collection and recording implied a deluge of information about many different kinds of systems [ 1 – 8 ]. As a consequence, many methodologies aimed at organizing and modeling data have been developed [ 9 ]. Such methodologies are motivated by their widespread application in diagnosis [ 10 ], education [ 11 ], forecasting [ 12 ], and many other domains [ 13 ]. The definition, evaluation and application of these methodologies are all part of the machine learning field [ 14 ], which became a major subarea of computer science and statistics due to their crucial role in the modern world.

Machine learning encompasses different topics such as regression analysis [ 15 ], feature selection methods [ 16 ], and classification [ 14 ]. The latter involves assigning classes to the objects in a dataset. Three main approaches can be considered for classification: supervised, semi-supervised and unsupervised classification. In the former case, the classes, or labels, of some objects are known beforehand, defining the training set, and an algorithm is used to obtain the classification criteria. Semi-supervised classification deals with training the algorithm using both labeled and unlabeled data. They are commonly used when manually labeling a dataset becomes costly. Lastly, unsupervised classification, henceforth referred as clustering , deals with defining classes from the data without knowledge of the class labels. The purpose of clustering algorithms is to identify groups of objects, or clusters, that are more similar to each other than to other clusters. Such an approach to data analysis is closely related to the task of creating a model of the data, that is, defining a simplified set of properties that can provide intuitive explanation about relevant aspects of a dataset. Clustering methods are generally more demanding than supervised approaches, but provide more insights about complex data. This type of classifiers constitute the main object of the current work.

Because clustering algorithms involve several parameters, often operate in high dimensional spaces, and have to cope with noisy, incomplete and sampled data, their performance can vary substantially for different applications and types of data. For such reasons, several different approaches to clustering have been proposed in the literature (e.g. [ 17 – 19 ]). In practice, it becomes a difficult endeavor, given a dataset or problem, to choose a suitable clustering approach. Nevertheless, much can be learned by comparing different clustering methods. Several previous efforts for comparing clustering algorithms have been reported in the literature [ 20 – 29 ]. Here, we focus on generating a diversified and comprehensive set of artificial, normally distributed data containing not only distinct number of classes, features, number of objects and separation between classes, but also a varied structure of the involved groups (e.g. possessing predefined correlation distributions between features). The purpose of using artificial data is the possibility to obtain an unlimited number of samples and to systematically change any of the aforementioned properties of a dataset. Such features allow the clustering algorithms to be comprehensive and strictly evaluated in a vast number of circumstances, and also grants the possibility of quantifying the sensitivity of the performance with respect to small changes in the data. It should be observed, nevertheless, that the performance results reported in this work are therefore respective and limited to normally distributed data, and other results could be expected for other types of data following other statistical behavior. Here we associate performance with the similarity between the known labels of the objects and those found by the algorithm. Many measurements have been defined for quantifying such similarity [ 30 ], we compare the Jaccard index [ 31 ], Adjusted Rand index [ 32 ], Fowlkes-Mallows index [ 33 ] and Normalized mutual information [ 34 ]. A modified version of the procedure developed by [ 35 ] was used to create 400 distinct datasets, which were used in order to quantify the performance of the clustering algorithms. We describe the adopted procedure and the respective parameters used for data generation. Related approaches include [ 36 ].

Each clustering algorithm relies on a set of parameters that needs to be adjusted in order to achieve viable performance, which corresponds to an important point to be addressed while comparing clustering algorithms. A long standing problem in machine learning is the definition of a proper procedure for setting the parameter values [ 37 ]. In principle, one can apply an optimization procedure (e.g., simulated annealing [ 38 ] or genetic algorithms [ 39 ]) to find the parameter configuration providing the best performance of a given algorithm. Nevertheless, there are two major problems with such an approach. First, adjusting parameters to a given dataset may lead to overfitting [ 40 ]. That is, the specific values found to provide good performance may lead to lower performance when new data is considered. Second, parameter optimization can be unfeasible in some cases, given the time complexity of many algorithms, combined with their typically large number of parameters. Ultimately, many researchers resort to applying classifier or clustering algorithms using the default parameters provided by the software. Therefore, efforts are required for evaluating and comparing the performance of clustering algorithms in the optimization and default situations. In the following, we consider some representative examples of algorithms applied in the literature [ 37 , 41 ].

Clustering algorithms have been implemented in several programming languages and packages. During the development and implementation of such codes, it is common to implement changes or optimizations, leading to new versions of the original methods. The current work focuses on the comparative analysis of several clustering algorithm found in popular packages available in the R programming language [ 42 ]. This choice was motivated by the popularity of the R language in the data mining field, and by virtue of the well-established clustering packages it contains. This study is intended to assist researchers who have programming skills in R language, but with little experience in clustering of data.

The algorithms are evaluated on three distinct situations. First, we consider their performance when using the default parameters provided by the packages. Then, we consider the performance variation when single parameters of the algorithms are changed, while the rest are kept at their default values. Finally, we consider the simultaneous variation of all parameters by means of a random sampling procedure. We compare the results obtained for the latter two situations with those achieved by the default parameters, in such a way as to investigate the possible improvements in performance which could be achieved by modifying the algorithms.

The algorithms were evaluated on 400 artificial, normally distributed, datasets generated by a robust methodology previously described in [ 36 ]. The number of features, number of classes, number of objects for each class and average distance between classes can be systematically changed among the datasets.

The text is divided as follows. We start by revising some of the main approaches to clustering algorithms comparison. Next, we describe the clustering methods considered in the analysis, we also present the R packages implementing such methods. The data generation method and the performance measurements used to compare the algorithms are presented, followed by the presentation of the performance results obtained for the default parameters, for single parameter variation and for random parameter sampling.

Related works

Previous approaches for comparing the performance of clustering algorithms can be divided according to the nature of used datasets. While some studies use either real-world or artificial data, others employ both types of datasets to compare the performance of several clustering methods.

A comparative analysis using real world dataset is presented in several works [ 20 , 21 , 24 , 25 , 43 , 44 ]. Some of these works are reviewed briefly in the following. In [ 43 ], the authors propose an evaluation approach based in a multiple criteria decision making in the domain of financial risk analysis over three real world credit risk and bankruptcy risk datasets. More specifically, clustering algorithms are evaluated in terms of a combination of clustering measurements, which includes a collection of external and internal validity indexes. Their results show that no algorithm can achieve the best performance on all measurements for any dataset and, for this reason, it is mandatory to use more than one performance measure to evaluate clustering algorithms.

In [ 21 ], a comparative analysis of clustering methods was performed in the context of text-independent speaker verification task, using three dataset of documents. Two approaches were considered: clustering algorithms focused in minimizing a distance based objective function and a Gaussian models-based approach. The following algorithms were compared: k-means, random swap, expectation-maximization, hierarchical clustering, self-organized maps (SOM) and fuzzy c-means. The authors found that the most important factor for the success of the algorithms is the model order, which represents the number of centroid or Gaussian components (for Gaussian models-based approaches) considered. Overall, the recognition accuracy was similar for clustering algorithms focused in minimizing a distance based objective function. When the number of clusters was small, SOM and hierarchical methods provided lower accuracy than the other methods. Finally, a comparison of the computational efficiency of the methods revealed that the split hierarchical method is the fastest clustering algorithm in the considered dataset.

In [ 25 ], five clustering methods were studied: k-means, multivariate Gaussian mixture, hierarchical clustering, spectral and nearest neighbor methods. Four proximity measures were used in the experiments: Pearson and Spearman correlation coefficient, cosine similarity and the euclidean distance. The algorithms were evaluated in the context of 35 gene expression data from either Affymetrix or cDNA chip platforms, using the adjusted rand index for performance evaluation. The multivariate Gaussian mixture method provided the best performance in recovering the actual number of clusters of the datasets. The k-means method displayed similar performance. In this same analysis, the hierarchical method led to limited performance, while the spectral method showed to be particularly sensitive to the proximity measure employed.

In [ 24 ], experiments were performed to compare five different types of clustering algorithms: CLICK, self organized mapping-based method (SOM), k-means, hierarchical and dynamical clustering. Data sets of gene expression time series of the Saccharomyces cerevisiae yeast were used. A k-fold cross-validation procedure was considered to compare different algorithms. The authors found that k-means, dynamical clustering and SOM tended to yield high accuracy in all experiments. On the other hand, hierarchical clustering presented a more limited performance in clustering larger datasets, yielding low accuracy in some experiments.

A comparative analysis using artificial data is presented in [ 45 – 47 ]. In [ 47 ], two subspace clustering methods were compared: MAFIA (Adaptive Grids for Clustering Massive Data Sets) [ 48 ] and FINDIT (A Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting) [ 49 ]. The artificial data, modeled according to a normal distribution, allowed the control of the number of dimensions and instances. The methods were evaluated in terms of both scalability and accuracy. In the former, the running time of both algorithms were compared for different number of instances and features. In addition, the authors assessed the ability of the methods in finding adequate subspaces for each cluster. They found that MAFIA discovered all relevant clusters, but one significant dimension was left out in most cases. Conversely, the FINDIT method performed better in the task of identifying the most relevant dimensions. Both algorithms were found to scale linearly with the number of instances, however MAFIA outperformed FINDIT in most of the tests.

Another common approach for comparing clustering algorithms considers using a mixture of real world and artificial data (e.g. [ 23 , 26 – 28 , 50 ]). In [ 28 ], the performance of k-means, single linkage and simulated annealing (SA) was evaluated, considering different partitions obtained by validation indexes. The authors used two real world datasets obtained from [ 51 ] and three artificial datasets (having two dimensions and 10 clusters). The authors proposed a new validation index called I index that measures the separation based on the maximum distance between clusters and compactness based on the sum of distances between objects and their respective centroids. They found that such an index was the most reliable among other considered indices, reaching its maximum value when the number of clusters is properly chosen.

A systematic quantitative evaluation of four graph-based clustering methods was performed in [ 27 ]. The compared methods were: markov clustering (MCL), restricted neighborhood search clustering (RNSC), super paramagnetic clustering (SPC), and molecular complex detection (MCODE). Six datasets modeling protein interactions in the Saccharomyces cerevisiae and 84 random graphs were used for the comparison. For each algorithm, the robustness of the methods was measured in a twofold fashion: the variation of performance was quantified in terms of changes in the (i) methods parameters and (ii) dataset properties. In the latter, connections were included and removed to reflect uncertainties in the relationship between proteins. The restricted neighborhood search clustering method turned out to be particularly robust to variations in the choice of method parameters, whereas the other algorithms were found to be more robust to dataset alterations. In [ 52 ] the authors report a brief comparison of clustering algorithms using the Fundamental clustering problem suite (FPC) as dataset. The FPC contains artificial and real datasets for testing clustering algorithms. Each dataset represents a particular challenge that the clustering algorithm has to handle, for example, in the Hepta and LSum datasets the clusters can be separated by a linear decision boundary, but have different densities and variances. On the other hand, the ChainLink and Atom datasets cannot be separated by linear decision boundaries. Likewise, the Target dataset contains outliers. Lower performance was obtained by the single linkage clustering algorithm for the Tetra, EngyTime, Twodiamonds and Wingnut datasets. Although the datasets are quite versatile, it is not possible to control and evaluate how some of its characteristics, such as dimensions or number of features, affect the clustering accuracy.

Clustering methods

Many different types of clustering methods have been proposed in the literature [ 53 – 56 ]. Despite such a diversity, some methods are more frequently used [ 57 ]. Also, many of the commonly employed methods are defined in terms of similar assumptions about the data (e.g., k-means and k-medoids) or consider analogous mathematical concepts (e.g, similarity matrices for spectral or graph clustering) and, consequently, should provide similar performance in typical usage scenarios. Therefore, in the following we consider a choice of clustering algorithms from different families of methods. Several taxonomies have been proposed to organize the many different types of clustering algorithms into families [ 29 , 58 ]. While some taxonomies categorize the algorithms based on their objective functions [ 58 ], others aim at the specific structures desired for the obtained clusters (e.g. hierarchical) [ 29 ]. Here we consider the algorithms indicated in Table 1 as examples of the categories indicated in the same table. The algorithms represent some of the main types of methods in the literature. Note that some algorithms are from the same family, but in these cases they posses notable differences in their applications (e.g., treating very large datasets using clara). A short description about the parameters of each considered algorithm is provided in S1 File of the supplementary material.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

The first column shows the name of the algorithms used throughout the text. The second column indicates the category of the algorithms. The third and fourth columns contain, respectively, the function name and R library of each algorithm.

https://doi.org/10.1371/journal.pone.0210236.t001

Regarding partitional approaches, the k-means [ 68 ] algorithm has been widely used by researchers [ 57 ]. This method requires as input parameters the number of groups ( k ) and a distance metric. Initially, each data point is associated with one of the k clusters according to its distance to the centroids (clusters centers) of each cluster. An example is shown in Fig 1(a) , where black points correspond to centroids and the remaining points have the same color if the centroid that is closest to them is the same. Then, new centroids are calculated, and the classification of the data points is repeated for the new centroids, as indicated in Fig 1(b) , where gray points indicate the position of the centroids in the previous iteration. The process is repeated until no significant changes of the centroids positions is observed at each new step, as shown in Fig 1(c) and 1(d) .

thumbnail

Each plot shows the partition obtained after specific iterations of the algorithm. The centroids of the clusters are shown as a black marker. Points are colored according to their assigned clusters. Gray markers indicate the position of the centroids in the previous iteration. The dataset contains 2 clusters, but k = 4 seeds were used in the algorithm.

https://doi.org/10.1371/journal.pone.0210236.g001

The a priori setting of the number of clusters is the main limitation of the k-means algorithm. This is so because the final classification can strongly depend on the choice of the number of centroids [ 68 ]. In addition, the k-means is not particularly recommended in cases where the clusters do not show convex distribution or have very different sizes [ 59 , 60 ]. Moreover, the k-means algorithm is sensitive to the initial seed selection [ 41 ]. Given these limitations, many modifications of this algorithm have been proposed [ 61 – 63 ], such as the k-medoid [ 64 ] and k-means++ [ 65 ]. Nevertheless, this algorithm, besides having low computational cost, can provide good results in many practical situations such as in anomaly detection [ 66 ] and data segmentation [ 67 ]. The R routine used for k-means clustering was the k-means from the stats package, which contains the implementation of the algorithms proposed by Macqueen [ 68 ], Hartigan and Wong [ 69 ]. The algorithm of Hartigan and Wong is employed by the stats package when setting the parameters to their default values, while the algorithm proposed by Macqueen is used for all other cases. Another interesting example of partitional clustering algorithms is the clustering for large applications (clara) [ 70 ]. This method takes into account multiple fixed samples of the dataset to minimize sampling bias and, subsequently, select the best medoids among the chosen samples, where a medoid is defined as the object i for which the average dissimilarity to all other objects in its cluster is minimal. This method tends to be efficient for large amounts of data because it does not explore the whole neighborhood of the data points [ 71 ], although the quality of the results have been found to strongly depend on the number of objects in the sample data [ 62 ]. The clara algorithm employed in our analysis was provided by the clara function contained in the cluster package. This function implements the method developed by Kaufman and Rousseeuw [ 70 ].

The Ordering Points To Identify the Clustering Structure (OPTICS) [ 72 , 73 ] is a density-based cluster ordering based on the concept of maximal density-reachability [ 72 ]. The algorithm starts with a data point and expands its neighborhood using a similar procedure as in the dbscan algorithm [ 74 ], with the difference that the neighborhood is first expanded to points with low core-distance. The core distance of an object p is defined as the m -th smallest distance between p and the objects in its ϵ -neighborhood (i.e., objects having distance less than or equal to ϵ from p ), where m is a parameter of the algorithm indicating the smallest number of points that can form a cluster. The optics algorithm can detect clusters having large density variations and irregular shapes. The R routine used for optics clustering was the optics from the dbscan package. This function considers the original algorithm developed by Ankerst et al. [ 72 ]. An hierarchical clustering structure from the output of the optics algorithm can be constructed using the function extractXi from the dbscan package. We note that the function extractDBSCAN , from the same package, provides a clustering from an optics ordering that is similar to what the dbscan algorithm would generate.

Clustering methods that take into account the linkage between data points, traditionally known as hierarchical methods, can be subdivided into two groups: agglomerative and divisive [ 59 ]. In an agglomerative hierarchical clustering algorithm, initially, each object belongs to a respective individual cluster. Then, after successive iterations, groups are merged until stop conditions are reached. On the other hand, a divisive hierarchical clustering method starts with all objects in a single cluster and, after successive iterations, objects are separated into clusters. There are two main packages in the R language that provide routines for performing hierarchical clustering, they are the stats and cluster . Here we consider the agnes routine from the cluster package which implements the algorithm proposed by Kaufman and Rousseeuw [ 70 ]. Four well-known linkage criteria are available in agnes , namely single linkage, complete linkage, Ward’s method, and weighted average linkage [ 75 ].

Model-based methods can be regarded as a general framework for estimating the maximum likelihood of the parameters of an underlying distribution to a given dataset. A well-known instance of model-based methods is the expectation-maximization (EM) algorithm. Most commonly, one considers that the data from each class can be modeled by multivariate normal distributions, and, therefore, the distribution observed for the whole data can be seen as a mixture of such normal distributions. A maximum likelihood approach is then applied for finding the most probable parameters of the normal distributions of each class. The EM approach for clustering is particularly suitable when the dataset is incomplete [ 76 , 77 ]. On the other hand, the clusters obtained from the method may strongly depend on the initial conditions [ 54 ]. In addition, the algorithm may fail to find very small clusters [ 29 , 78 ]. In the R language, the package mclust [ 79 , 80 ]. provides iterative EM (Expectation-Maximization) methods for maximum likelihood estimation using parameterized Gaussian mixture models. Functions estep and mstep implement the individual steps of an EM iteration. A related algorithm that is also analyzed in the current study is the hcmodel, which can be found in the hc function of the mclust package. The hcmodel algorithm, which is also based on Gaussian-mixtures, was proposed by Fraley [ 81 ]. The algorithm contains many additional steps compared to traditional EM methods, such as an agglomerative procedure and the adjustment of model parameters through a Bayes factor selection using the BIC aproximation [ 82 ].

thesis on data clustering

In recent years, the efficient handling of high dimensional data has become of paramount importance and, for this reason, this feature has been desired when choosing the most appropriate method for obtaining accurate partitions. To tackle high dimensional data, subspace clustering was proposed [ 49 ]. This method works by considering the similarity between objects with respect to distinct subsets of the attributes [ 88 ]. The motivation for doing so is that different subsets of the attributes might define distinct separations between the data. Therefore, the algorithm can identify clusters that exist in multiple, possibly overlapping, subspaces [ 49 ]. Subspace algorithms can be categorized into four main families [ 89 ], namely: lattice, statistical, approximation and hybrid. The hddc function from package HDclassif implements the subspace clustering method of Bouveyron [ 90 ] in the R language. The algorithm is based on statistical models, with the assumption that all attributes may be relevant for clustering [ 91 ]. Some parameters of the algorithm, such as the number of clusters or model to be used, are estimated using an EM procedure.

So far, we have discussed the application of clustering algorithms on static data. Nevertheless, when analyzing data, it is important to take into account whether the data are dynamic or static. Dynamic data, unlike static data, undergo changes over time. Some kinds of data, like the network packets received by a router and credit card transaction streams, are transient in nature and they are known as data stream . Another example of dynamic data are time series because its values change over time [ 92 ]. Dynamic data usually include a large number of features and the amount of objects is potentially unbounded [ 59 ]. This requires the application of novel approaches to quickly process the entire volume of continuously incoming data [ 93 ], the detection of new clusters that are formed and the identification of outliers [ 94 ].

Materials and methods

Artificial datasets.

The proper comparison of clustering algorithms requires a robust artificial data generation method to produce a variety of datasets. For such a task, we apply a methodology based on a previous work by Hirschberger et al. [ 35 ]. The procedure can be used to generate normally distributed samples characterized by F features and separated into C classes. In addition, the method can control both the variance and correlation distributions among the features for each class. The artificial dataset can also be generated by varying the number of objects per class, N e , and the expected separation, α , between the classes.

thesis on data clustering

For each class i in the dataset, a covariance matrix R i of size F × F is created, and this matrix is used for generating N e objects for the classes. This means that pairs of features can have distinct correlation for each generated class. Then, the generated class values are divided by α and translated by s i , where s i is a random variable described by a uniform random distribution defined in the interval [−1, 1]. Parameter α is associated with the expected distances between classes. Such distances can have different impacts on clusterization depending on the number of objects and features used in the dataset. The features in the generated data have a multivariate normal distribution. In addition, the covariance among the features also have a normal distribution. Notice that such a procedure for the generation of artificial datasets was previously used in [ 36 ].

In Fig 2 , we show some examples of artificially generated data. For visualization purposes, all considered cases contain F = 2 features. The parameters used for each case are described in the caption of the figure. Note that the methodology can generate a variety of dataset configurations, including variations in features correlation for each class.

thumbnail

The parameters used for each case are (a) C = 2, Ne = 100 and α = 3.3. (b) C = 2, Ne = 100 and α = 2.3. (c) C = 10, Ne = 50 and α = 4.3. (d) C = 10, Ne = 50 and α = 6.3. Note that each class can present highly distinct properties due to differences in correlation between their features.

https://doi.org/10.1371/journal.pone.0210236.g002

  • Number of classes ( C ): The generated datasets are divided into C = {2, 10, 50} classes.
  • Number of features ( F ): The number of features to characterize the objects is F = {2, 5, 10, 50, 200}.
  • Number of object per class ( N e ): we considered Ne = {5, 50, 100, 500, 5000} objects per class. In our experiments, in a given generated dataset, the number of instances for each class is constant.
  • Mixing parameter ( α ): This parameter has a non-trivial dependence on the number of classes and features. Therefore, for each dataset, the value of this parameter was tuned so that no algorithm would achieve an accuracy of 0% or 100%.

We refer to datasets containing 2, 10, 50 and 200 features as DB2F, DB10F, DB50F, DB200F respectively. Such datasets are composed of all considered number of classes, C = {2, 10, 50}, and 50 elements for each class (i.e., Ne = 50). In some cases, we also indicate the number of classes considered for the dataset. For example, dataset DB2C10F contains 2 classes, 10 features and 50 elements per class.

For each case, we consider 10 realizations of the dataset. Therefore, 400 datasets were generated in total.

Evaluating the performance of clustering algorithms

The evaluation of the quality of the generated partitions is one of the most important issues in cluster analysis [ 30 ]. Indices used for measuring the quality of a partition can be categorized into two classes, internal and external indices. Internal validation indices are based on information intrinsic to the data, and evaluates the goodness of a clustering structure without external information. When the correct partition is not available it is possible to estimate the quality of a partition measuring how closely each instance is related to the cluster and how well-separated a cluster is from other clusters. They are mainly used for choosing an optimal clustering algorithm to be applied on a specific dataset [ 96 ]. On the other hand, external validation indices measure the similarity between the output of the clustering algorithm and the correct partitioning of the dataset. The Jaccard, Fowlkes-Mallows and adjusted rand index belong to the same pair counting category, making them closely related. Some differences include the fact that they can exhibit biasing with respect to the number of clusters or the distribution of class sizes in a partition. Normalization helps prevent this unwanted effect. In [ 97 ] the authors discuss several types of bias that may affect external cluster validity indices. A total of 26 pair-counting based external cluster validity indices were used to identify the bias generated by the number of clusters. It was shown that the Fowlkes Mallows and Jaccard index monotonically decrease as the number of clusters increases, favoring partitions with smaller number of clusters, while the Adjusted Rand Index tends to be indifferent to the number of clusters.

thesis on data clustering

Note that when the two sets of labels have a perfect one-to-one correspondence, the quality measures are all equal to unity.

Previous works have shown that there is no single internal cluster validation index that outperforms the other indices [ 100 , 101 ]. In [ 101 ] the authors compare a set of internal cluster validation indices in many distinct scenarios, indicating that the Silhouette index yielded the best results in most cases.

thesis on data clustering

Results and discussion

The accuracy of each considered clustering algorithm was evaluated using three methodologies. In the first methodology, we consider the default parameters of the algorithms provided by the R package. The reason for measuring performance using the default parameters is to consider the case where a researcher applies the classifier to a dataset without any parameter adjustment. This is a common scenario when the researcher is not a machine learning expert. In the second methodology, we quantify the influence of the algorithms parameters on the accuracy. This is done by varying a single parameter of an algorithm while keeping the others at their default values. The third methodology consists in analyzing the performance by randomly varying all parameters of a classifier. This procedure allows the quantification of certain properties such as the maximum accuracy attained and the sensibility of the algorithm to parameter variation.

Performance when using default parameters

In this experiment, we evaluated the performance of the classifiers for all datasets described in Section Artificial datasets . All unsupervised algorithms were set with their default configuration of parameters. For each algorithm, we divide the results according to the number of features contained in the dataset. In other words, for a given number of features, F , we used datasets with C = {2, 10, 50, 200} classes, and N e = {5, 50, 100} objects for each class. Thus, the performance results obtained for each F corresponds to the performance averaged over distinct number of classes and objects per class. We note that the algorithm based on subspaces cannot be applied to datasets containing 2 features, and therefore its accuracy was not quantified for such datasets.

In Fig 3 , we show the obtained values for the four considered performance metrics. The results indicate that all performance metrics provide similar results. Also, the hierarchical method seems to be strongly affected by the number of features in the dataset. In fact, when using 50 and 200 features the hierarchical method provided lower accuracy. The k-means, spectral, optics and dbscan methods benefit from an increment in the number of features. Interestingly, the hcmodel has a better performance in the datasets containing 10 features than in those containing 2, 50 and 200 features, which suggests an optimum performance for this algorithm for datasets containing around 10 features. It is also clear that for 2 features the performance of the algorithms tend to be similar, with the exception of the optics and dbscan methods. On the other hand a larger number of features induce marked differences in performance. In particular, for 200 features, the spectral algorithm provides the best results among all classifiers.

thumbnail

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 2, 10 and 50 features. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.

https://doi.org/10.1371/journal.pone.0210236.g003

We use the Kruskal-Wallis test [ 102 ], a nonparametric test, to explore the statistical differences in performance when considering distinct number of features in clustering methods. First, we test if the difference in performance is significant for 2 features. For this case, the Kruskal-Wallis test returns a p-value of p = 6.48 × 10 −7 , with a chi-squared distance of χ 2 = 41.50. Therefore, the difference in performance is statistically significant when considering all algorithms. For datasets containing 10 features, a p-value of p = 1.53 × 10 −8 is returned by the Kruskal-Wallis test, with a chi-squared distance of χ 2 = 52.20). For 50 features, the test returns a p-value of p = 1.56 × 10 −6 , with a chi-squared distance of χ 2 = 41.67). For 200 features, the test returns a p-value of p = 2.49 × 10 −6 , with a chi-squared distance of χ 2 = 40.58). Therefore, the null hypothesis of the Kruskal–Wallis test is rejected. This means that the algorithms indeed have significant differences in performance for 2, 10, 50 and 200 features, as indicated in Fig 3 .

In order to verify the influence of the number of objects used for classification, we also calculated the average accuracy for datasets separated according to the number of objects N e . The result is shown in Fig 4 . We observe that the impact that changing N e has on the accuracy depends on the algorithm. Surprisingly, the hierarchical, k-means and clara methods attain lower accuracy when more data is used. The result indicates that these algorithms tend to be less robust with respect to the larger overlap between the clusters due to an increase in the number of objects. We also observe that a larger N e enhances the performance of the hcmodel, optics and dbscan algorithms. This results is in agreement with [ 90 ].

thumbnail

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 5, 50 and 100 objects per class. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.

https://doi.org/10.1371/journal.pone.0210236.g004

In most clustering algorithms, the size of the data has an effect on the clustering quality. In order to quantify this effect, we considered a scenario where the data has a high number of instances. Datasets with F = 5, C = 10 and Ne = {5, 50, 500, 5000} instances per class were created. This dataset will be referenced as DB10C5F. In Fig 5 we can observe that the subspace and spectral methods lead to improved accuracy when the number of instances increases. On the other hand, the size of the dataset does not seem to influence the accuracy of the kmeans, clara, hcmodel and EM algorithms. For the spectral, hierarchical and hcmodel algorithms, the accuracy could not be calculated when 5000 instances per class was used due to the amount of memory used by these methods. For example, in the case of the spectral algorithm method, a lot of processing power is required to compute and store the kernel matrix when the algorithm is executed. When the size of the dataset is too small, we see that the subspace algorithm results in low accuracy.

thumbnail

The plots correspond to the ARI, Jaccard and FM indexes averaged for all datasets containing 10 classes and 5 features (DB10C5F).

https://doi.org/10.1371/journal.pone.0210236.g005

It is also interesting to verify the performance of the clustering algorithms when setting distinct values for the expected number of classes K in the dataset. Such a value is usually not known beforehand in real datasets. For instance, one might expect the data to contain 10 classes, and, as a consequence, set K = 10 in the algorithm, but the objects may actually be better accommodated into 12 classes. An accurate algorithm should still provide reasonable results even when a wrong number of classes is assumed. Thus, we varied K for each algorithm and verified the resulting variation in accuracy. Observe that the optics and dbscan methods were not considered in this analysis as they do not have a parameter for setting the number of classes. In order to simplify the analysis, we only considered datasets comprising objects described by 10 features and divided into 10 classes (DB10C10F). The results are shown in Fig 6 . The top figures correspond to the average ARI and Jaccard indexes calculated for DB10C10F, while the Silhoute and Dunn indexes are shown at the bottom of the figure. The results indicate that setting K < 10 leads to a worse performance than obtained for the cases where K > 10, which suggests that a slight overestimation of the number of classes has smaller effect on the performance. Therefore, a good strategy for choosing K seems to be setting it to values that are slightly larger than the number of expected classes. An interesting behavior is observed for hierarchical clustering. The accuracy improves as the number of expected classes increases. This behavior is due to the default value of the method parameter, which is set as “average”. The “average” value means that the unweighted pair group method with arithmetic mean (UPGMA) is used to agglomerate the points. UPGMA is the average of the dissimilarities between the points in one cluster and the points in the other cluster. The moderate performance of UPGMA in recovering the original groups, even with high subgroup differentiation, is probably a consequence of the fact that UPGMA tends to result in more unbalanced clusters, that is, the majority of the objects are assigned to a few clusters while many other clusters contain only one or two objects.

thumbnail

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 10 features (DB10C10F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.

https://doi.org/10.1371/journal.pone.0210236.g006

The external validation indices show that most of the clustering algorithms correctly identify the 10 main clusters in the dataset. Naturally, this knowledge would not be available in a real life cluster analysis. For this reason, we also consider internal validation indices, which provides feedback on the partition quality. Two internal validation indices were considered, the Silhouette index (defined in the range [−1,1]) and the Dunn index (defined in the range [0, ∞]). These indices were applied to the DB10C10F and DB10C2F dataset while varying the expected number of clusters K . The results are presented in Figs 6 and 7 . In Fig 6 we can see that the results obtained for the different algorithms are mostly similar. The results for the Silhouette index indicate high accuracy around k = 10. The Dunn index displays a slightly lower performance, misestimating the correct number of clusters for the hierarchical algorithm. In Fig 7 Silhouette and Dunn show similar behavior.

thumbnail

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 2 features (DB10C2F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.

https://doi.org/10.1371/journal.pone.0210236.g007

The results obtained for the default parameters are summarized in Table 2 . The table is divided into four parts, each part corresponds to a performance metric. For each performance metric, the value in row i and column j of the table represents the average performance of the method in row i minus the average performance of the method in column j . The last column of the table indicates the average performance of each algorithm. We note that the averages were taken over all generated datasets.

thumbnail

In general, the spectral algorithm provides the highest accuracy rate among all evaluated methods.

https://doi.org/10.1371/journal.pone.0210236.t002

The results shown in Table 2 indicate that the spectral algorithm tends to outperform the other algorithms by at least 10%. On the other hand, the hierarchical method attained lower performance in most of the considered cases. Another interesting result is that the k-means and clara provided equivalent performance when considering all datasets. In the light of the results, the spectral method could be preferred when no optimitization of parameters values is performed.

One-dimensional analysis

thesis on data clustering

In addition to the aforementioned quantities, we also measured, for each dataset, the maximum accuracy obtained when varying each single parameter of the algorithm. We then calculate the average of maximum accuracies, 〈max Acc〉, obtained over all considered datasets. In Table 3 , we show the values of 〈 S 〉, max S , Δ S and 〈max Acc〉 for datasets containing two features. When considering a two-class problem (DB2C2F), a significant improvement in performance (〈 S 〉 = 10.75% and 〈 S 〉 = 13.35%) was observed when varying parameter modelName , minPts and kpar of, respectively, the EM, optics and spectral methods. For all other cases, only minor average gain in performance was observed. For the 10-class problem, we notice that an inadequate value for parameter method of the hierarchical algorithm can lead to substantial loss of accuracy (16.15% on average). In most cases, however, the average variation in performance was small.

thumbnail

This analysis is based on the performance (measured through the ARI index) obtained when varying a single parameter of the clustering algorithm, while maintaining the others in their default configuration. 〈 S 〉, max S , Δ S are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.

https://doi.org/10.1371/journal.pone.0210236.t003

In Table 4 , we show the values of 〈 S 〉, max S , Δ S and 〈max Acc〉 for datasets described by 10 features. For the the two-class clustering problem, a moderate improvement can be observed for the k-means, hierarchical and optics algorithm through the variation of, respectively, parameter nstart , method and minPts . A large increase in accuracy was observed when varying parameter modelName of the EM method. Changing the modelName used by the algorithm led to, on average, an improvement of 18.8%. A similar behavior was obtained when the number of classes was set to C = 10. For 10 classes, the variation of method in the hierarchical algorithm provided an average improvement of 6.72%. A high improvement was also observed when varying parameter modelName of the EM algorithm, with an average improvement of 13.63%.

thumbnail

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈 S 〉, max S , Δ S are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.

https://doi.org/10.1371/journal.pone.0210236.t004

Differently from the parameters discussed so far, the variation of some parameters plays a minor role in the discriminative power of the clustering algorithms. This is the case, for instance, of parameters kernel and iter of the spectral clustering algorithm and parameter iter.max of the kmeans clustering. In some cases, the effect of a unidimensional variation of parameter resulted in reduction of performance. For instance, the variation of min.individuals and models of the subspace algorithm provided an average loss of accuracy on the order of 〈 S 〉 = 20%, depending on the dataset. Similar behavior is observed for the dbscan method, for which the variation of minPts causes and average loss of accuracy of 20.32%. Parameters metric and rngR of the clara algorithm also led to marked decrease in performance.

In Table 5 , we show the values of 〈 S 〉, max S , Δ S and 〈max Acc〉 for datasets described by 200 features. For the two-class clustering problem, a significant improvement in performance was observed when varying nstart in the k-means method, method in the hierarchical algorithm, modelName in the hcmodel method and modelName in the EM method. On the other hand, when varying metric , min.individuals and use in, respectively, the clara, subspace, and hcmodel methods an average loss of accuracy larger than 10% was verified. The largest loss of accuracy happens with parameter minPts (49.47%) of the dbscan method. For the 10-class problem, similar results were observed, with the exception of the clara method, for which any parameter change resulted in a large loss of accuracy.

thumbnail

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈 S 〉, max S , Δ S are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter.

https://doi.org/10.1371/journal.pone.0210236.t005

Multi-dimensional analysis

A complete analysis of the performance of a clustering algorithm requires the simultaneous variation of all of its parameters. Nevertheless, such a task is difficult to do in practice, given the large number of parameter combinations that need to be taken into account. Therefore, here we consider a random variation of parameters aimed at obtaining a sampling of each algorithm performance for its complete multi-dimensional parameter space.

thesis on data clustering

The performance of the algorithms for the different sets of parameters was evaluated according to the following procedure. Consider the histogram of ARI values obtained for the random sampling of parameters for the k-means algorithm, shown in Fig 8 . The red dashed line indicates the ARI value obtained for the default parameters of the algorithm. The light blue shaded region indicates the parameters configurations where the performance of the algorithm improved. From this result we calculated four main measures. The first, which we call p-value, is given by the area of the blue region divided by the total histogram area, multiplied by 100 in order to result in a percentage value. The p-value represents the percentage of parameter configurations where the algorithm performance improved when compared to the default parameters configuration. The second, third and fourth measures are given by the mean, 〈 R 〉, standard deviation, Δ R , and maximum value, max R , of the relative performance for all cases where the performance is improved (e.g. the blue shaded region in Fig 8 ). The relative performance is calculated as the difference in performance between a given realization of parameter values and the default parameters. The mean indicates the expected improvement of the algorithm for the random variation of parameters. The standard deviation represents the stability of such improvement, that is, how certain one is that the performance will be improved when doing such random variation. The maximum value indicates the largest improvement obtained when random parameters are considered. We also measured the average of the maximum accuracies 〈max ARI〉 obtained for each dataset when randomly selecting the parameters. In the S2 File of the supplementary material we show the distribution of ARI values obtained for the random sampling of parameters for all clustering algorithms considered in our analysis.

thumbnail

The algorithm was applied to dataset DB10C10F, and 500 sets of parameters were drawn.

https://doi.org/10.1371/journal.pone.0210236.g008

In Table 6 we show the performance (ARI) of the algorithms for dataset DB2C2F when applying the aforementioned random selection of parameters. The optics and EM methods are the only algorithms with a p-value larger than 50%. Also, a high average gain in performance was observed for the EM (22.1%) and hierarchical (30.6%) methods. Moderate improvement was observed for the hcmodel, kmeans, spectral, optics and dbscan algorithms.

thumbnail

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈 R 〉, Δ R and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

https://doi.org/10.1371/journal.pone.0210236.t006

The performance of the algorithms for dataset DB10C2F is presented in Table 7 . A high p-value was obtained for the optics (96.6%), EM (76.5%) and k-means (77.7%). Nevertheless, the average improvement in performance was relatively low for most algorithms, with the exception of the optics method, which led to an average improvement of 15.9%.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t007

A more marked variation in performance was observed for dataset DB2C10F, with results shown in Table 8 . The EM, kmeans, hierarchical and optics clustering algorithms resulted in a p-value larger than 50%. In such cases, when the performance was improved, the average gain in performance was, respectively, 30.1%, 18.0%, 25.9% and 15.5%. This means that the random variation of parameters might represent a valid approach for improving these algorithms. Actually, with the exception of clara and dbscan, all methods display significant average improvement in performance for this dataset. The results also show that a maximum accuracy of 100% can be achieved for the EM and subspace algorithms.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t008

In Table 9 we show the performance of the algorithms for dataset DB10C10F. The p-values for the EM, clara, k-means and optics indicate that the random selection of parameters usually improves the performance of these algorithms. The hierarchical algorithm can be significantly improved by the considered random selection of parameters. This is a consequence of the default value of parameter method , which, as discussed in the previous section, is not appropriate for this dataset.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t009

The performance of the algorithms for the dataset DB2C200F is presented in Table 10 . A high p-value was obtained for the EM (65.1%) and k-means (65.6%) algorithms. The average gain in performance in such cases was 39.1% and 35.4%, respectively. On the other hand, only in approximately 16% of the cases the Spectral and Subspace methods resulted in an improved ARI. Interestingly, the random variation of parameters led to, on average, large performance improvements for all algorithms.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t010

In Table 11 we show the performance of the algorithms for dataset DB10C200F. A high p-value was obtained for all methods. On the other hand, the average improvement in accuracy tended to be lower than in the case of the dataset DB2C200F.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t011

Conclusions

Clustering data is a complex task involving the choice between many different methods, parameters and performance metrics, with implications in many real-world problems [ 63 , 103 – 108 ]. Consequently, the analysis of the advantages and pitfalls of clustering algorithms is also a difficult task that has been received much attention. Here, we approached this task focusing on a comprehensive methodology for generating a large diversity of heterogeneous datasets with precisely defined properties such as the distances between classes and correlations between features. Using packages in the R language, we developed a comparison of the performance of nine popular clustering methods applied to 400 artificial datasets. Three situations were considered: default parameters, single parameter variation and random variation of parameters. It should be nevertheless be borne in mind that all results reported in this work are respective to specific configurations of normally distributed data and algorithmic implementations, so that different performance can be obtained in other situations. Besides serving as a practical guidance to the application of clustering methods when the researcher is not an expert in data mining techniques, a number of interesting results regarding the considered clustering methods were obtained.

Regarding the default parameters, the difference in performance of clustering methods was not significant for low-dimensional datasets. Specifically, the Kruskal-Wallis test on the differences in performance when 2 features were considered resulted in a p-value of p = 6.48 × 10 −7 (with a chi-squared distance of χ 2 = 41.50). For 10 features, a p-value of p = 1.53 × 10 −8 ( χ 2 = 52.20) was obtained. Considering 50 features resulted in a p-value of p = 1.56 × 10 −6 for the Kruskal-Wallis test ( χ 2 = 41.67). For 200 features, the obtained p-value was p = 2.49 × 10 −6 ( χ 2 = 40.58).

The Spectral method provided the best performance when using default parameters, with an Adjusted Rand Index (ARI) of 68.16%, as indicated in Table 2 . In contrast, the hierarchical method yielded an ARI of 21.34%. It is also interesting that underestimating the number of classes in the dataset led to worse performance than in overestimation situations. This was observed for all algorithms and is in accordance with previous results [ 44 ].

Regarding single parameter variations, for datasets containing 2 features, the hierarchical, optics and EM methods showed significant performance variation. On the other hand, for datasets containing 10 or more features, most methods could be readily improved through changes on selected parameters.

With respect to the multidimensional analysis for datasets containing ten classes and two features, the performance of the algorithms for the multidimensional selection of parameters was similar to that using the default parameters. This suggests that the algorithms are not sensitive to parameter variations for this dataset. For datasets containing two classes and ten features, the EM, hcmodel, subspace and hierarchical algorithm showed significant gain in performance. The EM algorithm also resulted in a high p-value (70.8%), which indicates that many parameter values for this algorithm can provide better results than the default configuration. For datasets containing ten classes and ten features, the improvement was significantly lower for almost all the algorithms, with the exception of the hierarchical clustering. When a large number of features was considered, such as in the case of the datasets containing 200 features, large gains in performance were observed for all methods.

In Tables 12 , 13 and 14 we show a summary of the best accuracies obtained during our analysis. The tables contain the best performance, measured as the ARI of the resulting partitions, achieved by each algorithm in the three considered situations (default, one- and multi-dimensional adjustment of parameters). The results are respective to datasets DB2C2F, DB10C2F, DB2C10F, DB10C10F, DB2C200F and DB10C200F. We observe that, for datasets containing 2 features, the algorithms tend to show similar performance, specially when the number of classes is increased. For datasets containing 10 features or more, the spectral algorithm seems to consistently provide the best performance, although the EM, hierarchical, k-means and subspace algorithms can also achieve similar performance with some parameter tuning. It should be observed that several clustering algorithms, such as optics and dbscan, aim at other data distributions such as elongated or S-shaped [ 72 , 74 ]. Therefore, different results could be obtained for non-normally distributed data.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t012

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t013

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t014

Other algorithms could be compared in future extensions of this work. An important aspect that could also be explored is to consider other statistical distributions for modeling the data. In addition, an analogous approach could be applied to semi-supervised classification.

Supporting information

S1 file. description of the clustering algorithms’ parameters..

We provide a brief description about the parameters of the clustering algorithms considered in the main text.

https://doi.org/10.1371/journal.pone.0210236.s001

S2 File. Clustering performance obtained for the random selection of parameters.

The file contains figures showing the histograms of ARI values obtained for identifying the clusters of, respectively, datasets DB10C10F and DB2C10F using a random selection of parameters. Each plot corresponds to a clustering method considered in the main text.

https://doi.org/10.1371/journal.pone.0210236.s002

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 7. Aggarwal CC, Zhai C. In: Aggarwal CC, Zhai C, editors. A Survey of Text Clustering Algorithms. Boston, MA: Springer US; 2012. p. 77–128.
  • 13. Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: European conference on machine learning. Springer; 1998. p. 137–142.
  • 14. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco: Morgan Kaufmann; 2005.
  • 37. Berkhin P. In: Kogan J, Nicholas C, Teboulle M, editors. A Survey of Clustering Data Mining Techniques. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 25–71.
  • 42. R Development Core Team. R: A Language and Environment for Statistical Computing; 2006. Available from: http://www.R-project.org .
  • 44. Erman J, Arlitt M, Mahanti A. Traffic classification using clustering algorithms. In: Proceedings of the 2006 SIGCOMM workshop on mining network data. ACM; 2006. p. 281–286.
  • 47. Parsons L, Haque E, Liu H. Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM Int. Conf. on Data Mining. Citeseer; 2004. p. 48–56.
  • 48. Burdick D, Calimlim M, Gehrke J. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. In: Proceedings of the 17th International Conference on Data Engineering. Washington, DC, USA: IEEE Computer Society; 2001. p. 443–452.
  • 51. UCI. breast-cancer-wisconsin;. Available from: https://http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ .
  • 52. Ultsch A. Clustering wih som: U* c. In: Proceedings of the 5th Workshop on Self-Organizing Maps. vol. 2; 2005. p. 75–82.
  • 54. Aggarwal CC, Reddy CK. Data Clustering: Algorithms and Applications. vol. 2. 1st ed. Chapman & Hall/CRC; 2013.
  • 58. Jain AK, Topchy A, Law MH, Buhmann JM. Landscape of clustering algorithms. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on. vol. 1. IEEE; 2004. p. 260–263.
  • 64. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. Series in Probability& Mathematical Statistics. 2009;.
  • 65. Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.
  • 66. Sequeira K, Zaki M. ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2002. p. 386–395.
  • 67. Williams GJ, Huang Z. Mining the knowledge mine. In: Australian Joint Conference on Artificial Intelligence. Springer; 1997. p. 340–348.
  • 68. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif: University of California Press; 1967. p. 281–297.
  • 70. Kaufman L, Rousseeuw PJ. Finding Groups in Data: an introduction to cluster analysis. John Wiley & Sons; 1990.
  • 71. Han J, Kamber M. Data Mining. Concepts and Techniques. vol. 2. 2nd ed. -: Morgan Kaufmann; 2006.
  • 73. Ankerst M, Breunig MM, Kriegel HP, Sander J. OPTICS: Ordering Points To Identify the Clustering Structure. ACM Press; 1999. p. 49–60.
  • 74. Ester M, Kriegel HP, Sander J, Xu X. A Density-based Algorithm for Discovering Clusters a Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. KDD’96. AAAI Press; 1996. p. 226–231.
  • 86. Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14. MIT Press; 2001. p. 849–856.
  • 95. Horn RA, Johnson CR. Matrix Analysis. 2nd ed. New York, NY, USA: Cambridge University Press; 2012.
  • 96. Liu Y, Li Z, Xiong H, Gao X, Wu J. Understanding of internal clustering validation measures. In: Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE; 2010. p. 911–916.
  • 98. Cover TM, Thomas JA. Elements of Information Theory. vol. 2. Wiley; 2012.
  • 102. McKight PE, Najab J. Kruskal-Wallis Test. Corsini Encyclopedia of Psychology. 2010;.

Educational Data Mining Clustering Approach: Case Study of Undergraduate Student Thesis Topic

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Data Clustering Using K-Mean Algorithm for Network Intrusion Detection A Research Project submitted by

Profile image of Aftab Ahmad

Related Papers

International Journal of Computer Applications

kamini nalavade

thesis on data clustering

With the growth of hacking and exploiting tools and invention of new ways of intrusion, Intrusion detection and prevention is becoming the major challenge in the world of network security. It is becoming more demanding due to increasing network traffic and data on Internet. There are various approaches being utilized in intrusion detections, but unfortunately any of the systems so far is not completely flawless. So, the quest of betterment continues. Intrusion detection systems using data mining approaches make it possible to search patterns and rules in large amount of audit data. Classification-based data mining models for intrusion detection are often ineffective in dealing with dynamic changes in intrusion patterns and characteristics. Unsupervised learning methods are efficient in detecting unknown attacks in large datasets. In this paper we investigate clustering approaches for network intrusion detection. We carried out our experiments on K-means clustering algorithm and meas...

Computer Science Department, University of …

vipin Kumar

Anita Komlodi

ith network size and complexity continuously increasing, securing computing infrastructures from attacks is an escalating challenge. Intrusion detection systems (IDSs) are often used to aid analysts&#39; efforts by automatically identifying successful and unsuccessful system attacks or abuses. Although IDS alerts can be a useful first step in uncovering security compromises, they&#39;re often just that: a starting point. While IDS alerts contain some pertinent information, analysts can rarely determine an event&#39;s accuracy and severity from an IDS alert alone. Rather, they must collect and construct the event&#39;s relevant context within voluminous network traffic data. Building this contextual understanding of an event is fundamental to intrusion detection (ID) analysis. Whether the starting point of analysis is data rich (as with an IDS alert) or data poor (as with a phone call from a user), analysis of a network security event is a complex task. Generally, contextual data com...

International Journal of Innovation, Management and Technology

Rishika Todi

Poonam Thakre [UCOE]

Intrusion detection is the act of detecting unwanted traffic on a network or a device. An IDS can be a piece of installed software or a physical appliance that monitors network traffic in order to detect unwanted activity and events such as illegal and malicious traffic, traffic that violates security policy, and traffic that violates acceptable use policies. This article aims at providing a general presentation of the techniques and types of the intrusion detection and prevention systems and an in-depth description of the evaluation, comparison and classification features of the IDS and the IPS. Many IDS tools will also store a detected event in a log to be reviewed at a later date or will combine events with other data to make decisions regarding policies or damage control. An IPS is a type of IDS that can prevent or stop unwanted traffic. The IPS usually logs such events and related information.MachineLearning, it is a field of computer science that uses statistical techniques to...

IEEE Xplore

NITIN M A H A D E O SHIVSHARAN

This paper introduces the Network Intrusion Detection System (NIDS), which uses a suite of data mining techniques to automatically detect attacks against computer networks and systems. This paper focuses on two specific contributions: (i) an unsupervised anomaly detection technique that assigns a score to each network connection that reflects how anomalous the connection is, and (ii) an association pattern analysis based module that summarizes those network connections that are ranked highly anomalous by the anomaly detection module. Experimental results show that our anomaly detection techniques are successful in automatically detecting several intrusions that could not be identified using popular signature-based tools .Furthermore, given the very high volume of connections observed per unit time, association pattern based summarization of novel attacks is quite useful in enabling a security analyst to understand and characterize emerging threats.

IJERA Journal

Journal of Financial Crime

Richard Overill

Babatunde R. S , Abdulrafiu Isiaka

RELATED PAPERS

Scientific reports

Veronica Vallejo

Journal of Thermal Analysis and Calorimetry

Gunilla Sörensen

gaspare mura

Kind & Adolescent Praktijk

Harrie van Leeuwen

Ebenezer Rojt

Acta Physiologiae Plantarum

Aneta Piechalak

anggraeni dwi Saputri

Journal of Environmental Radioactivity

Sergei Spiridonov

Jurnal Akuntansi dan Manajemen

Sarah Nurjanah

Joaquín Salmerón Juan

Erik Exe Christoffersen

Valtioneuvoston kanslia

Vappu Renko

Art-e Sanat Dergisi

Erol Kılıç , Salih DENLI

Salvatore Venticinque

Mohammed Sawalhah

Aldi Fauzan Zulhakim

Revista Fitotecnia Mexicana

Jose Luis Rodriguez

Jan Voorberg

Supportive Care in Cancer

Dagmara Poprawski

International Journal of Pure and Apllied Mathematics

Oleh B. Skaskiv

Vittorio Formentin

Journal of Molecular Structure

Károly Vékey

Dr. Mohammed Sallam

See More Documents Like This

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Eindhoven University of Technology research portal Logo

  • Help & FAQ

Data Mining

  • Data Science
  • Data and Artificial Intelligence

Student theses

  • 1 - 50 out of 258 results
  • Title (descending)

Search results

3d face reconstruction using deep learning.

Supervisor: Medeiros de Carvalho, R. (Supervisor 1), Gallucci, A. (Supervisor 2) & Vanschoren, J. (Supervisor 2)

Student thesis : Master

Achieving Long Term Fairness through Curiosity Driven Reinforcement Learning: How intrinsic motivation influences fairness in algorithmic decision making

Supervisor: Pechenizkiy, M. (Supervisor 1), Gajane, P. (Supervisor 2) & Kapodistria, S. (Supervisor 2)

Activity Recognition Using Deep Learning in Videos under Clinical Setting

Supervisor: Duivesteijn, W. (Supervisor 1), Papapetrou, O. (Supervisor 2), Zhang, L. (External person) (External coach) & Vasu, J. D. (External coach)

A Data Cleaning Assistant

Supervisor: Vanschoren, J. (Supervisor 1)

Student thesis : Bachelor

A Data Cleaning Assistant for Machine Learning

A deep learning approach for clustering a multi-class dataset.

Supervisor: Pei, Y. (Supervisor 1), Marczak, M. (External person) (External coach) & Groen, J. (External person) (External coach)

Aerial Imagery Pixel-level Segmentation

A framework for understanding business process remaining time predictions.

Supervisor: Pechenizkiy, M. (Supervisor 1) & Scheepens, R. J. (Supervisor 2)

A Hybrid Model for Pedestrian Motion Prediction

Supervisor: Pechenizkiy, M. (Supervisor 1), Muñoz Sánchez, M. (Supervisor 2), Silvas, E. (External coach) & Smit, R. M. B. (External coach)

Algorithms for center-based trajectory clustering

Supervisor: Buchin, K. (Supervisor 1) & Driemel, A. (Supervisor 2)

Allocation Decision-Making in Service Supply Chain with Deep Reinforcement Learning

Supervisor: Zhang, Y. (Supervisor 1), van Jaarsveld, W. L. (Supervisor 2), Menkovski, V. (Supervisor 2) & Lamghari-Idrissi, D. (Supervisor 2)

Analyzing Policy Gradient approaches towards Rapid Policy Transfer

An empirical study on dynamic curriculum learning in information retrieval.

Supervisor: Fang, M. (Supervisor 1)

An Explainable Approach to Multi-contextual Fake News Detection

Supervisor: Pechenizkiy, M. (Supervisor 1), Pei, Y. (Supervisor 2) & Das, B. (External person) (External coach)

An exploration and evaluation of concept based interpretability methods as a measure of representation quality in neural networks

Supervisor: Menkovski, V. (Supervisor 1) & Stolikj, M. (External coach)

Anomaly detection in image data sets using disentangled representations

Supervisor: Menkovski, V. (Supervisor 1) & Tonnaer, L. M. A. (Supervisor 2)

Anomaly Detection in Polysomnography signals using AI

Supervisor: Pechenizkiy, M. (Supervisor 1), Schwanz Dias, S. (Supervisor 2) & Belur Nagaraj, S. (External person) (External coach)

Anomaly detection in text data using deep generative models

Supervisor: Menkovski, V. (Supervisor 1) & van Ipenburg, W. (External person) (External coach)

Anomaly Detection on Dynamic Graph

Supervisor: Pei, Y. (Supervisor 1), Fang, M. (Supervisor 2) & Monemizadeh, M. (Supervisor 2)

Anomaly Detection on Finite Multivariate Time Series from Semi-Automated Screwing Applications

Supervisor: Pechenizkiy, M. (Supervisor 1) & Schwanz Dias, S. (Supervisor 2)

Anomaly Detection on Multivariate Time Series Using GANs

Supervisor: Pei, Y. (Supervisor 1) & Kruizinga, P. (External person) (External coach)

Anomaly detection on vibration data

Supervisor: Hess, S. (Supervisor 1), Pechenizkiy, M. (Supervisor 2), Yakovets, N. (Supervisor 2) & Uusitalo, J. (External person) (External coach)

Application of P&ID symbol detection and classification for generation of material take-off documents (MTOs)

Supervisor: Pechenizkiy, M. (Supervisor 1), Banotra, R. (External person) (External coach) & Ya-alimadad, M. (External person) (External coach)

Applications of deep generative models to Tokamak Nuclear Fusion

Supervisor: Koelman, J. M. V. A. (Supervisor 1), Menkovski, V. (Supervisor 2), Citrin, J. (Supervisor 2) & van de Plassche, K. L. (External coach)

A Similarity Based Meta-Learning Approach to Building Pipeline Portfolios for Automated Machine Learning

Aspect-based few-shot learning.

Supervisor: Menkovski, V. (Supervisor 1)

Assessing Bias and Fairness in Machine Learning through a Causal Lens

Supervisor: Pechenizkiy, M. (Supervisor 1)

Assessing fairness in anomaly detection: A framework for developing a context-aware fairness tool to assess rule-based models

Supervisor: Pechenizkiy, M. (Supervisor 1), Weerts, H. J. P. (Supervisor 2), van Ipenburg, W. (External person) (External coach) & Veldsink, J. W. (External person) (External coach)

A Study of an Open-Ended Strategy for Learning Complex Locomotion Skills

A systematic determination of metrics for classification tasks in openml, a universally applicable emm framework.

Supervisor: Duivesteijn, W. (Supervisor 1), van Dongen, B. F. (Supervisor 2) & Yakovets, N. (Supervisor 2)

Automated machine learning with gradient boosting and meta-learning

Automated object recognition of solar panels in aerial photographs: a case study in the liander service area.

Supervisor: Pechenizkiy, M. (Supervisor 1), Medeiros de Carvalho, R. (Supervisor 2) & Weelinck, T. (External person) (External coach)

Automatic data cleaning

Automatic scoring of short open-ended questions.

Supervisor: Pechenizkiy, M. (Supervisor 1) & van Gils, S. (External coach)

Automatic Synthesis of Machine Learning Pipelines consisting of Pre-Trained Models for Multimodal Data

Automating string encoding in automl, autoregressive neural networks to model electroencephalograpy signals.

Supervisor: Vanschoren, J. (Supervisor 1), Pfundtner, S. (External person) (External coach) & Radha, M. (External coach)

Balancing Efficiency and Fairness on Ride-Hailing Platforms via Reinforcement Learning

Supervisor: Tavakol, M. (Supervisor 1), Pechenizkiy, M. (Supervisor 2) & Boon, M. A. A. (Supervisor 2)

Benchmarking Audio DeepFake Detection

Better clustering evaluation for the openml evaluation engine.

Supervisor: Vanschoren, J. (Supervisor 1), Gijsbers, P. (Supervisor 2) & Singh, P. (Supervisor 2)

Bi-level pipeline optimization for scalable AutoML

Supervisor: Nobile, M. (Supervisor 1), Vanschoren, J. (Supervisor 1), Medeiros de Carvalho, R. (Supervisor 2) & Bliek, L. (Supervisor 2)

Block-sparse evolutionary training using weight momentum evolution: training methods for hardware efficient sparse neural networks

Supervisor: Mocanu, D. (Supervisor 1), Zhang, Y. (Supervisor 2) & Lowet, D. J. C. (External coach)

Boolean Matrix Factorization and Completion

Supervisor: Peharz, R. (Supervisor 1) & Hess, S. (Supervisor 2)

Bootstrap Hypothesis Tests for Evaluating Subgroup Descriptions in Exceptional Model Mining

Supervisor: Duivesteijn, W. (Supervisor 1) & Schouten, R. M. (Supervisor 2)

Bottom-Up Search: A Distance-Based Search Strategy for Supervised Local Pattern Mining on Multi-Dimensional Target Spaces

Supervisor: Duivesteijn, W. (Supervisor 1), Serebrenik, A. (Supervisor 2) & Kromwijk, T. J. (Supervisor 2)

Bridging the Domain-Gap in Computer Vision Tasks

Supervisor: Mocanu, D. C. (Supervisor 1) & Lowet, D. J. C. (External coach)

CCESO: Auditing AI Fairness By Comparing Counterfactual Explanations of Similar Objects

Supervisor: Pechenizkiy, M. (Supervisor 1) & Hoogland, K. (External person) (External coach)

Clean-Label Poison Attacks on Machine Learning

Supervisor: Michiels, W. P. A. J. (Supervisor 1), Schalij, F. D. (External coach) & Hess, S. (Supervisor 2)

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Dissertations / Theses on the topic 'Hierarchical Clustering'

Create a spot-on reference in apa, mla, chicago, harvard, and other styles.

Consult the top 50 dissertations / theses for your research on the topic 'Hierarchical Clustering.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

Achtert, Elke. "Hierarchical Subspace Clustering." Diss., lmu, 2007. http://nbn-resolving.de/urn:nbn:de:bvb:19-68071.

Loganathan, Satish Kumar. "Distributed Hierarchical Clustering." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544001912266574.

Heger, Andreas. "Hierarchical clustering of sequence space." Thesis, University of Cambridge, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.619530.

Rodrigues, Domingos da Costa. "Galaxy formation in hierarchical clustering cosmologies." Thesis, University of Sussex, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.364133.

Pucek, Ludvig, and Viktor Sonebäck. "Hierarchical clustering of market risk models." Thesis, KTH, Matematisk statistik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-208307.

Thomas, Zachary Micah. "Bayesian Hierarchical Space-Time Clustering Methods." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1435324379.

Tittley, Eric Robert. "Hierarchical clustering and galaxy cluster scaling laws." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape9/PQDD_0008/NQ40291.pdf.

Giocoli, Carlo. "Hierarchical Clustering: Structure Formation in the Universe." Doctoral thesis, Università degli studi di Padova, 2008. http://hdl.handle.net/11577/3425516.

Benson, Andrew John. "Galaxy formation and clustering in a hierarchical universe." Thesis, Durham University, 2000. http://etheses.dur.ac.uk/4288/.

Butchart, Kate. "Hierarchical clustering using dynamic self organising neural networks." Thesis, University of Hertfordshire, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.338383.

Blundell, C. "Bayesian methods for hierarchical clustering and community discovery." Thesis, University College London (University of London), 2015. http://discovery.ucl.ac.uk/1466632/.

Šulc, Zdeněk. "Similarity Measures for Nominal Data in Hierarchical Clustering." Doctoral thesis, Vysoká škola ekonomická v Praze, 2013. http://www.nusl.cz/ntk/nusl-261939.

Pensuwon, Wanida. "Stochastic dynamic hierarchical neural networks." Thesis, University of Hertfordshire, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.366030.

Pereira, Sandra M. C. "Analysis of spatial point patterns using hierarchical clustering algorithms." University of Western Australia. School of Mathematics and Statistics, 2003. http://theses.library.uwa.edu.au/adt-WU2004.0056.

Kim, Han Sik. "The clustering of galaxies in hierarchical galaxy formation models." Thesis, Durham University, 2010. http://etheses.dur.ac.uk/588/.

Baker, Serena F. "Assessment of aCGH Clustering Methodologies." BYU ScholarsArchive, 2010. https://scholarsarchive.byu.edu/etd/2644.

Malm, Patrik. "Development of a hierarchical k-selecting clustering algorithm – application to allergy." Thesis, Linköping University, The Department of Physics, Chemistry and Biology, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10273.

The objective with this Master’s thesis was to develop, implement and evaluate an iterative procedure for hierarchical clustering with good overall performance which also merges features of certain already described algorithms into a single integrated package. An accordingly built tool was then applied to an allergen IgE-reactivity data set. The finally implemented algorithm uses a hierarchical approach which illustrates the emergence of patterns in the data. At each level of the hierarchical tree a partitional clustering method is used to divide data into k groups, where the number k is decided through application of cluster validation techniques. The cross-reactivity analysis, by means of the new algorithm, largely arrives at anticipated cluster formations in the allergen data, which strengthen results obtained through previous studies on the subject. Notably, though, certain unexpected findings presented in the former analysis where aggregated differently, and more in line with phylogenetic and protein family relationships, by the novel clustering package.

Kumar, Swapnil. "Comparison of blocking and hierarchical ways to find cluster." Kansas State University, 2017. http://hdl.handle.net/2097/35425.

Varas, González David. "Region-based particle filter leveraged with a hierarchical co-clustering." Doctoral thesis, Universitat Politècnica de Catalunya, 2016. http://hdl.handle.net/10803/404443.

Manrique, Oliva Alberto. "Modeling the gravitational clustering in hierarchical scenarios of structure formation." Doctoral thesis, Universitat de Barcelona, 1995. http://hdl.handle.net/10803/667632.

Gong, Yi. "A Flexible Framework for Hierarchical Clustering on Wireless Sensor Networks." Thesis, KTH, School of Information and Communication Technology (ICT), 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-24320.

In Wireless Sensor Networks, a clustering algorithm is a good option to reduce redundant data transmission and organize nodes efficiently for long life span. In this thesis, we mainly introduce, implement and evaluate a novel flexible hierarchical clustering algorithm on middleware layer based on Publish/Subscribe paradigm.

The goal of our algorithm is to provide the five features, scalability, energy efficiency, fault tolerance, load balancing and multi-level clustering. Our algorithm includes three protocols. Flat Clustering protocol constructs a 2-level clustering network structure. Function Delegation protocol provides the fault tolerance and load balancing features. Multi-Level Clustering protocol builds up a more than 2-level clustering network structure.

The thesis also studies several popular simulators and explicates the design and implementation of our algorithm on the ns-2 simulator. Then, we evaluate the algorithm from different angles. Based on our simulation, we optimize key parameters for our algorithm and find that the performance and the energy efficiency of our algorithm are promising, especially in the bridge topologies.

Zhang, Hongyang (Fred). "Hierarchical clustering of observations and features in high-dimensional data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/62648.

Neto, Fernando Soares de Aguiar. "Pre-processing approaches for collaborative filtering based on hierarchical clustering." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-25032019-102215/.

Wiberg, Benjamin. "Automatic Clustering of 3D Objects for Hierarchical Level-of-Detail." Thesis, Linköpings universitet, Medie- och Informationsteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-150534.

Dimaki, Georgia. "Dynamic node clustering in hierarchical optical data center network architectures." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/128973.

Almodóvar, Velázquez Leyda Michelle. "Studying brain networks via topological data analysis and hierarchical clustering." Diss., University of Iowa, 2016. https://ir.uiowa.edu/etd/2177.

Wang, Xinyu. "Toward Scalable Hierarchical Clustering and Co-clustering Methods : application to the Cluster Hypothesis in Information Retrieval." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSE2123/document.

Eldridge, Justin Eldridge. "Clustering Consistently." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070374903249.

Bhusal, Prem. "Scalable Clustering for Immune Repertoire Sequence Analysis." Wright State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374.

Gupta, Pramod. "Robust clustering algorithms." Thesis, Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/39553.

Cunningham, James. "Efficient, Parameter-Free Online Clustering." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606762403895603.

Annakula, Chandravyas. "Hierarchical and partitioning based hybridized blocking model." Kansas State University, 2017. http://hdl.handle.net/2097/35468.

Avril, François. "Mécanismes pour la cohérence, l'atomicité et les communications au niveau des clusters : application au clustering hiérarchique distribué adaptatif." Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLV034/document.

Selvan, A. D. Arul Nirai. "Highlighting dissimilarity in medical images using hierarchical clustering based segmentation (HCS)." Thesis, Sheffield Hallam University, 2007. http://shura.shu.ac.uk/20342/.

Cybis, Gabriela Bettella. "Phenotypic Bayesian phylodynamics : hierarchical graph models, antigenic clustering and latent liabilities." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/132858.

Roshani, Asra. "Unsupervised segmentation of sequences using harmony search and hierarchical clustering techniques." Master's thesis, Université Laval, 2014. http://hdl.handle.net/20.500.11794/25350.

Su, Chun-Chi, and 蘇俊吉. "Efficient Hierarchical Clustering Algorithm." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/50971269857645140390.

Fučík, Vojtěch. "Portfolio Construction Using Hierarchical Clustering." Master's thesis, 2017. http://www.nusl.cz/ntk/nusl-367640.

Hsu, Chun-Chieh, and 徐俊傑. "Improving DPC with Hierarchical Clustering." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/58n594.

Achtert, Elke [Verfasser]. "Hierarchical subspace clustering / von Elke Achtert." 2007. http://d-nb.info/984634878/34.

Yang, Hu. "Biomedical data classification using hierarchical clustering." 2005. http://hdl.handle.net/1993/20253.

Chen, Hsin-fu, and 陳信夫. "Dynamic Hierarchical Clustering Based on Taxonomy." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/49621476514265812600.

Chen, Zhi-Wei, and 陳致偉. "Hierarchical Clustering based on Influence Function." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/73220863089169153962.

Hsiao, Yu-Hsuan, and 蕭佑亘. "Hierarchical clustering for time-course data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/97774664111225816449.

Dinis, Joel Alexandre Ezequiel. "Hierarchical Classification using hierarchical clustering: an application to Human Activity Recognition." Master's thesis, 2018. https://repositorio-aberto.up.pt/handle/10216/114135.

Dinis, Joel Alexandre Ezequiel. "Hierarchical Classification using hierarchical clustering: an application to Human Activity Recognition." Dissertação, 2018. https://repositorio-aberto.up.pt/handle/10216/114135.

Vijaya, P. A. "Efficient Hierarchical Clustering Techniques For Pattern Classification." Thesis, 2005. https://etd.iisc.ac.in/handle/2005/1442.

Vijaya, P. A. "Efficient Hierarchical Clustering Techniques For Pattern Classification." Thesis, 2005. http://etd.iisc.ernet.in/handle/2005/1442.

Lin, Tsung-Hau, and 林琮晧. "An extension of Fuzzy Agglomerative Hierarchical Clustering." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/41036400665248006823.

Mehrabadi, Houman. "Simultaneous Hierarchical Clustering for Cell Formation Problems." Thesis, 2011. http://spectrum.library.concordia.ca/974000/1/Mehrabadi_MSc_S2012.pdf.

peer review mega journal research article scientific journal open access journal open science academic journal

Spatial and Temporal Analysis: A GIS-Based Application Tool for Crime Monitoring and Clustering in Malawi

  • Chitani Jarves Bob Tobias Department of Humanities (Geography) University of Livingstonia, Malawi
  • Brave Mwanza Department of Basic Sciences, University of Livingstonia, Malawi

For the purposes of monitoring, evaluating, and conducting a geographical analysis of crime-related data, the study used geospatial technology to collect crime data based on spatial location and the Malawi Police Data Digest of 2019 and 2020. In a more generic sense, knowing the geographic patterns of crime in Malawi using GIS technology can help determine how to make and implement important decisions to reduce crimes in Malawi. The Malawi Police Service has established a number of database management systems to help with crime monitoring. Notwithstanding, it has not yet fully integrated Geographic Information Systems across all jurisdictions. Maps showing crime locations and crime hotspot zones are therefore not included in the crime data and statistics report provided by the Malawi Police Service. In this light, a lot of people have become victims of various forms of crimes in areas where those crimes are also prevalent. To collect, track, and analyze crime data in Malawi for this study, Geographical Information System (GIS) particularly network analysis techniques were used. Network Analysis was used to identify crime hotspots by analyzing crime data as a network of interconnected events and locations. The rationale behind this was to treat each crime event as a node in the network and the spatial relationships between the crimes as edges. By analyzing this network, patterns and relationships between crime events were revealed, allowing for the identification of crime hotspots. The study found that Lilongwe in the central region and the capital city registered the highest number of crimes seconded by Blantyre in the southern region and followed by Mangochi. Mzimba recorded high crimes in the northern region. In Malawi, the traditional systems of intelligence and criminal record keeping have failed to satisfy the demands of today's crime situation. Manual methods neither give accurate, dependable, or complete data 24 hours a day nor do they help in trend forecasting and decision assistance. It also leads to poorer productivity and inefficient workforce use. The appropriate application of information technology is the solution to this ever-increasing challenge.

PlumX Statistics

thesis on data clustering

  • Review History
  • Endnote/Zotero/Mendeley (RIS)

Copyright (c) 2024 Chitani Jarves Bob Tobias, Brave Mwanza

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License .

ESI Postdoctoral Program 2024/25 (35 scholarships)

thesis on data clustering

Research Visit (Spain, Italy, Latvia, Czech Republic, Poland, and China)

Follow us on Social Media

facebook

IMAGES

  1. Narrative essay: Graph clustering by flow simulation phd thesis

    thesis on data clustering

  2. Clustering Based Unsupervised Learning

    thesis on data clustering

  3. Analytics and Visualization of Big Data: Distance-based clusterings

    thesis on data clustering

  4. Cluster analysis. (Lecture 6-8)

    thesis on data clustering

  5. GitHub

    thesis on data clustering

  6. A Quick Tutorial on Clustering for Data Science Professionals

    thesis on data clustering

VIDEO

  1. HOW TO TABULATE YOUR THESIS CONDUCTED DATA IN EXCEL

  2. Data mining masters thesis ! Post graduation research thesis

  3. SPSS in Nepali

  4. SPSS in Nepali

  5. DATA MINING 5 Cluster Analysis in Data Mining 5 4 Grid Based Clustering Methods

  6. DATA MINING 5 Cluster Analysis in Data Mining 5 1 Density Based and Grid Based Clustering Method

COMMENTS

  1. PDF Clustering via Deep Dictionary Learning

    Clustering is a fundamentally important unsupervised learning problem, in which the objective is to partition a data set into a collection of cohesive groups called clusters. Traditionally, a wide range of techniques have been used for clustering, the prototypical example being the popular k-means clustering algorithm [Llo82].

  2. PDF A Theoretical Study of Clusterability and Clustering Quality

    Clustering is a widely used technique, with applications ranging from data mining, bioinfor-matics and image analysis to marketing, psychology, and city planning. Despite the practical ... as well as a number of measures for center-based clustering. In this thesis we take the first step towards establishing a theory of clustering. Through-

  3. PDF Design and Analysis of Clustering Algorithms for Numerical, Categorical

    categorical and mixed data sets. Most clustering algorithms are limited to either numerical or categorical attributes. Datasets with mixed types of attributes are common in real life and so to design and analyse clustering algorithms for mixed data sets is quite timely. Determining the optimal solution to the clustering problem is NP-hard.

  4. Foundations of Clustering: New Models and Algorithms

    In this dissertation, we study clustering, one of the most common unsupervised learning problems. This dissertation covers recent developments in both clustering theory and machine learning practice. In particular, it explores how to bridge the gap between theory and practice by making them benefit from each other. Many clustering algorithms are developed in an application-oriented way lacking ...

  5. PDF Sindhuja Ranganathan Improvements to k-means clustering Master's Thesis

    Clustering aims at grouping data points that are close or similar to each other and to identify such clusters in an unsupervised manner.Figure 2 illustrates to identify four clusters and its centers into which the input data is divided. Figure 2: Clustering of data Two well-known methods of clustering are hierarchical clustering and the parti-

  6. PDF Hierarchical Clustering With Global Objectives: Approximation

    Hierarchical Clustering is an important tool for data analysis, used in diverse areas ranging from Phylogenetics in Biology to YouTube video recommendations and everything in between. The term ... As I am finishing my thesis from an apartment in New York, I can't help but thinking how lucky I have been so far, especially throughout those 5 ...

  7. (PDF) Data Clustering

    Organizing data into groups is one of the most fundamental ways of understanding and learning. Cluster analysis is the study of methods and algorithms for grouping (clustering) objects according ...

  8. PDF Master's Thesis Applying Clustering Techniques for Re ning Large Data

    Master's Thesis Applying Clustering Techniques for Re ning Large Data Set (Case Study on Malware) 1710443 Yoon Myet Thwe Supervisor Mizuhito Ogawa ... k-means turns out to be the most suitable algorithm to cluster the malware data sets in terms of runtime and accuracy. Although it can separate the. data set more accurately than other ...

  9. Functional Data Clustering Analysis via the Learning of Gaussian

    Functional data clustering analysis becomes an urgent and challenging task in the new era of big data. In this paper, we propose a new framework for functional data clustering analysis, which adopts a similar structure as the k-means algorithm for the conventional clustering analysis.Under this framework, we clarify three issues: how to represent functions, how to measure distances between ...

  10. PDF Efficient Clustering of Big Data Streams

    voluminous, various data. This thesis focuses on the data mining task of cluster-ing, i.e. grouping objects into clusters such that similar objects are assigned to the same cluster while dissimilar ones are assigned to different clusters. While traditional clustering algorithms merely considered static data, today's applica-

  11. Clustering algorithms: A comparative approach

    In most clustering algorithms, the size of the data has an effect on the clustering quality. In order to quantify this effect, we considered a scenario where the data has a high number of instances. Datasets with F = 5, C = 10 and Ne = {5, 50, 500, 5000} instances per class were created. This dataset will be referenced as DB10C5F.

  12. PDF Comparing applicability of prevalent Clustering Algorithms for Document

    Its goal is to minimize the distance between data points and their barycentre in each cluster. It is hence a so-called distance-based clustering algorithm. The algorithm has the following four steps: Given a set of nobservations with ddimensions, so X. 1;:::;X. nwith each X. i2Rdarranged in a d nmatrix1.

  13. PDF Using Cluster Analysis, Cluster Validation, and Consensus Clustering to

    We apply cluster analysis to data collected from 358 children with PDDs, and validate the resulting clusters. Notably, ... great guidance, support, patience and understanding throughout the course of this thesis. I would like to thank her for all her help and advice over the years I have pursued my education and research at Queen's University.

  14. PDF Chapter 4 Clustering Algorithms and Evaluations

    clustering objects within this thesis are verbs, and the clustering task is a semantic classification of the verbs. Further cluster parameters are to be explored within the cluster analysis of the verbs. ... 4.1.2 Data Objects, Clustering Purpose and Object Features This work is concerned with inducing a classification of Germ an verbs, i.e ...

  15. (PDF) Educational Data Mining Clustering Approach: Case Study of

    This study analyzed five different clustering algorithms, including K-Means, DBScan, BIRCH, Gaussian Mixture, and Mean Shift, to identify the best approach for analyzing undergraduate thesis data.

  16. A Generalized Study on Data Mining and Clustering Algorithm

    January 2014. To cluster the data, Data Mining plays a major role in most of the field. It contains many algorithms which has set of heuristic calculations and is used to create model from data ...

  17. Educational Data Mining Clustering Approach: Case Study of

    The results demonstrated that k-means clustering is the most efficient method, generating five distinct clusters with unique characteristics. Furthermore, this study investigated the correlation between educational data, specifically GPA, and the average grades of courses that support a thesis title and the duration of thesis completion.

  18. (PDF) Data Clustering Using K-Mean Algorithm for Network Intrusion

    Next, the reformatted data will be used to develop a classification model. So in this thesis, data mining techniques (clustering) are used for intrusion detection to detect unwanted attempts at accessing, manipulating, and/or disabling of computer system, mainly through a network. The goal of IDS is to detect malicious traffic.

  19. Dissertations / Theses: 'K-means clustering algorithm'

    In this thesis we will use the GPU to implement one of the most common algorithms of data clustering, known as fuzzy k-means clustering (FKM). In FKM Clustering, when updating memberships, we provide two different strategies to ensure that all test samples can be executed optimally, and when calculating new centroids, we use two different ways ...

  20. Data Mining

    Student thesis: Master. File. Activity Recognition Using Deep Learning in Videos under Clinical Setting Author: Srinivasan, ... Algorithms for center-based trajectory clustering Author: van de L'Isle, N. A. F., ... Anomaly detection in image data sets using disentangled representations Author: Rombouts, J. C., ...

  21. Dissertations / Theses: 'Fuzzy clustering'

    In cluster analysis, the fuzzy c-means (FCM) clustering algorithm is the most used method. The main purpose of clustering analysis is for clustering a given data set. In this thesis,we propose a clustering algorithm by extending Yang and Wu [6] clustering algorithm, called AFCM, such that it can handle interval data with the best representative.

  22. Dissertations / Theses: 'Hierarchical Clustering'

    The main objective of this study is to improve the quality of hierarchical data clustering. This thesis also proposes a novel approach to remove outliers. This thesis reports several experiments conducted to compare the clustering quality of the proposed algorithm and the existing agglomerative hierarchical clustering algorithms, including ...

  23. Spatial and Temporal Analysis: A GIS-Based Application Tool for Crime

    For the purposes of monitoring, evaluating, and conducting a geographical analysis of crime-related data, the study used geospatial technology to collect crime data based on spatial location and the Malawi Police Data Digest of 2019 and 2020. In a more generic sense, knowing the geographic patterns of crime in Malawi using GIS technology can help determine how to make and implement important ...

  24. Thesis On Data Clustering

    Download Once the deadline is over, we will upload your order into you personal profile and send you a copy to the email address you used while placing order. Degree: Bachelor's. Irene W. Megan Sharp. #12 in Global Rating. Level: College, University, High School, Master's. 77. Customer Reviews.