How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

Wilmer Leal; Eugenio J. Llanos; Guillermo Restrepo; Carlos F. Suárez; Manuel Elkin Patarroyo

doi:10.1186/s13321-016-0114-x

How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

Wilmer Leal, Eugenio J. Llanos, Guillermo Restrepo, Carlos F. Suárez, Manuel Elkin Patarroyo

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

13 Citas (Scopus)

Resumen

Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.

Idioma original	Inglés estadounidense
Número de artículo	4
Páginas (desde-hasta)	1-16
Número de páginas	16
Publicación	Journal of Cheminformatics
Volumen	8
N.º	1
DOI	https://doi.org/10.1186/s13321-016-0114-x
Estado	Publicada - ene. 25 2016

Áreas temáticas de ASJC Scopus

Informática aplicada
Química física y teórica
Infografía y diseno asistido por ordenador
Biblioteconomía y ciencias de la información

Acceder al documento

10.1186/s13321-016-0114-x

Otros archivos y enlaces

Citar esto

@article{161b5c8be5c7404188a9819f02f9488a,

title = "How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity",

abstract = "Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.",

author = "Wilmer Leal and Llanos, {Eugenio J.} and Guillermo Restrepo and Su{\'a}rez, {Carlos F.} and Patarroyo, {Manuel Elkin}",

note = "Publisher Copyright: {\textcopyright} 2016 Leal et al.",

year = "2016",

month = jan,

day = "25",

doi = "10.1186/s13321-016-0114-x",

language = "English (US)",

volume = "8",

pages = "1--16",

journal = "Journal of Cheminformatics",

issn = "1758-2946",

publisher = "Chemistry Central",

number = "1",

}

TY - JOUR

T1 - How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

AU - Leal, Wilmer

AU - Llanos, Eugenio J.

AU - Restrepo, Guillermo

AU - Suárez, Carlos F.

AU - Patarroyo, Manuel Elkin

PY - 2016/1/25

Y1 - 2016/1/25

N2 - Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.

AB - Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.

UR - http://www.scopus.com/inward/record.url?scp=84958102688&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84958102688&partnerID=8YFLogxK

U2 - 10.1186/s13321-016-0114-x

DO - 10.1186/s13321-016-0114-x

M3 - Article

AN - SCOPUS:84958102688

SN - 1758-2946

VL - 8

SP - 1

EP - 16

JO - Journal of Cheminformatics

JF - Journal of Cheminformatics

IS - 1

M1 - 4

ER -

How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

Resumen

Áreas temáticas de ASJC Scopus

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto