Proceedings of the Federated Conference on Computer Science DOI: 10.15439/2016F453 _and Information Systems pp. 41-46_ACSIS, Vol. 8. ISSN 2300-5963 A new way for the exploration of a dataset based on a social choice inspired approach Michel HERBIN, Amine AIT YOUNES, Frederic BLANCHARD Universitě de Reims Champagne-Ardenne CReSTIC, France Email: {michel.herbin, amine.ait-younes, frederic.blanchard }@univ-reims.fr Abstract—The exploration of a data set consists in grouping similar data. The classical statistical methods often fail when there is is no minimal assumption on the clusters. Our approach is based on the links between data, but the pairwise comparison between data and the importance of the links depend heavily on context where data lies. We propose to analyze a dataset through methods of the social choice theory where data plays both the role of a candidate and the role of a voter. The candidates are ranked by the voters and each voter gives a score to each candidate according to his ranking. We propose one specific election for each voter based on his preferences. The voters of these elections have weights computed according to their respective behaviors. In this approach, the conventional similarity indices between data are used to define the electoral behavior of each data. Index Terms—exploratory data analysis, social choice theory, representatives, vote, data reduction I. Introduction ONE OF the first steps in the exploration of a data set consists in grouping similar data. For this purpose, lot of clustering methods are proposed in literature to detect clusters within a dataset. The methods often fail when there is neither a minimal assumption on the clusters nor a minimal model of the clusters. For instance the classical k-means method [8] assumes both that data could be grouped around mean values or mean vectors and that the number of clusters is known. Unfortunately the first assumption leads to important constraints on the shape of the clusters in the data space and this condition is seldom corroborated. Other approaches of clustering are based on links between data. The hierarchical agglomerative clustering methods are probably the most known methods for exploring the datasets using such links. The links are usually drawn from pairwise comparisons between data and they are based on distances or pseudo-distances [4]. But the pairwise comparison between data and the importance of the links depend heavily on context where data lies. Indeed the ranges of values of a comparison index could change when data are not in the same clusters. In other words, the links could be well suited to connect two data in one cluster and they are not adapted for the other clusters. Thus this paper proposes a new way to define the links between data through the ranks to overcome this constraint of cluster context. We propose to analyze a dataset through methods of the social choice theory where data plays both the role of a candidate and the role of a voter [5]. The social choice inspired approach brings a metaphorical meaning that help to understand the concepts (as in bioinspired or human-inspired algorithms [10]). The candidates are ranked by the voters and each voter gives a score to each candidate according to his ranking. Then the scores of the voters are aggregated using generally the sum of scores obtained by the candidates. In the classical procedure of election, each voter has the same weight in the aggregation. Thus this procedure is the same for all clusters. In this paper the election procedures differ from one cluster to another. We propose one specific election for each voter based on his preferences (i.e. one election per voter). The voters of these elections have weights computed by comparison of their respective behaviors. The weights differ from one election to another. The links between data are defined using these elections where each voter selects one candidate for representing itself within the dataset. The chainings between the voters and their representatives define data communities. Thus the partitions of the dataset with these communities give us a new way to explore the dataset. The following section describes the procedure of election that we propose in this paper. It leads to a graph that permit us to structure the dataset. Then we study and we assess this method for structuring a dataset. Finally we discuss and we conclude this work. II. Dataset and Voters A. Collective preference Let O be a dataset with n elements: n = {x1,x2,...xn} In the framework of the social choice theory [9] [3], fl is both a set of n voters and a set of n candidates. Thus each data is a voter of fl and it also becomes an alternative that the other voters could prefer as a representative in fl (i.e. an elected candidate of fl). The dataset is provided with a pairwise comparison index between data. We call D this index. In this paper, we use Euclidean distance as pairwise comparison index. But we need only two properties of D. When Xt, Xj, and X^ are three data in fl, we should have: • D{Xi,Xj) < D(Xi,Xk) if Xj is more similar to Xt than Xk is (in other terms : Xj is prefered to X^ by Xt) 978-83-60810-90-3/$25.00©2016, IEEE 41 42 PROCEEDINGS OF THE FEDCSIS. GDAŇSK, 2016 In the following, any pairwise comparison index should respect these two properties. With using such a pairwise comparator, each data Xi is considered as a voter which can rank the other data. The ranks of Xi are defined between 1 and n. The ranking function is called RXi and we have: . RXi(Xi) = l, . RXi(Xj) < RXi(Xk) if D(Xi,Xj) < D(Xi,Xk). The data Xi is a voter that selects the candidates using RXi as preference indicator. The vote of X{ is realized with a score of Borda which is a classical method of social choice theory [6] [1]. In this paper, the score of Borda given by the voter Xi to the candidate X, is defined as: (n-l) where Xj is a candidate and Xi is a voter. The classical election procedure attributes the sum of the scores of the voters for each candidate. Thus the candidate Xj obtains the global score S(Xj) defined by: S(XJ) = YJSxÁXJ) This procedure leads to nominate the best candidate as the one with the highest score. But each voter has the same weight in this overall vote. This overall election does not take into account that two voters could belong to two different clusters. In the following, we will change the paradigm. We consider that each voter has his own election procedure that is adapted to itself. The following describes the specific procedure for each voter. B. Individual preference Each voter will choose its candidate with its own election procedure. Let Xi be a voter that chooses the candidates. Each data Xj is also a voter of the election that Xi proposes. All the voters of fl have weights that are specific of the election procedure of Xt. The weight of Xi itself is equal to one. The more similar to Xi a voter Xj is, the higher the weight of Xj is in this election. The weights are based on the similarity between the voters and the similarities with Xi are used for the election that Xi proposes. Let us describe the similarity of the behaviors of two voters. We consider that two voters Xi and Xj are similar when their respective ranking function RXi and RXj are similar. The correlation of Spearman [11] is classically used to evaluate the correlation between ranks. The higher the correlation is close to 1, the more ranks are correlated. In this paper the correlation gives us an index of the similarity of the behavior of two voters. Spearman correlation between Xi and Xj is defined by: 6 * V™ t. Let wxt(Xj) be the weight given to the voter Xj for the election based on the preferences of Xt. We define this weight by: , n , Car(Xi.Xi) -t, wXi(Xj) = max(0,-v ' ^-) The weight lies between 0 and 1. It is equal to zero when Xi and Xj are not similar. In the election based on the preferences of X{, each candidate Xj obtains a score Scorext(Xj) defined by: n ScoreXt{Xj) = x SXk(Xj) k=l Thus this election is based on a sum of scores weighted by the similarity of the voters with Xt. Other voters similar to Xi participate in the election of the representative of Xt. C. Communities of voters The representative of Xi becomes the one which have the highest score within fl for the election based on the preferences of Xt. So each voter Xi has one representative in fl elected by the specific election of X{ : Repgcore(Xi) . Score(Repscore(Xi)) = mkx(ScoreXi(Xk)) k=l We define a graph in fl where the vertices are the voters and the edges are the links between the voters and their representatives. Each connected components of this graph defines a community of voters. The more we claim a high correlation between voters, the more the size of communities is reduced. In other words, the higher the threshold t is close to 1, the more the communities are small and the number of communities increases within fl. These communities give a data structuration to study a dataset when we have neither assumption nor model for the clusters. If the threshold t is close to 1, the representative of each voter Xi is based only on the preference of Xt. If this threshold decrease, other voters similar to Xi participate in the election of the representative of Xt. Each data Xi has two representatives: the favorite candidate of Xi and the elected candidate of the local election of Xi. For each data Xi we define an individual loss indicator, wich represents the correlation loss between X{ and theses two representatives. The collective loss indicator for the data is the sum of all the individual loss. loss = y^JossInd(Xk) lossInd(Xi) = Car(Xi, Reps(Xi)) - Cor(Xx, RepScor£(Xx)) with MICHEL HERBIN ET AL.: A NEW WAY FOR THE EXPLORATION OF A DATASET BASED ON A SOCIAL CHOICE INSPIRED APPROACH 43 S(RePs(Xi)) = maxk#(SXi(Xk)) Score(Repscore(Xi)) = m&xk^i(ScoreXt(Xk)) ILL Experimental Study This section is devoted to the study of our method for structuring a dataset with a pairwise comparator. First let us present an example of the different steps of the dataset structuration with our method. In a second section, we assess the quality of this structuration using simulated data. Third the quality is assessed when using one real dataset. TABLE I: Number of communities of voters, number of unique representatives and loss, using the simple example of Fig.l when the threshold t of correlation varies between 0 and 1. t nbcom nbrep loss 1 0.00 2 4 0.85 2 0.05 2 4 0.85 3 0.10 2 4 0.85 4 0.15 2 4 0.85 5 0.20 2 4 0.85 6 0.25 2 4 0.85 7 0.30 2 4 0.85 8 0.35 2 4 0.77 9 0.40 2 4 0.77 10 0.45 2 4 0.77 11 0.50 2 4 0.78 12 0.55 2 4 0.78 13 0.60 2 5 0.59 14 0.65 2 5 0.59 15 0.70 2 5 0.52 16 0.75 3 8 0.32 17 0.80 3 9 0.23 18 0.85 3 11 0.17 19 0.90 5 14 0.00 20 0.95 5 14 0.00 21 1.00 20 20 0.00 A. Workflow for structuring a dataset We propose to explore a dataset with 20 simulated data in dimension 2 (see Fig.l-A). The pairwise comparisons are based on Euclidean distance. We conduct the overall election with a classical Borda's procedure. This overall election permits us to propose the best candidate which could be considered as the representative of the whole dataset (see Fig.l-B). Then we proceed to the elections based on the individual preferences for obtaining linking each data with another one. These links allow to define communities of voters. The procedure of the election with the individual preferences is based on a threshold of correlation. Fig.l-C shows the number of communities when the correlation threshold increases. The higher the threshold, the higher the number of communities is. The highest threshold leads to the highest number of unique representatives. The higher the threshold, the lesser the losses are (both individual and collective). When the threshold is equal to 0.5, 0.95 and 0.99 (Fig.l-D, Fig.l-E, Fig.l-F) : • the number of communities is 2, 5 and 6 (resp.) • the number of unique representatives is 4, 14 and 15 (resp.) • the collective loss is 0.78, 0 and 0 (resp.) This number of communities is less than the number of data. That gives a new way for the exploration of a dataset. B. Assessment of the links structuring a dataset TABLE II: Number of communities of voters, number of unique representatives and loss, using the three classes of Fig.2 and criterion of assessment when the threshold of correlation varies between 0 and 1. t nbcom nbrep loss 1 0.00 3 20 6.99 2 0.05 3 19 6.65 3 0.10 3 19 6.37 4 0.15 3 18 6.49 5 0.20 3 16 6.23 6 0.25 3 17 6.09 7 0.30 3 17 6.00 8 0.35 3 16 6.01 9 0.40 3 16 6.03 10 0.45 3 16 5.95 11 0.50 3 16 5.82 12 0.55 3 17 5.70 13 0.60 3 19 5.47 14 0.65 3 21 5.00 15 0.70 3 24 4.58 16 0.75 3 31 4.04 17 0.80 3 35 3.03 18 0.85 4 47 2.32 19 0.90 7 58 1.37 20 0.95 10 74 0.52 21 1.00 150 150 0.00 In this paper, we place ourselves resolutely in the context of the exploratory analysis of data without any a priori assumption on eventual classes, we only use an index of pairwise comparison. But the use of classes gives the most classical way to evaluate a structuration of a dataset. So this paper uses classes to assess only the links that we propose between data. The detection of classes (i.e. the clustering) is out of the scope of this paper. The assessment of our method for structuring a dataset is performed using a dataset with known classes. Each data belongs to one class and it has the label of its class. Using our structuration each data is also linked to a representative in the dataset. A data is well represented when its own label is equal to the label of its representative. In such case the link between a voter and its representative remains inside a class of the dataset. We propose a structuration of the dataset with graphs. The vertices of the graph are labeled and the edges are labeled when their extremities have the same label. We compute the number of the labeled edges. The percentage of such edges could assess the quality of the structuration through a graph. Unfortunately the classes are unknown in the first step of data exploration. Thus we propose to use the loss indicator instead of this percentage of labeled links. The higher is this quality criterion and the lower the number of communities is, then the better the structuration is. Table II gives the values of this criterion when the threshold of correlation lies between 0 and 1. The dataset is simulated in dimension 2 (see Fig.2) and the number of detected 44 PROCEEDINGS OF THE FEDCSIS. GDAŇSK, 2016 A *- —• 3.25 0. E j0 0.75 20- o.'oo 0 =5 0. c j0 0.75 1.00 --• D Fig. 1: Twenty simulated data in dimension 2 (A), the overal election selecting one representative (B) and elections based on individual preferences leading to several communities whose number depends on the correlation threshold (C). 2, 5, and 6 communities respectively obtained with a correlation threshold equal to 0.5, 0.95, 0.99 (D, E, F). communities of voters is displayed when the threshold of correlation between voters increases from 0 to 1. Fig.2 gives also three examples of the communities when the threshold is respectively equal to 0, 0.5 and 0.9. TABLE III: Number of communities of voters, number of unique representatives and loss, using the three classes of Fig.3 and criterion of assessment when the threshold of correlation varies between 0 and 1. t nbcom nbrep loss 1 0.00 1 74 35.49 2 0.05 1 70 32.88 3 0.10 1 69 30.32 4 0.15 1 72 28.65 5 0.20 1 69 26.90 6 0.25 1 70 24.89 7 0.30 1 74 23.56 8 0.35 4 79 21.55 9 0.40 4 75 19.58 10 0.45 3 76 17.25 11 0.50 3 82 15.09 12 0.55 5 92 11.73 13 0.60 7 101 9.50 14 0.65 7 107 7.94 15 0.70 11 118 6.54 16 0.75 13 128 5.27 17 0.80 14 134 3.99 18 0.85 19 149 3.10 19 0.90 25 166 1.96 20 0.95 36 186 0.88 21 1.00 380 380 0.00 In the following we simulated a dataset with three classes that are hardly distinguishable because of their shapes and their overlapping. The dataset (n = 380) is simulated in dimension 2 with three uniform distributions in two rectangular crowns with 200 and 80 data and one rectangle with 100 data (see Fig.3). The number of voter communities is displayed when the threshold of correlation between voters increases from 0 to 1. Fig.3 gives also three examples of the communities when the threshold is respectively equal to 0.5, 0.8 and 0.99, • the number of communities is 3, 14 and 71 (resp.) • the number of unique representatives is 82, 134 and 212 (resp.) • the collective loss is 15.09, 3.99 and 0.11 (resp.) the number of communities are respectively equal to 8, 16 and 106. In such a case the classical clustering methods fail to detect meaning clusters. Indeed classical clustering methods are often based on statistics such as means or medoids. They use these statistics to determine the clusters and they make the assumption that data could be well represented with such statistics. Unfortunately these statistical approaches are unadapted in this case. Table III gives the values of our assessment criterion when the threshold of correlation lies between 0 and 1. C. Assessment with real data We use the databases from Machine Learning Repository of UCI [2] to assess our method with real data. Iris is the classical database that has 150 iris plants with 4 attributes and three clusters. Table IV gives the results we obtain with this dataset. Fig.4 displays the number of voter communities and the percentage of labeled links when the correlation threshold increases from 0 to 0.99, and the loss indicator. IV. Discussion and Conclusion In this paper we describe and we implement a method for exploring a data set. The main originality of this method lies in MICHEL HERBIN ET AL.: A NEW WAY FOR THE EXPLORATION OF A DATASET BASED ON A SOCIAL CHOICE INSPIRED APPROACH 45 the definition of links between data. These links are based on a local election mechanism with individual preferences that connects each data to another data designated by the local election process. In this approach, the conventional similarity indices between data are used to define the electoral behavior of each data. As the preferences of the users in a recommender system, the voters then have weights corresponding to the similarity of electoral behaviors. However this approach by recommender systems is not used in this paper and the robustness of our method when data is incomplete or imperfect could be studied in future work. Another important contribution of this work is to reduce the size of a data set from the exploration of a set of n data to a set of p communities where p is much smaller than n. This approach of dimensionality reduction has the advantage that it makes no assumption about the shape or the exact number of communities. It thus constitutes a preliminary step to a more meaningful clustering and it leads to select a more suitable method for the exploring dataset. This extension of our work could also be involved in further work. 46 PROCEEDINGS OF THE FEDCSIS. GDAŇSK, 2016 t t t D E F Fig. 4: Iris Data (n = 150) with three classes of 50 data in dimension four. Top : data projections in dimension two using sepal width and sepal length and detection of communities when the correlation threshold is equal to 0.5 and 0.95. Bottom : Number of voter communities and the percentage of labeled links when the correlation threshold increases from 0 to 0.99, and the loss indicator TABLE IV: Number of communities of voters, number of unique representatives and loss, using the three classes of the Iris data (see Fig.4) and criterion of assessment when the threshold of correlation varies between 0 and 1. t nbcom nbrep loss 1 0.00 2 11 11.33 2 0.05 2 11 10.57 3 0.10 2 11 10.23 4 0.15 2 11 9.64 5 0.20 2 11 9.25 6 0.25 2 11 8.56 7 0.30 2 12 8.14 8 0.35 2 12 7.58 9 0.40 2 12 7.51 10 0.45 2 12 7.11 11 0.50 2 15 6.51 12 0.55 3 17 6.06 13 0.60 4 18 4.90 14 0.65 4 17 4.27 15 0.70 4 18 3.33 16 0.75 4 18 2.83 17 0.80 6 20 2.67 18 0.85 6 25 1.94 19 0.90 6 33 1.34 20 0.95 7 46 0.46 21 1.00 150 150 0.00 We are currently working on application for sensor network data analysis. Acknowledgment This work is partially supported by the EC SCOOP project (INEA/CEF/TRAN/A2014/1042281). References [I] A. Ait Younes, F. Blanchard and M. Herbin, "New similarity index based on the aggregation of membership functions through OWA operator", Federated Conference on Computer Science and Information Systems, FedCSIS 2015, 163-168, Lodz, Poland, 2015. [2] K. Bache, M. Lichman, "UCI Machine learning repository", http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences, 2013. [3] J.P. Barfhelemy and B.Montjardet, "The median procedure in cluster analysis and social choice theory", Mathematical Social Sciences, 1:235- 267, 1981. [4] A. Bellet, A. Habrard, M. Sebban, "A Survey on Metric Learning for Feature Vectors and Structured Data", Technical report, arXiv: 1306.6709, 2014. [5] F. Blanchard, C. de Runz, M. Herbin, H. Akdag, "Representativite et graphe de representants : une approche inspiree de la fheorie du choix social pour la fouille de donnees relationnelles", Atelier Fouille de Donnees Complexes, Conference Extraction et Gestion des Connaissances, EGC, 73-83, Brest, France, 2011. [6] M. de Borda, "Memoire sur les elections au scrutin", Academie Royale des Sciences, Paris, 1784. [7] A.K. Jain, M.N. Murty, P.J. Flynn, "Data Clustering: A Review", ACM Computing Surveys, 31(3), 264-323, 1999. [8] A.K. Jain, "Data clustering: 50 years beyond K-means", Pattern Recognition Letters, 31, 651-666, 2010. [9] J.N. Mordeson, D.S. Malik, T.D. Clark, "Application of Fuzzy Logic to Social Choice Theory", Chapman and Hall/CRC, 2015. [10] M. Parsapoor, U. Bilstrup, "An Emotional Learning-inspired Ensemble Classifier (ELiEC)", Proceedings of the 2013 Federated Conference on Computer Science and Information Systems, 137-141, 2013 [II] C. Spearman, "General intelligence objectively determined and measured", Am J Psychol, 15, 201-293, 1904.