MINING TEXT DATA MINING TEXT DATA Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA CHENGXIANG ZHAI University of Illinois at Urbana-Champaign, Urbana, IL, USA Kluwer Academic Publishers Boston/Dordrecht/London Contents 1 An Introduction to Text Mining 1 Charu C. Aggarwal and ChengXiang Zhai 1. Introduction 1 2. Algorithms for Text Mining 4 3. Future Directions 8 References 10 2 Information Extraction from Text 11 Jing Jiang 1. Introduction 11 2. Named Entity Recognition 15 2.1 Rule-based Approach 16 2.2 Statistical Learning Approach 17 3. Relation Extraction 22 3.1 Feature-based Classification 23 3.2 Kernel Methods 26 3.3 Weakly Supervised Learning Methods 29 4. Unsupervised Information Extraction 30 4.1 Relation Discovery and Template Induction 31 4.2 Open Information Extraction 32 5. Evaluation 33 6. Conclusions and Summary 34 References 35 3 A Survey of Text Summarization Techniques 43 Ani Nenkova and Kathleen McKeown 1. How do Extractive Summarizers Work? 44 2. Topic Representation Approaches 46 2.1 Topic Words 46 2.2 Frequency-driven Approaches 48 2.3 Latent Semantic Analysis 52 2.4 Bayesian Topic Models 53 2.5 Sentence Clustering and Domain-dependent Topics 55 3. Influence of Context 56 3.1 Web Summarization 57 3.2 Summarization of Scientific Articles 58 vi MINING TEXT DATA 3.3 Query-focused Summarization 58 3.4 Email Summarization 59 4. Indicator Representations and Machine Learning for Summarization 60 4.1 Graph Methods for Sentence Importance 60 4.2 Machine Learning for Summarization 62 5. Selecting Summary Sentences 64 5.1 Greedy Approaches: Maximal Marginal Relevance 64 5.2 Global Summary Selection 65 6. Conclusion 66 References 66 4 A Survey of Text Clustering Algorithms 77 Charu C. Aggarwal and ChengXiang Zhai 1. Introduction 77 2. Feature Selection and Transformation Methods for Text Clustering 81 2.1 Feature Selection Methods 81 2.2 LSI-based Methods 84 2.3 Non-negative Matrix Factorization 86 3. Distance-based Clustering Algorithms 89 3.1 Agglomerative and Hierarchical Clustering Algorithms 90 3.2 Distance-based Partitioning Algorithms 92 3.3 A Hybrid Approach: The Scatter-Gather Method 94 4. Word and Phrase-based Clustering 99 4.1 Clustering with Frequent Word Patterns 100 4.2 Leveraging Word Clusters for Document Clusters 102 4.3 Co-clustering Words and Documents 103 4.4 Clustering with Frequent Phrases 105 5. Probabilistic Document Clustering and Topic Models 107 6. Online Clustering with Text Streams 110 7. Clustering Text in Networks 115 8. Semi-Supervised Clustering 118 9. Conclusions and Summary 120 References 121 5 Dimensionality Reduction and Topic Modeling 129 Steven P. Crain, Ke Zhou, Shuang-Hong Yang and Hongyuan Zha 1. Introduction 130 1.1 The Relationship Between Clustering, Dimension Reduction and Topic Modeling 131 1.2 Notation and Concepts 132 2. Latent Semantic Indexing 133 2.1 The Procedure of Latent Semantic Indexing 134 2.2 Implementation Issues 135 2.3 Analysis 137 3. Topic Models and Dimension Reduction 139 3.1 Probabilistic Latent Semantic Indexing 140 3.2 Latent Dirichlet Allocation 142 4. Interpretation and Evaluation 148 Contents vii 4.1 Interpretation 148 4.2 Evaluation 149 4.3 Parameter Selection 150 4.4 Dimension Reduction 150 5. Beyond Latent Dirichlet Allocation 151 5.1 Scalability 151 5.2 Dynamic Data 151 5.3 Networked Data 152 5.4 Adapting Topic Models to Applications 154 6. Conclusion 155 References 156 6 A Survey of Text Classification Algorithms 163 Charu C. Aggarwal and ChengXiang Zhai 1. Introduction 163 2. Feature Selection for Text Classification 167 2.1 Gini Index 168 2.2 Information Gain 169 2.3 Mutual Information 169 2.4 χ2-Statistic 170 2.5 Feature Transformation Methods: Supervised LSI 171 2.6 Supervised Clustering for Dimensionality Reduction 172 2.7 Linear Discriminant Analysis 173 2.8 Generalized Singular Value Decomposition 175 2.9 Interaction of Feature Selection with Classification 175 3. Decision Tree Classifiers 176 4. Rule-based Classifiers 178 5. Probabilistic and Naive Bayes Classifiers 181 5.1 Bernoulli Multivariate Model 183 5.2 Multinomial Distribution 188 5.3 Mixture Modeling for Text Classification 190 6. Linear Classifiers 193 6.1 SVM Classifiers 194 6.2 Regression-Based Classifiers 196 6.3 Neural Network Classifiers 197 6.4 Some Observations about Linear Classifiers 199 7. Proximity-based Classifiers 200 8. Classification of Linked and Web Data 203 9. Meta-Algorithms for Text Classification 209 9.1 Classifier Ensemble Learning 209 9.2 Data Centered Methods: Boosting and Bagging 210 9.3 Optimizing Specific Measures of Accuracy 211 10. Conclusions and Summary 213 References 213 7 Transfer Learning for Text Mining 223 Weike Pan, Erheng Zhong and Qiang Yang 1. Introduction 224 2. Transfer Learning in Text Classification 225 2.1 Cross Domain Text Classification 225 viii MINING TEXT DATA 2.2 Instance-based Transfer 231 2.3 Cross-Domain Ensemble Learning 232 2.4 Feature-based Transfer Learning for Document Classification 235 3. Heterogeneous Transfer Learning 239 3.1 Heterogeneous Feature Space 241 3.2 Heterogeneous Label Space 243 3.3 Summary 244 4. Discussion 245 5. Conclusions 246 References 247 8 Probabilistic Models for Text Mining 259 Yizhou Sun, Hongbo Deng and Jiawei Han 1. Introduction 260 2. Mixture Models 261 2.1 General Mixture Model Framework 262 2.2 Variations and Applications 263 2.3 The Learning Algorithms 266 3. Stochastic Processes in Bayesian Nonparametric Models 269 3.1 Chinese Restaurant Process 269 3.2 Dirichlet Process 270 3.3 Pitman-Yor Process 274 3.4 Others 275 4. Graphical Models 275 4.1 Bayesian Networks 276 4.2 Hidden Markov Models 278 4.3 Markov Random Fields 282 4.4 Conditional Random Fields 285 4.5 Other Models 286 5. Probabilistic Models with Constraints 287 6. Parallel Learning Algorithms 288 7. Conclusions 289 References 290 9 Mining Text Streams 297 Charu C. Aggarwal 1. Introduction 297 2. Clustering Text Streams 299 2.1 Topic Detection and Tracking in Text Streams 307 3. Classification of Text Streams 312 4. Evolution Analysis in Text Streams 316 5. Conclusions 317 References 318 10 Translingual Mining from Text Data 323 Jian-Yun Nie, Jianfeng Gao and Guihong Cao 1. Introduction 324 2. Traditional Translingual Text Mining – Machine Translation 325 Contents ix 2.1 SMT and Generative Translation Models 325 2.2 Word-Based Models 327 2.3 Phrase-Based Models 329 2.4 Syntax-Based Models 333 3. Automatic Mining of Parallel texts 336 3.1 Using Web structure 337 3.2 Matching parallel pages 339 4. Using Translation Models in CLIR 341 5. Collecting and Exploiting Comparable Texts 344 6. Selecting Parallel Sentences, Phrases and Translation Words 347 7. Mining Translingual Relations From Monolingual Texts 349 8. Mining using hyperlinks 351 9. Conclusions and Discussions 353 References 354 11 Text Mining in Multimedia 361 Zheng-Jun Zha, Meng Wang, Jialie Shen and Tat-Seng Chua 1. Introduction 362 2. Surrounding Text Mining 364 3. Tag Mining 366 3.1 Tag Ranking 366 3.2 Tag Refinement 367 3.3 Tag Information Enrichment 369 4. Joint Text and Visual Content Mining 370 4.1 Visual Re-ranking 371 5. Cross Text and Visual Content Mining 374 6. Summary and Open Issues 377 References 379 12 Text Analytics in Social Media 385 Xia Hu and Huan Liu 1. Introduction 385 2. Distinct Aspects of Text in Social Media 388 2.1 A General Framework for Text Analytics 388 2.2 Time Sensitivity 390 2.3 Short Length 391 2.4 Unstructured Phrases 392 2.5 Abundant Information 393 3. Applying Text Analytics to Social Media 393 3.1 Event Detection 393 3.2 Collaborative Question Answering 395 3.3 Social Tagging 397 3.4 Bridging the Semantic Gap 398 3.5 Exploiting the Power of Abundant Information 399 3.6 Related Efforts 401 4. An Illustrative Example 402 4.1 Seed Phrase Extraction 402 4.2 Semantic Feature Generation 404 4.3 Feature Space Construction 406 5. Conclusion and Future Work 407 x MINING TEXT DATA References 408 13 A Survey of Opinion Mining and Sentiment Analysis 415 Bing Liu and Lei Zhang 1. The Problem of Opinion Mining 416 1.1 Opinion Definition 416 1.2 Aspect-Based Opinion Summary 420 2. Document Sentiment Classification 422 2.1 Classification based on Supervised Learning 422 2.2 Classification based on Unsupervised Learning 424 3. Sentence Subjectivity and Sentiment Classification 426 4. Opinion Lexicon Expansion 429 4.1 Dictionary based approach 429 4.2 Corpus-based approach and sentiment consistency 430 5. Aspect-Based Sentiment Analysis 432 5.1 Aspect Sentiment Classification 433 5.2 Basic Rules of Opinions 434 5.3 Aspect Extraction 438 5.4 Simultaneous Opinion Lexicon Expansion and Aspect Extraction 440 6. Mining Comparative Opinions 441 7. Some Other Problems 444 8. Opinion Spam Detection 447 8.1 Spam Detection Based on Supervised Learning 448 8.2 Spam Detection Based on Abnormal Behaviors 449 8.3 Group Spam Detection 450 9. Utility of Reviews 451 10. Conclusions 452 References 453 14 Biomedical Text Mining: A Survey of Recent Progress 465 Matthew S. Simpson and Dina Demner-Fushman 1. Introduction 466 2. Resources for Biomedical Text Mining 467 2.1 Corpora 467 2.2 Annotation 469 2.3 Knowledge Sources 470 2.4 Supporting Tools 471 3. Information Extraction 472 3.1 Named Entity Recognition 473 3.2 Relation Extraction 478 3.3 Event Extraction 482 4. Summarization 484 5. Question Answering 488 5.1 Medical Question Answering 489 5.2 Biological Question Answering 491 6. Literature-Based Discovery 492 7. Conclusion 495 References 496 Contents xi Index 519 Chapter 4 A SURVEY OF TEXT CLUSTERING ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign Urbana, IL czhai@cs.uiuc.edu Abstract Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data. Keywords: Text Clustering 1. Introduction The problem of clustering has been studied widely in the database and statistics literature in the context of a wide variety of data mining tasks [50, 54]. The clustering problem is defined to be that of finding groups of similar objects in the data. The similarity between the ob- 78 MINING TEXT DATA jects is measured with the use of a similarity function. The problem of clustering can be very useful in the text domain, where the objects to be clusters can be of different granularities such as documents, paragraphs, sentences or terms. Clustering is especially useful for organizing documents to improve retrieval and support browsing [11, 26]. The study of the clustering problem precedes its applicability to the text domain. Traditional methods for clustering have generally focussed on the case of quantitative data [44, 71, 50, 54, 108], in which the attributes of the data are numeric. The problem has also been studied for the case of categorical data [10, 41, 43], in which the attributes may take on nominal values. A broad overview of clustering (as it relates to generic numerical and categorical data) may be found in [50, 54]. A number of implementations of common text clustering algorithms, as applied to text data, may be found in several toolkits such as Lemur [114] and BOW toolkit in [64]. The problem of clustering finds applicability for a number of tasks: Document Organization and Browsing: The hierarchical organization of documents into coherent categories can be very useful for systematic browsing of the document collection. A classical example of this is the Scatter/Gather method [25], which provides a systematic browsing technique with the use of clustered organization of the document collection. Corpus Summarization: Clustering techniques provide a coherent summary of the collection in the form of cluster-digests [83] or word-clusters [17, 18], which can be used in order to provide summary insights into the overall content of the underlying corpus. Variants of such methods, especially sentence clustering, can also be used for document summarization, a topic, discussed in detail in Chapter 3. The problem of clustering is also closely related to that of dimensionality reduction and topic modeling. Such dimensionality reduction methods are all different ways of summarizing a corpus of documents, and are covered in Chapter 5. Document Classification: While clustering is inherently an unsupervised learning method, it can be leveraged in order to improve the quality of the results in its supervised variant. In particular, word-clusters [17, 18] and co-training methods [72] can be used in order to improve the classification accuracy of supervised applications with the use of clustering techniques. We note that many classes of algorithms such as the k-means algorithm, or hierarchical algorithms are general-purpose methods, which A Survey of Text Clustering Algorithms 79 can be extended to any kind of data, including text data. A text document can be represented either in the form of binary data, when we use the presence or absence of a word in the document in order to create a binary vector. In such cases, it is possible to directly use a variety of categorical data clustering algorithms [10, 41, 43] on the binary representation. A more enhanced representation would include refined weighting methods based on the frequencies of the individual words in the document as well as frequencies of words in an entire collection (e.g., TF-IDF weighting [82]). Quantitative data clustering algorithms [44, 71, 108] can be used in conjunction with these frequencies in order to determine the most relevant groups of objects in the data. However, such naive techniques do not typically work well for clustering text data. This is because text data has a number of unique properties which necessitate the design of specialized algorithms for the task. The distinguishing characteristics of the text representation are as follows: The dimensionality of the text representation is very large, but the underlying data is sparse. In other words, the lexicon from which the documents are drawn may be of the order of 105, but a given document may contain only a few hundred words. This problem is even more serious when the documents to be clustered are very short (e.g., when clustering sentences or tweets). While the lexicon of a given corpus of documents may be large, the words are typically correlated with one another. This means that the number of concepts (or principal components) in the data is much smaller than the feature space. This necessitates the careful design of algorithms which can account for word correlations in the clustering process. The number of words (or non-zero entries) in the different documents may vary widely. Therefore, it is important to normalize the document representations appropriately during the clustering task. The sparse and high dimensional representation of the different documents necessitate the design of text-specific algorithms for document representation and processing, a topic heavily studied in the information retrieval literature where many techniques have been proposed to optimize document representation for improving the accuracy of matching a document with a query [82, 13]. Most of these techniques can also be used to improve document representation for clustering. 80 MINING TEXT DATA In order to enable an effective clustering process, the word frequencies need to be normalized in terms of their relative frequency of presence in the document and over the entire collection. In general, a common representation used for text processing is the vector-space based TF-IDF representation [81]. In the TF-IDF representation, the term frequency for each word is normalized by the inverse document frequency, or IDF. The inverse document frequency normalization reduces the weight of terms which occur more frequently in the collection. This reduces the importance of common terms in the collection, ensuring that the matching of documents be more influenced by that of more discriminative words which have relatively low frequencies in the collection. In addition, a sub-linear transformation function is often applied to the termfrequencies in order to avoid the undesirable dominating effect of any single term that might be very frequent in a document. The work on document-normalization is itself a vast area of research, and a variety of other techniques which discuss different normalization methods may be found in [86, 82]. Text clustering algorithms are divided into a wide variety of different types such as agglomerative clustering algorithms, partitioning algorithms, and standard parametric modeling based methods such as the EM-algorithm. Furthermore, text representations may also be treated as strings (rather than bags of words). These different representations necessitate the design of different classes of clustering algorithms. Different clustering algorithms have different tradeoffs in terms of effectiveness and efficiency. An experimental comparison of different clustering algorithms may be found in [90, 111]. In this chapter we will discuss a wide variety of algorithms which are commonly used for text clustering. We will also discuss text clustering algorithms for related scenarios such as dynamic data, network-based text data and semi-supervised scenarios. This chapter is organized as follows. In section 2, we will present feature selection and transformation methods for text clustering. Section 3 describes a number of common algorithms which are used for distancebased clustering of text documents. Section 4 contains the description of methods for clustering with the use of word patterns and phrases. Methods for clustering text streams are described in section 5. Section 6 describes methods for probabilistic clustering of text data. Section 7 contains a description of methods for clustering text which naturally occurs in the context of social or web-based networks. Section 8 discusses methods for semi-supervised clustering. Section 9 presents the conclusions and summary. A Survey of Text Clustering Algorithms 81 2. Feature Selection and Transformation Methods for Text Clustering The quality of any data mining method such as classification and clustering is highly dependent on the noisiness of the features that are used for the clustering process. For example, commonly used words such as “the”, may not be very useful in improving the clustering quality. Therefore, it is critical to select the features effectively, so that the noisy words in the corpus are removed before the clustering. In addition to feature selection, a number of feature transformation methods such as Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA), and Non-negative Matrix Factorization (NMF) are available to improve the quality of the document representation and make it more amenable to clustering. In these techniques (often called dimension reduction), the correlations among the words in the lexicon are leveraged in order to create features, which correspond to the concepts or principal components in the data. In this section, we will discuss both classes of methods. A more in-depth discussion of dimension reduction can be found in Chapter 5. 2.1 Feature Selection Methods Feature selection is more common and easy to apply in the problem of text categorization [99] in which supervision is available for the feature selection process. However, a number of simple unsupervised methods can also be used for feature selection in text clustering. Some examples of such methods are discussed below. 2.1.1 Document Frequency-based Selection. The simplest possible method for feature selection in document clustering is that of the use of document frequency to filter out irrelevant features. While the use of inverse document frequencies reduces the importance of such words, this may not alone be sufficient to reduce the noise effects of very frequent words. In other words, words which are too frequent in the corpus can be removed because they are typically common words such as “a”, “an”, “the”, or “of” which are not discriminative from a clustering perspective. Such words are also referred to as stop words. A variety of methods are commonly available in the literature [76] for stop-word removal. Typically commonly available stop word lists of about 300 to 400 words are used for the retrieval process. In addition, words which occur extremely infrequently can also be removed from the collection. This is because such words do not add anything to the similarity computations which are used in most clustering methods. In 82 MINING TEXT DATA some cases, such words may be misspellings or typographical errors in documents. Noisy text collections which are derived from the web, blogs or social networks are more likely to contain such terms. We note that some lines of research define document frequency based selection purely on the basis of very infrequent terms, because these terms contribute the least to the similarity calculations. However, it should be emphasized that very frequent words should also be removed, especially if they are not discriminative between clusters. Note that the TF-IDF weighting method can also naturally filter out very common words in a “soft” way. Clearly, the standard set of stop words provide a valid set of words to prune. Nevertheless, we would like a way of quantifying the importance of a term directly to the clustering process, which is essential for more aggressive pruning. We will discuss a number of such methods below. 2.1.2 Term Strength. A much more aggressive technique for stop-word removal is proposed in [94]. The core idea of this approach is to extend techniques which are used in supervised learning to the unsupervised case. The term strength is essentially used to measure how informative a word is for identifying two related documents. For example, for two related documents x and y, the term strength s(t) of term t is defined in terms of the following probability: s(t) = P(t ∈ y|t ∈ x) (4.1) Clearly, the main issue is how one might define the document x and y as related. One possibility is to use manual (or user) feedback to define when a pair of documents are related. This is essentially equivalent to utilizing supervision in the feature selection process, and may be practical in situations in which predefined categories of documents are available. On the other hand, it is not practical to manually create related pairs in large collections in a comprehensive way. It is therefore desirable to use an automated and purely unsupervised way to define the concept of when a pair of documents is related. It has been shown in [94] that it is possible to use automated similarity functions such as the cosine function [81] to define the relatedness of document pairs. A pair of documents are defined to be related if their cosine similarity is above a user-defined threshold. In such cases, the term strength s(t) can be defined by randomly sampling a number of pairs of such related documents as follows: s(t) = Number of pairs in which t occurs in both Number of pairs in which t occurs in the first of the pair (4.2) Here, the first document of the pair may simply be picked randomly. In order to prune features, the term strength may be compared to the A Survey of Text Clustering Algorithms 83 expected strength of a term which is randomly distributed in the training documents with the same frequency. If the term strength of t is not at least two standard deviations greater than that of the random word, then it is removed from the collection. One advantage of this approach is that it requires no initial supervision or training data for the feature selection, which is a key requirement in the unsupervised scenario. Of course, the approach can also be used for feature selection in either supervised clustering [4] or categorization [100], when such training data is indeed available. One observation about this approach to feature selection is that it is particularly suited to similarity-based clustering because the discriminative nature of the underlying features is defined on the basis of similarities in the documents themselves. 2.1.3 Entropy-based Ranking. The entropy-based ranking approach was proposed in [27]. In this case, the quality of the term is measured by the entropy reduction when it is removed. Here the entropy E(t) of the term t in a collection of n documents is defined as follows: E(t) = − n i=1 n j=1 (Sij · log(Sij) + (1 − Sij) · log(1 − Sij)) (4.3) Here Sij ∈ (0, 1) is the similarity between the ith and jth document in the collection, after the term t is removed, and is defined as follows: Sij = 2− dist(i,j) dist (4.4) Here dist(i, j) is the distance between the terms i and j after the term t is removed, and dist is the average distance between the documents after the term t is removed. We note that the computation of E(t) for each term t requires O(n2) operations. This is impractical for a very large corpus containing many terms. It has been shown in [27] how this method may be made much more efficient with the use of sampling methods. 2.1.4 Term Contribution. The concept of term contribution [62] is based on the fact that the results of text clustering are highly dependent on document similarity. Therefore, the contribution of a term can be viewed as its contribution to document similarity. For example, in the case of dot-product based similarity, the similarity between two documents is defined as the dot product of their normalized frequencies. Therefore, the contribution of a term of the similarity of two documents is the product of their normalized frequencies in the two documents. This 84 MINING TEXT DATA needs to be summed over all pairs of documents in order to determine the term contribution. As in the previous case, this method requires O(n2) time for each term, and therefore sampling methods may be required to speed up the contribution. A major criticism of this method is that it tends to favor highly frequent words without regard to the specific discriminative power within a clustering process. In most of these methods, the optimization of term selection is based on some pre-assumed similarity function (e.g., cosine). While this strategy makes these methods unsupervised, there is a concern that the term selection might be biased due to the potential bias of the assumed similarity function. That is, if a different similarity function is assumed, we may end up having different results for term selection. Thus the choice of an appropriate similarity function may be important for these methods. 2.2 LSI-based Methods In feature selection, we attempt to explicitly select out features from the original data set. Feature transformation is a different method in which the new features are defined as a functional representation of the features in the original data set. The most common class of methods is that of dimensionality reduction [53] in which the documents are transformed to a new feature space of smaller dimensionality in which the features are typically a linear combination of the features in the original data. Methods such as Latent Semantic Indexing (LSI) [28] are based on this common principle. The overall effect is to remove a lot of dimensions in the data which are noisy for similarity based applications such as clustering. The removal of such dimensions also helps magnify the semantic effects in the underlying data. Since LSI is closely related to problem of Principal Component Analysis (PCA) or Singular Value Decomposition (SVD), we will first discuss this method, and its relationship to LSI. For a d-dimensional data set, PCA constructs the symmetric d × d covariance matrix C of the data, in which the (i, j)th entry is the covariance between dimensions i and j. This matrix is positive semi-definite, and can be diagonalized as follows: C = P · D · PT (4.5) Here P is a matrix whose columns contain the orthonormal eigenvectors of C and D is a diagonal matrix containing the corresponding eigenvalues. We note that the eigenvectors represent a new orthonormal basis system along which the data can be represented. In this context, the eigenvalues correspond to the variance when the data is projected along this basis system. This basis system is also one in which the second A Survey of Text Clustering Algorithms 85 order covariances of the data are removed, and most of variance in the data is captured by preserving the eigenvectors with the largest eigenvalues. Therefore, in order to reduce the dimensionality of the data, a common approach is to represent the data in this new basis system, which is further truncated by ignoring those eigenvectors for which the corresponding eigenvalues are small. This is because the variances along those dimensions are small, and the relative behavior of the data points is not significantly affected by removing them from consideration. In fact, it can be shown that the Euclidian distances between data points are not significantly affected by this transformation and corresponding truncation. The method of PCA is commonly used for similarity search in database retrieval applications. LSI is quite similar to PCA, except that we use an approximation of the covariance matrix C which is quite appropriate for the sparse and high-dimensional nature of text data. Specifically, let A be the n × d term-document matrix in which the (i, j)th entry is the normalized frequency for term j in document i. Then, AT · A is a d × d matrix which is close (scaled) approximation of the covariance matrix, in which the means have not been subtracted out. In other words, the value of AT ·A would be the same as a scaled version (by factor n) of the covariance matrix, if the data is mean-centered. While text-representations are not mean-centered, the sparsity of text ensures that the use of AT · A is quite a good approximation of the (scaled) covariances. As in the case of numerical data, we use the eigenvectors of AT ·A with the largest variance in order to represent the text. In typical collections, only about 300 to 400 eigenvectors are required for the representation. One excellent characteristic of LSI [28] is that the truncation of the dimensions removes the noise effects of synonymy and polysemy, and the similarity computations are more closely affected by the semantic concepts in the data. This is particularly useful for a semantic application such as text clustering. However, if finer granularity clustering is needed, such lowdimensional space representation of text may not be sufficiently discriminative; in information retrieval, this problem is often solved by mixing the low-dimensional representation with the original high-dimensional word-based representation (see, e.g., [105]). A similar technique to LSI, but based on probabilistic modeling is Probabilistic Latent Semantic Analysis (PLSA) [49]. The similarity and equivalence of PLSA and LSI are discussed in [49]. 2.2.1 Concept Decomposition using Clustering. One interesting observation is that while feature transformation is often used as a pre-processing technique for clustering, the clustering itself can be 86 MINING TEXT DATA used for a novel dimensionality reduction technique known as concept decomposition [2, 29]. This of course leads to the issue of circularity in the use of this technique for clustering, especially if clustering is required in order to perform the dimensionality reduction. Nevertheless, it is still possible to use this technique effectively for pre-processing with the use of two separate phases of clustering. The technique of concept decomposition uses any standard clustering technique [2, 29] on the original representation of the documents. The frequent terms in the centroids of these clusters are used as basis vectors which are almost orthogonal to one another. The documents can then be represented in a much more concise way in terms of these basis vectors. We note that this condensed conceptual representation allows for enhanced clustering as well as classification. Therefore, a second phase of clustering can be applied on this reduced representation in order to cluster the documents much more effectively. Such a method has also been tested in [87] by using word-clusters in order to represent documents. We will describe this method in more detail later in this chapter. 2.3 Non-negative Matrix Factorization The non-negative matrix factorization (NMF) technique is a latentspace method, and is particularly suitable to clustering [97]. As in the case of LSI, the NMF scheme represents the documents in a new axissystem which is based on an analysis of the term-document matrix. However, the NMF method has a number of critical differences from the LSI scheme from a conceptual point of view. In particular, the NMF scheme is a feature transformation method which is particularly suited to clustering. The main conceptual characteristics of the NMF scheme, which are very different from LSI are as follows: In LSI, the new basis system consists of a set of orthonormal vectors. This is not the case for NMF. In NMF, the vectors in the basis system directly correspond to cluster topics. Therefore, the cluster membership for a document may be determined by examining the largest component of the document along any of the vectors. The coordinate of any document along a vector is always non-negative. The expression of each document as an additive combination of the underlying semantics makes a lot of sense from an intuitive perspective. Therefore, the NMF transformation is particularly suited to clustering, and it also provides an intuitive understanding of the basis system in terms of the clusters. A Survey of Text Clustering Algorithms 87 Let A be the n × d term document matrix. Let us assume that we wish to create k clusters from the underlying document corpus. Then, the non-negative matrix factorization method attempts to determine the matrices U and V which minimize the following objective function: J = (1/2) · ||A − U · V T || (4.6) Here || · || represents the sum of the squares of all the elements in the matrix, U is an n×k non-negative matrix, and V is a m×k non-negative matrix. We note that the columns of V provide the k basis vectors which correspond to the k different clusters. What is the significance of the above optimization problem? Note that by minimizing J, we are attempting to factorize A approximately as: A ≈ U · V T (4.7) For each row a of A (document vector), we can rewrite the above equation as: a ≈ u · V T (4.8) Here u is the corresponding row of U. Therefore, the document vector a can be rewritten as an approximate linear (non-negative) combination of the basis vector which corresponds to the k columns of V T . If the value of k is relatively small compared to the corpus, this can only be done if the column vectors of V T discover the latent structure in the data. Furthermore, the non-negativity of the matrices U and V ensures that the documents are expressed as a non-negative combination of the key concepts (or clustered) regions in the term-based feature space. Next, we will discuss how the optimization problem for J above is actually solved. The squared norm of any matrix Q can be expressed as the trace of the matrix Q · QT . Therefore, we can express the objective function above as follows: J = (1/2) · tr((A − U · V T ) · (A − U · V T )T ) = (1/2) · tr(A · AT ) − tr(A · U · V T ) + (1/2) · tr(U · V T · V · UT ) Thus, we have an optimization problem with respect to the matrices U = [uij] and V = [vij], the entries uij and vij of which are the variables with respect to which we need to optimize this problem. In addition, since the matrices are non-negative, we have the constraints that uij ≥ 0 and vij ≥ 0. This is a typical constrained non-linear optimization problem, and can be solved using the Lagrange method. Let α = [αij] and β = [βij] be matrices with the same dimensions as U and V respectively. The elements of the matrices α and β are the corresponding Lagrange 88 MINING TEXT DATA multipliers for the non-negativity conditions on the different elements of U and V respectively. We note that tr(α·UT ) is simply equal to i,j αij · uij and tr(β · V T ) is simply equal to i,j βij · vij. These correspond to the lagrange expressions for the non-negativity constraints. Then, we can express the Lagrangian optimization problem as follows: L = J + tr(α · UT ) + tr(β · V T ) (4.9) Then, we can express the partial derivative of L with respect to U and V as follows, and set them to 0: δL δU = −A · V + U · V T · V + α = 0 δL δV = −AT · U + V · UT · U + β = 0 We can then multiply the (i, j)th entry of the above (two matrices of) conditions with uij and vij respectively. Using the Kuhn-Tucker conditions αij · uij = 0 and βij · vij = 0, we get the following: (A · V )ij · uij − (U · V T · V )ij · uij = 0 (AT · U)ij · vij − (V · UT · U)ij · vij = 0 We note that these conditions are independent of α and β. This leads to the following iterative updating rules for uij and vij: uij = (A · V )ij · uij (U · V T · V )ij vij = (AT · U)ij · vij (V · UT · U)ij It has been shown in [58] that the objective function continuously improves under these update rules, and converges to an optimal solution. One interesting observation about the matrix factorization technique is that it can also be used to determine word-clusters instead of document clusters. Just as the columns of V provide a basis which can be used to discover document clusters, we can use the columns of U to discover a basis which correspond to word clusters. As we will see later, document clusters and word clusters are closely related, and it is often useful to discover both simultaneously, as in frameworks such as co-clustering [30, 31, 75]. Matrix-factorization provides a natural way of achieving this goal. It has also been shown both theoretically and experimentally [33, 93] that the matrix-factorization technique is equivalent to another graph-structure based document clustering technique known A Survey of Text Clustering Algorithms 89 as spectral clustering. An analogous technique called concept factorization was proposed in [98], which can also be applied to data points with negative values in them. 3. Distance-based Clustering Algorithms Distance-based clustering algorithms are designed by using a similarity function to measure the closeness between the text objects. The most well known similarity function which is used commonly in the text domain is the cosine similarity function. Let U = (f(u1) . . . f(uk)) and V = (f(v1) . . . f(vk)) be the damped and normalized frequency term vector in two different documents U and V . The values u1 . . . uk and v1 . . . vk represent the (normalized) term frequencies, and the function f(·) represents the damping function. Typical damping functions for f(·) could represent either the square-root or the logarithm [25]. Then, the cosine similarity between the two documents is defined as follows: cosine(U, V ) = k i=1 f(ui) · f(vi) k i=1 f(ui)2 · k i=1 f(vi)2 (4.10) Computation of text similarity is a fundamental problem in information retrieval. Although most of the work in information retrieval has focused on how to assess the similarity of a keyword query and a text document, rather than the similarity between two documents, many weighting heuristics and similarity functions can also be applied to optimize the similarity function for clustering. Effective information retrieval models generally capture three heuristics, i.e., TF weighting, IDF weighting, and document length normalization [36]. One effective way to assign weights to terms when representing a document as a weighted term vector is the BM25 term weighting method [78], where the normalized TF not only addresses length normalization, but also has an upper bound which improves the robustness as it avoids overly rewarding the matching of any particular term. A document can also be represented with a probability distribution over words (i.e., unigram language models), and the similarity can then be measured based an information theoretic measure such as cross entropy or Kullback-Leibler divergencce [105]. For clustering, symmetric variants of such a similarity function may be more appropriate. One challenge in clustering short segments of text (e.g., tweets or sentences) is that exact keyword matching may not work well. One general strategy for solving this problem is to expand text representation by exploiting related text documents, which is related to smoothing of a document language model in information retrieval [105]. A specific 90 MINING TEXT DATA technique, which leverages a search engine to expand text representation, was proposed in [79]. A comparison of several simple measures for computing similarity of short text segments can be found in [66]. These similarity functions can be used in conjunction with a wide variety of traditional clustering algorithms [50, 54]. In the next subsections, we will discuss some of these techniques. 3.1 Agglomerative and Hierarchical Clustering Algorithms Hierarchical clustering algorithms have been studied extensively in the clustering literature [50, 54] for records of different kinds including multidimensional numerical data, categorical data and text data. An overview of the traditional agglomerative and hierarchical clustering algorithms in the context of text data is provided in [69, 70, 92, 96]. An experimental comparison of different hierarchical clustering algorithms may be found in [110]. The method of agglomerative hierarchical clustering is particularly useful to support a variety of searching methods because it naturally creates a tree-like hierarchy which can be leveraged for the search process. In particular, the effectiveness of this method in improving the search efficiency over a sequential scan has been shown in [51, 77]. The general concept of agglomerative clustering is to successively merge documents into clusters based on their similarity with one another. Almost all the hierarchical clustering algorithms successively merge groups based on the best pairwise similarity between these groups of documents. The main differences between these classes of methods are in terms of how this pairwise similarity is computed between the different groups of documents. For example, the similarity between a pair of groups may be computed as the best-case similarity, averagecase similarity, or worst-case similarity between documents which are drawn from these pairs of groups. Conceptually, the process of agglomerating documents into successively higher levels of clusters creates a cluster hierarchy (or dendogram) for which the leaf nodes correspond to individual documents, and the internal nodes correspond to the merged groups of clusters. When two groups are merged, a new node is created in this tree corresponding to this larger merged group. The two children of this node correspond to the two groups of documents which have been merged to it. The different methods for merging groups of documents for the different agglomerative methods are as follows: A Survey of Text Clustering Algorithms 91 Single Linkage Clustering: In single linkage clustering, the similarity between two groups of documents is the greatest similarity between any pair of documents from these two groups. In single link clustering we merge the two groups which are such that their closest pair of documents have the highest similarity compared to any other pair of groups. The main advantage of single linkage clustering is that it is extremely efficient to implement in practice. This is because we can first compute all similarity pairs and sort them in order of reducing similarity. These pairs are processed in this pre-defined order and the merge is performed successively if the pairs belong to different groups. It can be easily shown that this approach is equivalent to the single-linkage method. This is essentially equivalent to a spanning tree algorithm on the complete graph of pairwise-distances by processing the edges of the graph in a certain order. It has been shown in [92] how Prim’s minimum spanning tree algorithm can be adapted to single-linkage clustering. Another method in [24] designs the single-linkage method in conjunction with the inverted index method in order to avoid computing zero similarities. The main drawback of this approach is that it can lead to the phenomenon of chaining in which a chain of similar documents lead to disparate documents being grouped into the same clusters. In other words, if A is similar to B and B is similar to C, it does not always imply that A is similar to C, because of lack of transitivity in similarity computations. Single linkage clustering encourages the grouping of documents through such transitivity chains. This can often lead to poor clusters, especially at the higher levels of the agglomeration. Effective methods for implementing single-linkage clustering for the case of document data may be found in [24, 92]. Group-Average Linkage Clustering: In group-average linkage clustering, the similarity between two clusters is the average similarity between the pairs of documents in the two clusters. Clearly, the average linkage clustering process is somewhat slower than single-linkage clustering, because we need to determine the average similarity between a large number of pairs in order to determine group-wise similarity. On the other hand, it is much more robust in terms of clustering quality, because it does not exhibit the chaining behavior of single linkage clustering. It is possible to speed up the average linkage clustering algorithm by approximating the average linkage similarity between two clusters C1 and C2 by computing the similarity between the mean document of C1 92 MINING TEXT DATA and the mean document of C2. While this approach does not work equally well for all data domains, it works particularly well for the case of text data. In this case, the running time can be reduced to O(n2), where n is the total number of nodes. The method can be implemented quite efficiently in the case of document data, because the centroid of a cluster is simply the concatenation of the documents in that cluster. Complete Linkage Clustering: In this technique, the similarity between two clusters is the worst-case similarity between any pair of documents in the two clusters. Complete-linkage clustering can also avoid chaining because it avoids the placement of any pair of very disparate points in the same cluster. However, like groupaverage clustering, it is computationally more expensive than the single-linkage method. The complete linkage clustering method requires O(n2) space and O(n3) time. The space requirement can however be significantly lower in the case of the text data domain, because a large number of pairwise similarities are zero. Hierarchical clustering algorithms have also been designed in the context of text data streams. A distributional modeling method for hierarchical clustering of streaming documents has been proposed in [80]. The main idea is to model the frequency of word-presence in documents with the use of a multi-poisson distribution. The parameters of this model are learned in order to assign documents to clusters. The method extends the COBWEB and CLASSIT algorithms [37, 40] to the case of text data. The work in [80] studies the different kinds of distributional assumptions of words in documents. We note that these distributional assumptions are required to adapt these algorithms to the case of text data. The approach essentially changes the distributional assumption so that the method can work effectively for text data. 3.2 Distance-based Partitioning Algorithms Partitioning algorithms are widely used in the database literature in order to efficiently create clusters of objects. The two most widely used distance-based partitioning algorithms [50, 54] are as follows: k-medoid clustering algorithms: In k-medoid clustering algorithms, we use a set of points from the original data as the anchors (or medoids) around which the clusters are built. The key aim of the algorithm is to determine an optimal set of representative documents from the original corpus around which the clusters are built. Each document is assigned to its closest representative from A Survey of Text Clustering Algorithms 93 the collection. This creates a running set of clusters from the corpus which are successively improved by a randomized process. The algorithm works with an iterative approach in which the set of k representatives are successively improved with the use of randomized inter-changes. Specifically, we use the average similarity of each document in the corpus to its closest representative as the objective function which needs to be improved during this interchange process. In each iteration, we replace a randomly picked representative in the current set of medoids with a randomly picked representative from the collection, if it improves the clustering objective function. This approach is applied until convergence is achieved. There are two main disadvantages of the use of k-medoids based clustering algorithms, one of which is specific to the case of text data. One general disadvantage of k-medoids clustering algorithms is that they require a large number of iterations in order to achieve convergence and are therefore quite slow. This is because each iteration requires the computation of an objective function whose time requirement is proportional to the size of the underlying corpus. The second key disadvantage is that k-medoid algorithms do not work very well for sparse data such as text. This is because a large fraction of document pairs do not have many words in common, and the similarities between such document pairs are small (and noisy) values. Therefore, a single document medoid often does not contain all the concepts required in order to effectively build a cluster around it. This characteristic is specific to the case of the information retrieval domain, because of the sparse nature of the underlying text data. k-means clustering algorithms: The k-means clustering algorithm also uses a set of k representatives around which the clusters are built. However, these representatives are not necessarily obtained from the original data and are refined somewhat differently than a k-medoids approach. The simplest form of the k-means approach is to start off with a set of k seeds from the original corpus, and assign documents to these seeds on the basis of closest similarity. In the next iteration, the centroid of the assigned points to each seed is used to replace the seed in the last iteration. In other words, the new seed is defined, so that it is a better central point for this cluster. This approach is continued until convergence. One of the advantages of the k-means method over the k-medoids method is that it requires an extremely small number 94 MINING TEXT DATA of iterations in order to converge. Observations from [25, 83] seem to suggest that for many large data sets, it is sufficient to use 5 or less iterations for an effective clustering. The main disadvantage of the k-means method is that it is still quite sensitive to the initial set of seeds picked during the clustering. Secondly, the centroid for a given cluster of documents may contain a large number of words. This will slow down the similarity calculations in the next iteration. A number of methods are used to reduce these effects, which will be discussed later on in this chapter. The initial choice of seeds affects the quality of k-means clustering, especially in the case of document clustering. Therefore, a number of techniques are used in order to improve the quality of the initial seeds which are picked for the clustering process. For example, another lightweight clustering method such as an agglomerative clustering technique can be used in order to decide the initial set of seeds. This is at the core of the method discussed in [25] for effective document clustering. We will discuss this method in detail in the next subsection. A second method for improving the initial set of seeds is to use some form of partial supervision in the process of initial seed creation. This form of partial supervision can also be helpful in creating clusters which are designed for particular application-specific criteria. An example of such an approach is discussed in [4] in which we pick the initial set of seeds as the centroids of the documents crawled from a particular category if the Y ahoo! taxonomy. This also has the effect that the final set of clusters are grouped by the coherence of content within the different Y ahoo! categories. The approach has been shown to be quite effective for use in a number of applications such as text categorization. Such semi-supervised techniques are particularly useful for information organization in cases where the starting set of categories is somewhat noisy, but contains enough information in order to create clusters which satisfy a pre-defined kind of organization. 3.3 A Hybrid Approach: The Scatter-Gather Method While hierarchical clustering methods tend to be more robust because of their tendency to compare all pairs of documents, they are generally not very efficient, because of their tendency to require at least O(n2) time. On the other hand, k-means type algorithms are more efficient than hierarchical algorithms, but may sometimes not be very effective because of their tendency to rely on a small number of seeds. A Survey of Text Clustering Algorithms 95 The method in [25] uses both hierarchical and partitional clustering algorithms to good effect. Specifically, it uses a hierarchical clustering algorithm on a sample of the corpus in order to find a robust initial set of seeds. This robust set of seeds is used in conjunction with a standard k-means clustering algorithm in order to determine good clusters. The size of the sample in the initial phase is carefully tailored so as to provide the best possible effectiveness without this phase becoming a bottleneck in algorithm execution. There are two possible methods for creating the initial set of seeds, which are referred to as buckshot and fractionation respectively. These are two alternative methods, and are described as follows: Buckshot: Let k be the number of clusters to be found and n be the number of documents in the corpus. Instead of picking the k seeds randomly from the collection, the buckshot scheme picks an overestimate √ k · n of the seeds, and then agglomerates these to k seeds. Standard agglomerative hierarchical clustering algorithms (requiring quadratic time) are applied to this initial sample of √ k · n seeds. Since we use quadratically scalable algorithms in this phase, this approach requires O(k · n) time. We note that this seed set is much more robust than one which simply samples for k seeds, because of the summarization of a large document sample into a robust set of k seeds. Fractionation: The fractionation algorithm initially breaks up the corpus into n/m buckets of size m > k each. An agglomerative algorithm is applied to each of these buckets to reduce them by a factor of ν. Thus, at the end of the phase, we have a total of ν · n agglomerated points. The process is repeated by treating each of these agglomerated points as an individual record. This is achieved by merging the different documents within an agglomerated cluster into a single document. The approach terminates when a total of k seeds remain. We note that the the agglomerative clustering of each group of m documents in the first iteration of the fractionation algorithm requires O(m2) time, which sums to O(n · m) over the n/m different groups. Since, the number of individuals reduces geometrically by a factor of ν in each iteration, the total running time over all iterations is O(n · m · (1+ μ + ν2 + . . .)). For constant ν < 1, the running time over all iterations is still O(n · m). By picking m = O(k), we can still ensure a running time of O(n · k) for the initialization procedure. The Buckshot and Fractionation procedures require O(k·n) time which is also equivalent to running time of one iteration of the k means algorithm. 96 MINING TEXT DATA Each iteration of the K-means algorithm also requires O(k · n) time because we need to compute the similarity of the n documents to the k different seeds. We further note that the fractionation procedure can be applied to a random grouping of the documents into n/m different buckets. Of course, one can also replace the random grouping approach with a more carefully designed procedure for more effective results. One such procedure is to sort the documents by the index of the jth most common word in the document. Here j is chosen to be a small number such as 3, which corresponds to medium frequency words in the data. The documents are then partitioned into groups based on this sort order by segmenting out continuous groups of m documents. This approach ensures that the groups created have at least a few common words in them and are therefore not completely random. This can sometimes provide a better quality of the centers which are determined by the fractionation algorithm. Once the initial cluster centers have been determined with the use of the Buckshot or Fractionation algorithms we can apply standard k-means partitioning algorithms. Specifically, we each document is assigned to the nearest of the k cluster centers. The centroid of each such cluster is determined as the concatenation of the different documents in a cluster. These centroids replace the sets of seeds from the last iteration. This process can be repeated in an iterative approach in order to successively refine the centers for the clusters. Typically, only a smaller number of iterations are required, because the greatest improvements occur only in the first few iterations. It is also possible to use a number of procedures to further improve the quality of the underlying clusters. These procedures are as follows: Split Operation: The process of splitting can be used in order to further refine the clusters into groups of better granularity. This can be achieved by applying the buckshot procedure on the individual documents in a cluster by using k = 2, and then re-clustering around these centers. This entire procedure requires O(k ·ni) time for a cluster containing ni data points, and therefore splitting all the groups requires O(k · n) time. However, it is not necessary to split all the groups. Instead, only a subset of the groups can be split. Those are the groups which are not very coherent and contain documents of a disparate nature. In order to measure the coherence of a group, we compute the self-similarity of a cluster. This self-similarity provides us with an understanding of the underlying coherence. This quantity can be computed both in terms of the similarity of the documents in a cluster to its centroid or A Survey of Text Clustering Algorithms 97 in terms of the similarity of the cluster documents to each other. The split criterion can then be applied selectively only to those clusters which have low self similarity. This helps in creating more coherent clusters. Join Operation: The join operation attempts to merge similar clusters into a single cluster. In order to perform the merge, we compute the topical words of each cluster by examining the most frequent words of the centroid. Two clusters are considered similar, if there is significant overlap between the topical words of the two clusters. We note that the method is often referred to as the Scatter-Gather clustering method, but this is more because of how the clustering method has been presented in terms of its use for browsing large collections in the original paper [25]. The scatter-gather approach can be used for organized browsing of large document collections, because it creates a natural hierarchy of similar documents. In particular, a user may wish to browse the hierarchy of clusters in an interactive way in order to understand topics of different levels of granularity in the collection. One possibility is to perform a hierarchical clustering a-priori; however such an approach has the disadvantage that it is unable to merge and recluster related branches of the tree hierarchy on-the-fly when a user may need it. A method for constant-interaction time browsing with the use of the scatter-gather approach has been presented in [26]. This approach presents the keywords associated with the different keywords to a user. The user may pick one or more of these keywords, which also corresponds to one or more clusters. The documents in these clusters are merged and re-clustered to a finer-granularity on-the-fly. This finer granularity of clustering is presented to the user for further exploration. The set of documents which is picked by the user for exploration is referred to as the focus set. Next we will explain how this focus set is further explored and re-clustered on the fly in constant-time. The key assumption in order to enable this approach is the cluster refinement hypothesis. This hypothesis states that documents which belong to the same cluster in a significantly finer granularity partitioning will also occur together in a partitioning with coarser granularity. The first step is to create a hierarchy of the documents in the clusters. A variety of agglomerative algorithms such as the buckshot method can be used for this purpose. We note that each (internal) node of this tree can be viewed as a meta-document corresponding to the concatenation of all the documents in the leaves of this subtree. The cluster-refinement hypothesis allows us to work with a smaller set of meta-documents rather 98 MINING TEXT DATA than the entire set of documents in a particular subtree. The idea is to pick a constant M which represents the maximum number of metadocuments that we are willing to re-cluster with the use of the interactive approach. The tree nodes in the focus set are then expanded (with priority to the branches with largest degree), to a maximum of M nodes. These M nodes are then re-clustered on-the-fly with the scatter-gather approach. This requires constant time because of the use of a constant number M of meta-documents in the clustering process. Thus, by working with the meta-documents for M. we assume the cluster-refinement hypothesis of all nodes of the subtree at the lower level. Clearly, a larger value of M does not assume the cluster-refinement hypothesis quite as strongly, but also comes at a higher cost. The details of the algorithm are described in [26]. Some extensions of this approach are also presented in [85], in which it has been shown how this approach can be used to cluster arbitrary corpus subsets of the documents in constant time. Another recent online clustering algorithm called LAIR2 [55] provides constant-interaction time for Scatter/Gather browsing. The parallelization of this algorithm is significantly faster than a corresponding version of the Buckshot algorithm. It has also been suggested that the LAIR2 algorithm leads to better quality clusters in the data. 3.3.1 Projections for Efficient Document Clustering. One of the challenges of the scatter-gather algorithm is that even though the algorithm is designed to balance the running times of the agglomerative and partitioning phases quite well, it sometimes suffer a slowdown in large document collections because of the massive number of distinct terms that a given cluster centroid may contain. Recall that a cluster centroid in the scatter-gather algorithm is defined as the concatenation of all the documents in that collection. When the number of documents in the cluster is large, this will also lead to a large number of distinct terms in the centroid. This will also lead to a slow down of a number of critical computations such as similarity calculations between documents and cluster centroids. An interesting solution to this problem has been proposed in [83]. The idea is to use the concept of projection in order to reduce the dimensionality of the document representation. Such a reduction in dimensionality will lead to significant speedups, because the similarity computations will be made much more efficient. The work in [83] proposes three kinds of projections: Global Projection: In global projection, the dimensionality of the original data set is reduced in order to remove the least important (weighted) terms from the data. The weight of a term is A Survey of Text Clustering Algorithms 99 defined as the aggregate of the (normalized and damped) frequencies of the terms in the documents. Local Projection: In local projection, the dimensionality of the documents in each cluster are reduced with a locally specific approach for that cluster. Thus, the terms in each cluster centroid are truncated separately. Specifically, the least weight terms in the different cluster centroids are removed. Thus, the terms removed from each document may be different, depending upon their local importance. Latent Semantic Indexing: In this case, the document-space is transformed with an LSI technique, and the clustering is applied to the transformed document space. We note that the LSI technique can also be applied either globally to the whole document collection, or locally to each cluster if desired. It has been shown in [83] that the projection approaches provide competitive results in terms of effectiveness while retaining an extremely high level of efficiency with respect to all the competing approaches. In this sense, the clustering methods are different from similarity search because they show little degradation in quality, when projections are performed. One of the reasons for this is that clustering is a much less fine grained application as compared to similarity search, and therefore there is no perceptible difference in quality even when we work with a truncated feature space. 4. Word and Phrase-based Clustering Since text documents are drawn from an inherently high-dimensional domain, it can be useful to view the problem in a dual way, in which important clusters of words may be found and utilized for finding clusters of documents. In a corpus containing d terms and n documents, one may view a term-document matrix as an n × d matrix, in which the (i, j)th entry is the frequency of the jth term in the ith document. We note that this matrix is extremely sparse since a given document contains an extremely small fraction of the universe of words. We note that the problem of clustering rows in this matrix is that of clustering documents, whereas that of clustering columns in this matrix is that of clustering words. In reality, the two problems are closely related, as good clusters of words may be leveraged in order to find good clusters of documents and vice-versa. For example, the work in [16] determines frequent itemsets of words in the document collection, and uses them to determine compact clusters of documents. This is somewhat analogous 100 MINING TEXT DATA to the use of clusters of words [87] for determining clusters of documents. The most general technique for simultaneous word and document clustering is referred to as co-clustering [30, 31]. This approach simultaneous clusters the rows and columns of the term-document matrix, in order to create such clusters. This can also be considered to be equivalent to the problem of re-ordering the rows and columns of the term-document matrix so as to create dense rectangular blocks of non-zero entries in this matrix. In some cases, the ordering information among words may be used in order to determine good clusters. The work in [103] determines the frequent phrases in the collection and leverages them in order to determine document clusters. It is important to understand that the problem of word clusters and document clusters are essentially dual problems which are closely related to one another. The former is related to dimensionality reduction, whereas the latter is related to traditional clustering. The boundary between the two problems is quite fluid, because good word clusters provide hints for finding good document clusters and vice-versa. For example, a more general probabilistic framework which determines word clusters and document clusters simultaneously is referred to as topic modeling [49]. Topic modeling is a more general framework than either clustering or dimensionality reduction. We will introduce the method of topic modeling in a later section of this chapter. A more detailed treatment is also provided in the next chapter in this book, which is on dimensionality reduction, and in Chapter 8 where a more general discussion of probabilistic models for text mining is given. 4.1 Clustering with Frequent Word Patterns Frequent pattern mining [8] is a technique which has been widely used in the data mining literature in order to determine the most relevant patterns in transactional data. The clustering approach in [16] is designed on the basis of such frequent pattern mining algorithms. A frequent itemset in the context of text data is also referred to as a frequent term set, because we are dealing with documents rather than transactions. The main idea of the approach is to not cluster the high dimensional document data set, but consider the low dimensional frequent term sets as cluster candidates. This essentially means that a frequent terms set is a description of a cluster which corresponds to all the documents containing that frequent term set. Since a frequent term set can be considered a description of a cluster, a set of carefully chosen frequent terms sets can be considered a clustering. The appropriate choice of this set A Survey of Text Clustering Algorithms 101 of frequent term sets is defined on the basis of the overlaps between the supporting documents of the different frequent term sets. The notion of clustering defined in [16] does not necessarily use a strict partitioning in order to define the clusters of documents, but it allows a certain level of overlap. This is a natural property of many term- and phrase-based clustering algorithms because one does not directly control the assignment of documents to clusters during the algorithm execution. Allowing some level of overlap between clusters may sometimes be more appropriate, because it recognizes the fact that documents are complex objects and it is impossible to cleanly partition documents into specific clusters, especially when some of the clusters are partially related to one another. The clustering definition of [16] assumes that each document is covered by at least one frequent term set. Let R be the set of chosen frequent term sets which define the clustering. Let fi be the number of frequent term sets in R which are contained in the ith document. The value of fi is at least one in order to ensure complete coverage, but we would otherwise like it to be as low as possible in order to minimize overlap. Therefore, we would like the average value of (fi − 1) for the documents in a given cluster to be as low as possible. We can compute the average value of (fi − 1) for the documents in the cluster and try to pick frequent term sets such that this value is as low as possible. However, such an approach would tend to favor frequent term sets containing very few terms. This is because if a term set contains m terms, then all subsets of it would also be covered by the document, as a result of which the standard overlap would be increased. The entropy overlap of a given term is essentially the sum of the values of −(1/fi) · log(1/fi) over all documents in the cluster. This value is 0, when each document has fi = 1, and increases monotonically with increasing fi values. It then remains to describe how the frequent term sets are selected from the collection. Two algorithms are described in [16], one of which corresponds to a flat clustering, and the other corresponds to a hierarchical clustering. We will first describe the method for flat clustering. Clearly, the search space of frequent terms is exponential, and therefore a reasonable solution is to utilize a greedy algorithm to select the frequent terms sets. In each iteration of the greedy algorithm, we pick the frequent term set with a cover having the minimum overlap with other cluster candidates. The documents covered by the selected frequent term are removed from the database, and the overlap in the next iteration is computed with respect to the remaining documents. The hierarchical version of the algorithm is similar to the broad idea in flat clustering, with the main difference that each level of the clustering 102 MINING TEXT DATA is applied to a set of term sets containing a fixed number k of terms. In other words, we are working only with frequent patterns of length k for the selection process. The resulting clusters are then further partitioned by applying the approach for (k + 1)-term sets. For further partitioning a given cluster, we use only those (k + 1)-term sets which contain the frequent k-term set defining that cluster. More details of the approach may be found in [16]. 4.2 Leveraging Word Clusters for Document Clusters A two phase clustering procedure is discussed in [87], which uses the following steps to perform document clustering: In the first phase, we determine word-clusters from the documents in such a way that most of mutual information between words and documents is preserved when we represent the documents in terms of word clusters rather than words. In the second phase, we use the condensed representation of the documents in terms of word-clusters in order to perform the final document clustering. Specifically, we replace the word occurrences in documents with word-cluster occurrences in order to perform the document clustering. One advantage of this two-phase procedure is the significant reduction in the noise in the representation. Let X = x1 . . . xn be the random variables corresponding to the rows (documents), and let Y = y1 . . . yd be the random variables corresponding to the columns (words). We would like to partition X into k clusters, and Y into l clusters. Let the clusters be denoted by ˆX = ˆx1 . . . ˆxk and ˆY = ˆy1 . . . ˆyl. In other words, we wish to find the maps CX and CY , which define the clustering: CX : x1 . . . xn ⇒ ˆx1 . . . ˆxk CY : y1 . . . yd ⇒ ˆy1 . . . ˆyl In the first phase of the procedure we cluster Y to ˆY , so that most of the information in I(X, Y ) is preserved in I(X, ˆY ). In the second phase, we perform the clustering again from X to ˆX using exactly the same procedure so that as much information as possible from I(X, ˆY ) is preserved in I( ˆX, ˆY ). Details of how each phase of the clustering is performed is provided in [87]. How to discover interesting word clusters (which can be leveraged for document clustering) has itself attracted attention in the natural lan- A Survey of Text Clustering Algorithms 103 guage processing research community, with particular interests in discovering word clusters that can characterize word senses [34] or a semantic concept [21]. In [34], for example, the Markov clustering algorithm was applied to discover corpus-specific word senses in an unsupervised way. Specifically, a word association graph is first constructed in which related words would be connected with an edge. For a given word that potentially has multiple senses, we can then isolate the subgraph representing its neighbors. These neighbors are expected to form clusters according to different senses of the target word, thus by grouping together neighbors that are well connected with each other, we can discover word clusters that characterize different senses of the target word. In [21], an n-gram class language model was proposed to cluster words based on minimizing the loss of mutual information between adjacent words, which can achieve the effect of grouping together words that share similar context in natural language text. 4.3 Co-clustering Words and Documents In many cases, it is desirable to simultaneously cluster the rows and columns of the contingency table, and explore the interplay between word clusters and document clusters during the clustering process. Since the clusters among words and documents are clearly related, it is often desirable to cluster both simultaneously when when it is desirable to find clusters along one of the two dimensions. Such an approach is referred to as co-clustering [30, 31]. Co-clustering is defined as a pair of maps from rows to row-cluster indices and columns to column-cluster indices. These maps are determined simultaneously by the algorithm in order to optimize the corresponding cluster representations. We further note that the matrix factorization approach [58] discussed earlier in this chapter can be naturally used for co-clustering because it discovers word clusters and document clusters simultaneously. In that section, we have also discussed how matrix factorization can be viewed as a co-clustering technique. While matrix factorization has not widely been used as a technique for co-clustering, we point out this natural connection, as possible exploration for future comparison with other coclustering methods. Some recent work [60] has shown how matrix factorization can be used in order to transform knowledge from word space to document space in the context of document clustering techniques. The problem of co-clustering is also closely related to the problem of subspace clustering [7] or projected clustering [5] in quantitative data in the database literature. In this problem, the data is clustered by simultaneously associating it with a set of points and subspaces in multi- 104 MINING TEXT DATA dimensional space. The concept of co-clustering is a natural application of this broad idea to data domains which can be represented as sparse high dimensional matrices in which most of the entries are 0. Therefore, traditional methods for subspace clustering can also be extended to the problem of co-clustering. For example, an adaptive iterative subspace clustering method for documents was proposed in [59]. We note that subspace clustering or co-clustering can be considered a form of local feature selection, in which the features selected are specific to each cluster. A natural question arises, as to whether the features can be selected as a linear combination of dimensions as in the case of traditional dimensionality reduction techniques such as PCA [53]. This is also known as local dimensionality reduction [22] or generalized projected clustering [6] in the traditional database literature. In this method, PCA-based techniques are used in order to generate subspace representations which are specific to each cluster, and are leveraged in order to achieve a better clustering process. In particular, such an approach has recently been designed [32], which has been shown to work well with document data. In this section, we will study two well known methods for document co-clustering, which are commonly used in the document clustering literature. One of these methods uses graph-based term-document representations [30] and the other uses information theory [31]. We will discuss both of these methods below. 4.3.1 Co-clustering with graph partitioning. The core idea in this approach [30] is to represent the term-document matrix as a bipartite graph G = (V1 ∪ V2, E), where V1 and V2 represent the vertex sets in the two bipartite portions of this graph, and E represents the edge set. Each node in V1 corresponds to one of the n documents, and each node in V2 corresponds to one of the d terms. An undirected edge exists between node i ∈ V1 and node j ∈ V2 if document i contains the term j. We note that there are no edges in E directly between terms, or directly between documents. Therefore, the graph is bipartite. The weight of each edge is the corresponding normalized term-frequency. We note that a word partitioning in this bipartite graph induces a document partitioning and vice-versa. Given a partitioning of the documents in this graph, we can associate each word with the document cluster to which it is connected with the most weight of edges. Note that this criterion also minimizes the weight of the edges across the partitions. Similarly, given a word partitioning, we can associate each document with the word partition to which it is connected with the greatest weight of edges. Therefore, a natural solution to this problem would A Survey of Text Clustering Algorithms 105 be simultaneously perform the k-way partitioning of this graph which minimizes the total weight of the edges across the partitions. This is of course a classical problem in the graph partitioning literature. In [30], it has been shown how a spectral partitioning algorithm can be used effectively for this purpose. Another method discussed in [75] uses an isometric bipartite graph-partitioning approach for the clustering pro- cess. 4.3.2 Information-Theoretic Co-clustering. In [31], the optimal clustering has been defined to be one which maximizes the mutual information between the clustered random variables. The normalized non-negative contingency table is treated as a joint probability distribution between two discrete random variables which take values over rows and columns. Let X = x1 . . . xn be the random variables corresponding to the rows, and let Y = y1 . . . yd be the random variables corresponding to the columns. We would like to partition X into k clusters, and Y into l clusters. Let the clusters be denoted by ˆX = ˆx1 . . . ˆxk and ˆY = ˆy1 . . . ˆyl. In other words, we wish to find the maps CX and CY , which define the clustering: CX : x1 . . . xn ⇒ ˆx1 . . . ˆxk CY : y1 . . . yd ⇒ ˆy1 . . . ˆyl The partition functions CX and CY are allowed to depend on the joint probability distribution p(X, Y ). We note that since ˆX and ˆY are higher level clusters of X and Y , there is loss in mutual information in the higher level representations. In other words, the distribution p( ˆX, ˆY ) contains less information than p(X, Y ), and the mutual information I( ˆX, ˆY ) is lower than the mutual information I(X, Y ). Therefore, the optimal coclustering problem is to determine the mapping which minimizes the loss in mutual information. In other words, we wish to find a co-clustering for which I(X, Y ) − I( ˆX, ˆY ) is as small as possible. An iterative algorithm for finding a co-clustering which minimizes mutual information loss is proposed in [29]. 4.4 Clustering with Frequent Phrases One of the key differences of this method from other text clustering methods is that it treats a document as a string as opposed to a bag of words. Specifically, each document is treated as a string of words, rather than characters. The main difference between the string representation and the bag-of-words representation is that the former also retains ordering information for the clustering process. As is the case with many 106 MINING TEXT DATA clustering methods, it uses an indexing method in order to organize the phrases in the document collection, and then uses this organization to create the clusters [103, 104]. Several steps are used in order to create the clusters: (1) The first step is to perform the cleaning of the strings representing the documents. A light stemming algorithm is used by deleting word prefixes and suffixes and reducing plural to singular. Sentence boundaries are marked and non-word tokens are stripped. (2) The second step is the identification of base clusters. These are defined by the frequent phases in the collection which are represented in the form of a suffix tree. A suffix tree [45] is essentially a trie which contains all the suffixes of the entire collection. Each node of the suffix tree represents a group of documents, and a phrase which is common to all these documents. Since each node of the suffix-tree also corresponds to a group of documents, it also corresponds to a base clustering. Each base cluster is given a score which is essentially the product of the number of documents in that cluster and a non-decreasing function of the length of the underlying phrase. Therefore, clusters containing a large number of documents, and which are defined by a relatively long phrase are more desirable. (3) An important characteristic of the base clusters created by the suffix tree is that they do not define a strict partitioning and have overlaps with one another. For example, the same document may contain multiple phrases in different parts of the suffix tree, and will therefore be included in the corresponding document groups. The third step of the algorithm merges the clusters based on the similarity of their underlying document sets. Let P and Q be the document sets corresponding to two clusters. The base similarity BS(P, Q) is defined as follows: BS(P, Q) = |P ∩ Q| max{|P|, |Q|} + 0.5 (4.11) This base similarity is either 0 or 1, depending upon whether the two groups have at least 50% of their documents in common. Then, we construct a graph structure in which the nodes represent the base clusters, and an edge exists between two cluster nodes, if the corresponding base similarity between that pair of groups is 1. The connected components in this graph define the final clusters. Specifically, the union of the groups of documents in each connected component is used as the final set of clusters. We note that the final set of clusters have much less overlap with one another, but they still do not define a strict partitioning. This is sometimes the case with clustering algorithms in which modest overlaps are allowed to enable better clustering quality. A Survey of Text Clustering Algorithms 107 5. Probabilistic Document Clustering and Topic Models A popular method for probabilistic document clustering is that of topic modeling. The idea of topic modeling is to create a probabilistic generative model for the text documents in the corpus. The main approach is to represent a corpus as a function of hidden random variables, the parameters of which are estimated using a particular document collection. The primary assumptions in any topic modeling approach (together with the corresponding random variables) are as follows: The n documents in the corpus are assumed to have a probability of belonging to one of k topics. Thus, a given document may have a probability of belonging to multiple topics, and this reflects the fact that the same document may contain a multitude of subjects. For a given document Di, and a set of topics T1 . . . Tk, the probability that the document Di belongs to the topic Tj is given by P(Tj|Di). We note that the the topics are essentially analogous to clusters, and the value of P(Tj|Di) provides a probability of cluster membership of the ith document to the jth cluster. In nonprobabilistic clustering methods, the membership of documents to clusters is deterministic in nature, and therefore the clustering is typically a clean partitioning of the document collection. However, this often creates challenges, when there are overlaps in document subject matter across multiple clusters. The use of a soft cluster membership in terms of probabilities is an elegant solution to this dilemma. In this scenario, the determination of the membership of the documents to clusters is a secondary goal to that of finding the latent topical clusters in the underlying text collection. Therefore, this area of research is referred to as topic modeling, and while it is related to the clustering problem, it is often studied as a distinct area of research from clustering. The value of P(Tj|Di) is estimated using the topic modeling approach, and is one of the primary outputs of the algorithm. The value of k is one of the inputs to the algorithm and is analogous to the number of clusters. Each topic is associated with a probability vector, which quantifies the probability of the different terms in the lexicon for that topic. Let t1 . . . td be the d terms in the lexicon. Then, for a document that belongs completely to topic Tj, the probability that the term tl occurs in it is given by P(tl|Tj). The value of P(tl|Tj) is another 108 MINING TEXT DATA important parameter which needs to be estimated by the topic modeling approach. Note that the number of documents is denoted by n, topics by k and lexicon size (terms) by d. Most topic modeling methods attempt to learn the above parameters using maximum likelihood methods, so that the probabilistic fit to the given corpus of documents is as large as possible. There are two basic methods which are used for topic modeling, which are Probabilistic Latent Semantic Indexing (PLSI) [49] and Latent Dirichlet Allocation (LDA)[20] respectively. In this section, we will focus on the probabilistic latent semantic indexing method. Note that the above set of random variables P(Tj|Di) and P(tl|Tj) allow us to model the probability of a term tl occurring in any document Di. Specifically, the probability P(tl|Di) of the term tl occurring document Di can be expressed in terms of afore-mentioned parameters as follows: P(tl|Di) = k j=1 p(tl|Tj) · P(Tj|Di) (4.12) Thus, for each term tl and document Di, we can generate a n × d matrix of probabilities in terms of these parameters, where n is the number of documents and d is the number of terms. For a given corpus, we also have the n × d term-document occurrence matrix X, which tells us which term actually occurs in each document, and how many times the term occurs in the document. In other words, X(i, l) is the number of times that term tl occurs in document Di. Therefore, we can use a maximum likelihood estimation algorithm which maximizes the product of the probabilities of terms that are observed in each document in the entire collection. The logarithm of this can be expressed as a weighted sum of the logarithm of the terms in Equation 4.12, where the weight of the (i, l)th term is its frequency count X(i, l). This is a constrained optimization problem which optimizes the value of the log likelihood probability i,l X(i, l)·log(P(tl|Di)) subject to the constraints that the probability values over each of the topic-document and term-topic spaces must sum to 1: l P(tl|Tj) = 1 ∀Tj (4.13) j P(Tj|Di) = 1 ∀Di (4.14) A Survey of Text Clustering Algorithms 109 The value of P(tl|Di) in the objective function is expanded and expressed in terms of the model parameters with the use of Equation 4.12. We note that a Lagrangian method can be used to solve this constrained problem. This is quite similar to the approach that we discussed for the non-negative matrix factorization problem in this chapter. The Lagrangian solution essentially leads to a set of iterative update equations for the corresponding parameters which need to be estimated. It can be shown that these parameters can be estimated [49] with the iterative update of two matrices [P1]k×n and [P2]d×k containing the topic-document probabilities and term-topic probabilities respectively. We start off by initializing these matrices randomly, and normalize each of them so that the probability values in their columns sum to one. Then, we iteratively perform the following steps on each of P1 and P2 respectively: for each entry (j, i) in P1 do update P1(j, i) ← P1(j, i) · d r=1 P2(r, j) · X(i,r) k v=1 P1(v,i)·P2(r,v) Normalize each column of P1 to sum to 1; for each entry (l, j) in P2 do update P2(l, j) ← P2(l, j) · n q=1 P1(j, q) · X(q,l) k v=1 P1(v,q)·P2(l,v) Normalize each column of P2 to sum to 1; The process is iterated to convergence. The output of this approach are the two matrices P1 and P2, the entries of which provide the topicdocument and term-topic probabilities respectively. The second well known method for topic modeling is that of Latent Dirichlet Allocation. In this method, the term-topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution as a prior. Thus, the LDA method is the Bayesian version of the PLSI technique. It can also be shown the the PLSI method is equivalent to the LDA technique, when applied with a uniform Dirichlet prior [42]. The method of LDA was first introduced in [20]. Subsequently, it has generally been used much more extensively as compared to the PLSI method. Its main advantage over the PLSI method is that it is not quite as susceptible to overfitting. This is generally true of Bayesian methods which reduce the number of model parameters to be estimated, and therefore work much better for smaller data sets. Even for larger data sets, PLSI has the disadvantage that the number of model parameters grows linearly with the size of the collection. It has been argued [20] that the PLSI model is not a fully generative model, because there is no accurate way to model the topical distribution of a document which is not included in the current data set. For example, one can use the current set 110 MINING TEXT DATA of topical distributions to perform the modeling of a new document, but it is likely to be much more inaccurate because of the overfitting inherent in PLSI. A Bayesian model, which uses a small number of parameters in the form of a well-chosen prior distribution, such as a Dirichlet, is likely to be much more robust in modeling new documents. Thus, the LDA method can also be used in order to model the topic distribution of a new document more robustly, even if it is not present in the original data set. Despite the theoretical advantages of LDA over PLSA, a recent study has shown that their task performances in clustering, categorization and retrieval tend to be similar [63]. The area of topic models is quite vast, and will be treated in more depth in Chapter 5 and Chapter 8 of this book; the purpose of this section is to simply acquaint the reader with the basics of this area and its natural connection to clustering. We note that the EM-concepts which are used for topic modeling are quite general, and can be used for different variations on the text clustering tasks, such as text classification [72] or incorporating user feedback into clustering [46]. For example, the work in [72] uses an EM-approach in order to perform supervised clustering (and classification) of the documents, when a mixture of labeled and unlabeled data is available. A more detailed discussion is provided in Chapter 6 on text classification. 6. Online Clustering with Text Streams The problem of streaming text clustering is particularly challenging in the context of text data because of the fact that the clusters need to be continuously maintained in real time. One of the earliest methods for streaming text clustering was proposed in [112]. This technique is referred to as the Online Spherical k-Means Algorithm (OSKM), which reflects the broad approach used by the methodology. This technique divides up the incoming stream into small segments, each of which can be processed effectively in main memory. A set of k-means iterations are applied to each such data segment in order to cluster them. The advantage of using a segment-wise approach for clustering is that since each segment can be held in main memory, we can process each data point multiple times as long as it is held in main memory. In addition, the centroids from the previous segment are used in the next iteration for clustering purposes. A decay factor is introduced in order to ageout the old documents, so that the new documents are considered more important from a clustering perspective. This approach has been shown to be extremely effective in clustering massive text streams in [112]. A different method for clustering massive text and categorical data streams is discussed in [3]. The method discussed in [3] uses an approach A Survey of Text Clustering Algorithms 111 which examines the relationship between outliers, emerging trends, and clusters in the underlying data. Old clusters may become inactive, and eventually get replaced by new clusters. Similarly, when newly arriving data points do not naturally fit in any particular cluster, these need to be initially classified as outliers. However, as time progresses, these new points may create a distinctive pattern of activity which can be recognized as a new cluster. The temporal locality of the data stream is manifested by these new clusters. For example, the first web page belonging to a particular category in a crawl may be recognized as an outlier, but may later form a cluster of documents of its own. On the other hand, the new outliers may not necessarily result in the formation of new clusters. Such outliers are true short-term abnormalities in the data since they do not result in the emergence of sustainable patterns. The approach discussed in [3] recognizes new clusters by first recognizing them as outliers. This approach works with the use of a summarization methodology, in which we use the concept of condensed droplets [3] in order to create concise representations of the underlying clusters. As in the case of the OSKM algorithm, we ensure that recent data points are given greater importance than older data points. This is achieved by creating a time-sensitive weight for each data point. It is assumed that each data point has a time-dependent weight defined by the function f(t). The function f(t) is also referred to as the fading function. The fading function f(t) is a non-monotonic decreasing function which decays uniformly with time t. The aim of defining a half life is to quantify the rate of decay of the importance of each data point in the stream clustering process. The decay-rate is defined as the inverse of the half life of the data stream. We denote the decay rate by λ = 1/t0. We denote the weight function of each point in the data stream by f(t) = 2−λ·t. From the perspective of the clustering process, the weight of each data point is f(t). It is easy to see that this decay function creates a half life of 1/λ. It is also evident that by changing the value of λ, it is possible to change the rate at which the importance of the historical information in the data stream decays. When a cluster is created during the streaming process by a newly arriving data point, it is allowed to remain as a trend-setting outlier for at least one half-life. During that period, if at least one more data point arrives, then the cluster becomes an active and mature cluster. On the other hand, if no new points arrive during a half-life, then the trend-setting outlier is recognized as a true anomaly in the data stream. At this point, this anomaly is removed from the list of current clusters. We refer to the process of removal as cluster death. Thus, a new cluster containing one data point dies when the (weighted) number of points 112 MINING TEXT DATA in the cluster is 0.5. The same criterion is used to define the death of mature clusters. A necessary condition for this criterion to be met is that the inactivity period in the cluster has exceeded the half life 1/λ. The greater the number of points in the cluster, the greater the level by which the inactivity period would need to exceed its half life in order to meet the criterion. This is a natural solution, since it is intuitively desirable to have stronger requirements (a longer inactivity period) for the death of a cluster containing a larger number of points. The statistics of the data points are captured in summary statistics, which are referred to as condensed droplets. These represent the word distributions within a cluster, and can be used in order to compute the similarity of an incoming data point to the cluster. The overall algorithm proceeds as follows. At the beginning of algorithmic execution, we start with an empty set of clusters. As new data points arrive, unit clusters containing individual data points are created. Once a maximum number k of such clusters have been created, we can begin the process of online cluster maintenance. Thus, we initially start off with a trivial set of k clusters. These clusters are updated over time with the arrival of new data points. When a new data point X arrives, its similarity to each cluster droplet is computed. In the case of text data sets, the cosine similarity measure between DF1 and X is used. The similarity value S(X, Cj) is computed from the incoming document X to every cluster Cj. The cluster with the maximum value of S(X, Cj) is chosen as the relevant cluster for data insertion. Let us assume that this cluster is Cmindex. We use a threshold denoted by thresh in order to determine whether the incoming data point is an outlier. If the value of S(X, Cmindex) is larger than the threshold thresh, then the point X is assigned to the cluster Cmindex. Otherwise, we check if some inactive cluster exists in the current set of cluster droplets. If no such inactive cluster exists, then the data point X is added to Cmindex. On the other hand, when an inactive cluster does exist, a new cluster is created containing the solitary data point X. This newly created cluster replaces the inactive cluster. We note that this new cluster is a potential true outlier or the beginning of a new trend of data points. Further understanding of this new cluster may only be obtained with the progress of the data stream. In the event that X is inserted into the cluster Cmindex, we update the statistics of the cluster in order to reflect the insertion of the data point and temporal decay statistics. Otherwise, we replace the most inactive cluster by a new cluster containing the solitary data point X. In particular, the replaced cluster is the least recently updated cluster among all inactive clusters. This process is continuously performed over A Survey of Text Clustering Algorithms 113 the life of the data stream, as new documents arrive over time. The work in [3] also presents a variety of other applications of the stream clustering technique such as evolution and correlation analysis. A different way of utilizing the temporal evolution of text documents in the clustering process is described in [48]. Specifically, the work in [48] uses bursty features as markers of new topic occurrences in the data stream. This is because the semantics of an up-and-coming topic are often reflected in the frequent presence of a few distinctive words in the text stream. At a given period in time, the nature of relevant topics could lead to bursts in specific features of the data stream. Clearly, such features are extremely important from a clustering perspective. Therefore, the method discussed in [48] uses a new representation, which is referred to as the bursty feature representation for mining text streams. In this representation, a time-varying weight is associated with the features depending upon its burstiness. This also reflects the varying importance of the feature to the clustering process. Thus, it is important to remember that a particular document representation is dependent upon the particular instant in time at which it is constructed. Another issue which is handled effectively in this approach is an implicit reduction in dimensionality of the underlying collection. Text is inherently a high dimensional data domain, and the pre-selection of some of the features on the basis of their burstiness can be a natural way to reduce the dimensionality of document representation. This can help in both the effectiveness and efficiency of the underlying algorithm. The first step in the process is to identify the bursty features in the data stream. In order to achieve this goal, the approach uses Kleinberg’s 2-state finite automaton model [57]. Once these features have been identified, the bursty features are associated with weights which depend upon their level of burstiness. Subsequently, a bursty feature representation is defined in order to reflect the underlying weight of the feature. Both the identification and the weight of the bursty feature are dependent upon its underlying frequency. A standard k-means approach is applied to the new representation in order to construct the clustering. It was shown in [48] that the approach of using burstiness improves the cluster quality. Once criticism of the work in [48] is that it is mostly focused on the issue of improving effectiveness with the use of temporal characteristics of the data stream, and does not address the issue of efficient clustering of the underlying data stream. In general, it is evident that feature extraction is important for all clustering algorithms. While the work in [48] focuses on using temporal characteristics of the stream for feature extraction, the work in [61] focuses on using phrase extraction for effective feature selection. This work 114 MINING TEXT DATA is also related to the concept of topic-modeling, which will be discussed in detail in the next section. This is because the different topics in a collection can be related to the clusters in a collection. The work in [61] uses topic-modeling techniques for clustering. The core idea in the work of [61] is that individual words are not very effective for a clustering algorithm because they miss the context in which the word is used. For example, the word “star” may either refer to a celestial body or to an entertainer. On the other hand, when the phrase “fixed star” is used, it becomes evident that the word “star” refers to a celestial body. The phrases which are extracted from the collection are also referred to as topic signatures. The use of such phrasal clarification for improving the quality of the clustering is referred to as semantic smoothing because it reduces the noise which is associated with semantic ambiguity. Therefore, a key part of the approach is to extract phrases from the underlying data stream. After phrase extraction, the training process determines a translation probability of the phrase to terms in the vocabulary. For example, the word “planet” may have high probability of association with the phrase “fixed star”, because both refer to celestial bodies. Therefore, for a given document, a rational probability count may also be assigned to all terms. For each document, it is assumed that all terms in it are generated either by a topic-signature model, or a background collection model. The approach in [61] works by modeling the soft probability p(w|Cj) for word w and cluster Cj. The probability p(w|Cj) is modeled as a linear combination of two factors; (a) A maximum likelihood model which computes the probabilities of generating specific words for each cluster (b) An indirect (translated) word-membership probability which first determines the maximum likelihood probability for each topic-signature, and then multiplying with the conditional probability of each word, given the topic-signature. We note that we can use p(w|Cj) in order to estimate p(d|Cj) by using the product of the constituent words in the document. For this purpose, we use the frequency f(w, d) of word w in document d. p(d|Cj) = w∈d p(w|Cj)f(w,d) (4.15) We note that in the static case, it is also possible to add a background model in order to improve the robustness of the estimation process. This is however not possible in a data stream because of the fact that the background collection model may require multiple passes in order to build effectively. The work in [61] maintains these probabilities in online fashion with the use of a cluster profile, that weights the probabilities with the use of a fading function. We note that the concept of cluster A Survey of Text Clustering Algorithms 115 profile is analogous to the concept of condensed droplet introduced in [3]. The key algorithm (denoted by OCTS) is to maintain a dynamic set of clusters into which documents are progressively assigned with the use of similarity computations. It has been shown in [61] how the cluster profile can be used in order to efficiently compute p(d|Cj) for each incoming document. This value is then used in order to determine the similarity of the documents to the different clusters. This is used in order to assign the documents to their closest cluster. We note that the methods in [3, 61] share a number of similarities in terms of (a) maintenance of cluster profiles, (b) use of cluster profiles (or condensed droplets) to compute similarity and assignment of documents to most similar clusters, and (c) the rules used to decide when a new singleton cluster should be created, or one of the older clusters should be replaced. The main difference between the two algorithms is the technique which is used in order to compute cluster similarity. The OCTS algorithm uses the probabilistic computation p(d|Cj) to compute cluster similarity, which takes the phrasal information into account during the computation process. One observation about OCTS is that it may allow for very similar clusters to co-exist in the current set. This reduces the space available for distinct cluster profiles. A second algorithm called OCTSM is also proposed in [61], which allows for merging of very similar clusters. Before each assignment, it checks whether pairs of similar clusters can be merged on the basis of similarity. If this is the case, then we allow the merging of the similar clusters and their corresponding cluster profiles. Detailed experimental results on the different clustering algorithms and their effectiveness are presented in [61]. A closely related area to clustering is that of topic modeling, which we discussed in an earlier section. Recently, the topic modeling method has also been extended to the dynamic case which is helpful for topic modeling of text streams [107]. 7. Clustering Text in Networks Many social networks contain both text content in the nodes, as well as links between the different nodes. Clearly, the links provide useful cues in understanding the related nodes in the network. The impact of different link types on the quality of the clustering has been studied in [109], and it has been shown that many forms of implicit and explicit links improve clustering quality, because they encode human knowledge. Therefore, a natural choice is to combine these two factors in the process of clustering the different nodes. In this section, we will discuss a number of such techniques. 116 MINING TEXT DATA In general, links may be considered as a kind of side-information, which can be represented in the form of attributes. A general approach for incorporating side attributes into the clustering process has been proposed in [1]. This algorithm uses a combination of a k-means approach on the text attributes, and Bayesian probability estimations on the side attributes for the clustering process. The idea is to identify those attributes, which are helpful for the clustering process, and use them in order to enhance the quality of the clustering. However, this approach is really designed for general attributes of any kind, rather than link-based attributes, in which an underlying graph structure is implied by the document-to-document linkages. In spite of this, it has been shown in [1], that it is possible to significantly enhance the quality of clustering by treating linkage information as side-attributes. Many other techniques, which will be discussed in this section, have been proposed specifically for the case of text documents, which are linked together in a network structure. The earliest methods for combining text and link information for the clustering process are proposed in [12]. Two different methods were proposed in this paper for the clustering process. The first method uses the link information in the neighbors of a node in order to bias the term weights in a document. Term weights which are common between a document and its neighbors are given more importance in the clustering process. One advantage of such an approach is that we can use any of the existing clustering algorithms for this purpose, because the link information is implicitly encoded in the modified term weights. The second method proposed in [12] is a graph-based approach which directly uses the links in the clustering process. In this case, the approach attempts to model the probability that a particular document belongs to a given cluster for a particular set of links and content. This is essentially a soft-clustering, in which a probability of assignment is determined for each cluster. The cluster with the largest probability of assignment is considered the most relevant cluster. A Markov Random Field (MRF) technique is used in order to perform the clustering. An iterative technique called relaxation labeling is used in order to compute the maximum likelihood parameters of this MRF. More details of this approach may be found in [12]. A recent method to perform clustering with both structural and attribute similarities is proposed in [113]. The techniques of this paper can be applied to both relational and text attributes. This paper integrates structural and attribute-based clustering by adding attribute vertices to the network in addition to the original structural vertices. In the context of text data, this implies that a vertex exists for each word in the lexi- A Survey of Text Clustering Algorithms 117 con. Therefore, in addition to the original set of vertices V in the graph G = (V, E), we now have the augmented vertex set V ∪ V1, such that V1 contains one vertex for each nodes. We also augment the edge set, in order to add to the original set of structural edges E. We add an edge between a structural vertex i ∈ V and an attribute vertex j ∈ V1, if word j is contained in the node i. This new set of edges added is denoted by E1. Therefore, we now have an augmented graph G1 = (V ∪ V1, E ∪ E1) which is semi-bipartite. A neighborhood random walk model is used in order to determine the closeness of vertices. This closeness measure is used in order to perform the clustering. The main challenge in the algorithm is to determine the relative importance of structural and attribute components in the clustering process. In the context of the random walk model, this translates to determining the appropriate weights of different edges during the random walk process. A learning model has been proposed in [113] in order to learn these weights, and leverage them for an effective clustering process. The problem of clustering network content is often encountered in the context of community detection in social networks. The text content in the social network graph may be attached to either the nodes [101] of the network, or to the edges [74]. The node-based approach is generally more common, and most of the afore-mentioned techniques in this paper can be modeled in terms of content attached to the nodes. In the method proposed in [101], the following link-based and content-based steps are combined for effective community detection: A conditional model is proposed for link analysis, in which the conditional probability for the destination of given link is modeled. A hidden variable is introduced in order to capture the popularity of a node in terms of the likelihood of that node being cited by other nodes. A discriminative content model is introduced in order to reduce the impact of noisy content attributes. In this model, the attributes are weighed by their ability to discriminate between the different communities. The two models are combined into a unified framework with the use of a two-stage optimization algorithm for maximum likelihood inference. One interesting characteristic of this broad framework is that it can also be used in the context of other complementary approaches. The details of the algorithm are discussed in [101]. 118 MINING TEXT DATA For the case of edge-based community detection, it is assumed that the text content in the network is attached to the edges [74]. This is common in applications which involve extensive communication between the different nodes. For example, in email networks, or online chat networks, the text in the network is associated with the communications between the different entities. In such cases, the text is associated with an edge in the underlying network. The presence of content associated with edges allows for a much more nuanced approach in community detection, because a given node may participate in communities of different kinds. The presence of content associated with edges helps in separating out these different associations of the same individual to different communities. The work in [74] uses a matrix-factorization methodology in order to jointly model the content and structure for the community detection process. The matrix factorization method is used to transform the representation into multi-dimensional representation, which can be easily clustered by a simple algorithm such as the k-means algorithm. It was shown in [74], that the use of such an approach can provide much more effective results than a pure content- or link-based clustering methodology. A closely related area to clustering is that of topic modeling, in which we attempt to model the probability of a document belonging to a particular cluster. A natural approach to network-based topic modeling is to add a network-based regularization constraint to traditional topic models such as NetPLSA [65]. The relational topic model (RTM) proposed in [23] tries to model the generation of documents and links sequentially. The first step for generating the documents is the same as LDA. Subsequently, the model predicts links based on the similarity of the topic mixture used in two documents. Thus, this method can be used both for topic modeling and predicting missing links. A more unified model is proposed in the iTopicModel [91] framework which creates a Markov Random Field model in order to create a generative model which simultaneously captures both text and links. Experimental results have shown this approach to be more general and superior to previously existing methods. A number of other methods for incorporating network information into topic modeling are discussed in the next chapter on dimensionality reduction. 8. Semi-Supervised Clustering In some applications, prior knowledge may be available about the kinds of clusters that are available in the underlying data. This prior knowledge may take on the form of labels attached with the documents, A Survey of Text Clustering Algorithms 119 which indicate its underlying topic. For example, if we wish to use the broad distribution of topics in the Y ahoo! taxonomy in order to supervise the clustering process of a new web collection, one way to performing supervision would be add some labeled pages from the Y ahoo! taxonomy to the collection. Typically such pages would contain labels of the form @Science@Astronomy or @Arts@Painting, which indicate the subject area of the added pages. Such knowledge can be very useful in creating significantly more coherent clusters, especially when the total number of clusters is large. The process of using such labels to guide the clustering process is referred to as semi-supervised clustering. This form of learning is a bridge between the clustering and classification problem, because it uses the underlying class structure, but it is not completely tied down by the specific structure. As a result, such an approach finds applicability both to the clustering and classification scenarios. The most natural method for incorporating supervision into the clustering process is to do so in partitional clustering methods such as kmeans. This is because the supervision can be easily incorporated by changing the seeds in the clustering process. For example, the work in [4] uses the initial seeds in the k-means clustering process as the centroids of the original classes in the underlying data. A similar approach has also been used in [15], except a wider variation of how the seeds may be selected has been explored. A number of probabilistic frameworks have also been designed for semi-supervised clustering [72, 14]. The work in [72] uses an iterative EM-approach in which the unlabeled documents are assigned labels using a naive Bayes approach on the currently labeled documents. These newly labeled documents are then again used for re-training a Bayes classifier. This process is iterated to convergence. The iterative labeling approach in [72] can be considered a partially supervised approach for clustering the unlabeled documents. The work in [14] uses a Heterogeneous Markov Random Field (HMRF) model for the clustering process. A graph-based method for incorporating prior knowledge into the clustering process has been proposed in [52]. In this method, the documents are modeled as a graph, in which nodes represent documents and edges represent the similarity among them. New edges may also be added to this graph, which correspond to the prior knowledge. Specifically, an edge is added to the graph, when it is known on the basis of prior knowledge that these two documents are similar. A normalized cut algorithm [84] is then applied to this graph in order to create the final clustering. This approach implicitly uses the prior knowledge because of the augmented graph representation which is used for the clustering. 120 MINING TEXT DATA Since semi-supervised clustering forms a natural bridge between the clustering and classification problems, it is natural that semi-supervised methods can be used for classification as well [68]. This is also referred to as co-training, because it involves the use of unsupervised document clustering in order to assist the training process. Since semi-supervised methods use both the clustering structure in the feature space and the class information, they are sometimes more robust in classification scenarios, especially in cases where the amount of available labeled data is small. It has been shown in [72], how a partially supervised co-training approach which mixes supervised and unsupervised data may yield more effective classification results, when the amount of training data available is small. The work in [72] uses a partially supervised EM-algorithm which iteratively assigns labels to the unlabeled documents and refines them over time as convergence is achieved. A number of similar methods along this spirit are proposed in [4, 14, 35, 47, 89] with varying levels of supervision in the clustering process. Partially supervised clustering methods are also used feature transformation in classification using the methods as discussed in [17, 18, 88]. The idea is that the clustering structure provides a compressed feature space, which capture the relevant classification structure very well, and can therefore be helpful for classification. Partially supervised methods can also be used in conjunction with preexisting categorical hierarchies (or prototype hierarchies) [4, 56, 67]. A typical example of a prototype hierarchy would be the Yahoo! taxonomy, the Open Directory Project, or the Reuters collection. The idea is that such hierarchies provide a good general idea of the clustering structure, but also have considerable noise and overlaps in them because of their typical manual origins. The partial supervision is able to correct the noise and overlaps, and this results in a relatively clean and coherent clustering structure. An unusual kind of supervision for document clustering is the method of use of a universum of documents which are known not to belong to a cluster [106]. This is essentially, the background distribution which cannot be naturally clustered into any particular group. The intuition is that the universum of examples provide an effective way of avoiding mistakes in the clustering process, since it provides a background of examples to compare a cluster with. 9. Conclusions and Summary In this chapter, we presented a survey of clustering algorithms for text data. A good clustering of text requires effective feature selection and a A Survey of Text Clustering Algorithms 121 proper choice of the algorithm for the task at hand. Among the different classes of algorithms, the distance-based methods are among the most popular in a wide variety of applications. In recent years, the main trend in research in this area has been in the context of two kinds of text data: Dynamic Applications: The large amounts of text data being created by dynamic applications such as social networks or online chat applications has created an immense need for streaming text clustering applications. Such streaming applications need to be applicable in the case of text which is not very clean, as is often the case for applications such as social networks. Heterogeneous Applications: Text applications increasingly arise in heterogeneous applications in which the text is available in the context of links, and other heterogeneous multimedia data. For example, in social networks such as Flickr the clustering often needs to be applied in such scenario. Therefore, it is critical to effectively adapt text-based algorithms to heterogeneous multimedia scenarios. We note that the field of text clustering is too vast to cover comprehensively in a single chapter. Some methods such as committee-based clustering [73] cannot even be neatly incorporated into any class of methods, since they use a combination of the different clustering methods in order to create a final clustering result. The main purpose of this chapter is to provide a comprehensive overview of the main algorithms which are often used in the area, as a starting point for further study. References [1] C. C. Aggarwal, Y. Zhao, P. S. Yu. On Text Clustering with Side Information, ICDE Conference, 2012. [2] C. C. Aggarwal, P. S. Yu. On Effective Conceptual Indexing and Similarity Search in Text, ICDM Conference, 2001. [3] C. C. Aggarwal, P. S. Yu. A Framework for Clustering Massive Text and Categorical Data Streams, SIAM Conference on Data Mining, 2006. [4] C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision for Text Categorization, IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [5] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering, ACM SIGMOD Conference, 1999. 122 MINING TEXT DATA [6] C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference, 2000. [7] R. Agrawal, J. Gehrke, P. Raghavan. D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD Conference, 1999. [8] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases, VLDB Conference, 1994. [9] J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998. [10] P. Andritsos, P. Tsaparas, R. Miller, K. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT Conference, 2004. [11] P. Anick, S. Vaithyanathan. Exploiting Clustering and Phrases for Context-Based Information Retrieval. ACM SIGIR Conference, 1997. [12] R. Angelova, S. Siersdorfer. A neighborhood-based approach for clustering of linked document collections. CIKM Conference, 2006. [13] R. A. Baeza-Yates, B. A. Ribeiro-Neto, Modern Information Retrieval - the concepts and technology behind search, Second edition, Pearson Education Ltd., Harlow, England, 2011. [14] S. Basu, M. Bilenko, R. J. Mooney. A probabilistic framework for semi-supervised clustering. ACM KDD Conference, 2004. [15] S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by Seeding. ICML Conference, 2002. [16] F. Beil, M. Ester, X. Xu. Frequent term-based text clustering, ACM KDD Conference, 2002. [17] L. Baker, A. McCallum. Distributional Clustering of Words for Text Classification, ACM SIGIR Conference, 1998. [18] R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional Clustering for Text Categorization. ACM SIGIR Conference, 2001. [19] D. Blei, J. Lafferty. Dynamic topic models. ICML Conference, 2006. [20] D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993–1022, 2003. [21] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J/ C. Lai. Class-based n-gram models of natural language, Computational Linguistics, 18, 4 (December 1992), 467-479. [22] K. Chakrabarti, S. Mehrotra. Local Dimension reduction: A new Approach to Indexing High Dimensional Spaces, VLDB Conference, 2000. A Survey of Text Clustering Algorithms 123 [23] J. Chang, D. Blei. Topic Models for Document Networks. AISTASIS, 2009. [24] W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28: pp. 341–344, 1977. [25] D. Cutting, D. Karger, J. Pedersen, J. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. ACM SIGIR Conference, 1992. [26] D. Cutting, D. Karger, J. Pederson. Constant Interaction-time Scatter/Gather Browsing of Large Document Collections, ACM SIGIR Conference, 1993. [27] M. Dash, H. Liu. Feature Selection for Clustering, PAKDD Conference, pp. 110–121, 1997. [28] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407, 1990. [29] I. Dhillon, D. Modha. Concept Decompositions for Large Sparse Data using Clustering, 42(1), pp. 143–175, 2001. [30] I. Dhillon. Co-clustering Documents and Words using bipartite spectral graph partitioning, ACM KDD Conference, 2001. [31] I. Dhillon, S. Mallela, D. Modha. Information-theoretic CoClustering, ACM KDD Conference, 2003. [32] C. Ding, X. He, H. Zha, H. D. Simon. Adaptive Dimension Reduction for Clustering High Dimensional Data, ICDM Conference, 2002. [33] C. Ding, X. He, H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, 2005. [34] B. Dorow, D. Widdows. Discovering corpus-specific word senses, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2 (EACL ’03), pages 79-82, 2003. [35] R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. NIPS Conference, 2002. [36] H. Fang, T. Tao, C. Zhai, A formal study of information retrieval heuristics, Proceedings of ACM SIGIR 2004, 2004. [37] D. Fisher. Knowledge Acquisition via incremental conceptual clustering. Machine Learning, 2: pp. 139–172, 1987. [38] M. Franz, T. Ward, J. McCarley, W.-J. Zhu. Unsupervised and supervised clustering for topic tracking. ACM SIGIR Conference, 2001. 124 MINING TEXT DATA [39] G. P. C. Fung, J. X. Yu, P. Yu, H. Lu. Parameter Free Bursty Events Detection in Text Streams, VLDB Conference, 2005. [40] J. H. Gennari, P. Langley, D. Fisher. Models of incremental concept formation. Journal of Artificial Intelligence, 40 pp. 11–61, 1989. [41] D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems, VLDB Conference, 1998. [42] M. Girolami, A Kaban. On the Equivalance between PLSI and LDA, SIGIR Conference, pp. 433–434, 2003. [43] S. Guha, R. Rastogi, K. Shim. ROCK: a robust clustering algorithm for categorical attributes, International Conference on Data Engineering, 1999. [44] S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. ACM SIGMOD Conference, 1998. [45] D. Gusfield. Algorithms for strings, trees and sequences, Cambridge University Press, 1997. [46] Y. Huang, T. Mitchell. Text clustering with extended user feedback. ACM SIGIR Conference, 2006. [47] H. Li, K. Yamanishi. Document classification using a finite mixture model. Annual Meeting of the Association for Computational Linguistics, 1997. [48] Q. He, K. Chang, E.-P. Lim, J. Zhang. Bursty feature representation for clustering text streams. SDM Conference, 2007. [49] T. Hofmann. Probabilistic Latent Semantic Indexing. ACM SIGIR Conference, 1999. [50] A. Jain, R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1998. [51] N. Jardine, C. J.van Rijsbergen. The use of hierarchical clustering in information retrieval, Information Storage and Retrieval, 7: pp. 217–240, 1971. [52] X. Ji, W. Xu. Document clustering with prior knowledge. ACM SIGIR Conference, 2006. [53] I. T. Jolliffee. Principal Component Analysis. Springer, 2002. [54] L. Kaufman, P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Interscience, 1990. [55] W. Ke, C. Sugimoto, J. Mostafa. Dynamicity vs. effectiveness: studying online clustering for scatter/gather. ACM SIGIR Conference, 2009. A Survey of Text Clustering Algorithms 125 [56] H. Kim, S. Lee. A Semi-supervised document clustering technique for information organization, CIKM Conference, 2000. [57] J. Kleinberg, Bursty and hierarchical structure in streams, ACM KDD Conference, pp. 91–101, 2002. [58] D. D. Lee, H. S. Seung. Learning the parts of objects by nonnegative matrix factorization, Nature, 401: pp. 788–791, 1999. [59] T. Li, S. Ma, M. Ogihara, Document Clustering via Adaptive Subspace Iteration, ACM SIGIR Conference, 2004. [60] T. Li, C. Ding, Y. Zhang, B. Shao. Knowledge transformation from word space to document space. ACM SIGIR Conference, 2008. [61] Y.-B. Liu, J.-R. Cai, J. Yin, A. W.-C. Fu. Clustering Text Data Streams, Journal of Computer Science and Technology, Vol. 23(1), pp. 112–128, 2008. [62] T. Liu, S. Lin, Z. Chen, W.-Y. Ma. An Evaluation on Feature Selection for Text Clustering, ICML Conference, 2003. [63] Y. Lu, Q. Mei, C. Zhai. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA, Information Retrieval, 14(2): 178-203 (2011). [64] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu. edu/~mccallum/bow, 1996. [65] Q. Mei, D. Cai, D. Zhang, C.-X. Zhai. Topic Modeling with Network Regularization. WWW Conference, 2008. [66] D. Metzler, S. T. Dumais, C. Meek, Similarity Measures for Short Segments of Text, Proceedings of ECIR 2007, 2007. [67] Z. Ming, K. Wang, T.-S. Chua. Prototype hierarchy-based clustering for the categorization and navigation of web collections. ACM SIGIR Conference, 2010. [68] T. M. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999. [69] F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal, 26(4), pp. 354–359, 1983. [70] F. Murtagh. Complexities of Hierarchical Clustering Algorithms: State of the Art, Computational Statistics Quarterly, 1(2), pp. 101– 113, 1984. [71] R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB Conference, 1994. 126 MINING TEXT DATA [72] K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify text from labeled and unlabeled documents. AAAI Conference, 1998. [73] P. Pantel, D. Lin. Document Clustering with Committees, ACM SIGIR Conference, 2002. [74] G. Qi, C. Aggarwal, T. Huang. Community Detection with Edge Content in Social Media Networks, ICDE Conference, 2012. [75] M. Rege, M. Dong, F. Fotouhi. Co-clustering Documents and Words Using Bipartite Isoperimetric Graph Partitioning. ICDM Conference, pp. 532–541, 2006. [76] C. J. van Rijsbergen. Information Retrieval, Butterworths, 1975. [77] C. J.van Rijsbergen, W. B. Croft. Document Clustering: An Evaluation of some experiments with the Cranfield 1400 collection, Information Processing and Management, 11, pp. 171–182, 1975. [78] S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232–241, 1994. [79] M. Sahami, T. D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, Proceedings of WWW 2006, pages 377-386, 2006. [80] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, R. Padman. Incremental Hierarchical Clustering of Text Documents, ACM CIKM Conference, 2006. [81] G. Salton. An Introduction to Modern Information Retrieval, Mc Graw Hill, 1983. [82] G. Salton, C. Buckley. Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24(5), pp. 513–523, 1988. [83] H. Schutze, C. Silverstein. Projections for Efficient Document Clustering, ACM SIGIR Conference, 1997. [84] J. Shi, J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000. [85] C. Silverstein, J. Pedersen. Almost-constant time clustering of arbitrary corpus subsets. ACM SIGIR Conference, pp. 60–66, 1997. [86] A. Singhal, C. Buckley, M. Mitra. Pivoted Document Length Normalization. ACM SIGIR Conference, pp. 21–29, 1996. [87] N. Slonim, N. Tishby. Document Clustering using word clusters via the information bottleneck method, ACM SIGIR Conference, 2000. A Survey of Text Clustering Algorithms 127 [88] N. Slonim, N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001. [89] N. Slonim, N. Friedman, N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR Conference, 2002. [90] M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on text mining, 2000. [91] Y. Sun, J. Han, J. Gao, Y. Yu. iTopicModel: Information Network Integrated Topic Modeling, ICDM Conference, 2009. [92] E. M. Voorhees. Implementing Agglomerative Hierarchical Clustering for use in Information Retrieval,Technical Report TR86–765, Cornell University, Ithaca, NY, July 1986. [93] F. Wang, C. Zhang, T. Li. Regularized clustering for documents. ACM SIGIR Conference, 2007. [94] J. Wilbur, K. Sirotkin. The automatic identification of stopwords, J. Inf. Sci., 18: pp. 45–55, 1992. [95] P. Willett. Document Clustering using an inverted file approach. Journal of Information Sciences, 2: pp. 223–231, 1980. [96] P. Willett. Recent Trends in Hierarchical Document Clustering: A Critical Review. Information Processing and Management, 24(5): pp. 577–597, 1988. [97] W. Xu, X. Liu, Y. Gong. Document Clustering based on nonnegative matrix factorization, ACM SIGIR Conference, 2003. [98] W. Xu, Y. Gong. Document clustering by concept factorization. ACM SIGIR Conference, 2004. [99] Y. Yang, J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, 1995. [100] Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization, ACM SIGIR Conference, 1995. [101] T. Yang, R. Jin, Y. Chi, S. Zhu. Combining link and content for community detection: a discriminative approach. ACM KDD Conference, 2009. [102] L. Yao, D. Mimno, A. McCallum. Efficient methods for topic model inference on streaming document collections, ACM KDD Conference, 2009. [103] O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility Demonstration, ACM SIGIR Conference, 1998. 128 MINING TEXT DATA [104] O. Zamir, O. Etzioni, O. Madani, R. M. Karp. Fast and Intuitive Clustering of Web Documents, ACM KDD Conference, 1997. [105] C. Zhai, Statistical Language Models for Information Retrieval (Synthesis Lectures on Human Language Technologies), Morgan & Claypool Publishers, 2008. [106] D. Zhang, J. Wang, L. Si. Document clustering with universum. ACM SIGIR Conference, 2011. [107] J. Zhang, Z. Ghahramani, Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Saul L., Weiss Y., Bottou L. (eds) Advances in Neural Information Processing Letters, 17, 2005. [108] T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996. [109] X. Zhang, X. Hu, X. Zhou. A comparative evaluation of different link types on enhancing document clustering. ACM SIGIR Conference, 2008. [110] Y. Zhao, G. Karypis. Evaluation of hierarchical clustering algorithms for document data set, CIKM Conference, 2002. [111] Y. Zhao, G. Karypis. Empirical and Theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004. [112] S. Zhong. Efficient Streaming Text Clustering. Neural Networks, Volume 18, Issue 5–6, 2005. [113] Y. Zhou, H. Cheng, J. X. Yu. Graph Clustering based on Structural/Attribute Similarities, VLDB Conference, 2009. [114] http://www.lemurproject.org/ Chapter 6 A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign Urbana, IL czhai@cs.uiuc.edu Abstract The problem of classification has been widely studied in the data mining, machine learning, database, and information retrieval communities with applications in a number of diverse domains, such as target marketing, medical diagnosis, news group filtering, and document organization. In this paper we will provide a survey of a wide variety of text classification algorithms. Keywords: Text Classification 1. Introduction The problem of classification has been widely studied in the database, data mining, and information retrieval communities. The problem of classification is defined as follows. We have a set of training records D = {X1, . . . , XN }, such that each record is labeled with a class value drawn from a set of k different discrete values indexed by {1 . . . k}. The training data is used in order to construct a classification model, which relates the features in the underlying record to one of the class labels. For a given test instance for which the class is unknown, the training model 164 MINING TEXT DATA is used to predict a class label for this instance. In the hard version of the classification problem, a particular label is explicitly assigned to the instance, whereas in the soft version of the classification problem, a probability value is assigned to the test instance. Other variations of the classification problem allow ranking of different class choices for a test instance, or allow the assignment of multiple labels [52] to a test instance. The classification problem assumes categorical values for the labels, though it is also possible to use continuous values as labels. The latter is referred to as the regression modeling problem. The problem of text classification is closely related to that of classification of records with set-valued features [28]; however, this model assumes that only information about the presence or absence of words is used in a document. In reality, the frequency of words also plays a helpful role in the classification process, and the typical domain-size of text data (the entire lexicon size) is much greater than a typical set-valued classification problem. A broad survey of a wide variety of classification methods may be found in [42, 62], and a survey which is specific to the text domain may be found in [111]. A relative evaluation of different kinds of text classification methods may be found in [132]. A number of the techniques discussed in this chapter have also been converted into software and are publicly available through multiple toolkits such as the BOW toolkit [93], Mallot [96], WEKA 1, and LingPipe 2. The problem of text classification finds applications in a wide variety of domains in text mining. Some examples of domains in which text classification is commonly used are as follows: News filtering and Organization: Most of the news services today are electronic in nature in which a large volume of news articles are created very single day by the organizations. In such cases, it is difficult to organize the news articles manually. Therefore, automated methods can be very useful for news categorization in a variety of web portals [78]. This application is also referred to as text filtering. Document Organization and Retrieval: The above application is generally useful for many applications beyond news filtering and organization. A variety of supervised methods may be used for document organization in many domains. These include large digital libraries of documents, web collections, scientific literature, 1http://www.cs.waikato.ac.nz/ml/weka/ 2http://alias-i.com/lingpipe/ A Survey of Text Classification Algorithms 165 or even social feeds. Hierarchically organized document collections can be particularly useful for browsing and retrieval [19]. Opinion Mining: Customer reviews or opinions are often short text documents which can be mined to determine useful information from the review. Details on how classification can be used in order to perform opinion mining are discussed in [89] and Chapter 13 in this book. Email Classification and Spam Filtering: It is often desirable to classify email [23, 27, 85] in order to determine either the subject or to determine junk email [113] in an automated way. This is also referred to as spam filtering or email filtering. A wide variety of techniques have been designed for text classification. In this chapter, we will discuss the broad classes of techniques, and their uses for classification tasks. We note that these classes of techniques also generally exist for other data domains such as quantitative or categorical data. Since text may be modeled as quantitative data with frequencies on the word attributes, it is possible to use most of the methods for quantitative data directly on text. However, text is a particular kind of data in which the word attributes are sparse, and high dimensional, with low frequencies on most of the words. Therefore, it is critical to design classification methods which effectively account for these characteristics of text. In this chapter, we will focus on the specific changes which are applicable to the text domain. Some key methods, which are commonly used for text classification are as follows: Decision Trees: Decision trees are designed with the use of a hierarchical division of the underlying data space with the use of different text features. The hierarchical division of the data space is designed in order to create class partitions which are more skewed in terms of their class distribution. For a given text instance, we determine the partition that it is most likely to belong to, and use it for the purposes of classification. Pattern (Rule)-based Classifiers: In rule-based classifiers we determine the word patterns which are most likely to be related to the different classes. We construct a set of rules, in which the lefthand side corresponds to a word pattern, and the right-hand side corresponds to a class label. These rules are used for the purposes of classification. SVM Classifiers: SVM Classifiers attempt to partition the data space with the use of linear or non-linear delineations between the 166 MINING TEXT DATA different classes. The key in such classifiers is to determine the optimal boundaries between the different classes and use them for the purposes of classification. Neural Network Classifiers: Neural networks are used in a wide variety of domains for the purposes of classification. In the context of text data, the main difference for neural network classifiers is to adapt these classifiers with the use of word features. We note that neural network classifiers are related to SVM classifiers; indeed, they both are in the category of discriminative classifiers, which are in contrast with the generative classifiers [102]. Bayesian (Generative) Classifiers: In Bayesian classifiers (also called generative classifiers), we attempt to build a probabilistic classifier based on modeling the underlying word features in different classes. The idea is then to classify text based on the posterior probability of the documents belonging to the different classes on the basis of the word presence in the documents. Other Classifiers: Almost all classifiers can be adapted to the case of text data. Some of the other classifiers include nearest neighbor classifiers, and genetic algorithm-based classifiers. We will discuss some of these different classifiers in some detail and their use for the case of text data. The area of text categorization is so vast that it is impossible to cover all the different algorithms in detail in a single chapter. Therefore, our goal is to provide the reader with an overview of the most important techniques, and also the pointers to the different variations of these techniques. Feature selection is an important problem for text classification. In feature selection, we attempt to determine the features which are most relevant to the classification process. This is because some of the words are much more likely to be correlated to the class distribution than others. Therefore, a wide variety of methods have been proposed in the literature in order to determine the most important features for the purpose of classification. These include measures such as the gini-index or the entropy, which determine the level of which the presence of a particular feature skews the class distribution in the underlying data. We will also discuss the different feature selection methods which are commonly used for text classification. The rest of this chapter is organized as follows. In the next section, we will discuss methods for feature selection in text classification. In section A Survey of Text Classification Algorithms 167 3, we will describe decision tree methods for text classification. Rulebased classifiers are described in detail in section 4. We discuss naive Bayes classifiers in section 5. The nearest neighbor classifier is discussed in section 7. In section 7, we will discuss a number of linear classifiers, such as the SVM classifier, direct regression modeling and the neural network classifier. A discussion of how the classification methods can be adapted to text and web data containing hyperlinks is discussed in section 8. In section 9, we discuss a number of different meta-algorithms for classification such as boosting, bagging and ensemble learning. Section 10 contains the conclusions and summary. 2. Feature Selection for Text Classification Before any classification task, one of the most fundamental tasks that needs to be accomplished is that of document representation and feature selection. While feature selection is also desirable in other classification tasks, it is especially important in text classification due to the high dimensionality of text features and the existence of irrelevant (noisy) features. In general, text can be represented in two separate ways. The first is as a bag of words, in which a document is represented as a set of words, together with their associated frequency in the document. Such a representation is essentially independent of the sequence of words in the collection. The second method is to represent text directly as strings, in which each document is a sequence of words. Most text classification methods use the bag-of-words representation because of its simplicity for classification purposes. In this section, we will discuss some of the methods which are used for feature selection in text classification. The most common feature selection which is used in both supervised and unsupervised applications is that of stop-word removal and stemming. In stop-word removal, we determine the common words in the documents which are not specific or discriminatory to the different classes. In stemming, different forms of the same word are consolidated into a single word. For example, singular, plural and different tenses are consolidated into a single word. We note that these methods are not specific to the case of the classification problem, and are often used in a variety of unsupervised applications such as clustering and indexing. In the case of the classification problem, it makes sense to supervise the feature selection process with the use of the class labels. This kind of selection process ensures that those features which are highly skewed towards the presence of a particular class label are picked for the learning process. A wide variety of feature selection methods are discussed in [133, 135]. Many of these feature selection methods have been compared with one 168 MINING TEXT DATA another, and the experimental results are presented in [133]. We will discuss each of these feature selection methods in this section. 2.1 Gini Index One of the most common methods for quantifying the discrimination level of a feature is the use of a measure known as the gini-index. Let p1(w) . . . pk(w) be the fraction of class-label presence of the k different classes for the word w. In other words, pi(w) is the conditional probability that a document belongs to class i, given the fact that it contains the word w. Therefore, we have: k i=1 pi(w) = 1 (6.1) Then, the gini-index for the word w, denoted by G(w) is defined3 as follows: G(w) = k i=1 pi(w)2 (6.2) The value of the gini-index G(w) always lies in the range (1/k, 1). Higher values of the gini-index G(w) represent indicate a greater discriminative power of the word w. For example, when all documents which contain word w belong to a particular class, the value of G(w) is 1. On the other hand, when documents containing word w are evenly distributed among the k different classes, the value of G(w) is 1/k. One criticism with this approach is that the global class distribution may be skewed to begin with, and therefore the above measure may sometimes not accurately reflect the discriminative power of the underlying attributes. Therefore, it is possible to construct a normalized gini-index in order to reflect the discriminative power of the attributes more accurately. Let P1 . . . Pk represent the global distributions of the documents in the different classes. Then, we determine the normalized probability value pi(w) as follows: pi(w) = pi(w)/Pi k j=1 pj(w)/Pj (6.3) 3The gini-index is also sometimes defined as 1 − k i=1 pi(w)2, with lower values indicating greater discriminative power of the feature w. A Survey of Text Classification Algorithms 169 Then, the gini-index is computed in terms of these normalized probability values. G(w) = k i=1 pi(w)2 (6.4) The use of the global probabilities Pi ensures that the gini-index more accurately reflects class-discrimination in the case of biased class distributions in the whole document collection. For a document corpus containing n documents, d words, and k classes, the complexity of the information gain computation is O(n · d · k). This is because the computation of the term pi(w) for all the different words and the classes requires O(n · d · k) time. 2.2 Information Gain Another related measure which is commonly used for text feature selection is that of information gain or entropy. Let Pi be the global probability of class i, and pi(w) be the probability of class i, given that the document contains the word w. Let F(w) be the fraction of the documents containing the word w. The information gain measure I(w) for a given word w is defined as follows: I(w) = − k i=1 Pi · log(Pi) + F(w) · k i=1 pi(w) · log(pi(w)) + +(1 − F(w)) · k i=1 (1 − pi(w)) · log(1 − pi(w)) The greater the value of the information gain I(w), the greater the discriminatory power of the word w. For a document corpus containing n documents and d words, the complexity of the information gain computation is O(n · d · k). 2.3 Mutual Information This mutual information measure is derived from information theory [31], and provides a formal way to model the mutual information between the features and the classes. The pointwise mutual information Mi(w) between the word w and the class i is defined on the basis of the level of co-occurrence between the class i and word w. We note that the expected co-occurrence of class i and word w on the basis of mutual independence is given by Pi · F(w). The true co-occurrence is of course given by F(w)·pi(w). In practice, the value of F(w)·pi(w) may be much 170 MINING TEXT DATA larger or smaller than Pi · F(w), depending upon the level of correlation between the class i and word w. The mutual information is defined in terms of the ratio between these two values. Specifically, we have: Mi(w) = log F(w) · pi(w) F(w) · Pi = log pi(w) Pi (6.5) Clearly, the word w is positively correlated to the class i, when Mi(w) > 0, and the word w is negatively correlated to class i, when Mi(w) < 0. We note that Mi(w) is specific to a particular class i. We need to compute the overall mutual information as a function of the mutual information of the word w with the different classes. These are defined with the use of the average and maximum values of Mi(w) over the different classes. Mavg(w) = k i=1 Pi · Mi(w) Mmax(w) = maxi{Mi(w)} Either of these measures may be used in order to determine the relevance of the word w. The second measure is particularly useful, when it is more important to determine high levels of positive correlation of the word w with any of the classes. 2.4 χ2 -Statistic The χ2 statistic is a different way to compute the lack of independence between the word w and a particular class i. Let n be the total number of documents in the collection, pi(w) be the conditional probability of class i for documents which contain w, Pi be the global fraction of documents containing the class i, and F(w) be the global fraction of documents which contain the word w. The χ2-statistic of the word between word w and class i is defined as follows: χ2 i (w) = n · F(w)2 · (pi(w) − Pi)2 F(w) · (1 − F(w)) · Pi · (1 − Pi)) (6.6) As in the case of the mutual information, we can compute a global χ2 statistic from the class-specific values. We can use either the average of maximum values in order to create the composite value: χ2 avg(w) = k i=1 Pi · χ2 i (w) χ2 max(w) = maxiχ2 i (w) A Survey of Text Classification Algorithms 171 We note that the χ2-statistic and mutual information are different ways of measuring the the correlation between terms and categories. One major advantage of the χ2-statistic over the mutual information measure is that it is a normalized value, and therefore these values are more comparable across terms in the same category. 2.5 Feature Transformation Methods: Supervised LSI While feature selection attempts to reduce the dimensionality of the data by picking from the original set of attributes, feature transformation methods create a new (and smaller) set of features as a function of the original set of features. A typical example of such a feature transformation method is Latent Semantic Indexing (LSI) [38], and its probabilistic variant PLSA [57]. The LSI method transforms the text space of a few hundred thousand word features to a new axis system (of size about a few hundred) which are a linear combination of the original word features. In order to achieve this goal, Principal Component Analysis techniques [69] are used to determine the axis-system which retains the greatest level of information about the variations in the underlying attribute values. The main disadvantage of using techniques such as LSI is that these are unsupervised techniques which are blind to the underlying class-distribution. Thus, the features found by LSI are not necessarily the directions along which the class-distribution of the underlying documents can be best separated. A modest level of success has been obtained in improving classification accuracy by using boosting techniques in conjunction with the conceptual features obtained from unsupervised pLSA method [17]. A more recent study has systematically compared pLSA and LDA (which is a Bayesian version of pLSA) in terms of their effectiveness in transforming features for text categorization and drawn a similar conclusion and found that pLSA and LDA tend to perform similarly. A number of techniques have also been proposed to perform the feature transformation methods by using the class labels for effective supervision. The most natural method is to adapt LSI in order to make it work more effectively for the supervised case. A number of different methods have been proposed in this direction. One common approach is to perform local LSI on the subsets of data representing the individual classes, and identify the discriminative eigenvectors from the different reductions with the use of an iterative approach [123]. This method is known as SLSI (Supervised Latent Semantic Indexing), and the advantages of the method seem to be relatively limited, because the experiments in [123] 172 MINING TEXT DATA show that the improvements over a standard SVM classifier, which did not use a dimensionality reduction process, are relatively limited. The work in [129] uses a combination of class-specific LSI and global analysis. As in the case of [123], class-specific LSI representations are created. Test documents are compared against each LSI representation in order to create the most discriminative reduced space. One problem with this approach is that the different local LSI representations use a different subspace, and therefore it is difficult to compare the similarities of the different documents across the different subspaces. Furthermore, both the methods in [123, 129] tend to be computationally expensive. A method called sprinkling is proposed in [21], in which artificial terms are added to (or “sprinkled” into) the documents, which correspond to the class labels. In other words, we create a term corresponding to the class label, and add it to the document. LSI is then performed on the document collection with these added terms. The sprinkled terms can then be removed from the representation, once the eigenvectors have been determined. The sprinkled terms help in making the LSI more sensitive to the class distribution during the reduction process. It has also been proposed in [21] that it can be generally useful to make the sprinkling process adaptive, in which all classes are not necessarily treated equally, but the relationships between the classes are used in order to regulate the sprinkling process. Furthermore, methods have also been proposed in [21] to make the sprinkling process adaptive to the use of a particular kind of classifier. 2.6 Supervised Clustering for Dimensionality Reduction One technique which is commonly used for feature transformation is that of text clustering [7, 71, 83, 121]. In these techniques, the clusters are constructed from the underlying text collection, with the use of supervision from the class distribution. The exception is [83] in which supervision is not used. In the simplest case, each class can be treated as a separate cluster, though better results may be obtained by using the classes for supervision of the clustering process. The frequently occurring words in these supervised clusters can be used in order to create the new set of dimensions. The classification can be performed with respect to this new feature representation. One advantage of such an approach is that it retains interpretability with respect to the original words of the document collection. The disadvantage is that the optimum directions of separation may not necessarily be represented in the form of clusters of words. Furthermore, the underlying axes are not necessarily orthonor- A Survey of Text Classification Algorithms 173 mal to one another. The use of supervised methods [1, 7, 71, 121] has generally led to good results either in terms of improved classification accuracy, or significant performance gains at the expense of a small reduction in accuracy. The results with the use of unsupervised clustering [83, 87] are mixed. For example, the work in [83] suggests that the use of unsupervised term-clusters and phrases is generally not helpful [83] for the classification process. The key observation in [83] is that the loss of granularity associated with the use of phrases and term clusters is not necessarily advantageous for the classification process. The work in [8] has shown that the use of the information bottleneck method for feature distributional clustering can create clustered pseudo-word representations which are quite effective for text classification. 2.7 Linear Discriminant Analysis Another method for feature transformation is the use of linear discriminants, which explicitly try to construct directions in the feature space, along which there is best separation of the different classes. A common method is the Fisher’s linear discriminant [46]. The main idea in the Fisher’s discriminant method is to determine the directions in the data along which the points are as well separated as possible. The subspace of lower dimensionality is constructed by iteratively finding such unit vectors αi in the data, where αi is determined in the ith iteration. We would also like to ensure that the different values of αi are orthonormal to one another. In each step, we determine this vector αi by discriminant analysis, and project the data onto the remaining orthonormal subspace. The next vector αi+1 is found in this orthonormal subspace. The quality of vector αi is measured by an objective function which measures the separation of the different classes. This objective function reduces in each iteration, since the value of αi in a given iteration is the optimum discriminant in that subspace, and the vector found in the next iteration is the optimal one from a smaller search space. The process of finding linear discriminants is continued until the class separation, as measured by an objective function, reduces below a given threshold for the vector determined in the current iteration. The power of such a dimensionality reduction approach has been illustrated in [18], in which it has been shown that a simple decision tree classifier can perform much more effectively on this transformed data, as compared to more sophisticated classifiers. Next, we discuss how the Fisher’s discriminant is actually constructed. First, we will set up the objective function J(α) which determines the level of separation of the different classes along a given direction (unit- 174 MINING TEXT DATA vector) α. This sets up the crisp optimization problem of determining the value of α which maximizes J(α). For simplicity, let us assume the case of binary classes. Let D1 and D2 be the two sets of documents belonging to the two classes. Then, the projection of a document X ∈ D1∪D2 along α is given by X ·α. Therefore, the squared class separation S(D1, D2, α) along the direction α is given by: S(D1, D2, α) = X∈D1 α · X |D1| − X∈D2 α · X |D2| 2 (6.7) In addition, we need to normalize this absolute class separation with the use of the underlying class variances. Let V ar(D1, α) and V ar(D2, α) be the individual class variances along the direction α. In other words, we have: V ar(D1, α) = X∈D1 (X · α)2 |D1| − X∈D1 X · α |D1| 2 (6.8) The value of V ar(D2, α) can be defined in a similar way. Then, the normalized class-separation J(α) is defined as follows: J(α) = S(D1, D2, α) V ar(D1, α) + V ar(D2, α) (6.9) The optimal value of α needs to be determined subject to the constraint that α is a unit vector. Let μ1 and μ2 be the means of the two data sets D1 and D2, and C1 and C2 be the corresponding covariance matrices. It can be shown that the optimal (unscaled) direction α = α∗ can be expressed in closed form, and is given by the following: α∗ = C1 + C2 2 −1 (μ1 − μ2) (6.10) The main difficulty in computing the above equation is that this computation requires the inversion of the covariance matrix, which is sparse and computationally difficult in the high-dimensional text domain. Therefore, a gradient descent approach can be used in order to determine the value of α in a more computationally effective way. Details of the approach are presented in [18]. Another related method which attempts to determine projection directions that maximize the topical differences between different classes is the Topical Difference Factor Analysis method proposed in [72]. The problem has been shown to be solvable as a generalized eigenvalue problem. The method was used in conjunction with a k-nearest neighbor A Survey of Text Classification Algorithms 175 classifier, and it was shown that the use of this approach significantly improves the accuracy over a classifier which uses the original set of features. 2.8 Generalized Singular Value Decomposition While the method discussed above finds one vector αi at a time in order to determine the relevant dimension transformation, it is possible to be much more direct in finding the optimal subspaces simultaneously by using a generalized version of dimensionality reduction [58, 59]. It is important to note that this method has really been proposed in [58, 59] as an unsupervised method which preserves the underlying clustering structure, assuming the data has already been clustered in a pre-processing phase. Thus, the generalized dimensionality reduction method has been proposed as a much more aggressive dimensionality reduction technique, which preserves the underlying clustering structure rather than the individual points. This method can however also be used as a supervised technique in which the different classes are used as input to the dimensionality reduction algorithm, instead of the clusters constructed in the pre-processing phase [131]. This method is known as the Optimal Orthogonal Centroid Feature Selection Algorithm (OCFS), and it directly targets at the maximization of inter-class scatter. The algorithm is shown to have promising results for supervised feature selection in [131]. 2.9 Interaction of Feature Selection with Classification Since the classification and feature selection processes are dependent upon one another, it is interesting to test how the feature selection process interacts with the underlying classification algorithms. In this context, two questions are relevant: Can the feature-specific insights obtained from the intermediate results of some of the classification algorithms be used for creating feature selection methods that can be used more generally by other classification algorithms? Do the different feature selection methods work better or worse with different kinds of classifiers? Both these issues were explored in some detail in [99]. In regard to the first question, it was shown in [99] that feature selection which was derived from linear classifiers, provided very effective results. In regard to the second question, it was shown in [99] that the sophistication of 176 MINING TEXT DATA the feature selection process itself was more important than the specific pairing between the feature selection process and the classifier. Linear Classifiers are those for which the output of the linear predictor is defined to be p = A·X +b, where X = (x1 . . . xn) is the normalized document word frequency vector,A = (a1 . . . an) is a vector of linear coefficients with the same dimensionality as the feature space, and b is a scalar. Both the basic neural network and basic SVM classifiers [65] (which will be discussed later in this chapter) belong to this category. The idea here is that if the coefficient ai is close to zero, then the corresponding feature does not have a significant effect on the classification process. On the other hand, since large absolute values of aj may significantly influence the classification process, such features should be selected for classification. In the context of the SVM method, which attempts to determine linear planes of separation between the different classes, the vector A is essentially the normal vector to the corresponding plane of separation between the different classes. This intuitively explains the choice of selecting features with large values of |aj|. It was shown in [99] that this class of feature selection methods was quite robust, and performed well even for classifiers such as the Naive Bayes method, which were unrelated to the linear classifiers from which these features were derived. Further discussions on how SVM and maximum margin techniques can be used for feature selection may be found in [51, 56]. 3. Decision Tree Classifiers A decision tree [106] is essentially a hierarchical decomposition of the (training) data space, in which a predicate or a condition on the attribute value is used in order to divide the data space hierarchically. In the context of text data, such predicates are typically conditions on the presence or absence of one or more words in the document. The division of the data space is performed recursively in the decision tree, until the leaf nodes contain a certain minimum number of records, or some conditions on class purity. The majority class label (or cost-weighted majority label) in the leaf node is used for the purposes of classification. For a given test instance, we apply the sequence of predicates at the nodes, in order to traverse a path of the tree in top-down fashion and determine the relevant leaf node. In order to further reduce the overfitting, some of the nodes may be be pruned by holding out a part of the data, which are not used to construct the tree. The portion of the data which is held out is used in order to determine whether or not the constructed leaf node should be pruned or not. In particular, if the class distribution in the A Survey of Text Classification Algorithms 177 training data (for decision tree construction) is very different from the class distribution in the training data which is used for pruning, then it is assumed that the node overfits the training data. Such a node can be pruned. A detailed discussion of decision tree methods may be found in [15, 42, 62, 106]. In the particular case of text data, the predicates for the decision tree nodes are typically defined in terms of the terms in the underlying text collection. For example, a node may be partitioned into its children nodes depending upon the presence or absence of a particular term in the document. We note that different nodes at the same level of the tree may use different terms for the partitioning process. Many other kinds of predicates are possible. It may not be necessary to use individual terms for partitioning, but one may measure the similarity of documents to correlated sets of terms. These correlated sets of terms may be used to further partition the document collection, based on the similarity of the document to them. The different kinds of splits are as follows: Single Attribute Splits: In this case, we use the presence or absence of particular words (or even phrases) at a particular node in the tree in order to perform the split. At any given level, we pick the word which provides the maximum discrimination between the different classes. Measures such as the gini-index or information gain can be used in order to determine the level of entropy. For example, the DT-min10 algorithm [81] is based on this approach. Similarity-based multi-attribute split: In this case, we use documents (or meta-documents such as frequent word clusters), and use the similarity of the documents to these words clusters in order to perform the split. For the selected word cluster, the documents are further partitioned into groups by rank ordering the documents by similarity value, and splitting at a particular threshold. We select the word-cluster for which rank-ordering by similarity provides the best separation between the different classes. Discriminant-based multi-attribute split: For the multi-attribute case, a natural choice for performing the split is to use discriminants such as the Fisher discriminant for performing the split. Such discriminants provide the directions in the data along which the classes are best separated. The documents are projected on this discriminant vector for rank ordering, and then split at a particular coordinate. The choice of split point is picked in order to maximize the discrimination between the different classes. 178 MINING TEXT DATA The work in [18] uses a discriminant-based split, though this is done indirectly because of the use of a feature transformation to the discriminant representation, before building the classifier. Some of the earliest implementation of classifiers may be found in [80, 81, 87, 127]. The last of these is really a rule-based classifier, which can be interpreted either as a decision tree or a rule-based classifier. Most of the decision tree implementations in the text literature tend to be small variations on standard packages such as ID3 and C4.5, in order to adapt the model to text classification. Many of these classifiers are typically designed as baselines for comparison with other learning models [65]. A well known implementation of the decision tree classifier is based on the C4.5 taxonomy of algorithms [106] is presented in [87]. More specifically, the work in [87] uses the successor to the C4.5 algorithm, which is also known as the C5 algorithm. This algorithm uses single-attribute splits at each node, where the feature with the highest information gain [31] is used for the purpose of the split. Decision trees have also been used in conjunction with boosting techniques. An adaptive boosting technique [48] is used in order to improve the accuracy of classification. In this technique, we use n different classifiers. The ith classifier is constructed by examining the errors of the (i − 1)th classifier. A voting scheme is applied among these classifiers in order to report the final label. Other boosting techniques for improving decision tree classification accuracy are proposed in [116]. The work in [43] presents a decision tree algorithm based on the Bayesian approach developed in [22]. In this classifier, the decision tree is grown by recursive greedy splits, where the splits are chosen using Bayesian posterior probability of model structure. The structural prior penalizes additional model parameters at each node. The output of the process is a class probability rather than a deterministic class label for the test instance. 4. Rule-based Classifiers Decision trees are also generally related to rule-based classifiers. In rule-based classifiers, the data space is modeled with a set of rules, in which the left hand side is a condition on the underlying feature set, and the right hand side is the class label. The rule set is essentially the model which is generated from the training data. For a given test instance, we determine the set of rules for which the test instance satisfies the condition on the left hand side of the rule. We determine the predicted class label as a function of the class labels of the rules which are satisfied by the test instance. We will discuss more on this issue slightly later. A Survey of Text Classification Algorithms 179 In its most general form, the left hand side of the rule is a boolean condition, which is expressed in Disjunctive Normal Form (DNF). However, in most cases, the condition on the left hand side is much simpler and represents a set of terms, all of which must be present in the document for the condition to be satisfied. The absence of terms is rarely used, because such rules are not likely to be very informative for sparse text data, in which most words in the lexicon will typically not be present in it by default (sparseness property). Also, while the set intersection of conditions on term presence is used often, the union of such conditions is rarely used in a single rule. This is because such rules can be split into two separate rules, each of which is more informative on its own. For example, the rule Honda ∪ Toyota ⇒ Cars can be replaced by two separate rules Honda ⇒ Cars and Toyota ⇒ Cars without any loss of information. In fact, since the confidence of each of the two rules can now be measured separately, this can be more useful. On the other hand, the rule Honda ∩ Toyota ⇒ Cars is certainly much more informative than the individual rules. Thus, in practice, for sparse data sets such as text, rules are much more likely to be expressed as a simple conjunction of conditions on term presence. We note that decision trees and decision rules both tend to encode rules on the feature space, except that the decision tree tends to achieve this goal with a hierarchical approach. In fact, the original work on decision tree construction in C4.5 [106] studied the decision tree problem and decision rule problem within a single framework. This is because a particular path in the decision tree can be considered a rule for classification of the text instance. The main difference is that the decision tree framework is a strict hierarchical partitioning of the data space, whereas rule-based classifiers allow for overlaps in the decision space. The general principle is to create a rule set, such that all points in the decision space are covered by at least one rule. In most cases, this is achieved by generating a set of targeted rules which are related to the different classes, and one default catch-all rule, which can cover all the remaining instances. A number of criteria can be used in order to generate the rules from the training data. Two of the most common conditions which are used for rule generation are those of support and confidence. These conditions are common to all rule-based pattern classifiers [88] and may be defined as follows: Support: This quantifies the absolute number of instances in the training data set which are relevant to the rule. For example, in a corpus containing 100,000 documents, a rule in which both the left-hand set and right-hand side are satisfied by 50,000 documents 180 MINING TEXT DATA is more important than a rule which is satisfied by 20 documents. Essentially, this quantifies the statistical volume which is associated with the rule. However, it does not encode the strength of the rule. Confidence: This quantifies the conditional probability that the right hand side of the rule is satisfied, if the left-hand side is satisfied. This is a more direct measure of the strength of the underlying rule. We note that the afore-mentioned measures are not the only measures which are possible, but are widely used in the data mining and machine learning literature [88] for both textual and non-textual data, because of their intuitive nature and simplicity of interpretation. One criticism of the above measures is that they do not normalize for the a-priori presence of different terms and features, and are therefore prone to misinterpretation, when the feature distribution or class-distribution in the underlying data set is skewed. The training phase constructs all the rules, which are based on measures such as the above. For a given test instance, we determine all the rules which are relevant to the test instance. Since we allow overlaps, it is possible that more than one rule may be relevant to the test instance. If the class labels on the right hand sides of all these rules are the same, then it is easy to pick this class as the relevant label for the test instance. On the other hand, the problem becomes more challenging when there are conflicts between these different rules. A variety of different methods are used to rank-order the different rules [88], and report the most relevant rule as a function of these different rules. For example, a common approach is to rank-order the rules by their confidence, and pick the top-k rules as the most relevant. The class label on the right-hand side of the most number of these rules is reported as the relevant one. Am interesting rule-based classifier for the case of text data has been proposed in [5]. This technique uses an iterative methodology, which was first proposed in [128] for generating rules. Specifically, the method determines the single best rule related to any particular class in the training data. The best rule is defined in terms of the confidence of the rule, as defined above. This rule along with its corresponding instances are removed from the training data set. This approach is continuously repeated, until it is no longer possible to find strong rules in the training data, and complete predictive value is achieved. The transformation of decision trees to rule-based classifiers is discussed generally in [106], and for the particular case of text data in [68]. For each path in the decision tree a rule can be generated, which repre- A Survey of Text Classification Algorithms 181 sents the conjunction of the predicates along that path. One advantage of the rule-based classifier over a decision tree is that it is not restricted to a strict hierarchical partitioning of the feature space, and it allows for overlaps and inconsistencies among the different rules. Therefore, if a new set of training examples are encountered, which are related to a new class or new part of the feature space, then it is relatively easy to modify the rule set for these new examples. Furthermore, rule-based classifiers also allow for a tremendous interpretability of the underlying decision space. In cases in which domain-specific expert knowledge is known, it is possible to encode this into the classification process by manual addition of rules. In many practical scenarios, rule-based techniques are more commonly used because of their ease of maintenance and interpretability. One of the most common rule-based techniques is the RIPPER technique discussed in [26–28]. The RIPPER technique essentially determines frequent combinations of words which are related to a particular class. The RIPPER method has been shown to be especially effective in scenarios where the number of training examples is relatively small [25]. Another method called sleeping experts [26, 49] generates rules which take the placement of the words in the documents into account. Most of the classifiers such as RIPPER [26–28] treat documents as set-valued objects, and generate rules based on the co-presence of the words in the documents. The rules in sleeping experts are different from most of the other classifiers in this respect. In this case [49, 26], the left hand side of the rule consists of a sparse phrase, which is a group of words close to one another in the document (though not necessarily completely sequential). Each such rule has a weight, which depends upon its classification specificity in the training data. For a given test example, we determine the sparse phrases which are present in it, and perform the classification by combining the weights of the different rules that are fired. The sleeping experts and RIPPER systems have been compared in [26], and have been shown to have excellent performance on a variety of text collections. 5. Probabilistic and Naive Bayes Classifiers Probabilistic classifiers are designed to use an implicit mixture model for generation of the underlying documents. This mixture model typically assumes that each class is a component of the mixture. Each mixture component is essentially a generative model, which provides the probability of sampling a particular term for that component or class. This is why this kind of classifiers are often also called generative classifier. The naive Bayes classifier is perhaps the simplest and also the most 182 MINING TEXT DATA commonly used generative classifers. It models the distribution of the documents in each class using a probabilistic model with independence assumptions about the distributions of different terms. Two classes of models are commonly used for naive Bayes classification. Both models essentially compute the posterior probability of a class, based on the distribution of the words in the document. These models ignore the actual position of the words in the document, and work with the “bag of words” assumption. The major difference between these two models is the assumption in terms of taking (or not taking) word frequencies into account, and the corresponding approach for sampling the probability space: Multivariate Bernoulli Model: In this model, we use the presence or absence of words in a text document as features to represent a document. Thus, the frequencies of the words are not used for the modeling a document, and the word features in the text are assumed to be binary, with the two values indicating presence or absence of a word in text. Since the features to be modeled are binary, the model for documents in each class is a multivariate Bernoulli model. Multinomial Model: In this model, we captuer the frequencies of terms in a document by representing a document with a bag of words. The documents in each class can then be modeled as samples drawn from a multinomial word distribution. As a result, the conditional probability of a document given a class is simply a product of the probability of each observed word in the corresponding class. No matter how we model the documents in each class (be it a multivariate Bernoulli model or a multinomial model), the component class models (i.e., generative models for documents in each class) can be used in conjunction with the Bayes rule to compute the posterior probability of the class for a given document, and the class with the highest posterior probability can then be assigned to the document. There has been considerable confusion in the literature on the differences between the multivariate Bernoulli model and the multinomial model. A good exposition of the differences between these two models may be found in [94]. In the following, we describe these two models in more detail. A Survey of Text Classification Algorithms 183 5.1 Bernoulli Multivariate Model This class of techniques treats a document as a set of distinct words with no frequency information, in which an element (term) may be either present or absent. The seminal work on this approach may be found in [82]. Let us assume that the lexicon from which the terms are drawn are denoted by V = {t1 . . . tn}. Let us assume that the bag-of-words (or text document) in question contains the terms Q = {ti1 . . . tim }, and the class is drawn from {1 . . . k}. Then, our goal is to model the posterior probability that the document (which is assumed to be generated from the term distributions of one of the classes) belongs to class i, given that it contains the terms Q = {ti1 . . . tim }. The best way to understand the Bayes method is by understanding it as a sampling/generative process from the underlying mixture model of classes. The Bayes probability of class i can be modeled by sampling a set of terms T from the term distribution of the classes: If we sampled a term set T of any size from the term distribution of one of the randomly chosen classes, and the final outcome is the set Q, then what is the posterior probability that we had originally picked class i for sampling? The a-priori probability of picking class i is equal to its fractional presence in the collection. We denote the class of the sampled set T by CT and the corresponding posterior probability by P(CT = i|T = Q). This is essentially what we are trying to find. It is important to note that since we do not allow replacement, we are essentially picking a subset of terms from V with no frequencies attached to the picked terms. Therefore, the set Q may not contain duplicate elements. Under the naive Bayes assumption of independence between terms, this is essentially equivalent to either selecting or not selecting each term with a probability that depends upon the underlying term distribution. Furthermore, it is also important to note that this model has no restriction on the number of terms picked. As we will see later, these assumptions are the key differences with the multinomial Bayes model. The Bayes approach classifies a given set Q based on the posterior probability that Q is a sample from the data distribution of class i, i.e., P(CT = i|T = Q), and it requires us to compute the following two probabilities in order to achieve this: 1 What is the prior probability that a set T is a sample from the term distribution of class i? This probability is denoted by P(CT = i). 184 MINING TEXT DATA 2 If we sampled a set T of any size from the term distribution of class i, then what is the probability that our sample is the set Q? This probability is denoted by P(T = Q|CT = i). We will now provide a more mathematical description of Bayes modeling. In other words, we wish to model P(CT = i|Q is sampled). We can use the Bayes rule in order to write this conditional probability in a way that can be estimated more easily from the underlying corpus. In other words, we can simplify as follows: P(CT = i|T = Q) = P(CT = i) · P(T = Q|CT = i) P(T = Q) = P(CT = i) · tj∈Q P(tj ∈ T|CT = i) · tj∈Q(1 − P(tj ∈ T|CT = i)) P(T = Q) We note that the last condition of the above sequence uses the naive independence assumption, because we are assuming that the probabilities of occurrence of the different terms are independent of one another. This is practically necessary, in order to transform the probability equations to a form which can be estimated from the underlying data. The class assigned to Q is the one with the highest posterior probability given Q. It is easy to see that this decision is not affected by the denominator, which is the marginal probability of observing Q. That is, we will assign the following class to Q: ˆi = arg max i P(CT = i|T = Q) = arg max i P(CT = i) · tj∈Q P(tj ∈ T|CT = i) · tj∈Q (1 − P(tj ∈ T|CT = i)). It is important to note that all terms in the right hand side of the last equation can be estimated from the training corpus. The value of P(CT = i) is estimated as the global fraction of documents belonging to class i, the value of P(tj ∈ T|CT = i) is the fraction of documents in the ith class which contain term tj, and the value of P(tj ∈ T) is the fraction of documents (in the whole corpus) containing the term tj. We note that all of the above are maximum likelihood estimates of the corresponding probabilities. In practice, Laplacian smoothing [124] is used, in which small values are added to the frequencies of terms in order to avoid zero probabilities of sparsely present terms. In most applications of the Bayes classifier, we only care about the identity of the class with the highest probability value, rather than the A Survey of Text Classification Algorithms 185 actual probability value associated with it, which is why we do not need to compute the normalizer P(T = Q). In fact, in the case of binary classes, a number of simplifications are possible in computing these Bayes “probability” values by using the logarithm of the Bayes expression, and removing a number of terms which do not affect the ordering of class probabilities. We refer the reader to [108] for details. Although for classification, we do not need to compute P(T = Q), some applications necessitate the exact computation of the posterior probability P(CT = i|T = Q). For example, in the case of supervised anomaly detection (or rare class detection), the exact posterior probability value P(CT = i|T = Q) is needed in order to fairly compare the probability value over different test instances, and rank them for their anomalous nature. In such cases, we would need to compute P(T = Q). One way to achieve this is simply to take a sum over all the classes: P(T = Q) = i P(T = Q|CT = i)P(CT = i). This is based on the conditional independence of features for each class. Since the parameter values are estimated for each class separately, we may face the problem of data sparseness. An alternative way of computing it, which may alleviate the data sparseness problem, is to further make the assumption of (global) independence of terms, and compute it as: P(T = Q) = j∈Q P(tj ∈ T) · tj∈Q (1 − P(tj ∈ T)) where the term probabilities are based on global term distributions in all the classes. A natural question arises, as to whether it is possible to design a Bayes classifier which does not use the naive assumption, and models the dependencies between the terms during the classification process. Methods which generalize the naive Bayes classifier by not using the independence assumption do not work well because of the higher computational costs and the inability to estimate the parameters accurately and robustly in the presence of limited data. The most interesting line of work in relaxing the independence assumption is provided in [112]. In this work, the tradeoffs in spectrum of allowing different levels of dependence among the terms have been explored. On the one extreme, an assumption of complete dependence results in a Bayesian network model which turns out to be computationally very expensive. On the other hand, it has been shown that allowing limited levels of dependence can provide good tradeoffs between accuracy and computational costs. We note that while the independence assumption is a practical approximation, it has been 186 MINING TEXT DATA shown in [29, 39] that the approach does have some theoretical merit. Indeed, extensive experimental tests have tended to show that the naive classifier works quite well in practice. A number of papers [19, 64, 74, 79, 108, 113] have used the naive Bayes approach for classification in a number of different application domains. The classifier has also been extended to modeling temporally aware training data, in which the importance of a document may decay with time [114]. As in the case of other statistical classifiers, the naive Bayes classifier [113] can easily incorporate domain-specific knowledge into the classification process. The particular domain that the work in [113] addresses is that of filtering junk email. Thus, for such a problem, we often have a lot of additional domain knowledge which helps us determine whether a particular email message is junk or not. For example, some common characteristics of the email which would make an email to be more or less likely to be junk are as follows: The domain of the sender such as .edu or .com can make an email to be more or less likely to be junk. Phrases such as “Free Money” or over emphasized punctuation such as “!!!” can make an email more likely to be junk. Whether the recipient of the message was a particular user, or a mailing list. The Bayes method provides a natural way to incorporate such additional information into the classification process, by creating new features for each of these characteristics. The standard Bayes technique is then used in conjunction with this augmented representation for classification. The Bayes technique has also been used in conjunction with the incorporation of other kinds of domain knowledge, such as the incorporation of hyperlink information into the classification process [20, 104]. The Bayes method is also suited to hierarchical classification, when the training data is arranged in a taxonomy of topics. For example, the Open Directory Project (ODP), Yahoo! Taxonomy, and a variety of news sites have vast collections of documents which are arranged into hierarchical groups. The hierarchical structure of the topics can be exploited to perform more effective classification [19, 74], because it has been observed that context-sensitive feature selection can provide more useful classification results. In hierarchical classification, a Bayes classifier is built at each node, which then provides us with the next branch to follow for classification purposes. Two such methods are proposed in [19, 74], in which node specific features are used for the classification process. Clearly, much fewer features are required at a particular node A Survey of Text Classification Algorithms 187 in the hierarchy, because the features which are picked are relevant to that branch. An example in [74] suggests that a branch of the taxonomy which is related to Computer may have no relationship with the word “cow”. These node-specific features are referred to as signatures in [19]. Furthermore, it has been observed in [19] that in a given node, the most discriminative features for a given class may be different from their parent nodes. For example, the word “health” may be discriminative for the Y ahoo! category @Health, but the word “baby” may be much more discriminative for the category @Health@Nursing. Thus, it is critical to have an appropriate feature selection process at each node of the classification tree. The methods in [19, 74] use different methods for this purpose. The work in [74] uses an information-theoretic approach [31] for feature selection which takes into account the dependencies between the attributes [112]. The algorithm greedily eliminates the features one-by-one so as the least disrupt the conditional class distribution at that node. The node-specific features are referred to as signatures in [19]. These node-specific signatures are computed by calculating the ratio of intra-class variance to inter-class variance for the different words at each node of the tree. We note that this measure is the same as that optimized by the Fisher’s discriminant, except that it is applied to the original set of words, rather than solved as a general optimization problem in which arbitrary directions in the data are picked. A Bayesian classifier is constructed at each node in order to determine the appropriate branch. A small number of context-sensitive features provide One advantage of these methods is that Bayesian classifiers work much more effectively with a much smaller number of features. Another major difference between the two methods is that the work in [74] uses the Bernoulli model, whereas that in [19] uses the multinomial model, which will be discussed in the next subsection. This approach in [74] is referred to as the Pachinko Machine classifier and that in [19] is known as TAPER (Taxonomy and Path Enhanced Retrieval System). Other noteworthy methods for hierarchical classification are proposed in [11, 130, 95]. The work [11] addresses two common problems associated with hierarchical text classification: (1) error propagation; (2) non-linear decision surfaces. The problem of error propagation occurs when the classification mistakes made at a parent node are propagated to its children node. This problem was solved in [11] by using cross validation to obtain a training data set for a child node that is more similar 188 MINING TEXT DATA to the actual test data passed to the child node from its parent node than the training data set normally used for training a classifier at the child node. The problem of non-linear decision surfaces refers to that the decision boundary of a category at a higher level is often non-linear (since its members are the union of the members of its children nodes). This problem is addressed by using the tentative class labels obtained at the children nodes as features for use at a parent node. These are general strategies that can be applied to any base classifier, and the experimental results in [11] show that both strategies are effective. 5.2 Multinomial Distribution This class of techniques treats a document as a set of words with frequencies attached to each word. Thus, the set of words is allowed to have duplicate elements. As in the previous case, we assume that the set of words in document is denoted by Q, drawn from the vocabulary set V . The set Q contains the distinct terms {ti1 . . . tim } with associated frequencies F = {Fi1 . . . Fim }. We denote the terms and their frequencies by [Q, F]. The total number of terms in the document (or document length) is denoted by L = m j=1 F(ij). Then, our goal is to model the posterior probability that the document T belongs to class i, given that it contains the terms in Q with the associated frequencies F. The Bayes probability of class i can be modeled by using the following sampling process: If we sampled L terms sequentially from the term distribution of one of the randomly chosen classes (allowing repetitions) to create the term set T, and the final outcome for sampled set T is the set Q with the corresponding frequencies F, then what is the posterior probability that we had originally picked class i for sampling? The a-priori probability of picking class i is equal to its fractional presence in the collection. The aforementioned probability is denoted by P(CT = i|T = [Q, F]). An assumption which is commonly used in these models is that the length of the document is independent of the class label. While it is easily possible to generalize the method, so that the document length is used as a prior, independence is usually assumed for simplicity. As in the previous case, we need to estimate two values in order to compute the Bayes posterior. 1 What is the prior probability that a set T is a sample from the term distribution of class i? This probability is denoted by P(CT = i). A Survey of Text Classification Algorithms 189 2 If we sampled L terms from the term distribution of class i (with repetitions), then what is the probability that our sampled set T is the set Q with associated frequencies F? This probability is denoted by P(T = [Q, F]|CT = i). Then, the Bayes rule can be applied to this case as follows: P(CT = i|T = [Q, F]) = P(CT = i) · P(T = [Q, F]|CT = i) P(T = [Q, F]) ∝ P(CT = i) · P(T = [Q, F]|CT = i) (6.11) As in the previous case, it is not necessary to compute the denominator, P(T = [Q, F]), for the purpose of deciding the class label for Q. The value of the probability P(CT = i) can be estimated as the fraction of documents belonging to class i. The computation of P([Q, F]|CT = i) is much more complicated. When we consider the sequential order of the L different samples, the number of possible ways to sample the different terms so as to result in the outcome [Q, F] is given by L! m i=1 Fi! . The probability of each of these sequences is given by tj∈Q P(tj ∈ T)Fj , by using the naive independence assumption. Therefore, we have: P(T = [Q, F]|CT = i) = L! m i=1 Fi! · tj∈Q P(tj ∈ T|CT = i)Fj (6.12) We can substitute Equations 6.12 in Equation 6.11 to obtain the class with the highest Bayes posterior probability, where the class priors are computed as in the previous case, and the probabilities P(tj ∈ T|CT = i) can also be easily estimated as previously with Laplacian smoothing [124]. We note that the probabilities of class absence are not present in the above equations because of the way in which the sampling is performed. A number of different variations of the multinomial model have been proposed in [53, 70, 84, 95, 97, 103]. In the work [95], it is shown that a category hierarchy can be leveraged to improve the estimate of multinomial parameters in the naive Bayes classifier to significantly improve classification accuracy. The key idea is to apply shrinkage techniques to smooth the parameters for data-sparse child categories with their common parent nodes. As a result, the training data of related categories are essentially ”shared” with each other in a weighted manner, which helps improving the robustness and accuracy of parameter estimation when there are insufficient training data for each individual child category. The work in [94] has performed an extensive comparison between 190 MINING TEXT DATA the bernoulli and the multinomial models on different corpora, and the following conclusions were presented: The multi-variate Bernoulli model can sometimes perform better than the multinomial model at small vocabulary sizes. The multinomial model outperforms the multi-variate Bernoulli model for large vocabulary sizes, and almost always beats the multi-variate Bernoulli when vocabulary size is chosen optimally for both. On the average a 27% reduction in error was reported in [94]. The afore-mentioned results seem to suggest that the two models may have different strengths, and may therefore be useful in different scenar- ios. 5.3 Mixture Modeling for Text Classification We note that the afore-mentioned Bayes methods simply assume that each component of the mixture corresponds to the documents belonging to a class. A more general interpretation is one in which the components of the mixture are created by a clustering process, and the class membership probabilities are modeled in terms of this mixture. Mixture modeling is typically used for unsupervised (probabilistic) clustering or topic modeling, though the use of clustering can also help in enhancing the effectiveness of probabilistic classifiers [86, 103]. These methods are particularly useful in cases where the amount of training data is limited. In particular, clustering can help in the following ways: The Bayes method implicitly estimates the word probabilities P(ti ∈ T|CT = i) of a large number of terms in terms of their fractional presence in the corresponding component. This is clearly noisy. By treating the clusters as separate entities from the classes, we now only need to relate (a much smaller number of) cluster membership probabilities to class probabilities. This reduces the number of parameters and greatly improves classification accuracy [86]. The use of clustering can help in incorporating unlabeled documents into the training data for classification. The premise is that unlabeled data is much more copiously available than labeled data, and when labeled data is sparse, it should be used in order to assist the classification process. While such unlabeled documents do not contain class-specific information, they do contain a lot of information about the clustering behavior of the underlying data. This can A Survey of Text Classification Algorithms 191 be very useful for more robust modeling [103], when the amount of training data is low. This general approach is also referred to as co-training [9, 13, 37]. The common characteristic of both the methods [86, 103] is that they both use a form of supervised clustering for the classification process. While the goal is quite similar (limited training data), the approach used for this purpose is quite different. We will discuss both of these methods in this section. In the method discussed in [86], the document corpus is modeled with the use of supervised word clusters. In this case, the k mixture components are clusters which are correlated to, but are distinct from the k groups of documents belonging to the different classes. The main difference from the Bayes method is that the term probabilities are computed indirectly by using clustering as an intermediate step. For a sampled document T, we denote its class label by CT ∈ {1 . . . k}, and its mixture component by MT ∈ {1 . . . k}. The k different mixture components are essentially word-clusters whose frequencies are generated by using the frequencies of the terms in the k different classes. This ensures that the word clusters for the mixture components are correlated to the classes, but they are not assumed to be drawn from the same distribution. As in the previous case, let us assume that the a document contains the set of words Q. Then, we would like to estimate the probability P(T = Q|CT = i) for each class i. An interesting variation of the work in [86] from the Bayes approach is that it does not attempt to determine the posterior probability P(CT = i|T = Q). Rather, it simply reports the class with the highest likelihood P(T = Q|CT = i). This is essentially equivalent to assuming in the Bayes approach, that the prior distribution of each class is the same. The other difference of the approach is in terms of how the value of P(T = Q|CT = i) is computed. As before, we need to estimate the value of P(tj ∈ T|CT = i), according to the naive Bayes rule. However, unlike the standard Bayes classifier, this is done very indirectly with the use of mixture modeling. Since the mixture components do not directly correspond to the class, this term can only be estimated by summing up the expected value over all the mixture components: P(tj ∈ T|CT = i) = k s=1 P(tj ∈ T|MT = s) · P(MT = s|CT = i) (6.13) The value of P(tj ∈ T|MT = s) is easy to estimate by using the fractional presence of term tj in the sth mixture component. The main unknown here are the set of model parameters P(MT = s|CT = i). 192 MINING TEXT DATA Since a total of k classes and k mixture-components are used, this requires the estimation of only k2 model parameters, which is typically quite modest for a small number of classes. An EM-approach has been used in [86] in order to estimate this small number of model parameters in a robust way. It is important to understand that the work in [86] is an interesting combination of supervised topic modeling (dimensionality reduction) and Bayes classification after reducing the effective dimensionality of the feature space to a much smaller value by clustering. The scheme works well because of the use of supervision in the topic modeling process, which ensures that the use of an intermediate clustering approach does not lose information for classification. We also note that in this model, the number of mixtures can be made to vary from the number of classes. While the work in [86] does not explore this direction, there is really no reason to assume that the number of mixture components is the same as the number of classes. Such an assumption can be particularly useful for data sets in which the classes may not be contiguous in the feature space, and a natural clustering may contain far more components than the number of classes. Next, we will discuss the second method [103] which uses unlabeled data. The approach is [103] uses the unlabeled data in order to improve the training model. Why should unlabeled data help in classification at all? In order to understand this point, recall that the Bayes classification process effectively uses k mixture components, which are assumed to be the k different classes. If we had an infinite amount of training data, it would be possible to create the mixture components, but it would not be possible to assign labels to these components. However, the most data-intensive part of modeling the mixture, is that of determining the shape of the mixture components. The actual assignment of mixture components to class labels can be achieved with a relatively small number of class labels. It has been shown in [24] that the accuracy of assigning components to classes increases exponentially with the number of labeled samples available. Therefore, the work in [103] designs an EM-approach [36] to simultaneously determine the relevant mixture model and its class assignment. It turns out that the EM-approach, as applied to this problem is quite simple to implement. It has been shown in [103] that the EMapproach is equivalent to the following iterative methodology. First, a naive Bayes classifier is constructed by estimating the model parameters from the labeled documents only. This is used in order to assign probabilistically-weighted class labels to the unlabeled documents. Then, the Bayes classifier is re-constructed, except that we also use the newly labeled documents in the estimation of the underlying model A Survey of Text Classification Algorithms 193 parameters. We again use this classifier to re-classify the (originally unlabeled) documents. The process is continually repeated till convergence is achieved. The ability to significantly improve the quality of text classification with a small amount of labeled data, and the use of clustering on a large amount of unlabeled data has been a recurring theme in the text mining literature. For example, the method in [122] performs purely unsupervised clustering (with no knowledge of class labels), and then as a final step assigns all documents in the cluster to the dominant class label of that cluster (as an evaluation step for the unsupervised clustering process in terms of its ability in matching clusters to known topics).4 It has been shown that this approach is able to achieve a comparable accuracy of matching clusters to topics as a supervised Naive Bayes classifier trained over a small data set of about 1000 documents. Similar results were obtained in [47] where the quality of the unsupervised clustering process were shown to comparable to an SVM classifier which was trained over a small data set. 6. Linear Classifiers Linear Classifiers are those for which the output of the linear predictor is defined to be p = A · X + b, where X = (x1 . . . xn) is the normalized document word frequency vector,A = (a1 . . . an) is a vector of linear coefficients with the same dimensionality as the feature space, and b is a scalar. A natural interpretation of the predictor p = A · X + b in the discrete scenario (categorical class labels) would be as a separating hyperplane between the different classes. Support Vector Machines [30, 125] are a form of classifiers which attempt to determine “good” linear separators between the different classes. One characteristic of linear classifiers is that they are closely related to many feature transformation methods (such as the Fisher discriminant), which attempt to use these directions in order to transform the feature space, and then use other classifiers on this transformed feature space [51, 56, 99]. Thus, linear classifiers are intimately related to linear feature transformation methods as well. Regression modeling (such as the least squares method) is a more direct and traditional statistical method for text classification. However, it is generally used in cases where the target variable to be learned is numerical rather than categorical. A number of methods have been 4In a supervised application, the last step would require only a small number of class labels in the cluster to be known to determine the dominant label very accurately. 194 MINING TEXT DATA x x x x x x x x o o o o o o o o A B C Figure 6.1. What is the Best Separating Hyperplane? proposed in the literature for adapting such methods to the case of text data classification [134]. A comparison of different linear regression techniques for classificationm including SVM, may be found in [138]. Finally, simple neural networks are also a form of linear classifiers, since the function computed by a set of neurons is essentially linear. The simplest form of neural network, known as the perceptron (or single layer network) are essentially designed for linear separation, and work well for text. However, by using multiple layers of neurons, it is also possible to generalize the approach for non-linear separation. In this section, we will discuss the different linear methods for text classification. 6.1 SVM Classifiers Support-vector machines were first proposed in [30, 124] for numerical data. The main principle of SVMs is to determine separators in the search space which can best separate the different classes. For example, consider the example illustrated in Figure 6.1, in which we have two classes denoted by ’x’ and ’o’ respectively. We have denoted three different separating hyperplanes, which are denoted by A, B, and C respectively. It is evident that the hyperplane A provides the best separation between the different classes, because the normal distance of any of the data points from it is the largest. Therefore, the hyperplane A represents the maximum margin of separation. We note that the normal vector to this hyperplane (represented by the arrow in the figure) is a direction in the feature space along which we have the maximum discrimination. One advantage of the SVM method is that since it attempts to determine the optimum direction of discrimination in the feature space by examining the appropriate combination of features, it is quite robust to high dimen- A Survey of Text Classification Algorithms 195 sionality. It has been noted in [64] that text data is ideally suited for SVM classification because of the sparse high-dimensional nature of text, in which few features are irrelevant, but they tend to be correlated with one another and generally organized into linearly separable categories. We note that it is not necessary to use a linear function for the SVM classifier. Rather, with the kernel trick [6], SVM can construct a nonlinear decision surface in the original feature space by mapping the data instances non-linearly to an inner product space where the classes can be separated linearly with a hyperplane. However, in practice, linear SVM is used most often because of their simplicity and ease of interpretability. The first set of SVM classifiers, as adapted to the text domain were proposed in [64–66]. A deeper theoretical study of the SVM method has been provided in [67]. In particular, it has been shown why the SVM classifier is expected to work well under a wide variety of circumstances. This has also been demonstrated experimentally in a few different scenarios. For example, the work in [41] applied the method to email data for classifying it as spam or non-spam data. It was shown that the SVM method provides much more robust performance as compared to many other techniques such as boosting decision trees, the rule based RIPPER method, and the Rocchio method. The SVM method is flexible and can easily be combined with interactive user-feedback methods [107]. We note that the problem of finding the best separator is essentially an optimization problem, which can typically be reduced to a Quadratic Programming problem. For example, many of these methods use Newton’s method for iterative minimization of a convex function. This can sometimes be slow, especially for high dimensional domains such as text data. It has been shown [43] that by breaking a large Quadratic Programming problem (QP problem) to a set of smaller problems, an efficient solution can be derived for the task. The SVM approach has also been used successfully [44] in the context of a hierarchical organization of the classes, as often occurs in web data. In this approach, a different classifier is built at different positions of the hierarchy. The SVM classifier has also been shown to be useful in large scale scenarios in which a large amount of unlabeled data and a small amount of labeled data is available [120]. This is essentially a semi-supervised approach because of its use of unlabeled data in the classification process. This techniques is also quite scalable because of its use of a number of modified quasi-newton techniques, which tend to be efficient in practice. 196 MINING TEXT DATA 6.2 Regression-Based Classifiers Regression modeling is a method which is commonly used in order to learn the relationships between real-valued attributes. Typically, these methods are designed for real valued attributes, as opposed to binary attributes. This is however not an impediment to its use in classification, because the binary value of a class may be treated as a rudimentary special case of a real value, and some regression methods such as logistic regression can also naturally model discrete response variables. An early application of regression to text classification is the Linear Least Squares Fit (LLSF) method [134], which works as follows. Suppose the predicted class label be pi = A · Xi + b, and yi is known to be the true class label, then our aim is to learn the values of A and b, such that the Linear Least Squares Fit (LLSF) n i=1(pi − yi)2 is minimized. In practice, the value of b is set to 0 for the learning process. let P be 1 × n vector of binary values indicating the binary class to which the corresponding class belongs. Thus, if X be the the n × d term-matrix, then we wish to determine the 1 × d vector of regression coefficients A for which ||A · XT − P|| is minimized, where || · || represents the Froebinus norm. The problem can be easily generalized from the binary class scenario to the multi-class scenario with k classes, by using P as a k × n matrix of binary values. In this matrix, exactly one value in each column is 1, and the corresponding row identifier represents the class to which that instance belongs. Similarly, the set A is a k × d vector in the multi-class scenario. The LLSF method has been compared to a variety of other methods [132, 134, 138], and has been shown to be very robust in practice. A more natural way of modeling the classification problem with regression is the logistic regression classifier [102], which differs from the LLSF method in that the objective function to be optimized is the likelihood function. Specifically, instead of using pi = A · Xi + b directly to fit the true label yi, we assume that the probability of observing label yi is: p(C = yi|Xi) = exp(A · Xi + b) 1 + exp(A · Xi + b). This gives us a conditional generative model for yi given Xi. Putting it in another way, we assume that the logit transformation of p(C = yi|Xi) can be modeled by the linear combination of features of the instance Xi, i.e., log p(C = yi|Xi) 1 − p(C = yi|Xi) = A · Xi + b. A Survey of Text Classification Algorithms 197 Thus logistic regression is also a linear classifier as the decision boundary is determined by a linear function of the features. In the case of binary classification, p(C = yi|Xi) can be used to determine the class label (e.g., using a threshold of 0.5). In the case of multi-class classification, we have p(C = yi|Xi) ∝ exp(A · Xi + b), and the class label with the highest value according to p(C = yi|Xi) would be assigned to Xi. Given a set of training data points {(X1, yi), ...(Xn, yn)}, the logistic regression classifier can be trained by choosing parameters A to maximize the conditional likelihood n i=1 p(yi|Xi). In some cases, the domain knowledge may be of the form, where some sets of words are more important than others for a classification problem. For example, in a classification application, we may know that certain domain-words (Knowledge Words (KW)) may be more important to classification of a particular target category than other words. In such cases, it has been shown [35] that it may be possible to encode such domain knowledge into the logistic regression model in the form of prior on the model parameters and use Bayesian estimation of model parameters. It is clear that the regression classifiers are extremely similar to the SVM model for classification. Indeed, since LLSF, Logistic Regression, and SVM are all linear classifiers, they are thus identical at a conceptual level; the main difference among them lies in the details of the optimization formulation and implementation. As in the case of SVM classifiers, training a regression classifier also requires an expensive optimization process. For example, fitting LLSF requires expensive matrix computations in the form of a singular value decomposition process. 6.3 Neural Network Classifiers The basic unit in a neural network is a neuron or unit. Each unit receives a set of inputs, which are denoted by the vector Xi, which in this case, correspond to the term frequencies in the ith document. Each neuron is also associated with a set of weights A, which are used in order to compute a function f(·) of its inputs. A typical function which is often used in the neural network is the linear function as follows: pi = A · Xi (6.14) Thus, for a vector Xi drawn from a lexicon of d words, the weight vector A should also contain d elements. Now consider a binary classification problem, in which all labels are drawn from {+1, −1}. We assume that the class label of Xi is denoted by yi. In that case, the sign of the predicted function pi yields the class label. 198 MINING TEXT DATA x x x x x x x x o o o o o o o o A x o o Figure 6.2. The sign of the projection onto the weight vector A yields the class label In order to illustrate this point, let us consider a simple example in a 2-dimensional feature space, as illustrated in Figure 6.2. In this case, we have illustrated two different classes, and the plane corresponding to Ax = 0 is illustrated in the same figure. It is evident that the sign of the function A · Xi yields the class label. Thus, the goal of the approach is to learn the set of weights A with the use of the training data. The idea is that we start off with random weights and gradually update them when a mistake is made by applying the current function on the training example. The magnitude of the update is regulated by a learning rate μ. This forms the core idea of the perceptron algorithm, which is as follows: Perceptron Algorithm Inputs: Learning Rate: μ Training Data (Xi, yi) ∀i ∈ {1 . . . n} Initialize weight vectors in A to 0 or small random numbers repeat Apply each training data to the neural network to check if the sign of A · Xi matches yi; if sign of A · Xi does not match yi, then update weights A based on learning rate μ until weights in A converge The weights in A are typically updated (increased or decreased) proportionally to μ·Xi, so as to reduce the direction of the error of the neuron. We further note that many different update rules have been proposed in the literature. For example, one may simply update each weight by μ, rather than by μ · Xi. This is particularly possible in domains such A Survey of Text Classification Algorithms 199 x x x x oo o o o xo o o x x o o o o o ox Figure 6.3. Multi-Layered Neural Networks for Nonlinear Separation as text, in which all feature values take on small non-negative values of relatively similar magnitude. A number of implementations of neural network methods for text data have been studied in [34, 90, 101, 117, 129]. A natural question arises, as to how a neural network may be used, if all the classes may not be neatly separated from one another with a linear separator, as illustrated in Figure 6.2. For example, in Figure 6.3, we have illustrated an example in which the classes may not be separated with the use of a single linear separator. The use of multiple layers of neurons can be used in order to induce such non-linear classification boundaries. The effect of such multiple layers is to induce multiple piece-wise linear boundaries, which can be used to approximate enclosed regions belonging to a particular class. In such a network, the outputs of the neurons in the earlier layers feed into the neurons in the later layers. The training process of such networks is more complex, as the errors need to be back-propagated over different layers. Some examples of such classifiers include those discussed in [75, 110, 126, 132]. However, the general observation [117, 129] for text has been that linear classifiers generally provide comparable results to non-linear data, and the improvements of non-linear classification methods are relatively small. This suggests that the additional complexity of building more involved non-linear models does not pay for itself in terms of significantly better classification. 6.4 Some Observations about Linear Classifiers While the different linear classifiers have been developed independently from one another in the research literature, they are surprisingly similar at a basic conceptual level. Interestingly, these different lines of work have also resulted in a number of similar conclusions in terms of 200 MINING TEXT DATA the effectiveness of the different classifiers. We note that the main difference between the different classifiers is in terms of the details of the objective function which is optimized, and the iterative approach used in order to determine the optimum direction of separation. For example, the SVM method uses a Quadratic Programming (QP) formulation, whereas the LLSF method uses a closed-form least-squares formulation. On the other hand, the perceptron method does not try to formulate a closed-form objective function, but works with a softer iterative hill climbing approach. This technique is essentially inherited from the iterative learning approach used by neural network algorithms. However, its goal remains quite similar to the other two methods. Thus, the differences between these methods are really at a detailed level, rather than a conceptual level, in spite of their very different research origins. Another general observation about these methods is that all of them can be implemented with non-linear versions of their classifiers. For example, it is possible to create non-linear decision surfaces with the SVM classifier, just as it is possible to create non-linear separation boundaries by using layered neurons in a neural network [132]. However, the general consensus has been that the linear versions of these methods work very well, and the additional complexity of non-linear classification does not tend to pay for itself, except for some special data sets. The reason for this is perhaps because text is a high dimensional domain with highly correlated features and small non-negative values on sparse features. For example, it is hard to easily create class structures such as that indicated in Figure 6.3 for a sparse domain such as text containing only small non-negative values on the features. On the other hand, the high dimensional nature of correlated text dimensions is especially suited to classifiers which can exploit the redundancies and relationships between the different features in separating out the different classes. Common text applications have generally resulted in class structures which are linearly separable over this high dimensional domain of data. This is one of the reasons that linear classifiers have shown an unprecedented success in text classification. 7. Proximity-based Classifiers Proximity-based classifiers essentially use distance-based measures in order to perform the classification. The main thesis is that documents which belong to the same class are likely to be close to one another based on similarity measures such as the dot product or the cosine metric [115]. In order to perform the classification for a given test instance, two possible methods can be used: A Survey of Text Classification Algorithms 201 We determine the k-nearest neighbors in the training data to the test instance. The majority (or most abundant) class from these k neighbors are reported as the class label. Some examples of such methods are discussed in [25, 54, 134]. The choice of k typically ranges between 20 and 40 in most of the afore-mentioned work, depending upon the size of the underlying corpus. We perform training data aggregation during pre-processing, in which clusters or groups of documents belonging to the same class are created. A representative meta-document is created from each group. The same k-nearest neighbor approach is applied as discussed above, except that it is applied to this new set of metadocuments (or generalized instances [76]) rather than to the original documents in the collection. A pre-processing phase of summarization is useful in improving the efficiency of the classifier, because it significantly reduces the number of distance computations. In some cases, it may also boost the accuracy of the technique, especially when the data set contains a large number of outliers. Some examples of such methods are discussed in [55, 76, 109]. A method for performing nearest neighbor classification in text data is the WHIRL method discussed in [25]. The WHIRL method is essentially a method for performing soft similarity joins on the basis of text attributes. By soft similarity joins, we refer to the fact that the two records may not be exactly the same on the joined attribute, but a notion of similarity used for this purpose. It has been observed in [25] that any method for performing a similarity-join can be adapted as a nearest neighbor classifier, by using the relevant text documents as the joined attributes. One observation in [134] about nearest neighbor classifiers was that feature selection and document representation play an important part in the effectiveness of the classification process. This is because most terms in large corpora may not be related to the category of interest. Therefore, a number of techniques were proposed in [134] in order to learn the associations between the words and the categories. These are then used to create a feature representation of the document, so that the nearest neighbor classifier is more sensitive to the classes in the document collection. A similar observation has been made in [54], in which it has been shown that the addition of weights to the terms (based on their class-sensitivity) significantly improves the underlying classifier performance. The nearest neighbor classifier has also been extended to the temporally-aware scenario [114], in which the timeliness of a training document plays a role in the model construction process. 202 MINING TEXT DATA In order to incorporate such factors, a temporal weighting function has been introduced in [114], which allows the importance of a document to gracefully decay with time. For the case of classifiers which use grouping techniques, the most basic among such methods is that proposed by Rocchio in [109]. In this method, a single representative meta-document is constructed from each of the representative classes. For a given class, the weight of the term tk is the normalized frequency of the term tk in documents belonging to that class, minus the normalized frequency of the term in documents which do not belong to that class. Specifically, let fk p be the expected weight of term tk in a randomly picked document belonging to the positive class, and fk n be the expected weight of term tk in a randomly picked document belonging to the negative class. Then, for weighting parameters αp and αn, the weight fk rocchio is defined as follows: fk rocchio = αp · fk p − αn · fk n (6.15) The weighting parameters αp and αn are picked so that the positive class has much greater weight as compared to the negative class. For the relevant class, we now have a vector representation of the terms (f1 rocchio, f2 rocchio . . . fn rocchio). This approach is applied separately to each of the classes, in order to create a separate meta-document for each class. For a given test document, the closest meta-document to the test document can be determined by using a vector-based dot product or other similarity metric. The corresponding class is then reported as the relevant label. The main distinguishing characteristic of the Rocchio method is that it creates a single profile of the entire class. This class of methods is also referred to as the Rocchio framework. The main disadvantage of this method is that if a single class occurs in multiple disjoint clusters which are not very well connected in the data, then the centroid of these examples may not represent the class behavior very well. This is likely to be a source of inaccuracy for the classifier. The main advantage of this method is its extreme simplicity and efficiency; the training phase is linear in the corpus size, and the number of computations in the testing phase are linear to the number of classes, since all the documents have already been aggregated into a small number of classes. An analysis of the Rocchio algorithm, along with a number of different variations may be found in [64]. In order to handle the shortcomings of the Rocchio method, a number of classifiers have also been proposed [1, 14, 55, 76], which explicitly perform the clustering of each of the classes in the document collection. These clusters are used in order to generate class-specific profiles. These profiles are also referred to as generalized instances in [76]. For a given A Survey of Text Classification Algorithms 203 test instance, the label of the closest generalized instance is reported by the algorithm. The method in [14] is also a centroid-based classifier, but is specifically designed for the case of text documents. The work in [55] shows that there are some advantages in designing schemes in which the similarity computations take account of the dependencies between the terms of the different classes. We note that the nearest neighbor classifier can be used in order to generate a ranked list of categories for each document. In cases, where a document is related to multiple categories, these can be reported for the document, as long as a thresholding method is available. The work in [136] studies a number of thresholding strategies for the k-nearest neighbor classifier. It has also been suggested in [136] that these thresholding strategies can be used to understand the thresholding strategies of other classifiers which use ranking classifiers. 8. Classification of Linked and Web Data In recent years, the proliferation of the web and social network technologies has lead to a tremendous amount of document data, which is expressed in the form of linked networks. The simplest example of this is the web, in which the documents are linked to one another with the use of hyper-links. Social networks can also be considered a noisy example of such data, because the comments and text profiles of different users are connected to one another through a variety of links. Linkage information is quite relevant to the classification process, because documents of similar subjects are often linked together. This observation has been used widely in the collective classification literature [12], in which a subset of network nodes are labeled, and the remaining nodes are classified on the basis of the linkages among the nodes. In general, a content-based network may be denoted by G = (N, A, C), where N is the set of nodes, A is the set of edges between the nodes, and C is a set of text documents. Each node in N corresponds to a text document in C, and it is possible for a document to be the empty, when the corresponding node does not contain any content. A subset of the nodes in N are labeled. This corresponds to the training data. The classification problem in this scenario is to determine the labels of the remaining nodes with the use of the training data. It is clear that both the content and structure can play a useful and complementary role in the classification process. An interesting method for combining linkage and content information for classification was discussed in [20]. In this paper, a hypertext categorization method was proposed, which uses the content and labels of 204 MINING TEXT DATA neighboring web pages for the classification process. When the labels of all the nearest neighbors are available, then a Bayesian method can be adapted easily for classification purposes. Just as the presence of a word in a document can be considered a Bayesian feature for a text classifier, the presence of a link between the target page, and a page (for which the label is known) can be considered a feature for the classifier. The real challenge arises when the labels of all the nearest neighbors are not available. In such cases, a relaxation labeling method was proposed in order to perform the classification. Two methods have been proposed in this work: Fully Supervised Case of Radius one Enhanced Linkage Analysis: In this case, it is assumed that all the neighboring class labels are known. In such a case, a Bayesian approach is utilized in order to treat the labels on the nearest neighbors as features for classification purposes. In this case, the linkage information is the sole information which is used for classification purposes. When the class labels of the nearest neighbors are not known: In this case, an iterative approach is used for combining text and linkage based classification. Rather than using the predefined labels (which are not available), we perform a first labeling of the neighboring documents with the use of document content. These labels are then used to classify the label of the target document, with the use of both the local text and the class labels of the neighbors. This approach is used iteratively for re-defining the labels of both the target document and its neighbors until convergence is achieved. The conclusion from the work in [20] is that a combination of text and linkage based classification always improves the accuracy of a text classifier. Even when none of the neighbors of the document have known classes, it seemed to be always beneficial to add link information to the classification process. When the class labels of all the neighbors are known, then the advantages of using the scheme seem to be quite significant. An additional idea in the paper is that of the use of bridges in order to further improve the classification accuracy. The core idea in the use of a bridge is the use of 2-hop propagation for link-based classification. The results with the use of such an approach are somewhat mixed, as the accuracy seems to reduce with an increasing number of hops. The work in [20] shows results on a number of different kinds of data sets such as the Reuters database, US patent database, and Yahoo!. Since the Reuters database contains the least amount of noise, and pure text A Survey of Text Classification Algorithms 205 classifiers were able to do a good job. On the other hand, the US patent database and the Yahoo! database contain an increasing amount of noise which reduces the accuracy of text classifiers. An interesting observation in [20] was that a scheme which simply absorbed the neighbor text into the current document performed significantly worse than a scheme which was based on pure text-based classification. This is because there are often significant cross-boundary linkages between topics, and such linkages are able to confuse the classifier. A publicly available implementation of this algorithm may be found in the NetKit tool kit available in [92]. Another relaxation labeling method for graph-based document classification is proposed in [4]. In this technique, the probability that the end points of a link take on a particular pair of class labels is quantified. We refer to this as the link-class pair probability. The posterior probability of classification of a node T into class i is expressed as sum of the probabilities of pairing all possible class labels of the neighbors of T with class label i. We note a significant percentage of these (exponential number of ) possibilities are pruned, since only the currently most probable5 labelings are used in this approach. For this purpose, it is assumed that the class labels of the different neighbors of T (while dependent on T) are independent of each other. This is similar to the naive assumption, which is often used in Bayes classifiers. Therefore, the probability for a particular combination of labels on the neighbors can be expressed as the product of the corresponding link-class pair probabilities. The approach starts off with the use of a standard content-based Bayes or SVM classifier in order to assign the initial labels to the nodes. Then, an iterative approach is used to refine the labels, by using the most probably label estimations from the previous iteration in order to refine the labels in the current iteration. We note that the link-class pair probabilities can be estimated as the smoothed fraction of edges in the last iteration which contain a particular pair of classes as the end points (hard labeling), or it can also be estimated as the average product of node probabilities over all edges which take on that particular class pair (soft labeling). This approach is repeated to convergence. Another method which uses a naive Bayes classifier to enhance linkbased classification is proposed in [104]. This method incrementally assigns class labels, starting off with a temporary assignment and then gradually making them permanent. The initial class assignment is based on a simple Bayes expression based on both the terms and links in the 5In the case of hard labeling, the single most likely labeling is used, whereas in the case of soft labeling, a small set of possibilities is used. 206 MINING TEXT DATA document. In the final categorization, the method changes the term weights for Bayesian classification of the target document with the terms in the neighbor of the current document. This method uses a broad framework which is similar to that in [20], except that it differentiates between the classes in the neighborhood of a document in terms of their influence on the class label of the current document. For example, documents for which the class label was either already available in the training data, or for which the algorithm has performed a final assignment, have a different confidence weighting factor than those documents for which the class label is currently temporarily assigned. Similarly, documents which belong to a completely different subject (based on content) are also removed from consideration from the assignment. Then, the Bayesian classification is performed with the re-computed weights, so that the document can be assigned a final class label. By using this approach the technique is able to compensate for the noise and inconsistencies in the link structures among different documents. One major difference between the work in [20] and [104], is that the former is focussed on using link information in order to propagate the labels, whereas the latter attempts to use the content of the neighboring pages. Another work along this direction, which uses the content of the neighboring pages more explicitly is proposed in [105]. In this case, the content of the neighboring pages is broken up into different fields such as titles, anchor text, and general text. The different fields are given different levels of importance, which is learned during the classification process. It was shown in [105] that the use of title fields and anchor fields is much more relevant than the general text. This accounts for much of the accuracy improvements demonstrated in [105]. The work in [2] proposes a method for dynamic classification in text networks with the use of a random-walk method. The key idea in the work is to transform the combination of structure and content in the network into a pure network containing only content. Thus, we transform the original network G = (N, A, C) into an augmented network GA = (N ∪ Nc, A ∪ Ac), where Nc and Ac are an additional set of nodes and edges added to the original network. Each node in Nc corresponds to a distinct word in the lexicon. Thus, the augmented network contains the original structural nodes N, and a new set of word nodes Nc. The added edges in Ac are undirected edges added between the structural nodes N and the word nodes Nc. Specifically, an edge (i, j) is added to Ac, if the word i ∈ Nc occurs in the text content corresponding to the node j ∈ N. Thus, this network is semi-bipartite, in that there are no edges between the different word nodes. An illustration of the semibipartite content-structure transformation is provided in Figure 6.4. A Survey of Text Classification Algorithms 207 1 2 3 4 STRUCTURAL NODES WORD NODES DASHED LINES => WORD PRESENCE IN NODES Figure 6.4. The Semi-bipartite Transformation It is important to note that once such a transformation has been performed, any of the collective classification methods [12] can be applied to the structural nodes. In the work in [2], a random-walk method has been used in order to perform the collective classification of the underlying nodes. In this method, repeated random walks are performed starting at the unlabeled nodes which need to be classified. The random walks are defined only on the structural nodes, and each hop may either be a structural hop or a content hop. We perform l different random walks, each of which contains h nodes. Thus, a total of l · h nodes are encountered in the different walks. The class label of this node is predicted to be the label with the highest frequency of presence in the different l · h nodes encountered in the different walks. The error of this random walk-based sampling process has been bounded in [12]. In addition, the method in [12] can be adapted to dynamic contentbased networks, in which the nodes, edges and their underlying content continuously evolve over time. The method in [2] has been compared to that proposed in [18] (based on the implementation in [92]), and it has been shown that the classification methods of [12] are significantly superior. Another method for classification of linked text data is discussed in [139]. This method designs two separate regularization conditions; one is for the text-only classifier (also referred to as the local classifier), and 208 MINING TEXT DATA the other is for the link information in the network structure. These regularizers are expressed in the terms of the underlying kernels; the link regularizer is related to the standard graph regularizer used in the machine learning literature, and the text regularizer is expressed in terms of the kernel gram matrix. These two regularization conditions are combined in two possible ways. One can either use linear combinations of the regularizers, or linear combinations of the associated kernels. It was shown in [139] that both combination methods perform better than either pure structure-based or pure text-based methods. The method using a linear combination of regularizers was slightly more accurate and robust than the method which used a linear combination of the kernels. A method in [32] designs a classifier which combines a Naive Bayes classifier (on the text domain), and a rule-based classifier (on the structural domain). The idea is to invent a set of predicates, which are defined in the space of links, pages and words. A variety of predicates (or relations) are defined depending upon the presence of the word in a page, linkages of pages to each other, the nature of the anchor text of the hyperlink, and the neighborhood words of the hyperlink. These essentially encode the graph structure of the documents in the form of boolean predicates, and can also be used to construct relational learners. The main contribution in [32] is to combine the relational learners on the structural domain with the Naive Bayes approach in the text domain. We refer the reader to [32, 33] for the details of the algorithm, and the general philosophy of such relational learners. One of the interesting methods for collective classification in the context of email networks was proposed in [23]. The technique in [23] is designed to classify speech acts in email. Speech acts essentially characterize, whether an email refers to a particular kind of action (such as scheduling a meeting). It has been shown in [23] that the use of sequential thread-based information from the email is very useful for the classification process. An email system can be modeled as a network in several ways, one of which is to treat an email as a node, and the edges as the thread relationships between the different emails. In this sense, the work in [23] devises a network-based mining procedure which uses both the content and the structure of the email network. However, this work is rather specific to the case of email networks, and it is not clear whether the technique can be adapted (effectively) to more general networks. A different line of solutions to such problems, which are defined on a heterogeneous feature space is to use latent space methods in order to simultaneously homogenize the feature space, and also determine the latent factors in the underlying data. The resulting representation can A Survey of Text Classification Algorithms 209 be used in conjunction with any of the text classifiers which are designed for latent space representations. A method in [140] uses a matrix factorization approach in order to construct a latent space from the underlying data. Both supervised and unsupervised methods were proposed for constructing the latent space from the underlying data. It was then shown in [140] that this feature representation provides more accurate results, when used in conjunction with an SVM-classifier. Finally, a method for web page classification is proposed in [119]. This method is designed for using intelligent agents in web page categorization. The overall approach relies on the design of two functions which correspond to scoring web pages and links respectively. An advice language is created, and a method is proposed for mapping advice to neural networks. It is has been shown in [119] how this general purpose system may be used in order to find home pages on the web. 9. Meta-Algorithms for Text Classification Meta-algorithms play an important role in classification strategies because of their ability to enhance the accuracy of existing classification algorithms by combining them, or making a general change in the different algorithms to achieve a specific goal. Typical examples of classifier meta-algorithms include bagging, stacking and boosting [42]. Some of these methods change the underlying distribution of the training data, others combine classifiers, and yet others change the algorithms in order to satisfy specific classification criteria. We will discuss these different classes of methods in this section. 9.1 Classifier Ensemble Learning In this method, we use combinations of classifiers in conjunction with a voting mechanism in order to perform the classification. The idea is that since different classifiers are susceptible to different kinds of overtraining and errors, a combination classifier is likely to yield much more robust results. This technique is also sometimes referred to as stacking or classifier committee construction. Ensemble learning has been used quite frequently in text categorization. Most methods simply use weighted combinations of classifier outputs (either in terms of scores or ranks) in order to provide the final classification result. For example, the work by Larkey and Croft [79] used weighted linear combinations of the classifier scores or ranks. The work by Hull [60] used linear combinations of probabilities for the same goal. A linear combination of the normalized scores was used for classification [137]. The work in [87] used classifier selection techniques and 210 MINING TEXT DATA voting in order to provide the final classification result. Some examples of such voting and selection techniques are as follows: In a binary-class application, the class label which obtains the majority vote is reported as the final result. For a given test instance, a specific classifier is selected, depending upon the performance of the classifier which are closest to that test instance. A weighted combination of the results from the different classifiers are used, where the weight is regulated by the performance of the classifier on validation instances which are most similar to the current test instance. The last two methods above try to select the final classification in a smarter way by discriminating between the performances of the classifiers in different scenarios. The work by [77] used category-averaged features in order to construct a different classifier for each category. The major challenge in ensemble learning is to provide the appropriate combination of classifiers for a particular scenario. Clearly, this combination can significantly vary with the scenario and the data set. In order to achieve this goal, the method in [10] proposes a method for probabilistic combination of text classifiers. The work introduces a number of variables known as reliability variables in order to regulate the importance of the different classifiers. These reliability variables are learned dynamically for each situation, so as to provide the best classification. 9.2 Data Centered Methods: Boosting and Bagging While ensemble techniques focus on combining different classifiers, data-centered methods such as boosting and bagging typically focus on training the same classifier on different parts of the training data in order to create different models. For a given test instance, a combination of the results obtained from the use of these different models is reported. Another major difference between ensemble-methods and boosting methods is that the training models in a boosting method are not constructed independently, but are constructed sequentially. Specifically, after i classifiers are constructed, the (i+1)th classifier is constructed on those parts of the training data which the first i classifiers are unable to accurately classify. The results of these different classifiers are combined together carefully, where the weight of each classifier is typically a function of its error rate. The most well known meta-algorithm for boosting is the A Survey of Text Classification Algorithms 211 AdaBoost algorithm [48]. Such boosting algorithms have been applied to a variety of scenarios such as decision tree learners, rule-based systems, and Bayesian classifiers [49, 61, 73, 100, 116, 118]. We note that boosting is also a kind of ensemble learning methodology, except that we train the same model on different subsets of the data in order to create the ensemble. One major criticism of boosting is that in many data sets, some of the training records are noisy, and a classification model should be resistant to overtraining on the data. Since the boosting model tends to weight the error-prone examples more heavily in successive rounds, this can cause the classification process to be more prone to overfitting. This is particularly noticeable in the case of noisy data sets. Some recent results have suggested that all convex boosting algorithms may perform poorly in the presence of noise [91]. These results tend to suggest that the choice of boosting algorithm may be critical for a successful outcome, depending upon the underlying data set. Bagging methods [16] are generally designed to reduce the model overfitting error which arises during the learning process. The idea in bagging is to pick bootstrap samples (samples with replacement) from the underlying collection, and train the classifiers in these samples. The classification results from these different samples are then combined together in order to yield the final result. Bagging methods are generally used in conjunction with decision trees, though these methods can be used in principle with any kind of classifier. The main criticism of the bagging method is that it can sometimes lead to a reduction in accuracy because of the smaller size of each individual training sample. Bagging is useful only if the model is unstable to small details of the training algorithm, because it reduces the overfitting error. An example of such an algorithm would be the decision tree model, which is highly sensitive to how the higher levels of the tree are constructed in a high dimensional feature space such as text. Bagging methods have not been used frequently in text classification. 9.3 Optimizing Specific Measures of Accuracy We note that the use of the absolute classification accuracy is not the only measure which is relevant to classification algorithms. For example, in skewed-class scenarios, as often arise in the context of applications such as fraud detection, and spam filtering, it is more costly to misclassify examples of one class than another. For example, while it may be tolerable to misclassify a few spam emails (thereby allowing them into the inbox), it is much more undesirable to incorrectly mark 212 MINING TEXT DATA a legitimate email as spam. Cost-sensitive classification problems also naturally arise in cases in which one class is more rare than the other, and it is therefore more desirable to identify the rare examples. In such cases, it is desirable to optimize the cost-weighted accuracy of the classification process. We note that many of the broad techniques which have been designed for non-textual data [40, 42, 45] are also applicable to text data, because the specific feature representation is not material to how standard algorithms for modified to the cost-sensitive case. A good understanding of cost-sensitive classification both for the textual and non-textual case may be found in [40, 45, 3]. Some examples of how classification algorithms may be modified in straightforward ways to incorporate cost-sensitivity are as follows: In a decision-tree, the split condition at a given node tries to maximize the accuracy of its children nodes. In the cost-sensitive case, the split is engineered to maximize the cost-sensitive accuracy. In rule-based classifiers, the rules are typically quantified and ordered by measures corresponding to their predictive accuracy. In the cost-sensitive case, the rules are quantified and ordered by their cost-weighted accuracy. In Bayesian classifiers, the posterior probabilities are weighted by the cost of the class for which the prediction is made. In linear classifiers, the optimum hyperplane separating the classes is determined in a cost-weighted sense. Such costs can typically be incorporated in the underlying objective function. For example, the least-square error in the objective function of the LLSF method can be weighted by the underlying costs of the different classes. In a k-nearest neighbor classifier, we report the cost-weighted majority class among the k nearest neighbors of the test instance. We note that the use of a cost-sensitive approach is essentially a change of the objective function of classification, which can also be formulated as an optimization problem. While the standard classification problem generally tries to optimize accuracy, the cost-sensitive version tries to optimize a cost-weighted objective function. A more general approach was proposed in [50] in which a meta-algorithm was proposed for optimizing a specific figure of merit such as the accuracy, precision, recall, or F1-measure. Thus, this approach generalizes this class of methods to any arbitrary objective function, making it essentially an objective-centered classification method. A generalized probabilistic descent algorithm (with the desired objective function) is used in conjunc- A Survey of Text Classification Algorithms 213 tion with the classifier of interest in order to derive the class labels of the test instance. The work in [50] shows the advantages of using the technique over a standard SVM-based classifier. 10. Conclusions and Summary The classification problem is one of the most fundamental problems in the machine learning and data mining literature. In the context of text data, the problem can also be considered similar to that of classification of discrete set-valued attributes, when the frequencies of the words are ignored. The domains of these sets are rather large, as it comprises the entire lexicon. Therefore, text mining techniques need to be designed to effectively manage large numbers of elements with varying frequencies. Almost all the known techniques for classification such as decision trees, rules, Bayes methods, nearest neighbor classifiers, SVM classifiers, and neural networks have been extended to the case of text data. Recently, a considerable amount of emphasis has been placed on linear classifiers such as neural networks and SVM classifiers, with the latter being particularly suited to the characteristics of text data. In recent years, the advancement of web and social network technologies have lead to a tremendous interest in the classification of text documents containing links or other meta-information. Recent research has shown that the incorporation of linkage information into the classification process can significantly improve the quality of the underlying results. References [1] C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision for Text Categorization, IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [2] C. C. Aggarwal, N. Li. On Node Classification in Dynamic Contentbased Networks, SDM Conference, 2011. [3] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, C. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. Workshop on Machine Learning in the New Information Age, in conjunction with ECML Conference, 2000. http://arxiv.org/PS_cache/cs/pdf/0006/0006013v1.pdf [4] R. Angelova, G. Weikum. Graph-based text classification: learn from your neighbors. ACM SIGIR Conference, 2006. [5] C. Apte, F. Damerau, S. Weiss. Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, 12(3), pp. 233–251, 1994. 214 MINING TEXT DATA [6] M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning, Automation and Remote Control, 25: pp. 821–837, 1964. [7] L. Baker, A. McCallum. Distributional Clustering of Words for Text Classification, ACM SIGIR Conference, 1998. [8] R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional Clustering for Text Categorization. ACM SIGIR Conference, 2001. [9] S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by Seeding. ICML Conference, 2002. [10] P. Bennett, S. Dumais, E. Horvitz. Probabilistic Combination of Text Classifiers using Reliability Indicators: Models and Results. ACM SIGIR Conference, 2002. [11] P. Bennett, N. Nguyen. Refined experts: improving classification in large taxonomies. ACM SIGIR Conference, 2009. [12] S. Bhagat, G. Cormode, S. Muthukrishnan. Node Classification in Social Networks, Book Chapter in Social Network Data Analytics, Ed. Charu Aggarwal, Springer, 2011. [13] A. Blum, T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998. [14] D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems, Vol. 27, pp. 329–341, 1999. [15] L. Brieman, J. Friedman, R. Olshen, C. Stone. Classification and Regression Trees, Wadsworth Advanced Books and Software, CA, 1984. [16] L. Breiman. Bagging Predictors. Machine Learning, 24(2), pp. 123– 140, 1996. [17] L. Cai, T. Hofmann. Text categorization by boosting automatically extracted concepts. ACM SIGIR Conference, 2003. [18] S. Chakrabarti, S. Roy, M. Soundalgekar. Fast and Accurate Text Classification via Multiple Linear Discriminant Projections, VLDB Journal, 12(2), pp. 172–185, 2003. [19] S. Chakrabarti, B. Dom. R. Agrawal, P. Raghavan. Using taxonomy, discriminants and signatures for navigating in text databases, VLDB Conference, 1997. [20] S. Chakrabarti, B. Dom, P. Indyk. Enhanced hypertext categorization using hyperlinks. ACM SIGMOD Conference, 1998. A Survey of Text Classification Algorithms 215 [21] S. Chakraborti, R. Mukras, R. Lothian, N. Wiratunga, S. Watt, D. Harper. Supervised Latent Semantic Indexing using Adaptive Sprinkling, IJCAI, 2007. [22] D. Chickering, D. Heckerman, C. Meek. A Bayesian approach for learning Bayesian networks with local structure. Thirteenth Conference on Uncertainty in Artificial Intelligence, 1997. [23] V. R. de Carvalho, W. Cohen. On the collective classification of email ”speech acts”, ACM SIGIR Conference, 2005. [24] V. Castelli, T. M. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 16(1), pp. 105–111, 1995. [25] W. Cohen, H. Hirsh. Joins that generalize: text classification using Whirl. ACM KDD Conference, 1998. [26] W. Cohen, Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), pp. 141–173, 1999. [27] W. Cohen. Learning rules that classify e-mail. AAAI Conference, 1996. [28] W. Cohen. Learning with set-valued features. AAAI Conference, 1996. [29] W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1), pp. 100–111, 1995. [30] C. Cortes, V. Vapnik. Support-vector networks. Machine Learning, 20: pp. 273–297, 1995. [31] T. M. Cover, J. A. Thomas. Elements of information theory. New York: John Wiley and Sons, 1991. [32] M. Craven, S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43: pp. 97–119, 2001. [33] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the Worldwide Web. AAAI Conference, 1998. [34] I. Dagan, Y. Karov, D. Roth. Mistake-driven Learning in Text Categorization, Proceedings of EMNLP, 1997. [35] A. Dayanik, D. Lewis, D. Madigan, V. Menkov, A. Genkin. Constructing informative prior distributions from domain knowledge in text classification. ACM SIGIR Conference, 2006. [36] A. P. Dempster, N.M. Laird, D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1): pp. 1–38, 1977. 216 MINING TEXT DATA [37] F. Denis, A. Laurent. Text Classification and Co-Training from Positive and Unlabeled Examples, ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data. http://www.grappa. univ-lille3.fr/ftp/reports/icmlws03.pdf. [38] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407, 1990. [39] P. Domingos, M. J. Pazzani. On the the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), pp. 103–130, 1997. [40] P. Domingos. MetaCost: A General Method for making Classifiers Cost-Sensitive. ACM KDD Conference, 1999. [41] H. Drucker, D. Wu, V. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5), pp. 1048–1054, 1999. [42] R. Duda, P. Hart, W. Stork. Pattern Classification, Wiley Interscience, 2000. [43] S. Dumais, J. Platt, D. Heckerman, M. Sahami. Inductive learning algorithms and representations for text categorization. CIKM Conference, 1998. [44] S. Dumais, H. Chen. Hierarchical Classification of Web Content. ACM SIGIR Conference, 2000. [45] C. Elkan. The foundations of cost-sensitive learning, IJCAI Conference, 2001. [46] R. Fisher. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, pp. 179–188, 1936. [47] R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. NIPS Conference, 2002. [48] Y. Freund, R. Schapire. A decision-theoretic generalization of online learning and an application to boosting. In Proc. Second European Conference on Computational Learning Theory, pp. 23–37, 1995. [49] Y. Freund, R. Schapire, Y. Singer, M. Warmuth. Using and combining predictors that specialize. Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pp. 334–343, 1997. [50] S. Gao, W. Wu, C.-H. Lee, T.-S. Chua. A maximal figure-of-merit learning approach to text categorization. SIGIR Conference, 2003. [51] R. Gilad-Bachrach, A. Navot, N. Tishby. Margin based feature selection – theory and algorithms. ICML Conference, 2004. A Survey of Text Classification Algorithms 217 [52] S. Gopal, Y. Yang. Multilabel classification with meta-level features. ACM SIGIR Conference, 2010. [53] L. Guthrie, E. Walker. Document Classification by Machine: Theory and Practice. COLING, 1994. [54] E.-H. Han, G. Karypis, V. Kumar. Text Categorization using Weighted-Adjusted k-nearest neighbor classification, PAKDD Conference, 2001. [55] E.-H. Han, G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results, PKDD Conference, 2000. [56] D. Hardin, I. Tsamardinos, C. Aliferis. A theoretical characterization of linear SVM-based feature selection. ICML Conference, 2004. [57] T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, 1999. [58] P. Howland, M. Jeon, H. Park. Structure Preserving Dimension Reduction for Clustered Text Data based on the Generalized Singular Value Decomposition. SIAM Journal of Matrix Analysis and Applications, 25(1): pp. 165–179, 2003. [59] P. Howland, H. Park. Generalizing discriminant analysis using the generalized singular value decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), pp. 995–1006, 2004. [60] D. Hull, J. Pedersen, H. Schutze. Method combination for document filtering. ACM SIGIR Conference, 1996. [61] R. Iyer, D. Lewis, R. Schapire, Y. Singer, A. Singhal. Boosting for document routing. CIKM Conference, 2000. [62] M. James. Classification Algorithms, Wiley Interscience, 1985. [63] D. Jensen, J. Neville, B. Gallagher. Why collective inference improves relational classification. ACM KDD Conference, 2004. [64] T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. ICML Conference, 1997. [65] T. Joachims. Text categorization with support vector machines: learning with many relevant features. ECML Conference, 1998. [66] T. Joachims. Transductive inference for text classification using support vector machines. ICML Conference, 1999. [67] T. Joachims. A Statistical Learning Model of Text Classification for Support Vector Machines. ACM SIGIR Conference, 2001. [68] D. Johnson, F. Oles, T. Zhang, T. Goetz. A Decision Tree-based Symbolic Rule Induction System for Text Categorization, IBM Systems Journal, 41(3), pp. 428–437, 2002. 218 MINING TEXT DATA [69] I. T. Jolliffee. Principal Component Analysis. Springer, 2002. [70] T. Kalt, W. B. Croft. A new probabilistic model of text classification and retrieval. Technical Report IR-78, University of Massachusetts Center for Intelligent Information Retrieval, 1996. http://ciir. cs.umass.edu/publications/index.shtml [71] G. Karypis, E.-H. Han. Fast Supervised Dimensionality Reduction with Applications to Document Categorization and Retrieval, ACM CIKM Conference, 2000. [72] T. Kawatani. Topic difference factor extraction between two document sets and its application to text categorization. ACM SIGIR Conference, 2002. [73] Y.-H. Kim, S.-Y. Hahn, B.-T. Zhang. Text filtering by boosting naive Bayes classifiers. ACM SIGIR Conference, 2000. [74] D. Koller, M. Sahami. Hierarchically classifying documents with very few words, ICML Conference, 2007. [75] S. Lam, D. Lee. Feature reduction for neural network based text categorization. DASFAA Conference, 1999. [76] W. Lam, C. Y. Ho. Using a generalized instance set for automatic text categorization. ACM SIGIR Conference, 1998. [77] W. Lam, K.-Y. Lai. A meta-learning approach for text categorization. ACM SIGIR Conference, 2001. [78] K. Lang. Newsweeder: Learning to filter netnews. ICML Conference, 1995. [79] L. S. Larkey, W. B. Croft. Combining Classifiers in text categorization. ACM SIGIR Conference, 1996. [80] D. Lewis, J. Catlett. Heterogeneous uncertainty sampling for supervised learning. ICML Conference, 1994. [81] D. Lewis, M. Ringuette. A comparison of two learning algorithms for text categorization. SDAIR, 1994. [82] D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML Conference, 1998. [83] D. Lewis. An Evaluation of Phrasal and Clustered Representations for the Text Categorization Task, ACM SIGIR Conference, 1992. [84] D. Lewis, W. Gale. A sequential algorithm for training text classifiers, SIGIR Conference, 1994. [85] D. Lewis, K. Knowles. Threading electronic mail: A preliminary study. Information Processing and Management, 33(2), pp. 209– 217, 1997. A Survey of Text Classification Algorithms 219 [86] H. Li, K. Yamanishi. Document classification using a finite mixture model. Annual Meeting of the Association for Computational Linguistics, 1997. [87] Y. Li, A. Jain. Classification of text documents. The Computer Journal, 41(8), pp. 537–546, 1998. [88] B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. ACM KDD Conference, 1998. [89] B. Liu, L. Zhang. A Survey of Opinion Mining and Sentiment Analysis. Book Chapter in Mining Text Data, Ed. C. Aggarwal, C. Zhai, Springer, 2011. [90] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2: pp. 285– 318, 1988. [91] P. Long, R. Servedio. Random Classification Noise defeats all Convex Potential Boosters. ICML Conference, 2008. [92] S. A. Macskassy, F. Provost. Classification in Networked Data: A Toolkit and a Univariate Case Study, Journal of Machine Learning Research, Vol. 8, pp. 935–983, 2007. [93] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu. edu/~mccallum/bow, 1996. [94] A. McCallum, K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. AAAI Workshop on Learning for Text Categorization, 1998. [95] A. McCallum, R. Rosenfeld, T. Mitchell, A. Ng. Improving text classification by shrinkage in a hierarchy of classes. ICML Conference, 1998. [96] McCallum, Andrew Kachites. ”MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002. [97] T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997. [98] T. M. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999. [99] D. Mladenic, J. Brank, M. Grobelnik, N. Milic-Frayling. Feature selection using linear classifier weights: interaction with classification models. ACM SIGIR Conference, 2004. [100] K. Myers, M. Kearns, S. Singh, M. Walker. A boosting approach to topic spotting on subdialogues. ICML Conference, 2000. 220 MINING TEXT DATA [101] H. T. Ng, W. Goh, K. Low. Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Conference, 1997. [102] A. Y. Ng, M. I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. NIPS. pp. 841- 848, 2001. [103] K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify text from labeled and unlabeled documents. AAAI Conference, 1998. [104] H.-J. Oh, S.-H. Myaeng, M.-H. Lee. A practical hypertext categorization method using links and incrementally available class information. ACM SIGIR Conference, 2000. [105] X. Qi, B. Davison. Classifiers without borders: incorporating fielded text from neighboring web pages. ACM SIGIR Conference, 2008. [106] J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1(1), pp 81–106, 1986. [107] H. Raghavan, J. Allan. An interactive algorithm for asking and incorporating feature feedback into support vector machines. ACM SIGIR Conference, 2007. [108] S. E. Robertson, K. Sparck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27: pp. 129–146, 1976. [109] J. Rocchio. Relevance feedback information retrieval. The Smart Retrieval System- Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, pp 313–323, 1971. [110] M. Ruiz, P. Srinivasan. Hierarchical neural networks for text categorization. ACM SIGIR Conference, 1999. [111] F. Sebastiani. Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), 2002. [112] M. Sahami. Learning limited dependence Bayesian classifiers, ACM KDD Conference, 1996. [113] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. A Bayesian approach to filtering junk e-mail. AAAI Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http:// robotics.stanford.edu/users/sahami/papers.html [114] T. Salles, L. Rocha, G. Pappa, G. Mourao, W. Meira Jr., M. Goncalves. Temporally-aware algorithms for document classification. ACM SIGIR Conference, 2010. A Survey of Text Classification Algorithms 221 [115] G. Salton. An Introduction to Modern Information Retrieval, Mc Graw Hill, 1983. [116] R. Schapire, Y. Singer. BOOSTEXTER: A Boosting-based System for Text Categorization, Machine Learning, 39(2/3), pp. 135–168, 2000. [117] H. Schutze, D. Hull, J. Pedersen. A comparison of classifiers and document representations for the routing problem. ACM SIGIR Conference, 1995. [118] R. Shapire, Y. Singer, A. Singhal. Boosting and Rocchio applied to text filtering. ACM SIGIR Conference, 1998. [119] J. Shavlik, T. Eliassi-Rad. Intelligent agents for web-based tasks: An advice-taking approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press, 1998. http://www.cs.wisc.edu/~shavlik/mlrg/publications.html [120] V. Sindhwani, S. S. Keerthi. Large scale semi-supervised linear SVMs. ACM SIGIR Conference, 2006. [121] N. Slonim, N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001. [122] N. Slonim, N. Friedman, N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR Conference, 2002. [123] J.-T. Sun, Z. Chen, H.-J. Zeng, Y. Lu, C.-Y. Shi, W.-Y. Ma. Supervised Latent Semantic Indexing for Document Categorization. ICDM Conference, 2004. [124] V. Vapnik. Estimations of dependencies based on statistical data, Springer, 1982. [125] V. Vapnik. The Nature of Statistical Learning Theory, Springer, New York, 1995. [126] A. Weigand, E. Weiner, J. Pedersen. Exploiting hierarchy in text catagorization. Information Retrieval, 1(3), pp. 193–216, 1999. [127] S, M. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz, T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), pp. 63–69, 1999. [128] S. M. Weiss, N. Indurkhya. Optimized Rule Induction, IEEE Exp., 8(6), pp. 61–69, 1993. [129] E. Wiener, J. O. Pedersen, A. S. Weigend. A Neural Network Approach to Topic Spotting. SDAIR, pp. 317–332, 1995. 222 MINING TEXT DATA [130] G.-R. Xue, D. Xing, Q. Yang, Y. Yu. Deep classification in largescale text hierarchies. ACM SIGIR Conference, 2008. [131] J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan, W.-Y. Ma. OCFS: optimal orthogonal centroid feature selection for text categorization. ACM SIGIR Conference, 2005. [132] Y. Yang, L. Liu. A re-examination of text categorization methods, ACM SIGIR Conference, 1999. [133] Y. Yang, J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, 1995. [134] Y. Yang, C.G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3), 1994. [135] Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization, ACM SIGIR Conference, 1995. [136] Y. Yang. A Study on Thresholding Strategies for Text Categorization. ACM SIGIR Conference, 2001. [137] Y. Yang, T. Ault, T. Pierce. Combining multiple learning strategies for effective cross-validation. ICML Conference, 2000. [138] J. Zhang, Y. Yang. Robustness of regularized linear classification methods in text categorization. ACM SIGIR Conference, 2003. [139] T. Zhang, A. Popescul, B. Dom. Linear prediction models with graph regularization for web-page categorization, ACM KDD Conference, 2006. [140] S. Zhu, K. Yu, Y. Chi, Y. Gong. Combining content and link for classification using matrix factorization. ACM SIGIR Conference, 2007.