MINING TEXT DATA
MINING TEXT DATA
Edited by
CHARU C. AGGARWAL
IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
CHENGXIANG ZHAI
University of Illinois at Urbana-Champaign, Urbana, IL, USA
Kluwer Academic Publishers
Boston/Dordrecht/London
Contents
1
An Introduction to Text Mining 1
Charu C. Aggarwal and ChengXiang Zhai
1. Introduction 1
2. Algorithms for Text Mining 4
3. Future Directions 8
References 10
2
Information Extraction from Text 11
Jing Jiang
1. Introduction 11
2. Named Entity Recognition 15
2.1 Rule-based Approach 16
2.2 Statistical Learning Approach 17
3. Relation Extraction 22
3.1 Feature-based Classiﬁcation 23
3.2 Kernel Methods 26
3.3 Weakly Supervised Learning Methods 29
4. Unsupervised Information Extraction 30
4.1 Relation Discovery and Template Induction 31
4.2 Open Information Extraction 32
5. Evaluation 33
6. Conclusions and Summary 34
References 35
3
A Survey of Text Summarization Techniques 43
Ani Nenkova and Kathleen McKeown
1. How do Extractive Summarizers Work? 44
2. Topic Representation Approaches 46
2.1 Topic Words 46
2.2 Frequency-driven Approaches 48
2.3 Latent Semantic Analysis 52
2.4 Bayesian Topic Models 53
2.5 Sentence Clustering and Domain-dependent Topics 55
3. Inﬂuence of Context 56
3.1 Web Summarization 57
3.2 Summarization of Scientiﬁc Articles 58
vi MINING TEXT DATA
3.3 Query-focused Summarization 58
3.4 Email Summarization 59
4. Indicator Representations and Machine Learning for Summarization
60
4.1 Graph Methods for Sentence Importance 60
4.2 Machine Learning for Summarization 62
5. Selecting Summary Sentences 64
5.1 Greedy Approaches: Maximal Marginal Relevance 64
5.2 Global Summary Selection 65
6. Conclusion 66
References 66
4
A Survey of Text Clustering Algorithms 77
Charu C. Aggarwal and ChengXiang Zhai
1. Introduction 77
2. Feature Selection and Transformation Methods for Text Clustering
81
2.1 Feature Selection Methods 81
2.2 LSI-based Methods 84
2.3 Non-negative Matrix Factorization 86
3. Distance-based Clustering Algorithms 89
3.1 Agglomerative and Hierarchical Clustering Algorithms 90
3.2 Distance-based Partitioning Algorithms 92
3.3 A Hybrid Approach: The Scatter-Gather Method 94
4. Word and Phrase-based Clustering 99
4.1 Clustering with Frequent Word Patterns 100
4.2 Leveraging Word Clusters for Document Clusters 102
4.3 Co-clustering Words and Documents 103
4.4 Clustering with Frequent Phrases 105
5. Probabilistic Document Clustering and Topic Models 107
6. Online Clustering with Text Streams 110
7. Clustering Text in Networks 115
8. Semi-Supervised Clustering 118
9. Conclusions and Summary 120
References 121
5
Dimensionality Reduction and Topic Modeling 129
Steven P. Crain, Ke Zhou, Shuang-Hong Yang and Hongyuan Zha
1. Introduction 130
1.1 The Relationship Between Clustering, Dimension Reduction
and Topic Modeling 131
1.2 Notation and Concepts 132
2. Latent Semantic Indexing 133
2.1 The Procedure of Latent Semantic Indexing 134
2.2 Implementation Issues 135
2.3 Analysis 137
3. Topic Models and Dimension Reduction 139
3.1 Probabilistic Latent Semantic Indexing 140
3.2 Latent Dirichlet Allocation 142
4. Interpretation and Evaluation 148
Contents vii
4.1 Interpretation 148
4.2 Evaluation 149
4.3 Parameter Selection 150
4.4 Dimension Reduction 150
5. Beyond Latent Dirichlet Allocation 151
5.1 Scalability 151
5.2 Dynamic Data 151
5.3 Networked Data 152
5.4 Adapting Topic Models to Applications 154
6. Conclusion 155
References 156
6
A Survey of Text Classiﬁcation Algorithms 163
Charu C. Aggarwal and ChengXiang Zhai
1. Introduction 163
2. Feature Selection for Text Classiﬁcation 167
2.1 Gini Index 168
2.2 Information Gain 169
2.3 Mutual Information 169
2.4 χ2-Statistic 170
2.5 Feature Transformation Methods: Supervised LSI 171
2.6 Supervised Clustering for Dimensionality Reduction 172
2.7 Linear Discriminant Analysis 173
2.8 Generalized Singular Value Decomposition 175
2.9 Interaction of Feature Selection with Classiﬁcation 175
3. Decision Tree Classiﬁers 176
4. Rule-based Classiﬁers 178
5. Probabilistic and Naive Bayes Classiﬁers 181
5.1 Bernoulli Multivariate Model 183
5.2 Multinomial Distribution 188
5.3 Mixture Modeling for Text Classiﬁcation 190
6. Linear Classiﬁers 193
6.1 SVM Classiﬁers 194
6.2 Regression-Based Classiﬁers 196
6.3 Neural Network Classiﬁers 197
6.4 Some Observations about Linear Classiﬁers 199
7. Proximity-based Classiﬁers 200
8. Classiﬁcation of Linked and Web Data 203
9. Meta-Algorithms for Text Classiﬁcation 209
9.1 Classiﬁer Ensemble Learning 209
9.2 Data Centered Methods: Boosting and Bagging 210
9.3 Optimizing Speciﬁc Measures of Accuracy 211
10. Conclusions and Summary 213
References 213
7
Transfer Learning for Text Mining 223
Weike Pan, Erheng Zhong and Qiang Yang
1. Introduction 224
2. Transfer Learning in Text Classiﬁcation 225
2.1 Cross Domain Text Classiﬁcation 225
viii MINING TEXT DATA
2.2 Instance-based Transfer 231
2.3 Cross-Domain Ensemble Learning 232
2.4 Feature-based Transfer Learning for Document Classiﬁcation
235
3. Heterogeneous Transfer Learning 239
3.1 Heterogeneous Feature Space 241
3.2 Heterogeneous Label Space 243
3.3 Summary 244
4. Discussion 245
5. Conclusions 246
References 247
8
Probabilistic Models for Text Mining 259
Yizhou Sun, Hongbo Deng and Jiawei Han
1. Introduction 260
2. Mixture Models 261
2.1 General Mixture Model Framework 262
2.2 Variations and Applications 263
2.3 The Learning Algorithms 266
3. Stochastic Processes in Bayesian Nonparametric Models 269
3.1 Chinese Restaurant Process 269
3.2 Dirichlet Process 270
3.3 Pitman-Yor Process 274
3.4 Others 275
4. Graphical Models 275
4.1 Bayesian Networks 276
4.2 Hidden Markov Models 278
4.3 Markov Random Fields 282
4.4 Conditional Random Fields 285
4.5 Other Models 286
5. Probabilistic Models with Constraints 287
6. Parallel Learning Algorithms 288
7. Conclusions 289
References 290
9
Mining Text Streams 297
Charu C. Aggarwal
1. Introduction 297
2. Clustering Text Streams 299
2.1 Topic Detection and Tracking in Text Streams 307
3. Classiﬁcation of Text Streams 312
4. Evolution Analysis in Text Streams 316
5. Conclusions 317
References 318
10
Translingual Mining from Text Data 323
Jian-Yun Nie, Jianfeng Gao and Guihong Cao
1. Introduction 324
2. Traditional Translingual Text Mining – Machine Translation 325
Contents ix
2.1 SMT and Generative Translation Models 325
2.2 Word-Based Models 327
2.3 Phrase-Based Models 329
2.4 Syntax-Based Models 333
3. Automatic Mining of Parallel texts 336
3.1 Using Web structure 337
3.2 Matching parallel pages 339
4. Using Translation Models in CLIR 341
5. Collecting and Exploiting Comparable Texts 344
6. Selecting Parallel Sentences, Phrases and Translation Words 347
7. Mining Translingual Relations From Monolingual Texts 349
8. Mining using hyperlinks 351
9. Conclusions and Discussions 353
References 354
11
Text Mining in Multimedia 361
Zheng-Jun Zha, Meng Wang, Jialie Shen and Tat-Seng Chua
1. Introduction 362
2. Surrounding Text Mining 364
3. Tag Mining 366
3.1 Tag Ranking 366
3.2 Tag Reﬁnement 367
3.3 Tag Information Enrichment 369
4. Joint Text and Visual Content Mining 370
4.1 Visual Re-ranking 371
5. Cross Text and Visual Content Mining 374
6. Summary and Open Issues 377
References 379
12
Text Analytics in Social Media 385
Xia Hu and Huan Liu
1. Introduction 385
2. Distinct Aspects of Text in Social Media 388
2.1 A General Framework for Text Analytics 388
2.2 Time Sensitivity 390
2.3 Short Length 391
2.4 Unstructured Phrases 392
2.5 Abundant Information 393
3. Applying Text Analytics to Social Media 393
3.1 Event Detection 393
3.2 Collaborative Question Answering 395
3.3 Social Tagging 397
3.4 Bridging the Semantic Gap 398
3.5 Exploiting the Power of Abundant Information 399
3.6 Related Eﬀorts 401
4. An Illustrative Example 402
4.1 Seed Phrase Extraction 402
4.2 Semantic Feature Generation 404
4.3 Feature Space Construction 406
5. Conclusion and Future Work 407
x MINING TEXT DATA
References 408
13
A Survey of Opinion Mining and Sentiment Analysis 415
Bing Liu and Lei Zhang
1. The Problem of Opinion Mining 416
1.1 Opinion Deﬁnition 416
1.2 Aspect-Based Opinion Summary 420
2. Document Sentiment Classiﬁcation 422
2.1 Classiﬁcation based on Supervised Learning 422
2.2 Classiﬁcation based on Unsupervised Learning 424
3. Sentence Subjectivity and Sentiment Classiﬁcation 426
4. Opinion Lexicon Expansion 429
4.1 Dictionary based approach 429
4.2 Corpus-based approach and sentiment consistency 430
5. Aspect-Based Sentiment Analysis 432
5.1 Aspect Sentiment Classiﬁcation 433
5.2 Basic Rules of Opinions 434
5.3 Aspect Extraction 438
5.4 Simultaneous Opinion Lexicon Expansion and Aspect
Extraction 440
6. Mining Comparative Opinions 441
7. Some Other Problems 444
8. Opinion Spam Detection 447
8.1 Spam Detection Based on Supervised Learning 448
8.2 Spam Detection Based on Abnormal Behaviors 449
8.3 Group Spam Detection 450
9. Utility of Reviews 451
10. Conclusions 452
References 453
14
Biomedical Text Mining: A Survey
of Recent Progress
465
Matthew S. Simpson and Dina Demner-Fushman
1. Introduction 466
2. Resources for Biomedical Text Mining 467
2.1 Corpora 467
2.2 Annotation 469
2.3 Knowledge Sources 470
2.4 Supporting Tools 471
3. Information Extraction 472
3.1 Named Entity Recognition 473
3.2 Relation Extraction 478
3.3 Event Extraction 482
4. Summarization 484
5. Question Answering 488
5.1 Medical Question Answering 489
5.2 Biological Question Answering 491
6. Literature-Based Discovery 492
7. Conclusion 495
References 496
Contents xi
Index 519
Chapter 4
A SURVEY OF TEXT CLUSTERING
ALGORITHMS
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
charu@us.ibm.com
ChengXiang Zhai
University of Illinois at Urbana-Champaign
Urbana, IL
czhai@cs.uiuc.edu
Abstract Clustering is a widely studied data mining problem in the text domains.
The problem ﬁnds numerous applications in customer segmentation,
classiﬁcation, collaborative ﬁltering, visualization, document organization,
and indexing. In this chapter, we will provide a detailed survey of
the problem of text clustering. We will study the key challenges of the
clustering problem, as it applies to the text domain. We will discuss the
key methods used for text clustering, and their relative advantages. We
will also discuss a number of recent advances in the area in the context
of social network and linked data.
Keywords: Text Clustering
1. Introduction
The problem of clustering has been studied widely in the database
and statistics literature in the context of a wide variety of data mining
tasks [50, 54]. The clustering problem is deﬁned to be that of ﬁnding
groups of similar objects in the data. The similarity between the ob-
78 MINING TEXT DATA
jects is measured with the use of a similarity function. The problem
of clustering can be very useful in the text domain, where the objects
to be clusters can be of diﬀerent granularities such as documents, paragraphs,
sentences or terms. Clustering is especially useful for organizing
documents to improve retrieval and support browsing [11, 26].
The study of the clustering problem precedes its applicability to the
text domain. Traditional methods for clustering have generally focussed
on the case of quantitative data [44, 71, 50, 54, 108], in which the attributes
of the data are numeric. The problem has also been studied
for the case of categorical data [10, 41, 43], in which the attributes may
take on nominal values. A broad overview of clustering (as it relates
to generic numerical and categorical data) may be found in [50, 54]. A
number of implementations of common text clustering algorithms, as applied
to text data, may be found in several toolkits such as Lemur [114]
and BOW toolkit in [64]. The problem of clustering ﬁnds applicability
for a number of tasks:
Document Organization and Browsing: The hierarchical organization
of documents into coherent categories can be very useful
for systematic browsing of the document collection. A classical example
of this is the Scatter/Gather method [25], which provides a
systematic browsing technique with the use of clustered organization
of the document collection.
Corpus Summarization: Clustering techniques provide a coherent
summary of the collection in the form of cluster-digests [83] or
word-clusters [17, 18], which can be used in order to provide summary
insights into the overall content of the underlying corpus.
Variants of such methods, especially sentence clustering, can also
be used for document summarization, a topic, discussed in detail
in Chapter 3. The problem of clustering is also closely related to
that of dimensionality reduction and topic modeling. Such dimensionality
reduction methods are all diﬀerent ways of summarizing
a corpus of documents, and are covered in Chapter 5.
Document Classiﬁcation: While clustering is inherently an unsupervised
learning method, it can be leveraged in order to improve
the quality of the results in its supervised variant. In particular,
word-clusters [17, 18] and co-training methods [72] can be used in
order to improve the classiﬁcation accuracy of supervised applications
with the use of clustering techniques.
We note that many classes of algorithms such as the k-means algorithm,
or hierarchical algorithms are general-purpose methods, which
A Survey of Text Clustering Algorithms 79
can be extended to any kind of data, including text data. A text document
can be represented either in the form of binary data, when we use
the presence or absence of a word in the document in order to create a
binary vector. In such cases, it is possible to directly use a variety of
categorical data clustering algorithms [10, 41, 43] on the binary representation.
A more enhanced representation would include reﬁned weighting
methods based on the frequencies of the individual words in the document
as well as frequencies of words in an entire collection (e.g., TF-IDF
weighting [82]). Quantitative data clustering algorithms [44, 71, 108] can
be used in conjunction with these frequencies in order to determine the
most relevant groups of objects in the data.
However, such naive techniques do not typically work well for clustering
text data. This is because text data has a number of unique
properties which necessitate the design of specialized algorithms for the
task. The distinguishing characteristics of the text representation are as
follows:
The dimensionality of the text representation is very large, but the
underlying data is sparse. In other words, the lexicon from which
the documents are drawn may be of the order of 105, but a given
document may contain only a few hundred words. This problem
is even more serious when the documents to be clustered are very
short (e.g., when clustering sentences or tweets).
While the lexicon of a given corpus of documents may be large, the
words are typically correlated with one another. This means that
the number of concepts (or principal components) in the data is
much smaller than the feature space. This necessitates the careful
design of algorithms which can account for word correlations in
the clustering process.
The number of words (or non-zero entries) in the diﬀerent documents
may vary widely. Therefore, it is important to normalize
the document representations appropriately during the clustering
task.
The sparse and high dimensional representation of the diﬀerent documents
necessitate the design of text-speciﬁc algorithms for document
representation and processing, a topic heavily studied in the information
retrieval literature where many techniques have been proposed to optimize
document representation for improving the accuracy of matching
a document with a query [82, 13]. Most of these techniques can also be
used to improve document representation for clustering.
80 MINING TEXT DATA
In order to enable an eﬀective clustering process, the word frequencies
need to be normalized in terms of their relative frequency of presence
in the document and over the entire collection. In general, a common
representation used for text processing is the vector-space based TF-IDF
representation [81]. In the TF-IDF representation, the term frequency
for each word is normalized by the inverse document frequency, or IDF.
The inverse document frequency normalization reduces the weight of
terms which occur more frequently in the collection. This reduces the
importance of common terms in the collection, ensuring that the matching
of documents be more inﬂuenced by that of more discriminative
words which have relatively low frequencies in the collection. In addition,
a sub-linear transformation function is often applied to the termfrequencies
in order to avoid the undesirable dominating eﬀect of any
single term that might be very frequent in a document. The work on
document-normalization is itself a vast area of research, and a variety of
other techniques which discuss diﬀerent normalization methods may be
found in [86, 82].
Text clustering algorithms are divided into a wide variety of diﬀerent
types such as agglomerative clustering algorithms, partitioning algorithms,
and standard parametric modeling based methods such as the
EM-algorithm. Furthermore, text representations may also be treated
as strings (rather than bags of words). These diﬀerent representations
necessitate the design of diﬀerent classes of clustering algorithms. Diﬀerent
clustering algorithms have diﬀerent tradeoﬀs in terms of eﬀectiveness
and eﬃciency. An experimental comparison of diﬀerent clustering algorithms
may be found in [90, 111]. In this chapter we will discuss a wide
variety of algorithms which are commonly used for text clustering. We
will also discuss text clustering algorithms for related scenarios such as
dynamic data, network-based text data and semi-supervised scenarios.
This chapter is organized as follows. In section 2, we will present feature
selection and transformation methods for text clustering. Section 3
describes a number of common algorithms which are used for distancebased
clustering of text documents. Section 4 contains the description
of methods for clustering with the use of word patterns and phrases.
Methods for clustering text streams are described in section 5. Section
6 describes methods for probabilistic clustering of text data. Section
7 contains a description of methods for clustering text which naturally
occurs in the context of social or web-based networks. Section 8 discusses
methods for semi-supervised clustering. Section 9 presents the
conclusions and summary.
A Survey of Text Clustering Algorithms 81
2. Feature Selection and Transformation
Methods for Text Clustering
The quality of any data mining method such as classiﬁcation and clustering
is highly dependent on the noisiness of the features that are used
for the clustering process. For example, commonly used words such
as “the”, may not be very useful in improving the clustering quality.
Therefore, it is critical to select the features eﬀectively, so that the noisy
words in the corpus are removed before the clustering. In addition to
feature selection, a number of feature transformation methods such as
Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis
(PLSA), and Non-negative Matrix Factorization (NMF) are available to
improve the quality of the document representation and make it more
amenable to clustering. In these techniques (often called dimension reduction),
the correlations among the words in the lexicon are leveraged
in order to create features, which correspond to the concepts or principal
components in the data. In this section, we will discuss both classes
of methods. A more in-depth discussion of dimension reduction can be
found in Chapter 5.
2.1 Feature Selection Methods
Feature selection is more common and easy to apply in the problem of
text categorization [99] in which supervision is available for the feature
selection process. However, a number of simple unsupervised methods
can also be used for feature selection in text clustering. Some examples
of such methods are discussed below.
2.1.1 Document Frequency-based Selection. The simplest
possible method for feature selection in document clustering is that of
the use of document frequency to ﬁlter out irrelevant features. While
the use of inverse document frequencies reduces the importance of such
words, this may not alone be suﬃcient to reduce the noise eﬀects of
very frequent words. In other words, words which are too frequent in
the corpus can be removed because they are typically common words
such as “a”, “an”, “the”, or “of” which are not discriminative from a
clustering perspective. Such words are also referred to as stop words.
A variety of methods are commonly available in the literature [76] for
stop-word removal. Typically commonly available stop word lists of
about 300 to 400 words are used for the retrieval process. In addition,
words which occur extremely infrequently can also be removed from
the collection. This is because such words do not add anything to the
similarity computations which are used in most clustering methods. In
82 MINING TEXT DATA
some cases, such words may be misspellings or typographical errors in
documents. Noisy text collections which are derived from the web, blogs
or social networks are more likely to contain such terms. We note that
some lines of research deﬁne document frequency based selection purely
on the basis of very infrequent terms, because these terms contribute the
least to the similarity calculations. However, it should be emphasized
that very frequent words should also be removed, especially if they are
not discriminative between clusters. Note that the TF-IDF weighting
method can also naturally ﬁlter out very common words in a “soft” way.
Clearly, the standard set of stop words provide a valid set of words to
prune. Nevertheless, we would like a way of quantifying the importance
of a term directly to the clustering process, which is essential for more
aggressive pruning. We will discuss a number of such methods below.
2.1.2 Term Strength. A much more aggressive technique for
stop-word removal is proposed in [94]. The core idea of this approach
is to extend techniques which are used in supervised learning to the
unsupervised case. The term strength is essentially used to measure
how informative a word is for identifying two related documents. For
example, for two related documents x and y, the term strength s(t) of
term t is deﬁned in terms of the following probability:
s(t) = P(t ∈ y|t ∈ x) (4.1)
Clearly, the main issue is how one might deﬁne the document x and y
as related. One possibility is to use manual (or user) feedback to deﬁne
when a pair of documents are related. This is essentially equivalent
to utilizing supervision in the feature selection process, and may be
practical in situations in which predeﬁned categories of documents are
available. On the other hand, it is not practical to manually create
related pairs in large collections in a comprehensive way. It is therefore
desirable to use an automated and purely unsupervised way to deﬁne
the concept of when a pair of documents is related. It has been shown
in [94] that it is possible to use automated similarity functions such as
the cosine function [81] to deﬁne the relatedness of document pairs. A
pair of documents are deﬁned to be related if their cosine similarity is
above a user-deﬁned threshold. In such cases, the term strength s(t)
can be deﬁned by randomly sampling a number of pairs of such related
documents as follows:
s(t) =
Number of pairs in which t occurs in both
Number of pairs in which t occurs in the ﬁrst of the pair
(4.2)
Here, the ﬁrst document of the pair may simply be picked randomly.
In order to prune features, the term strength may be compared to the
A Survey of Text Clustering Algorithms 83
expected strength of a term which is randomly distributed in the training
documents with the same frequency. If the term strength of t is not at
least two standard deviations greater than that of the random word,
then it is removed from the collection.
One advantage of this approach is that it requires no initial supervision
or training data for the feature selection, which is a key requirement
in the unsupervised scenario. Of course, the approach can also be used
for feature selection in either supervised clustering [4] or categorization
[100], when such training data is indeed available. One observation
about this approach to feature selection is that it is particularly suited to
similarity-based clustering because the discriminative nature of the underlying
features is deﬁned on the basis of similarities in the documents
themselves.
2.1.3 Entropy-based Ranking. The entropy-based ranking
approach was proposed in [27]. In this case, the quality of the term is
measured by the entropy reduction when it is removed. Here the entropy
E(t) of the term t in a collection of n documents is deﬁned as follows:
E(t) = −
n
i=1
n
j=1
(Sij · log(Sij) + (1 − Sij) · log(1 − Sij)) (4.3)
Here Sij ∈ (0, 1) is the similarity between the ith and jth document in
the collection, after the term t is removed, and is deﬁned as follows:
Sij = 2−
dist(i,j)
dist (4.4)
Here dist(i, j) is the distance between the terms i and j after the term
t is removed, and dist is the average distance between the documents
after the term t is removed. We note that the computation of E(t) for
each term t requires O(n2) operations. This is impractical for a very
large corpus containing many terms. It has been shown in [27] how
this method may be made much more eﬃcient with the use of sampling
methods.
2.1.4 Term Contribution. The concept of term contribution
[62] is based on the fact that the results of text clustering are highly
dependent on document similarity. Therefore, the contribution of a term
can be viewed as its contribution to document similarity. For example,
in the case of dot-product based similarity, the similarity between two
documents is deﬁned as the dot product of their normalized frequencies.
Therefore, the contribution of a term of the similarity of two documents
is the product of their normalized frequencies in the two documents. This
84 MINING TEXT DATA
needs to be summed over all pairs of documents in order to determine the
term contribution. As in the previous case, this method requires O(n2)
time for each term, and therefore sampling methods may be required
to speed up the contribution. A major criticism of this method is that
it tends to favor highly frequent words without regard to the speciﬁc
discriminative power within a clustering process.
In most of these methods, the optimization of term selection is based
on some pre-assumed similarity function (e.g., cosine). While this strategy
makes these methods unsupervised, there is a concern that the term
selection might be biased due to the potential bias of the assumed similarity
function. That is, if a diﬀerent similarity function is assumed,
we may end up having diﬀerent results for term selection. Thus the
choice of an appropriate similarity function may be important for these
methods.
2.2 LSI-based Methods
In feature selection, we attempt to explicitly select out features from
the original data set. Feature transformation is a diﬀerent method in
which the new features are deﬁned as a functional representation of the
features in the original data set. The most common class of methods is
that of dimensionality reduction [53] in which the documents are transformed
to a new feature space of smaller dimensionality in which the
features are typically a linear combination of the features in the original
data. Methods such as Latent Semantic Indexing (LSI) [28] are based
on this common principle. The overall eﬀect is to remove a lot of dimensions
in the data which are noisy for similarity based applications
such as clustering. The removal of such dimensions also helps magnify
the semantic eﬀects in the underlying data.
Since LSI is closely related to problem of Principal Component Analysis
(PCA) or Singular Value Decomposition (SVD), we will ﬁrst discuss
this method, and its relationship to LSI. For a d-dimensional data set,
PCA constructs the symmetric d × d covariance matrix C of the data,
in which the (i, j)th entry is the covariance between dimensions i and j.
This matrix is positive semi-deﬁnite, and can be diagonalized as follows:
C = P · D · PT
(4.5)
Here P is a matrix whose columns contain the orthonormal eigenvectors
of C and D is a diagonal matrix containing the corresponding eigenvalues.
We note that the eigenvectors represent a new orthonormal basis
system along which the data can be represented. In this context, the
eigenvalues correspond to the variance when the data is projected along
this basis system. This basis system is also one in which the second
A Survey of Text Clustering Algorithms 85
order covariances of the data are removed, and most of variance in the
data is captured by preserving the eigenvectors with the largest eigenvalues.
Therefore, in order to reduce the dimensionality of the data,
a common approach is to represent the data in this new basis system,
which is further truncated by ignoring those eigenvectors for which the
corresponding eigenvalues are small. This is because the variances along
those dimensions are small, and the relative behavior of the data points
is not signiﬁcantly aﬀected by removing them from consideration. In
fact, it can be shown that the Euclidian distances between data points
are not signiﬁcantly aﬀected by this transformation and corresponding
truncation. The method of PCA is commonly used for similarity search
in database retrieval applications.
LSI is quite similar to PCA, except that we use an approximation of
the covariance matrix C which is quite appropriate for the sparse and
high-dimensional nature of text data. Speciﬁcally, let A be the n × d
term-document matrix in which the (i, j)th entry is the normalized frequency
for term j in document i. Then, AT · A is a d × d matrix which
is close (scaled) approximation of the covariance matrix, in which the
means have not been subtracted out. In other words, the value of AT ·A
would be the same as a scaled version (by factor n) of the covariance
matrix, if the data is mean-centered. While text-representations are not
mean-centered, the sparsity of text ensures that the use of AT · A is
quite a good approximation of the (scaled) covariances. As in the case
of numerical data, we use the eigenvectors of AT ·A with the largest variance
in order to represent the text. In typical collections, only about
300 to 400 eigenvectors are required for the representation. One excellent
characteristic of LSI [28] is that the truncation of the dimensions
removes the noise eﬀects of synonymy and polysemy, and the similarity
computations are more closely aﬀected by the semantic concepts in the
data. This is particularly useful for a semantic application such as text
clustering. However, if ﬁner granularity clustering is needed, such lowdimensional
space representation of text may not be suﬃciently discriminative;
in information retrieval, this problem is often solved by mixing
the low-dimensional representation with the original high-dimensional
word-based representation (see, e.g., [105]).
A similar technique to LSI, but based on probabilistic modeling is
Probabilistic Latent Semantic Analysis (PLSA) [49]. The similarity and
equivalence of PLSA and LSI are discussed in [49].
2.2.1 Concept Decomposition using Clustering. One
interesting observation is that while feature transformation is often used
as a pre-processing technique for clustering, the clustering itself can be
86 MINING TEXT DATA
used for a novel dimensionality reduction technique known as concept
decomposition [2, 29]. This of course leads to the issue of circularity in
the use of this technique for clustering, especially if clustering is required
in order to perform the dimensionality reduction. Nevertheless, it is still
possible to use this technique eﬀectively for pre-processing with the use
of two separate phases of clustering.
The technique of concept decomposition uses any standard clustering
technique [2, 29] on the original representation of the documents. The
frequent terms in the centroids of these clusters are used as basis vectors
which are almost orthogonal to one another. The documents can then be
represented in a much more concise way in terms of these basis vectors.
We note that this condensed conceptual representation allows for enhanced
clustering as well as classiﬁcation. Therefore, a second phase of
clustering can be applied on this reduced representation in order to cluster
the documents much more eﬀectively. Such a method has also been
tested in [87] by using word-clusters in order to represent documents.
We will describe this method in more detail later in this chapter.
2.3 Non-negative Matrix Factorization
The non-negative matrix factorization (NMF) technique is a latentspace
method, and is particularly suitable to clustering [97]. As in the
case of LSI, the NMF scheme represents the documents in a new axissystem
which is based on an analysis of the term-document matrix.
However, the NMF method has a number of critical diﬀerences from the
LSI scheme from a conceptual point of view. In particular, the NMF
scheme is a feature transformation method which is particularly suited
to clustering. The main conceptual characteristics of the NMF scheme,
which are very diﬀerent from LSI are as follows:
In LSI, the new basis system consists of a set of orthonormal vectors.
This is not the case for NMF.
In NMF, the vectors in the basis system directly correspond to
cluster topics. Therefore, the cluster membership for a document
may be determined by examining the largest component of the
document along any of the vectors. The coordinate of any document
along a vector is always non-negative. The expression of each
document as an additive combination of the underlying semantics
makes a lot of sense from an intuitive perspective. Therefore, the
NMF transformation is particularly suited to clustering, and it also
provides an intuitive understanding of the basis system in terms
of the clusters.
A Survey of Text Clustering Algorithms 87
Let A be the n × d term document matrix. Let us assume that we
wish to create k clusters from the underlying document corpus. Then,
the non-negative matrix factorization method attempts to determine the
matrices U and V which minimize the following objective function:
J = (1/2) · ||A − U · V T
|| (4.6)
Here || · || represents the sum of the squares of all the elements in the
matrix, U is an n×k non-negative matrix, and V is a m×k non-negative
matrix. We note that the columns of V provide the k basis vectors which
correspond to the k diﬀerent clusters.
What is the signiﬁcance of the above optimization problem? Note
that by minimizing J, we are attempting to factorize A approximately
as:
A ≈ U · V T
(4.7)
For each row a of A (document vector), we can rewrite the above equation
as:
a ≈ u · V T
(4.8)
Here u is the corresponding row of U. Therefore, the document vector
a can be rewritten as an approximate linear (non-negative) combination
of the basis vector which corresponds to the k columns of V T . If the
value of k is relatively small compared to the corpus, this can only be
done if the column vectors of V T discover the latent structure in the
data. Furthermore, the non-negativity of the matrices U and V ensures
that the documents are expressed as a non-negative combination of the
key concepts (or clustered) regions in the term-based feature space.
Next, we will discuss how the optimization problem for J above is
actually solved. The squared norm of any matrix Q can be expressed as
the trace of the matrix Q · QT . Therefore, we can express the objective
function above as follows:
J = (1/2) · tr((A − U · V T
) · (A − U · V T
)T
)
= (1/2) · tr(A · AT
) − tr(A · U · V T
) + (1/2) · tr(U · V T
· V · UT
)
Thus, we have an optimization problem with respect to the matrices U =
[uij] and V = [vij], the entries uij and vij of which are the variables with
respect to which we need to optimize this problem. In addition, since
the matrices are non-negative, we have the constraints that uij ≥ 0 and
vij ≥ 0. This is a typical constrained non-linear optimization problem,
and can be solved using the Lagrange method. Let α = [αij] and β =
[βij] be matrices with the same dimensions as U and V respectively.
The elements of the matrices α and β are the corresponding Lagrange
88 MINING TEXT DATA
multipliers for the non-negativity conditions on the diﬀerent elements of
U and V respectively. We note that tr(α·UT ) is simply equal to i,j αij ·
uij and tr(β · V T ) is simply equal to i,j βij · vij. These correspond to
the lagrange expressions for the non-negativity constraints. Then, we
can express the Lagrangian optimization problem as follows:
L = J + tr(α · UT
) + tr(β · V T
) (4.9)
Then, we can express the partial derivative of L with respect to U and
V as follows, and set them to 0:
δL
δU
= −A · V + U · V T
· V + α = 0
δL
δV
= −AT
· U + V · UT
· U + β = 0
We can then multiply the (i, j)th entry of the above (two matrices of)
conditions with uij and vij respectively. Using the Kuhn-Tucker conditions
αij · uij = 0 and βij · vij = 0, we get the following:
(A · V )ij · uij − (U · V T
· V )ij · uij = 0
(AT
· U)ij · vij − (V · UT
· U)ij · vij = 0
We note that these conditions are independent of α and β. This leads
to the following iterative updating rules for uij and vij:
uij =
(A · V )ij · uij
(U · V T · V )ij
vij =
(AT · U)ij · vij
(V · UT · U)ij
It has been shown in [58] that the objective function continuously improves
under these update rules, and converges to an optimal solution.
One interesting observation about the matrix factorization technique
is that it can also be used to determine word-clusters instead of document
clusters. Just as the columns of V provide a basis which can
be used to discover document clusters, we can use the columns of U
to discover a basis which correspond to word clusters. As we will see
later, document clusters and word clusters are closely related, and it is
often useful to discover both simultaneously, as in frameworks such as
co-clustering [30, 31, 75]. Matrix-factorization provides a natural way of
achieving this goal. It has also been shown both theoretically and experimentally
[33, 93] that the matrix-factorization technique is equivalent
to another graph-structure based document clustering technique known
A Survey of Text Clustering Algorithms 89
as spectral clustering. An analogous technique called concept factorization
was proposed in [98], which can also be applied to data points with
negative values in them.
3. Distance-based Clustering Algorithms
Distance-based clustering algorithms are designed by using a similarity
function to measure the closeness between the text objects. The
most well known similarity function which is used commonly in the text
domain is the cosine similarity function. Let U = (f(u1) . . . f(uk)) and
V = (f(v1) . . . f(vk)) be the damped and normalized frequency term
vector in two diﬀerent documents U and V . The values u1 . . . uk and
v1 . . . vk represent the (normalized) term frequencies, and the function
f(·) represents the damping function. Typical damping functions for
f(·) could represent either the square-root or the logarithm [25]. Then,
the cosine similarity between the two documents is deﬁned as follows:
cosine(U, V ) =
k
i=1 f(ui) · f(vi)
k
i=1 f(ui)2 · k
i=1 f(vi)2
(4.10)
Computation of text similarity is a fundamental problem in information
retrieval. Although most of the work in information retrieval has
focused on how to assess the similarity of a keyword query and a text document,
rather than the similarity between two documents, many weighting
heuristics and similarity functions can also be applied to optimize the
similarity function for clustering. Eﬀective information retrieval models
generally capture three heuristics, i.e., TF weighting, IDF weighting,
and document length normalization [36]. One eﬀective way to assign
weights to terms when representing a document as a weighted term vector
is the BM25 term weighting method [78], where the normalized TF
not only addresses length normalization, but also has an upper bound
which improves the robustness as it avoids overly rewarding the matching
of any particular term. A document can also be represented with
a probability distribution over words (i.e., unigram language models),
and the similarity can then be measured based an information theoretic
measure such as cross entropy or Kullback-Leibler divergencce [105]. For
clustering, symmetric variants of such a similarity function may be more
appropriate.
One challenge in clustering short segments of text (e.g., tweets or
sentences) is that exact keyword matching may not work well. One general
strategy for solving this problem is to expand text representation
by exploiting related text documents, which is related to smoothing of
a document language model in information retrieval [105]. A speciﬁc
90 MINING TEXT DATA
technique, which leverages a search engine to expand text representation,
was proposed in [79]. A comparison of several simple measures for
computing similarity of short text segments can be found in [66].
These similarity functions can be used in conjunction with a wide variety
of traditional clustering algorithms [50, 54]. In the next subsections,
we will discuss some of these techniques.
3.1 Agglomerative and Hierarchical Clustering
Algorithms
Hierarchical clustering algorithms have been studied extensively in
the clustering literature [50, 54] for records of diﬀerent kinds including
multidimensional numerical data, categorical data and text data. An
overview of the traditional agglomerative and hierarchical clustering algorithms
in the context of text data is provided in [69, 70, 92, 96]. An
experimental comparison of diﬀerent hierarchical clustering algorithms
may be found in [110]. The method of agglomerative hierarchical clustering
is particularly useful to support a variety of searching methods
because it naturally creates a tree-like hierarchy which can be leveraged
for the search process. In particular, the eﬀectiveness of this method in
improving the search eﬃciency over a sequential scan has been shown in
[51, 77].
The general concept of agglomerative clustering is to successively
merge documents into clusters based on their similarity with one another.
Almost all the hierarchical clustering algorithms successively
merge groups based on the best pairwise similarity between these groups
of documents. The main diﬀerences between these classes of methods
are in terms of how this pairwise similarity is computed between the
diﬀerent groups of documents. For example, the similarity between a
pair of groups may be computed as the best-case similarity, averagecase
similarity, or worst-case similarity between documents which are
drawn from these pairs of groups. Conceptually, the process of agglomerating
documents into successively higher levels of clusters creates a
cluster hierarchy (or dendogram) for which the leaf nodes correspond to
individual documents, and the internal nodes correspond to the merged
groups of clusters. When two groups are merged, a new node is created
in this tree corresponding to this larger merged group. The two children
of this node correspond to the two groups of documents which have been
merged to it.
The diﬀerent methods for merging groups of documents for the different
agglomerative methods are as follows:
A Survey of Text Clustering Algorithms 91
Single Linkage Clustering: In single linkage clustering, the similarity
between two groups of documents is the greatest similarity
between any pair of documents from these two groups. In single
link clustering we merge the two groups which are such that their
closest pair of documents have the highest similarity compared to
any other pair of groups. The main advantage of single linkage
clustering is that it is extremely eﬃcient to implement in practice.
This is because we can ﬁrst compute all similarity pairs and sort
them in order of reducing similarity. These pairs are processed in
this pre-deﬁned order and the merge is performed successively if
the pairs belong to diﬀerent groups. It can be easily shown that
this approach is equivalent to the single-linkage method. This is
essentially equivalent to a spanning tree algorithm on the complete
graph of pairwise-distances by processing the edges of the graph
in a certain order. It has been shown in [92] how Prim’s minimum
spanning tree algorithm can be adapted to single-linkage clustering.
Another method in [24] designs the single-linkage method
in conjunction with the inverted index method in order to avoid
computing zero similarities.
The main drawback of this approach is that it can lead to the
phenomenon of chaining in which a chain of similar documents
lead to disparate documents being grouped into the same clusters.
In other words, if A is similar to B and B is similar to C, it does not
always imply that A is similar to C, because of lack of transitivity
in similarity computations. Single linkage clustering encourages
the grouping of documents through such transitivity chains. This
can often lead to poor clusters, especially at the higher levels of the
agglomeration. Eﬀective methods for implementing single-linkage
clustering for the case of document data may be found in [24, 92].
Group-Average Linkage Clustering: In group-average linkage
clustering, the similarity between two clusters is the average similarity
between the pairs of documents in the two clusters. Clearly,
the average linkage clustering process is somewhat slower than
single-linkage clustering, because we need to determine the average
similarity between a large number of pairs in order to determine
group-wise similarity. On the other hand, it is much more
robust in terms of clustering quality, because it does not exhibit
the chaining behavior of single linkage clustering. It is possible
to speed up the average linkage clustering algorithm by approximating
the average linkage similarity between two clusters C1 and
C2 by computing the similarity between the mean document of C1
92 MINING TEXT DATA
and the mean document of C2. While this approach does not work
equally well for all data domains, it works particularly well for the
case of text data. In this case, the running time can be reduced
to O(n2), where n is the total number of nodes. The method can
be implemented quite eﬃciently in the case of document data, because
the centroid of a cluster is simply the concatenation of the
documents in that cluster.
Complete Linkage Clustering: In this technique, the similarity
between two clusters is the worst-case similarity between any pair
of documents in the two clusters. Complete-linkage clustering can
also avoid chaining because it avoids the placement of any pair of
very disparate points in the same cluster. However, like groupaverage
clustering, it is computationally more expensive than the
single-linkage method. The complete linkage clustering method
requires O(n2) space and O(n3) time. The space requirement can
however be signiﬁcantly lower in the case of the text data domain,
because a large number of pairwise similarities are zero.
Hierarchical clustering algorithms have also been designed in the context
of text data streams. A distributional modeling method for hierarchical
clustering of streaming documents has been proposed in [80]. The main
idea is to model the frequency of word-presence in documents with the
use of a multi-poisson distribution. The parameters of this model are
learned in order to assign documents to clusters. The method extends
the COBWEB and CLASSIT algorithms [37, 40] to the case of text data.
The work in [80] studies the diﬀerent kinds of distributional assumptions
of words in documents. We note that these distributional assumptions
are required to adapt these algorithms to the case of text data. The
approach essentially changes the distributional assumption so that the
method can work eﬀectively for text data.
3.2 Distance-based Partitioning Algorithms
Partitioning algorithms are widely used in the database literature in
order to eﬃciently create clusters of objects. The two most widely used
distance-based partitioning algorithms [50, 54] are as follows:
k-medoid clustering algorithms: In k-medoid clustering algorithms,
we use a set of points from the original data as the anchors
(or medoids) around which the clusters are built. The key aim
of the algorithm is to determine an optimal set of representative
documents from the original corpus around which the clusters are
built. Each document is assigned to its closest representative from
A Survey of Text Clustering Algorithms 93
the collection. This creates a running set of clusters from the corpus
which are successively improved by a randomized process.
The algorithm works with an iterative approach in which the set
of k representatives are successively improved with the use of randomized
inter-changes. Speciﬁcally, we use the average similarity
of each document in the corpus to its closest representative as the
objective function which needs to be improved during this interchange
process. In each iteration, we replace a randomly picked
representative in the current set of medoids with a randomly picked
representative from the collection, if it improves the clustering objective
function. This approach is applied until convergence is
achieved.
There are two main disadvantages of the use of k-medoids based
clustering algorithms, one of which is speciﬁc to the case of text
data. One general disadvantage of k-medoids clustering algorithms
is that they require a large number of iterations in order to achieve
convergence and are therefore quite slow. This is because each iteration
requires the computation of an objective function whose time
requirement is proportional to the size of the underlying corpus.
The second key disadvantage is that k-medoid algorithms do not
work very well for sparse data such as text. This is because a large
fraction of document pairs do not have many words in common,
and the similarities between such document pairs are small (and
noisy) values. Therefore, a single document medoid often does
not contain all the concepts required in order to eﬀectively build a
cluster around it. This characteristic is speciﬁc to the case of the
information retrieval domain, because of the sparse nature of the
underlying text data.
k-means clustering algorithms: The k-means clustering algorithm
also uses a set of k representatives around which the clusters
are built. However, these representatives are not necessarily obtained
from the original data and are reﬁned somewhat diﬀerently
than a k-medoids approach. The simplest form of the k-means approach
is to start oﬀ with a set of k seeds from the original corpus,
and assign documents to these seeds on the basis of closest similarity.
In the next iteration, the centroid of the assigned points
to each seed is used to replace the seed in the last iteration. In
other words, the new seed is deﬁned, so that it is a better central
point for this cluster. This approach is continued until convergence.
One of the advantages of the k-means method over the
k-medoids method is that it requires an extremely small number
94 MINING TEXT DATA
of iterations in order to converge. Observations from [25, 83] seem
to suggest that for many large data sets, it is suﬃcient to use 5 or
less iterations for an eﬀective clustering. The main disadvantage
of the k-means method is that it is still quite sensitive to the initial
set of seeds picked during the clustering. Secondly, the centroid
for a given cluster of documents may contain a large number of
words. This will slow down the similarity calculations in the next
iteration. A number of methods are used to reduce these eﬀects,
which will be discussed later on in this chapter.
The initial choice of seeds aﬀects the quality of k-means clustering, especially
in the case of document clustering. Therefore, a number of techniques
are used in order to improve the quality of the initial seeds which
are picked for the clustering process. For example, another lightweight
clustering method such as an agglomerative clustering technique can be
used in order to decide the initial set of seeds. This is at the core of
the method discussed in [25] for eﬀective document clustering. We will
discuss this method in detail in the next subsection.
A second method for improving the initial set of seeds is to use some
form of partial supervision in the process of initial seed creation. This
form of partial supervision can also be helpful in creating clusters which
are designed for particular application-speciﬁc criteria. An example of
such an approach is discussed in [4] in which we pick the initial set
of seeds as the centroids of the documents crawled from a particular
category if the Y ahoo! taxonomy. This also has the eﬀect that the
ﬁnal set of clusters are grouped by the coherence of content within the
diﬀerent Y ahoo! categories. The approach has been shown to be quite
eﬀective for use in a number of applications such as text categorization.
Such semi-supervised techniques are particularly useful for information
organization in cases where the starting set of categories is somewhat
noisy, but contains enough information in order to create clusters which
satisfy a pre-deﬁned kind of organization.
3.3 A Hybrid Approach: The Scatter-Gather
Method
While hierarchical clustering methods tend to be more robust because
of their tendency to compare all pairs of documents, they are generally
not very eﬃcient, because of their tendency to require at least O(n2)
time. On the other hand, k-means type algorithms are more eﬃcient
than hierarchical algorithms, but may sometimes not be very eﬀective
because of their tendency to rely on a small number of seeds.
A Survey of Text Clustering Algorithms 95
The method in [25] uses both hierarchical and partitional clustering
algorithms to good eﬀect. Speciﬁcally, it uses a hierarchical clustering
algorithm on a sample of the corpus in order to ﬁnd a robust initial set
of seeds. This robust set of seeds is used in conjunction with a standard
k-means clustering algorithm in order to determine good clusters. The
size of the sample in the initial phase is carefully tailored so as to provide
the best possible eﬀectiveness without this phase becoming a bottleneck
in algorithm execution.
There are two possible methods for creating the initial set of seeds,
which are referred to as buckshot and fractionation respectively. These
are two alternative methods, and are described as follows:
Buckshot: Let k be the number of clusters to be found and n
be the number of documents in the corpus. Instead of picking the
k seeds randomly from the collection, the buckshot scheme picks
an overestimate
√
k · n of the seeds, and then agglomerates these
to k seeds. Standard agglomerative hierarchical clustering algorithms
(requiring quadratic time) are applied to this initial sample
of
√
k · n seeds. Since we use quadratically scalable algorithms in
this phase, this approach requires O(k · n) time. We note that this
seed set is much more robust than one which simply samples for k
seeds, because of the summarization of a large document sample
into a robust set of k seeds.
Fractionation: The fractionation algorithm initially breaks up
the corpus into n/m buckets of size m > k each. An agglomerative
algorithm is applied to each of these buckets to reduce them by a
factor of ν. Thus, at the end of the phase, we have a total of ν · n
agglomerated points. The process is repeated by treating each of
these agglomerated points as an individual record. This is achieved
by merging the diﬀerent documents within an agglomerated cluster
into a single document. The approach terminates when a total of
k seeds remain. We note that the the agglomerative clustering of
each group of m documents in the ﬁrst iteration of the fractionation
algorithm requires O(m2) time, which sums to O(n · m) over the
n/m diﬀerent groups. Since, the number of individuals reduces
geometrically by a factor of ν in each iteration, the total running
time over all iterations is O(n · m · (1+ μ + ν2 + . . .)). For constant
ν < 1, the running time over all iterations is still O(n · m). By
picking m = O(k), we can still ensure a running time of O(n · k)
for the initialization procedure.
The Buckshot and Fractionation procedures require O(k·n) time which is
also equivalent to running time of one iteration of the k means algorithm.
96 MINING TEXT DATA
Each iteration of the K-means algorithm also requires O(k · n) time
because we need to compute the similarity of the n documents to the k
diﬀerent seeds.
We further note that the fractionation procedure can be applied to
a random grouping of the documents into n/m diﬀerent buckets. Of
course, one can also replace the random grouping approach with a more
carefully designed procedure for more eﬀective results. One such procedure
is to sort the documents by the index of the jth most common
word in the document. Here j is chosen to be a small number such
as 3, which corresponds to medium frequency words in the data. The
documents are then partitioned into groups based on this sort order by
segmenting out continuous groups of m documents. This approach ensures
that the groups created have at least a few common words in them
and are therefore not completely random. This can sometimes provide a
better quality of the centers which are determined by the fractionation
algorithm.
Once the initial cluster centers have been determined with the use of
the Buckshot or Fractionation algorithms we can apply standard k-means
partitioning algorithms. Speciﬁcally, we each document is assigned to
the nearest of the k cluster centers. The centroid of each such cluster is
determined as the concatenation of the diﬀerent documents in a cluster.
These centroids replace the sets of seeds from the last iteration. This
process can be repeated in an iterative approach in order to successively
reﬁne the centers for the clusters. Typically, only a smaller number of
iterations are required, because the greatest improvements occur only in
the ﬁrst few iterations.
It is also possible to use a number of procedures to further improve
the quality of the underlying clusters. These procedures are as follows:
Split Operation: The process of splitting can be used in order to
further reﬁne the clusters into groups of better granularity. This
can be achieved by applying the buckshot procedure on the individual
documents in a cluster by using k = 2, and then re-clustering
around these centers. This entire procedure requires O(k ·ni) time
for a cluster containing ni data points, and therefore splitting all
the groups requires O(k · n) time. However, it is not necessary
to split all the groups. Instead, only a subset of the groups can
be split. Those are the groups which are not very coherent and
contain documents of a disparate nature. In order to measure the
coherence of a group, we compute the self-similarity of a cluster.
This self-similarity provides us with an understanding of the underlying
coherence. This quantity can be computed both in terms
of the similarity of the documents in a cluster to its centroid or
A Survey of Text Clustering Algorithms 97
in terms of the similarity of the cluster documents to each other.
The split criterion can then be applied selectively only to those
clusters which have low self similarity. This helps in creating more
coherent clusters.
Join Operation: The join operation attempts to merge similar
clusters into a single cluster. In order to perform the merge, we
compute the topical words of each cluster by examining the most
frequent words of the centroid. Two clusters are considered similar,
if there is signiﬁcant overlap between the topical words of the two
clusters.
We note that the method is often referred to as the Scatter-Gather
clustering method, but this is more because of how the clustering method
has been presented in terms of its use for browsing large collections in
the original paper [25]. The scatter-gather approach can be used for
organized browsing of large document collections, because it creates a
natural hierarchy of similar documents. In particular, a user may wish
to browse the hierarchy of clusters in an interactive way in order to
understand topics of diﬀerent levels of granularity in the collection. One
possibility is to perform a hierarchical clustering a-priori; however such
an approach has the disadvantage that it is unable to merge and recluster
related branches of the tree hierarchy on-the-ﬂy when a user
may need it. A method for constant-interaction time browsing with
the use of the scatter-gather approach has been presented in [26]. This
approach presents the keywords associated with the diﬀerent keywords
to a user. The user may pick one or more of these keywords, which also
corresponds to one or more clusters. The documents in these clusters
are merged and re-clustered to a ﬁner-granularity on-the-ﬂy. This ﬁner
granularity of clustering is presented to the user for further exploration.
The set of documents which is picked by the user for exploration is
referred to as the focus set. Next we will explain how this focus set is
further explored and re-clustered on the ﬂy in constant-time.
The key assumption in order to enable this approach is the cluster
reﬁnement hypothesis. This hypothesis states that documents which belong
to the same cluster in a signiﬁcantly ﬁner granularity partitioning
will also occur together in a partitioning with coarser granularity. The
ﬁrst step is to create a hierarchy of the documents in the clusters. A
variety of agglomerative algorithms such as the buckshot method can be
used for this purpose. We note that each (internal) node of this tree can
be viewed as a meta-document corresponding to the concatenation of all
the documents in the leaves of this subtree. The cluster-reﬁnement hypothesis
allows us to work with a smaller set of meta-documents rather
98 MINING TEXT DATA
than the entire set of documents in a particular subtree. The idea is
to pick a constant M which represents the maximum number of metadocuments
that we are willing to re-cluster with the use of the interactive
approach. The tree nodes in the focus set are then expanded (with priority
to the branches with largest degree), to a maximum of M nodes.
These M nodes are then re-clustered on-the-ﬂy with the scatter-gather
approach. This requires constant time because of the use of a constant
number M of meta-documents in the clustering process. Thus, by working
with the meta-documents for M. we assume the cluster-reﬁnement
hypothesis of all nodes of the subtree at the lower level. Clearly, a larger
value of M does not assume the cluster-reﬁnement hypothesis quite as
strongly, but also comes at a higher cost. The details of the algorithm
are described in [26]. Some extensions of this approach are also presented
in [85], in which it has been shown how this approach can be used
to cluster arbitrary corpus subsets of the documents in constant time.
Another recent online clustering algorithm called LAIR2 [55] provides
constant-interaction time for Scatter/Gather browsing. The parallelization
of this algorithm is signiﬁcantly faster than a corresponding version
of the Buckshot algorithm. It has also been suggested that the LAIR2
algorithm leads to better quality clusters in the data.
3.3.1 Projections for Eﬃcient Document Clustering.
One of the challenges of the scatter-gather algorithm is that even though
the algorithm is designed to balance the running times of the agglomerative
and partitioning phases quite well, it sometimes suﬀer a slowdown
in large document collections because of the massive number of distinct
terms that a given cluster centroid may contain. Recall that a cluster
centroid in the scatter-gather algorithm is deﬁned as the concatenation
of all the documents in that collection. When the number of documents
in the cluster is large, this will also lead to a large number of distinct
terms in the centroid. This will also lead to a slow down of a number of
critical computations such as similarity calculations between documents
and cluster centroids.
An interesting solution to this problem has been proposed in [83]. The
idea is to use the concept of projection in order to reduce the dimensionality
of the document representation. Such a reduction in dimensionality
will lead to signiﬁcant speedups, because the similarity computations
will be made much more eﬃcient. The work in [83] proposes three kinds
of projections:
Global Projection: In global projection, the dimensionality of
the original data set is reduced in order to remove the least important
(weighted) terms from the data. The weight of a term is
A Survey of Text Clustering Algorithms 99
deﬁned as the aggregate of the (normalized and damped) frequencies
of the terms in the documents.
Local Projection: In local projection, the dimensionality of the
documents in each cluster are reduced with a locally speciﬁc approach
for that cluster. Thus, the terms in each cluster centroid
are truncated separately. Speciﬁcally, the least weight terms in the
diﬀerent cluster centroids are removed. Thus, the terms removed
from each document may be diﬀerent, depending upon their local
importance.
Latent Semantic Indexing: In this case, the document-space is
transformed with an LSI technique, and the clustering is applied
to the transformed document space. We note that the LSI technique
can also be applied either globally to the whole document
collection, or locally to each cluster if desired.
It has been shown in [83] that the projection approaches provide competitive
results in terms of eﬀectiveness while retaining an extremely
high level of eﬃciency with respect to all the competing approaches. In
this sense, the clustering methods are diﬀerent from similarity search
because they show little degradation in quality, when projections are
performed. One of the reasons for this is that clustering is a much less
ﬁne grained application as compared to similarity search, and therefore
there is no perceptible diﬀerence in quality even when we work with a
truncated feature space.
4. Word and Phrase-based Clustering
Since text documents are drawn from an inherently high-dimensional
domain, it can be useful to view the problem in a dual way, in which
important clusters of words may be found and utilized for ﬁnding clusters
of documents. In a corpus containing d terms and n documents,
one may view a term-document matrix as an n × d matrix, in which
the (i, j)th entry is the frequency of the jth term in the ith document.
We note that this matrix is extremely sparse since a given document
contains an extremely small fraction of the universe of words. We note
that the problem of clustering rows in this matrix is that of clustering
documents, whereas that of clustering columns in this matrix is that
of clustering words. In reality, the two problems are closely related, as
good clusters of words may be leveraged in order to ﬁnd good clusters
of documents and vice-versa. For example, the work in [16] determines
frequent itemsets of words in the document collection, and uses them to
determine compact clusters of documents. This is somewhat analogous
100 MINING TEXT DATA
to the use of clusters of words [87] for determining clusters of documents.
The most general technique for simultaneous word and document clustering
is referred to as co-clustering [30, 31]. This approach simultaneous
clusters the rows and columns of the term-document matrix, in order to
create such clusters. This can also be considered to be equivalent to the
problem of re-ordering the rows and columns of the term-document matrix
so as to create dense rectangular blocks of non-zero entries in this
matrix. In some cases, the ordering information among words may be
used in order to determine good clusters. The work in [103] determines
the frequent phrases in the collection and leverages them in order to
determine document clusters.
It is important to understand that the problem of word clusters and
document clusters are essentially dual problems which are closely related
to one another. The former is related to dimensionality reduction,
whereas the latter is related to traditional clustering. The boundary between
the two problems is quite ﬂuid, because good word clusters provide
hints for ﬁnding good document clusters and vice-versa. For example,
a more general probabilistic framework which determines word clusters
and document clusters simultaneously is referred to as topic modeling
[49]. Topic modeling is a more general framework than either clustering
or dimensionality reduction. We will introduce the method of topic
modeling in a later section of this chapter. A more detailed treatment
is also provided in the next chapter in this book, which is on dimensionality
reduction, and in Chapter 8 where a more general discussion
of probabilistic models for text mining is given.
4.1 Clustering with Frequent Word Patterns
Frequent pattern mining [8] is a technique which has been widely used
in the data mining literature in order to determine the most relevant patterns
in transactional data. The clustering approach in [16] is designed
on the basis of such frequent pattern mining algorithms. A frequent
itemset in the context of text data is also referred to as a frequent term
set, because we are dealing with documents rather than transactions.
The main idea of the approach is to not cluster the high dimensional
document data set, but consider the low dimensional frequent term sets
as cluster candidates. This essentially means that a frequent terms set
is a description of a cluster which corresponds to all the documents
containing that frequent term set. Since a frequent term set can be considered
a description of a cluster, a set of carefully chosen frequent terms
sets can be considered a clustering. The appropriate choice of this set
A Survey of Text Clustering Algorithms 101
of frequent term sets is deﬁned on the basis of the overlaps between the
supporting documents of the diﬀerent frequent term sets.
The notion of clustering deﬁned in [16] does not necessarily use a strict
partitioning in order to deﬁne the clusters of documents, but it allows
a certain level of overlap. This is a natural property of many term- and
phrase-based clustering algorithms because one does not directly control
the assignment of documents to clusters during the algorithm execution.
Allowing some level of overlap between clusters may sometimes be more
appropriate, because it recognizes the fact that documents are complex
objects and it is impossible to cleanly partition documents into speciﬁc
clusters, especially when some of the clusters are partially related to one
another. The clustering deﬁnition of [16] assumes that each document
is covered by at least one frequent term set.
Let R be the set of chosen frequent term sets which deﬁne the clustering.
Let fi be the number of frequent term sets in R which are contained
in the ith document. The value of fi is at least one in order to ensure
complete coverage, but we would otherwise like it to be as low as possible
in order to minimize overlap. Therefore, we would like the average
value of (fi − 1) for the documents in a given cluster to be as low as
possible. We can compute the average value of (fi − 1) for the documents
in the cluster and try to pick frequent term sets such that this
value is as low as possible. However, such an approach would tend to
favor frequent term sets containing very few terms. This is because if a
term set contains m terms, then all subsets of it would also be covered
by the document, as a result of which the standard overlap would be
increased. The entropy overlap of a given term is essentially the sum of
the values of −(1/fi) · log(1/fi) over all documents in the cluster. This
value is 0, when each document has fi = 1, and increases monotonically
with increasing fi values.
It then remains to describe how the frequent term sets are selected
from the collection. Two algorithms are described in [16], one of which
corresponds to a ﬂat clustering, and the other corresponds to a hierarchical
clustering. We will ﬁrst describe the method for ﬂat clustering.
Clearly, the search space of frequent terms is exponential, and therefore
a reasonable solution is to utilize a greedy algorithm to select the frequent
terms sets. In each iteration of the greedy algorithm, we pick the
frequent term set with a cover having the minimum overlap with other
cluster candidates. The documents covered by the selected frequent term
are removed from the database, and the overlap in the next iteration is
computed with respect to the remaining documents.
The hierarchical version of the algorithm is similar to the broad idea in
ﬂat clustering, with the main diﬀerence that each level of the clustering
102 MINING TEXT DATA
is applied to a set of term sets containing a ﬁxed number k of terms. In
other words, we are working only with frequent patterns of length k for
the selection process. The resulting clusters are then further partitioned
by applying the approach for (k + 1)-term sets. For further partitioning
a given cluster, we use only those (k + 1)-term sets which contain the
frequent k-term set deﬁning that cluster. More details of the approach
may be found in [16].
4.2 Leveraging Word Clusters for Document
Clusters
A two phase clustering procedure is discussed in [87], which uses the
following steps to perform document clustering:
In the ﬁrst phase, we determine word-clusters from the documents
in such a way that most of mutual information between words and
documents is preserved when we represent the documents in terms
of word clusters rather than words.
In the second phase, we use the condensed representation of the
documents in terms of word-clusters in order to perform the ﬁnal
document clustering. Speciﬁcally, we replace the word occurrences
in documents with word-cluster occurrences in order to perform the
document clustering. One advantage of this two-phase procedure
is the signiﬁcant reduction in the noise in the representation.
Let X = x1 . . . xn be the random variables corresponding to the rows
(documents), and let Y = y1 . . . yd be the random variables corresponding
to the columns (words). We would like to partition X into k clusters,
and Y into l clusters. Let the clusters be denoted by ˆX = ˆx1 . . . ˆxk and
ˆY = ˆy1 . . . ˆyl. In other words, we wish to ﬁnd the maps CX and CY ,
which deﬁne the clustering:
CX : x1 . . . xn ⇒ ˆx1 . . . ˆxk
CY : y1 . . . yd ⇒ ˆy1 . . . ˆyl
In the ﬁrst phase of the procedure we cluster Y to ˆY , so that most
of the information in I(X, Y ) is preserved in I(X, ˆY ). In the second
phase, we perform the clustering again from X to ˆX using exactly the
same procedure so that as much information as possible from I(X, ˆY )
is preserved in I( ˆX, ˆY ). Details of how each phase of the clustering is
performed is provided in [87].
How to discover interesting word clusters (which can be leveraged for
document clustering) has itself attracted attention in the natural lan-
A Survey of Text Clustering Algorithms 103
guage processing research community, with particular interests in discovering
word clusters that can characterize word senses [34] or a semantic
concept [21]. In [34], for example, the Markov clustering algorithm was
applied to discover corpus-speciﬁc word senses in an unsupervised way.
Speciﬁcally, a word association graph is ﬁrst constructed in which related
words would be connected with an edge. For a given word that potentially
has multiple senses, we can then isolate the subgraph representing
its neighbors. These neighbors are expected to form clusters according to
diﬀerent senses of the target word, thus by grouping together neighbors
that are well connected with each other, we can discover word clusters
that characterize diﬀerent senses of the target word. In [21], an n-gram
class language model was proposed to cluster words based on minimizing
the loss of mutual information between adjacent words, which can
achieve the eﬀect of grouping together words that share similar context
in natural language text.
4.3 Co-clustering Words and Documents
In many cases, it is desirable to simultaneously cluster the rows and
columns of the contingency table, and explore the interplay between
word clusters and document clusters during the clustering process. Since
the clusters among words and documents are clearly related, it is often
desirable to cluster both simultaneously when when it is desirable to ﬁnd
clusters along one of the two dimensions. Such an approach is referred
to as co-clustering [30, 31]. Co-clustering is deﬁned as a pair of maps
from rows to row-cluster indices and columns to column-cluster indices.
These maps are determined simultaneously by the algorithm in order to
optimize the corresponding cluster representations.
We further note that the matrix factorization approach [58] discussed
earlier in this chapter can be naturally used for co-clustering because it
discovers word clusters and document clusters simultaneously. In that
section, we have also discussed how matrix factorization can be viewed
as a co-clustering technique. While matrix factorization has not widely
been used as a technique for co-clustering, we point out this natural
connection, as possible exploration for future comparison with other coclustering
methods. Some recent work [60] has shown how matrix factorization
can be used in order to transform knowledge from word space
to document space in the context of document clustering techniques.
The problem of co-clustering is also closely related to the problem
of subspace clustering [7] or projected clustering [5] in quantitative data
in the database literature. In this problem, the data is clustered by
simultaneously associating it with a set of points and subspaces in multi-
104 MINING TEXT DATA
dimensional space. The concept of co-clustering is a natural application
of this broad idea to data domains which can be represented as sparse
high dimensional matrices in which most of the entries are 0. Therefore,
traditional methods for subspace clustering can also be extended to the
problem of co-clustering. For example, an adaptive iterative subspace
clustering method for documents was proposed in [59].
We note that subspace clustering or co-clustering can be considered a
form of local feature selection, in which the features selected are speciﬁc
to each cluster. A natural question arises, as to whether the features can
be selected as a linear combination of dimensions as in the case of traditional
dimensionality reduction techniques such as PCA [53]. This is
also known as local dimensionality reduction [22] or generalized projected
clustering [6] in the traditional database literature. In this method,
PCA-based techniques are used in order to generate subspace representations
which are speciﬁc to each cluster, and are leveraged in order to
achieve a better clustering process. In particular, such an approach has
recently been designed [32], which has been shown to work well with
document data.
In this section, we will study two well known methods for document
co-clustering, which are commonly used in the document clustering literature.
One of these methods uses graph-based term-document representations
[30] and the other uses information theory [31]. We will discuss
both of these methods below.
4.3.1 Co-clustering with graph partitioning. The core
idea in this approach [30] is to represent the term-document matrix as a
bipartite graph G = (V1 ∪ V2, E), where V1 and V2 represent the vertex
sets in the two bipartite portions of this graph, and E represents the
edge set. Each node in V1 corresponds to one of the n documents, and
each node in V2 corresponds to one of the d terms. An undirected edge
exists between node i ∈ V1 and node j ∈ V2 if document i contains the
term j. We note that there are no edges in E directly between terms,
or directly between documents. Therefore, the graph is bipartite. The
weight of each edge is the corresponding normalized term-frequency.
We note that a word partitioning in this bipartite graph induces a
document partitioning and vice-versa. Given a partitioning of the documents
in this graph, we can associate each word with the document
cluster to which it is connected with the most weight of edges. Note
that this criterion also minimizes the weight of the edges across the partitions.
Similarly, given a word partitioning, we can associate each document
with the word partition to which it is connected with the greatest
weight of edges. Therefore, a natural solution to this problem would
A Survey of Text Clustering Algorithms 105
be simultaneously perform the k-way partitioning of this graph which
minimizes the total weight of the edges across the partitions. This is of
course a classical problem in the graph partitioning literature. In [30],
it has been shown how a spectral partitioning algorithm can be used
eﬀectively for this purpose. Another method discussed in [75] uses an
isometric bipartite graph-partitioning approach for the clustering pro-
cess.
4.3.2 Information-Theoretic Co-clustering. In [31], the
optimal clustering has been deﬁned to be one which maximizes the mutual
information between the clustered random variables. The normalized
non-negative contingency table is treated as a joint probability distribution
between two discrete random variables which take values over
rows and columns. Let X = x1 . . . xn be the random variables corresponding
to the rows, and let Y = y1 . . . yd be the random variables
corresponding to the columns. We would like to partition X into k clusters,
and Y into l clusters. Let the clusters be denoted by ˆX = ˆx1 . . . ˆxk
and ˆY = ˆy1 . . . ˆyl. In other words, we wish to ﬁnd the maps CX and CY ,
which deﬁne the clustering:
CX : x1 . . . xn ⇒ ˆx1 . . . ˆxk
CY : y1 . . . yd ⇒ ˆy1 . . . ˆyl
The partition functions CX and CY are allowed to depend on the joint
probability distribution p(X, Y ). We note that since ˆX and ˆY are higher
level clusters of X and Y , there is loss in mutual information in the higher
level representations. In other words, the distribution p( ˆX, ˆY ) contains
less information than p(X, Y ), and the mutual information I( ˆX, ˆY ) is
lower than the mutual information I(X, Y ). Therefore, the optimal coclustering
problem is to determine the mapping which minimizes the loss
in mutual information. In other words, we wish to ﬁnd a co-clustering for
which I(X, Y ) − I( ˆX, ˆY ) is as small as possible. An iterative algorithm
for ﬁnding a co-clustering which minimizes mutual information loss is
proposed in [29].
4.4 Clustering with Frequent Phrases
One of the key diﬀerences of this method from other text clustering
methods is that it treats a document as a string as opposed to a bag of
words. Speciﬁcally, each document is treated as a string of words, rather
than characters. The main diﬀerence between the string representation
and the bag-of-words representation is that the former also retains ordering
information for the clustering process. As is the case with many
106 MINING TEXT DATA
clustering methods, it uses an indexing method in order to organize the
phrases in the document collection, and then uses this organization to
create the clusters [103, 104]. Several steps are used in order to create
the clusters:
(1) The ﬁrst step is to perform the cleaning of the strings representing
the documents. A light stemming algorithm is used by deleting word
preﬁxes and suﬃxes and reducing plural to singular. Sentence boundaries
are marked and non-word tokens are stripped.
(2) The second step is the identiﬁcation of base clusters. These are
deﬁned by the frequent phases in the collection which are represented
in the form of a suﬃx tree. A suﬃx tree [45] is essentially a trie which
contains all the suﬃxes of the entire collection. Each node of the suﬃx
tree represents a group of documents, and a phrase which is common to
all these documents. Since each node of the suﬃx-tree also corresponds
to a group of documents, it also corresponds to a base clustering. Each
base cluster is given a score which is essentially the product of the number
of documents in that cluster and a non-decreasing function of the
length of the underlying phrase. Therefore, clusters containing a large
number of documents, and which are deﬁned by a relatively long phrase
are more desirable.
(3) An important characteristic of the base clusters created by the sufﬁx
tree is that they do not deﬁne a strict partitioning and have overlaps
with one another. For example, the same document may contain multiple
phrases in diﬀerent parts of the suﬃx tree, and will therefore be
included in the corresponding document groups. The third step of the
algorithm merges the clusters based on the similarity of their underlying
document sets. Let P and Q be the document sets corresponding to two
clusters. The base similarity BS(P, Q) is deﬁned as follows:
BS(P, Q) =
|P ∩ Q|
max{|P|, |Q|}
+ 0.5 (4.11)
This base similarity is either 0 or 1, depending upon whether the two
groups have at least 50% of their documents in common. Then, we construct
a graph structure in which the nodes represent the base clusters,
and an edge exists between two cluster nodes, if the corresponding base
similarity between that pair of groups is 1. The connected components
in this graph deﬁne the ﬁnal clusters. Speciﬁcally, the union of the
groups of documents in each connected component is used as the ﬁnal
set of clusters. We note that the ﬁnal set of clusters have much less overlap
with one another, but they still do not deﬁne a strict partitioning.
This is sometimes the case with clustering algorithms in which modest
overlaps are allowed to enable better clustering quality.
A Survey of Text Clustering Algorithms 107
5. Probabilistic Document Clustering and Topic
Models
A popular method for probabilistic document clustering is that of
topic modeling. The idea of topic modeling is to create a probabilistic
generative model for the text documents in the corpus. The main approach
is to represent a corpus as a function of hidden random variables,
the parameters of which are estimated using a particular document collection.
The primary assumptions in any topic modeling approach (together
with the corresponding random variables) are as follows:
The n documents in the corpus are assumed to have a probability
of belonging to one of k topics. Thus, a given document may have
a probability of belonging to multiple topics, and this reﬂects the
fact that the same document may contain a multitude of subjects.
For a given document Di, and a set of topics T1 . . . Tk, the probability
that the document Di belongs to the topic Tj is given by
P(Tj|Di). We note that the the topics are essentially analogous to
clusters, and the value of P(Tj|Di) provides a probability of cluster
membership of the ith document to the jth cluster. In nonprobabilistic
clustering methods, the membership of documents to
clusters is deterministic in nature, and therefore the clustering is
typically a clean partitioning of the document collection. However,
this often creates challenges, when there are overlaps in document
subject matter across multiple clusters. The use of a soft cluster
membership in terms of probabilities is an elegant solution to this
dilemma. In this scenario, the determination of the membership of
the documents to clusters is a secondary goal to that of ﬁnding the
latent topical clusters in the underlying text collection. Therefore,
this area of research is referred to as topic modeling, and while it is
related to the clustering problem, it is often studied as a distinct
area of research from clustering.
The value of P(Tj|Di) is estimated using the topic modeling approach,
and is one of the primary outputs of the algorithm. The
value of k is one of the inputs to the algorithm and is analogous
to the number of clusters.
Each topic is associated with a probability vector, which quantiﬁes
the probability of the diﬀerent terms in the lexicon for that topic.
Let t1 . . . td be the d terms in the lexicon. Then, for a document
that belongs completely to topic Tj, the probability that the term
tl occurs in it is given by P(tl|Tj). The value of P(tl|Tj) is another
108 MINING TEXT DATA
important parameter which needs to be estimated by the topic
modeling approach.
Note that the number of documents is denoted by n, topics by k and
lexicon size (terms) by d. Most topic modeling methods attempt to
learn the above parameters using maximum likelihood methods, so that
the probabilistic ﬁt to the given corpus of documents is as large as possible.
There are two basic methods which are used for topic modeling,
which are Probabilistic Latent Semantic Indexing (PLSI) [49] and Latent
Dirichlet Allocation (LDA)[20] respectively.
In this section, we will focus on the probabilistic latent semantic indexing
method. Note that the above set of random variables P(Tj|Di)
and P(tl|Tj) allow us to model the probability of a term tl occurring
in any document Di. Speciﬁcally, the probability P(tl|Di) of the term
tl occurring document Di can be expressed in terms of afore-mentioned
parameters as follows:
P(tl|Di) =
k
j=1
p(tl|Tj) · P(Tj|Di) (4.12)
Thus, for each term tl and document Di, we can generate a n × d matrix
of probabilities in terms of these parameters, where n is the number
of documents and d is the number of terms. For a given corpus, we
also have the n × d term-document occurrence matrix X, which tells
us which term actually occurs in each document, and how many times
the term occurs in the document. In other words, X(i, l) is the number
of times that term tl occurs in document Di. Therefore, we can use a
maximum likelihood estimation algorithm which maximizes the product
of the probabilities of terms that are observed in each document in the
entire collection. The logarithm of this can be expressed as a weighted
sum of the logarithm of the terms in Equation 4.12, where the weight
of the (i, l)th term is its frequency count X(i, l). This is a constrained
optimization problem which optimizes the value of the log likelihood
probability i,l X(i, l)·log(P(tl|Di)) subject to the constraints that the
probability values over each of the topic-document and term-topic spaces
must sum to 1:
l
P(tl|Tj) = 1 ∀Tj (4.13)
j
P(Tj|Di) = 1 ∀Di (4.14)
A Survey of Text Clustering Algorithms 109
The value of P(tl|Di) in the objective function is expanded and expressed
in terms of the model parameters with the use of Equation 4.12. We
note that a Lagrangian method can be used to solve this constrained
problem. This is quite similar to the approach that we discussed for
the non-negative matrix factorization problem in this chapter. The Lagrangian
solution essentially leads to a set of iterative update equations
for the corresponding parameters which need to be estimated. It can be
shown that these parameters can be estimated [49] with the iterative update
of two matrices [P1]k×n and [P2]d×k containing the topic-document
probabilities and term-topic probabilities respectively. We start oﬀ by
initializing these matrices randomly, and normalize each of them so that
the probability values in their columns sum to one. Then, we iteratively
perform the following steps on each of P1 and P2 respectively:
for each entry (j, i) in P1 do update
P1(j, i) ← P1(j, i) · d
r=1 P2(r, j) · X(i,r)
k
v=1 P1(v,i)·P2(r,v)
Normalize each column of P1 to sum to 1;
for each entry (l, j) in P2 do update
P2(l, j) ← P2(l, j) · n
q=1 P1(j, q) · X(q,l)
k
v=1 P1(v,q)·P2(l,v)
Normalize each column of P2 to sum to 1;
The process is iterated to convergence. The output of this approach
are the two matrices P1 and P2, the entries of which provide the topicdocument
and term-topic probabilities respectively.
The second well known method for topic modeling is that of Latent
Dirichlet Allocation. In this method, the term-topic probabilities and
topic-document probabilities are modeled with a Dirichlet distribution
as a prior. Thus, the LDA method is the Bayesian version of the PLSI
technique. It can also be shown the the PLSI method is equivalent to
the LDA technique, when applied with a uniform Dirichlet prior [42].
The method of LDA was ﬁrst introduced in [20]. Subsequently, it has
generally been used much more extensively as compared to the PLSI
method. Its main advantage over the PLSI method is that it is not quite
as susceptible to overﬁtting. This is generally true of Bayesian methods
which reduce the number of model parameters to be estimated, and
therefore work much better for smaller data sets. Even for larger data
sets, PLSI has the disadvantage that the number of model parameters
grows linearly with the size of the collection. It has been argued [20] that
the PLSI model is not a fully generative model, because there is no accurate
way to model the topical distribution of a document which is not
included in the current data set. For example, one can use the current set
110 MINING TEXT DATA
of topical distributions to perform the modeling of a new document, but
it is likely to be much more inaccurate because of the overﬁtting inherent
in PLSI. A Bayesian model, which uses a small number of parameters in
the form of a well-chosen prior distribution, such as a Dirichlet, is likely
to be much more robust in modeling new documents. Thus, the LDA
method can also be used in order to model the topic distribution of a new
document more robustly, even if it is not present in the original data set.
Despite the theoretical advantages of LDA over PLSA, a recent study
has shown that their task performances in clustering, categorization and
retrieval tend to be similar [63]. The area of topic models is quite vast,
and will be treated in more depth in Chapter 5 and Chapter 8 of this
book; the purpose of this section is to simply acquaint the reader with
the basics of this area and its natural connection to clustering.
We note that the EM-concepts which are used for topic modeling are
quite general, and can be used for diﬀerent variations on the text clustering
tasks, such as text classiﬁcation [72] or incorporating user feedback
into clustering [46]. For example, the work in [72] uses an EM-approach
in order to perform supervised clustering (and classiﬁcation) of the documents,
when a mixture of labeled and unlabeled data is available. A
more detailed discussion is provided in Chapter 6 on text classiﬁcation.
6. Online Clustering with Text Streams
The problem of streaming text clustering is particularly challenging
in the context of text data because of the fact that the clusters need to
be continuously maintained in real time. One of the earliest methods
for streaming text clustering was proposed in [112]. This technique is
referred to as the Online Spherical k-Means Algorithm (OSKM), which
reﬂects the broad approach used by the methodology. This technique
divides up the incoming stream into small segments, each of which can
be processed eﬀectively in main memory. A set of k-means iterations
are applied to each such data segment in order to cluster them. The
advantage of using a segment-wise approach for clustering is that since
each segment can be held in main memory, we can process each data
point multiple times as long as it is held in main memory. In addition,
the centroids from the previous segment are used in the next iteration
for clustering purposes. A decay factor is introduced in order to ageout
the old documents, so that the new documents are considered more
important from a clustering perspective. This approach has been shown
to be extremely eﬀective in clustering massive text streams in [112].
A diﬀerent method for clustering massive text and categorical data
streams is discussed in [3]. The method discussed in [3] uses an approach
A Survey of Text Clustering Algorithms 111
which examines the relationship between outliers, emerging trends, and
clusters in the underlying data. Old clusters may become inactive, and
eventually get replaced by new clusters. Similarly, when newly arriving
data points do not naturally ﬁt in any particular cluster, these need
to be initially classiﬁed as outliers. However, as time progresses, these
new points may create a distinctive pattern of activity which can be
recognized as a new cluster. The temporal locality of the data stream
is manifested by these new clusters. For example, the ﬁrst web page
belonging to a particular category in a crawl may be recognized as an
outlier, but may later form a cluster of documents of its own. On the
other hand, the new outliers may not necessarily result in the formation
of new clusters. Such outliers are true short-term abnormalities in the
data since they do not result in the emergence of sustainable patterns.
The approach discussed in [3] recognizes new clusters by ﬁrst recognizing
them as outliers. This approach works with the use of a summarization
methodology, in which we use the concept of condensed droplets [3] in
order to create concise representations of the underlying clusters.
As in the case of the OSKM algorithm, we ensure that recent data
points are given greater importance than older data points. This is
achieved by creating a time-sensitive weight for each data point. It is
assumed that each data point has a time-dependent weight deﬁned by the
function f(t). The function f(t) is also referred to as the fading function.
The fading function f(t) is a non-monotonic decreasing function which
decays uniformly with time t. The aim of deﬁning a half life is to quantify
the rate of decay of the importance of each data point in the stream
clustering process. The decay-rate is deﬁned as the inverse of the half
life of the data stream. We denote the decay rate by λ = 1/t0. We denote
the weight function of each point in the data stream by f(t) = 2−λ·t.
From the perspective of the clustering process, the weight of each data
point is f(t). It is easy to see that this decay function creates a half life
of 1/λ. It is also evident that by changing the value of λ, it is possible
to change the rate at which the importance of the historical information
in the data stream decays.
When a cluster is created during the streaming process by a newly
arriving data point, it is allowed to remain as a trend-setting outlier
for at least one half-life. During that period, if at least one more data
point arrives, then the cluster becomes an active and mature cluster.
On the other hand, if no new points arrive during a half-life, then the
trend-setting outlier is recognized as a true anomaly in the data stream.
At this point, this anomaly is removed from the list of current clusters.
We refer to the process of removal as cluster death. Thus, a new cluster
containing one data point dies when the (weighted) number of points
112 MINING TEXT DATA
in the cluster is 0.5. The same criterion is used to deﬁne the death of
mature clusters. A necessary condition for this criterion to be met is
that the inactivity period in the cluster has exceeded the half life 1/λ.
The greater the number of points in the cluster, the greater the level by
which the inactivity period would need to exceed its half life in order
to meet the criterion. This is a natural solution, since it is intuitively
desirable to have stronger requirements (a longer inactivity period) for
the death of a cluster containing a larger number of points.
The statistics of the data points are captured in summary statistics,
which are referred to as condensed droplets. These represent the word
distributions within a cluster, and can be used in order to compute the
similarity of an incoming data point to the cluster. The overall algorithm
proceeds as follows. At the beginning of algorithmic execution, we start
with an empty set of clusters. As new data points arrive, unit clusters
containing individual data points are created. Once a maximum number
k of such clusters have been created, we can begin the process of online
cluster maintenance. Thus, we initially start oﬀ with a trivial set of k
clusters. These clusters are updated over time with the arrival of new
data points.
When a new data point X arrives, its similarity to each cluster droplet
is computed. In the case of text data sets, the cosine similarity measure
between DF1 and X is used. The similarity value S(X, Cj) is computed
from the incoming document X to every cluster Cj. The cluster with
the maximum value of S(X, Cj) is chosen as the relevant cluster for
data insertion. Let us assume that this cluster is Cmindex. We use a
threshold denoted by thresh in order to determine whether the incoming
data point is an outlier. If the value of S(X, Cmindex) is larger than the
threshold thresh, then the point X is assigned to the cluster Cmindex.
Otherwise, we check if some inactive cluster exists in the current set of
cluster droplets. If no such inactive cluster exists, then the data point
X is added to Cmindex. On the other hand, when an inactive cluster
does exist, a new cluster is created containing the solitary data point
X. This newly created cluster replaces the inactive cluster. We note
that this new cluster is a potential true outlier or the beginning of a new
trend of data points. Further understanding of this new cluster may
only be obtained with the progress of the data stream.
In the event that X is inserted into the cluster Cmindex, we update
the statistics of the cluster in order to reﬂect the insertion of the data
point and temporal decay statistics. Otherwise, we replace the most
inactive cluster by a new cluster containing the solitary data point X.
In particular, the replaced cluster is the least recently updated cluster
among all inactive clusters. This process is continuously performed over
A Survey of Text Clustering Algorithms 113
the life of the data stream, as new documents arrive over time. The
work in [3] also presents a variety of other applications of the stream
clustering technique such as evolution and correlation analysis.
A diﬀerent way of utilizing the temporal evolution of text documents
in the clustering process is described in [48]. Speciﬁcally, the work in
[48] uses bursty features as markers of new topic occurrences in the data
stream. This is because the semantics of an up-and-coming topic are
often reﬂected in the frequent presence of a few distinctive words in the
text stream. At a given period in time, the nature of relevant topics could
lead to bursts in speciﬁc features of the data stream. Clearly, such features
are extremely important from a clustering perspective. Therefore,
the method discussed in [48] uses a new representation, which is referred
to as the bursty feature representation for mining text streams. In this
representation, a time-varying weight is associated with the features depending
upon its burstiness. This also reﬂects the varying importance
of the feature to the clustering process. Thus, it is important to remember
that a particular document representation is dependent upon the
particular instant in time at which it is constructed.
Another issue which is handled eﬀectively in this approach is an implicit
reduction in dimensionality of the underlying collection. Text is
inherently a high dimensional data domain, and the pre-selection of some
of the features on the basis of their burstiness can be a natural way to
reduce the dimensionality of document representation. This can help in
both the eﬀectiveness and eﬃciency of the underlying algorithm.
The ﬁrst step in the process is to identify the bursty features in the
data stream. In order to achieve this goal, the approach uses Kleinberg’s
2-state ﬁnite automaton model [57]. Once these features have
been identiﬁed, the bursty features are associated with weights which
depend upon their level of burstiness. Subsequently, a bursty feature
representation is deﬁned in order to reﬂect the underlying weight of the
feature. Both the identiﬁcation and the weight of the bursty feature are
dependent upon its underlying frequency. A standard k-means approach
is applied to the new representation in order to construct the clustering.
It was shown in [48] that the approach of using burstiness improves the
cluster quality. Once criticism of the work in [48] is that it is mostly
focused on the issue of improving eﬀectiveness with the use of temporal
characteristics of the data stream, and does not address the issue of
eﬃcient clustering of the underlying data stream.
In general, it is evident that feature extraction is important for all
clustering algorithms. While the work in [48] focuses on using temporal
characteristics of the stream for feature extraction, the work in [61] focuses
on using phrase extraction for eﬀective feature selection. This work
114 MINING TEXT DATA
is also related to the concept of topic-modeling, which will be discussed
in detail in the next section. This is because the diﬀerent topics in a
collection can be related to the clusters in a collection. The work in [61]
uses topic-modeling techniques for clustering. The core idea in the work
of [61] is that individual words are not very eﬀective for a clustering
algorithm because they miss the context in which the word is used. For
example, the word “star” may either refer to a celestial body or to an
entertainer. On the other hand, when the phrase “ﬁxed star” is used,
it becomes evident that the word “star” refers to a celestial body. The
phrases which are extracted from the collection are also referred to as
topic signatures.
The use of such phrasal clariﬁcation for improving the quality of the
clustering is referred to as semantic smoothing because it reduces the
noise which is associated with semantic ambiguity. Therefore, a key part
of the approach is to extract phrases from the underlying data stream.
After phrase extraction, the training process determines a translation
probability of the phrase to terms in the vocabulary. For example, the
word “planet” may have high probability of association with the phrase
“ﬁxed star”, because both refer to celestial bodies. Therefore, for a given
document, a rational probability count may also be assigned to all terms.
For each document, it is assumed that all terms in it are generated either
by a topic-signature model, or a background collection model.
The approach in [61] works by modeling the soft probability p(w|Cj)
for word w and cluster Cj. The probability p(w|Cj) is modeled as a linear
combination of two factors; (a) A maximum likelihood model which computes
the probabilities of generating speciﬁc words for each cluster (b)
An indirect (translated) word-membership probability which ﬁrst determines
the maximum likelihood probability for each topic-signature, and
then multiplying with the conditional probability of each word, given the
topic-signature. We note that we can use p(w|Cj) in order to estimate
p(d|Cj) by using the product of the constituent words in the document.
For this purpose, we use the frequency f(w, d) of word w in document
d.
p(d|Cj) =
w∈d
p(w|Cj)f(w,d)
(4.15)
We note that in the static case, it is also possible to add a background
model in order to improve the robustness of the estimation process. This
is however not possible in a data stream because of the fact that the
background collection model may require multiple passes in order to
build eﬀectively. The work in [61] maintains these probabilities in online
fashion with the use of a cluster proﬁle, that weights the probabilities
with the use of a fading function. We note that the concept of cluster
A Survey of Text Clustering Algorithms 115
proﬁle is analogous to the concept of condensed droplet introduced in [3].
The key algorithm (denoted by OCTS) is to maintain a dynamic set of
clusters into which documents are progressively assigned with the use of
similarity computations. It has been shown in [61] how the cluster proﬁle
can be used in order to eﬃciently compute p(d|Cj) for each incoming
document. This value is then used in order to determine the similarity of
the documents to the diﬀerent clusters. This is used in order to assign the
documents to their closest cluster. We note that the methods in [3, 61]
share a number of similarities in terms of (a) maintenance of cluster
proﬁles, (b) use of cluster proﬁles (or condensed droplets) to compute
similarity and assignment of documents to most similar clusters, and (c)
the rules used to decide when a new singleton cluster should be created,
or one of the older clusters should be replaced.
The main diﬀerence between the two algorithms is the technique which
is used in order to compute cluster similarity. The OCTS algorithm
uses the probabilistic computation p(d|Cj) to compute cluster similarity,
which takes the phrasal information into account during the computation
process. One observation about OCTS is that it may allow for very
similar clusters to co-exist in the current set. This reduces the space
available for distinct cluster proﬁles. A second algorithm called OCTSM
is also proposed in [61], which allows for merging of very similar clusters.
Before each assignment, it checks whether pairs of similar clusters can
be merged on the basis of similarity. If this is the case, then we allow the
merging of the similar clusters and their corresponding cluster proﬁles.
Detailed experimental results on the diﬀerent clustering algorithms and
their eﬀectiveness are presented in [61].
A closely related area to clustering is that of topic modeling, which
we discussed in an earlier section. Recently, the topic modeling method
has also been extended to the dynamic case which is helpful for topic
modeling of text streams [107].
7. Clustering Text in Networks
Many social networks contain both text content in the nodes, as well
as links between the diﬀerent nodes. Clearly, the links provide useful
cues in understanding the related nodes in the network. The impact of
diﬀerent link types on the quality of the clustering has been studied in
[109], and it has been shown that many forms of implicit and explicit
links improve clustering quality, because they encode human knowledge.
Therefore, a natural choice is to combine these two factors in the process
of clustering the diﬀerent nodes. In this section, we will discuss a number
of such techniques.
116 MINING TEXT DATA
In general, links may be considered as a kind of side-information,
which can be represented in the form of attributes. A general approach
for incorporating side attributes into the clustering process has been proposed
in [1]. This algorithm uses a combination of a k-means approach
on the text attributes, and Bayesian probability estimations on the side
attributes for the clustering process. The idea is to identify those attributes,
which are helpful for the clustering process, and use them in
order to enhance the quality of the clustering. However, this approach is
really designed for general attributes of any kind, rather than link-based
attributes, in which an underlying graph structure is implied by the
document-to-document linkages. In spite of this, it has been shown in
[1], that it is possible to signiﬁcantly enhance the quality of clustering by
treating linkage information as side-attributes. Many other techniques,
which will be discussed in this section, have been proposed speciﬁcally
for the case of text documents, which are linked together in a network
structure.
The earliest methods for combining text and link information for the
clustering process are proposed in [12]. Two diﬀerent methods were
proposed in this paper for the clustering process. The ﬁrst method
uses the link information in the neighbors of a node in order to bias
the term weights in a document. Term weights which are common between
a document and its neighbors are given more importance in the
clustering process. One advantage of such an approach is that we can
use any of the existing clustering algorithms for this purpose, because
the link information is implicitly encoded in the modiﬁed term weights.
The second method proposed in [12] is a graph-based approach which
directly uses the links in the clustering process. In this case, the approach
attempts to model the probability that a particular document
belongs to a given cluster for a particular set of links and content. This
is essentially a soft-clustering, in which a probability of assignment is
determined for each cluster. The cluster with the largest probability of
assignment is considered the most relevant cluster. A Markov Random
Field (MRF) technique is used in order to perform the clustering. An
iterative technique called relaxation labeling is used in order to compute
the maximum likelihood parameters of this MRF. More details of this
approach may be found in [12].
A recent method to perform clustering with both structural and attribute
similarities is proposed in [113]. The techniques of this paper can
be applied to both relational and text attributes. This paper integrates
structural and attribute-based clustering by adding attribute vertices to
the network in addition to the original structural vertices. In the context
of text data, this implies that a vertex exists for each word in the lexi-
A Survey of Text Clustering Algorithms 117
con. Therefore, in addition to the original set of vertices V in the graph
G = (V, E), we now have the augmented vertex set V ∪ V1, such that
V1 contains one vertex for each nodes. We also augment the edge set, in
order to add to the original set of structural edges E. We add an edge
between a structural vertex i ∈ V and an attribute vertex j ∈ V1, if word
j is contained in the node i. This new set of edges added is denoted by
E1. Therefore, we now have an augmented graph G1 = (V ∪ V1, E ∪ E1)
which is semi-bipartite. A neighborhood random walk model is used in
order to determine the closeness of vertices. This closeness measure is
used in order to perform the clustering. The main challenge in the algorithm
is to determine the relative importance of structural and attribute
components in the clustering process. In the context of the random walk
model, this translates to determining the appropriate weights of diﬀerent
edges during the random walk process. A learning model has been
proposed in [113] in order to learn these weights, and leverage them for
an eﬀective clustering process.
The problem of clustering network content is often encountered in the
context of community detection in social networks. The text content in
the social network graph may be attached to either the nodes [101] of
the network, or to the edges [74]. The node-based approach is generally
more common, and most of the afore-mentioned techniques in this paper
can be modeled in terms of content attached to the nodes. In the method
proposed in [101], the following link-based and content-based steps are
combined for eﬀective community detection:
A conditional model is proposed for link analysis, in which the
conditional probability for the destination of given link is modeled.
A hidden variable is introduced in order to capture the popularity
of a node in terms of the likelihood of that node being cited by
other nodes.
A discriminative content model is introduced in order to reduce the
impact of noisy content attributes. In this model, the attributes
are weighed by their ability to discriminate between the diﬀerent
communities.
The two models are combined into a uniﬁed framework with the
use of a two-stage optimization algorithm for maximum likelihood
inference. One interesting characteristic of this broad framework
is that it can also be used in the context of other complementary
approaches.
The details of the algorithm are discussed in [101].
118 MINING TEXT DATA
For the case of edge-based community detection, it is assumed that
the text content in the network is attached to the edges [74]. This is
common in applications which involve extensive communication between
the diﬀerent nodes. For example, in email networks, or online chat networks,
the text in the network is associated with the communications
between the diﬀerent entities. In such cases, the text is associated with
an edge in the underlying network. The presence of content associated
with edges allows for a much more nuanced approach in community detection,
because a given node may participate in communities of diﬀerent
kinds. The presence of content associated with edges helps in separating
out these diﬀerent associations of the same individual to diﬀerent
communities. The work in [74] uses a matrix-factorization methodology
in order to jointly model the content and structure for the community
detection process. The matrix factorization method is used to transform
the representation into multi-dimensional representation, which can be
easily clustered by a simple algorithm such as the k-means algorithm.
It was shown in [74], that the use of such an approach can provide
much more eﬀective results than a pure content- or link-based clustering
methodology.
A closely related area to clustering is that of topic modeling, in which
we attempt to model the probability of a document belonging to a particular
cluster. A natural approach to network-based topic modeling is to
add a network-based regularization constraint to traditional topic models
such as NetPLSA [65]. The relational topic model (RTM) proposed
in [23] tries to model the generation of documents and links sequentially.
The ﬁrst step for generating the documents is the same as LDA. Subsequently,
the model predicts links based on the similarity of the topic
mixture used in two documents. Thus, this method can be used both
for topic modeling and predicting missing links. A more uniﬁed model
is proposed in the iTopicModel [91] framework which creates a Markov
Random Field model in order to create a generative model which simultaneously
captures both text and links. Experimental results have
shown this approach to be more general and superior to previously existing
methods. A number of other methods for incorporating network
information into topic modeling are discussed in the next chapter on
dimensionality reduction.
8. Semi-Supervised Clustering
In some applications, prior knowledge may be available about the
kinds of clusters that are available in the underlying data. This prior
knowledge may take on the form of labels attached with the documents,
A Survey of Text Clustering Algorithms 119
which indicate its underlying topic. For example, if we wish to use the
broad distribution of topics in the Y ahoo! taxonomy in order to supervise
the clustering process of a new web collection, one way to performing
supervision would be add some labeled pages from the Y ahoo! taxonomy
to the collection. Typically such pages would contain labels of the form
@Science@Astronomy or @Arts@Painting, which indicate the subject
area of the added pages. Such knowledge can be very useful in creating
signiﬁcantly more coherent clusters, especially when the total number of
clusters is large. The process of using such labels to guide the clustering
process is referred to as semi-supervised clustering. This form of learning
is a bridge between the clustering and classiﬁcation problem, because it
uses the underlying class structure, but it is not completely tied down by
the speciﬁc structure. As a result, such an approach ﬁnds applicability
both to the clustering and classiﬁcation scenarios.
The most natural method for incorporating supervision into the clustering
process is to do so in partitional clustering methods such as kmeans.
This is because the supervision can be easily incorporated by
changing the seeds in the clustering process. For example, the work in [4]
uses the initial seeds in the k-means clustering process as the centroids
of the original classes in the underlying data. A similar approach has
also been used in [15], except a wider variation of how the seeds may be
selected has been explored.
A number of probabilistic frameworks have also been designed for
semi-supervised clustering [72, 14]. The work in [72] uses an iterative
EM-approach in which the unlabeled documents are assigned labels using
a naive Bayes approach on the currently labeled documents. These
newly labeled documents are then again used for re-training a Bayes
classiﬁer. This process is iterated to convergence. The iterative labeling
approach in [72] can be considered a partially supervised approach for
clustering the unlabeled documents. The work in [14] uses a Heterogeneous
Markov Random Field (HMRF) model for the clustering process.
A graph-based method for incorporating prior knowledge into the clustering
process has been proposed in [52]. In this method, the documents
are modeled as a graph, in which nodes represent documents and edges
represent the similarity among them. New edges may also be added
to this graph, which correspond to the prior knowledge. Speciﬁcally,
an edge is added to the graph, when it is known on the basis of prior
knowledge that these two documents are similar. A normalized cut algorithm
[84] is then applied to this graph in order to create the ﬁnal
clustering. This approach implicitly uses the prior knowledge because
of the augmented graph representation which is used for the clustering.
120 MINING TEXT DATA
Since semi-supervised clustering forms a natural bridge between the
clustering and classiﬁcation problems, it is natural that semi-supervised
methods can be used for classiﬁcation as well [68]. This is also referred
to as co-training, because it involves the use of unsupervised document
clustering in order to assist the training process. Since semi-supervised
methods use both the clustering structure in the feature space and the
class information, they are sometimes more robust in classiﬁcation scenarios,
especially in cases where the amount of available labeled data is
small. It has been shown in [72], how a partially supervised co-training
approach which mixes supervised and unsupervised data may yield more
eﬀective classiﬁcation results, when the amount of training data available
is small. The work in [72] uses a partially supervised EM-algorithm
which iteratively assigns labels to the unlabeled documents and reﬁnes
them over time as convergence is achieved. A number of similar methods
along this spirit are proposed in [4, 14, 35, 47, 89] with varying levels
of supervision in the clustering process. Partially supervised clustering
methods are also used feature transformation in classiﬁcation using the
methods as discussed in [17, 18, 88]. The idea is that the clustering
structure provides a compressed feature space, which capture the relevant
classiﬁcation structure very well, and can therefore be helpful for
classiﬁcation.
Partially supervised methods can also be used in conjunction with preexisting
categorical hierarchies (or prototype hierarchies) [4, 56, 67]. A
typical example of a prototype hierarchy would be the Yahoo! taxonomy,
the Open Directory Project, or the Reuters collection. The idea is that
such hierarchies provide a good general idea of the clustering structure,
but also have considerable noise and overlaps in them because of their
typical manual origins. The partial supervision is able to correct the
noise and overlaps, and this results in a relatively clean and coherent
clustering structure.
An unusual kind of supervision for document clustering is the method
of use of a universum of documents which are known not to belong to
a cluster [106]. This is essentially, the background distribution which
cannot be naturally clustered into any particular group. The intuition
is that the universum of examples provide an eﬀective way of avoiding
mistakes in the clustering process, since it provides a background of
examples to compare a cluster with.
9. Conclusions and Summary
In this chapter, we presented a survey of clustering algorithms for text
data. A good clustering of text requires eﬀective feature selection and a
A Survey of Text Clustering Algorithms 121
proper choice of the algorithm for the task at hand. Among the diﬀerent
classes of algorithms, the distance-based methods are among the most
popular in a wide variety of applications.
In recent years, the main trend in research in this area has been in
the context of two kinds of text data:
Dynamic Applications: The large amounts of text data being
created by dynamic applications such as social networks or online
chat applications has created an immense need for streaming text
clustering applications. Such streaming applications need to be
applicable in the case of text which is not very clean, as is often
the case for applications such as social networks.
Heterogeneous Applications: Text applications increasingly
arise in heterogeneous applications in which the text is available
in the context of links, and other heterogeneous multimedia data.
For example, in social networks such as Flickr the clustering often
needs to be applied in such scenario. Therefore, it is critical to effectively
adapt text-based algorithms to heterogeneous multimedia
scenarios.
We note that the ﬁeld of text clustering is too vast to cover comprehensively
in a single chapter. Some methods such as committee-based clustering
[73] cannot even be neatly incorporated into any class of methods,
since they use a combination of the diﬀerent clustering methods in order
to create a ﬁnal clustering result. The main purpose of this chapter is
to provide a comprehensive overview of the main algorithms which are
often used in the area, as a starting point for further study.
References
[1] C. C. Aggarwal, Y. Zhao, P. S. Yu. On Text Clustering with Side
Information, ICDE Conference, 2012.
[2] C. C. Aggarwal, P. S. Yu. On Eﬀective Conceptual Indexing and
Similarity Search in Text, ICDM Conference, 2001.
[3] C. C. Aggarwal, P. S. Yu. A Framework for Clustering Massive Text
and Categorical Data Streams, SIAM Conference on Data Mining,
2006.
[4] C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision
for Text Categorization, IEEE Transactions on Knowledge and
Data Engineering, 16(2), 245–255, 2004.
[5] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast
Algorithms for Projected Clustering, ACM SIGMOD Conference,
1999.
122 MINING TEXT DATA
[6] C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters
in High Dimensional Spaces, ACM SIGMOD Conference, 2000.
[7] R. Agrawal, J. Gehrke, P. Raghavan. D. Gunopulos. Automatic
Subspace Clustering of High Dimensional Data for Data Mining
Applications, ACM SIGMOD Conference, 1999.
[8] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association
Rules in Large Databases, VLDB Conference, 1994.
[9] J. Allan, R. Papka, V. Lavrenko. Online new event detection and
tracking. ACM SIGIR Conference, 1998.
[10] P. Andritsos, P. Tsaparas, R. Miller, K. Sevcik. LIMBO: Scalable
Clustering of Categorical Data. EDBT Conference, 2004.
[11] P. Anick, S. Vaithyanathan. Exploiting Clustering and Phrases
for Context-Based Information Retrieval. ACM SIGIR Conference,
1997.
[12] R. Angelova, S. Siersdorfer. A neighborhood-based approach for
clustering of linked document collections. CIKM Conference, 2006.
[13] R. A. Baeza-Yates, B. A. Ribeiro-Neto, Modern Information Retrieval
- the concepts and technology behind search, Second edition,
Pearson Education Ltd., Harlow, England, 2011.
[14] S. Basu, M. Bilenko, R. J. Mooney. A probabilistic framework for
semi-supervised clustering. ACM KDD Conference, 2004.
[15] S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by
Seeding. ICML Conference, 2002.
[16] F. Beil, M. Ester, X. Xu. Frequent term-based text clustering, ACM
KDD Conference, 2002.
[17] L. Baker, A. McCallum. Distributional Clustering of Words for Text
Classiﬁcation, ACM SIGIR Conference, 1998.
[18] R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional
Clustering for Text Categorization. ACM SIGIR Conference,
2001.
[19] D. Blei, J. Laﬀerty. Dynamic topic models. ICML Conference, 2006.
[20] D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of
Machine Learning Research, 3: pp. 993–1022, 2003.
[21] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and
J/ C. Lai. Class-based n-gram models of natural language, Computational
Linguistics, 18, 4 (December 1992), 467-479.
[22] K. Chakrabarti, S. Mehrotra. Local Dimension reduction: A new
Approach to Indexing High Dimensional Spaces, VLDB Conference,
2000.
A Survey of Text Clustering Algorithms 123
[23] J. Chang, D. Blei. Topic Models for Document Networks. AISTASIS,
2009.
[24] W. B. Croft. Clustering large ﬁles of documents using the single-link
method. Journal of the American Society of Information Science,
28: pp. 341–344, 1977.
[25] D. Cutting, D. Karger, J. Pedersen, J. Tukey. Scatter/Gather: A
Cluster-based Approach to Browsing Large Document Collections.
ACM SIGIR Conference, 1992.
[26] D. Cutting, D. Karger, J. Pederson. Constant Interaction-time Scatter/Gather
Browsing of Large Document Collections, ACM SIGIR
Conference, 1993.
[27] M. Dash, H. Liu. Feature Selection for Clustering, PAKDD Conference,
pp. 110–121, 1997.
[28] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman.
Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407,
1990.
[29] I. Dhillon, D. Modha. Concept Decompositions for Large Sparse
Data using Clustering, 42(1), pp. 143–175, 2001.
[30] I. Dhillon. Co-clustering Documents and Words using bipartite
spectral graph partitioning, ACM KDD Conference, 2001.
[31] I. Dhillon, S. Mallela, D. Modha. Information-theoretic CoClustering,
ACM KDD Conference, 2003.
[32] C. Ding, X. He, H. Zha, H. D. Simon. Adaptive Dimension Reduction
for Clustering High Dimensional Data, ICDM Conference,
2002.
[33] C. Ding, X. He, H. Simon. On the equivalence of nonnegative matrix
factorization and spectral clustering. SDM Conference, 2005.
[34] B. Dorow, D. Widdows. Discovering corpus-speciﬁc word senses,
Proceedings of the tenth conference on European chapter of the Association
for Computational Linguistics - Volume 2 (EACL ’03),
pages 79-82, 2003.
[35] R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised
and Semi-supervised Learning. NIPS Conference, 2002.
[36] H. Fang, T. Tao, C. Zhai, A formal study of information retrieval
heuristics, Proceedings of ACM SIGIR 2004, 2004.
[37] D. Fisher. Knowledge Acquisition via incremental conceptual clustering.
Machine Learning, 2: pp. 139–172, 1987.
[38] M. Franz, T. Ward, J. McCarley, W.-J. Zhu. Unsupervised and
supervised clustering for topic tracking. ACM SIGIR Conference,
2001.
124 MINING TEXT DATA
[39] G. P. C. Fung, J. X. Yu, P. Yu, H. Lu. Parameter Free Bursty Events
Detection in Text Streams, VLDB Conference, 2005.
[40] J. H. Gennari, P. Langley, D. Fisher. Models of incremental concept
formation. Journal of Artiﬁcial Intelligence, 40 pp. 11–61, 1989.
[41] D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data:
An Approach Based on Dynamical Systems, VLDB Conference,
1998.
[42] M. Girolami, A Kaban. On the Equivalance between PLSI and LDA,
SIGIR Conference, pp. 433–434, 2003.
[43] S. Guha, R. Rastogi, K. Shim. ROCK: a robust clustering algorithm
for categorical attributes, International Conference on Data
Engineering, 1999.
[44] S. Guha, R. Rastogi, K. Shim. CURE: An Eﬃcient Clustering Algorithm
for Large Databases. ACM SIGMOD Conference, 1998.
[45] D. Gusﬁeld. Algorithms for strings, trees and sequences, Cambridge
University Press, 1997.
[46] Y. Huang, T. Mitchell. Text clustering with extended user feedback.
ACM SIGIR Conference, 2006.
[47] H. Li, K. Yamanishi. Document classiﬁcation using a ﬁnite mixture
model. Annual Meeting of the Association for Computational
Linguistics, 1997.
[48] Q. He, K. Chang, E.-P. Lim, J. Zhang. Bursty feature representation
for clustering text streams. SDM Conference, 2007.
[49] T. Hofmann. Probabilistic Latent Semantic Indexing. ACM SIGIR
Conference, 1999.
[50] A. Jain, R. C. Dubes. Algorithms for Clustering Data, Prentice
Hall, Englewood Cliﬀs, NJ, 1998.
[51] N. Jardine, C. J.van Rijsbergen. The use of hierarchical clustering
in information retrieval, Information Storage and Retrieval, 7: pp.
217–240, 1971.
[52] X. Ji, W. Xu. Document clustering with prior knowledge. ACM
SIGIR Conference, 2006.
[53] I. T. Jolliﬀee. Principal Component Analysis. Springer, 2002.
[54] L. Kaufman, P. J. Rousseeuw. Finding Groups in Data: An Introduction
to Cluster Analysis, Wiley Interscience, 1990.
[55] W. Ke, C. Sugimoto, J. Mostafa. Dynamicity vs. eﬀectiveness:
studying online clustering for scatter/gather. ACM SIGIR Conference,
2009.
A Survey of Text Clustering Algorithms 125
[56] H. Kim, S. Lee. A Semi-supervised document clustering technique
for information organization, CIKM Conference, 2000.
[57] J. Kleinberg, Bursty and hierarchical structure in streams, ACM
KDD Conference, pp. 91–101, 2002.
[58] D. D. Lee, H. S. Seung. Learning the parts of objects by nonnegative
matrix factorization, Nature, 401: pp. 788–791, 1999.
[59] T. Li, S. Ma, M. Ogihara, Document Clustering via Adaptive Subspace
Iteration, ACM SIGIR Conference, 2004.
[60] T. Li, C. Ding, Y. Zhang, B. Shao. Knowledge transformation from
word space to document space. ACM SIGIR Conference, 2008.
[61] Y.-B. Liu, J.-R. Cai, J. Yin, A. W.-C. Fu. Clustering Text Data
Streams, Journal of Computer Science and Technology, Vol. 23(1),
pp. 112–128, 2008.
[62] T. Liu, S. Lin, Z. Chen, W.-Y. Ma. An Evaluation on Feature Selection
for Text Clustering, ICML Conference, 2003.
[63] Y. Lu, Q. Mei, C. Zhai. Investigating task performance of probabilistic
topic models: an empirical study of PLSA and LDA, Information
Retrieval, 14(2): 178-203 (2011).
[64] A. McCallum. Bow: A toolkit for statistical language modeling,
text retrieval, classiﬁcation and clustering. http://www.cs.cmu.
edu/~mccallum/bow, 1996.
[65] Q. Mei, D. Cai, D. Zhang, C.-X. Zhai. Topic Modeling with Network
Regularization. WWW Conference, 2008.
[66] D. Metzler, S. T. Dumais, C. Meek, Similarity Measures for Short
Segments of Text, Proceedings of ECIR 2007, 2007.
[67] Z. Ming, K. Wang, T.-S. Chua. Prototype hierarchy-based clustering
for the categorization and navigation of web collections. ACM
SIGIR Conference, 2010.
[68] T. M. Mitchell. The role of unlabeled data in supervised learning.
Proceedings of the Sixth International Colloquium on Cognitive Science,
1999.
[69] F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering
Algorithms, The Computer Journal, 26(4), pp. 354–359, 1983.
[70] F. Murtagh. Complexities of Hierarchical Clustering Algorithms:
State of the Art, Computational Statistics Quarterly, 1(2), pp. 101–
113, 1984.
[71] R. Ng, J. Han. Eﬃcient and Eﬀective Clustering Methods for Spatial
Data Mining. VLDB Conference, 1994.
126 MINING TEXT DATA
[72] K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify
text from labeled and unlabeled documents. AAAI Conference,
1998.
[73] P. Pantel, D. Lin. Document Clustering with Committees, ACM
SIGIR Conference, 2002.
[74] G. Qi, C. Aggarwal, T. Huang. Community Detection with Edge
Content in Social Media Networks, ICDE Conference, 2012.
[75] M. Rege, M. Dong, F. Fotouhi. Co-clustering Documents and Words
Using Bipartite Isoperimetric Graph Partitioning. ICDM Conference,
pp. 532–541, 2006.
[76] C. J. van Rijsbergen. Information Retrieval, Butterworths, 1975.
[77] C. J.van Rijsbergen, W. B. Croft. Document Clustering: An Evaluation
of some experiments with the Cranﬁeld 1400 collection, Information
Processing and Management, 11, pp. 171–182, 1975.
[78] S. E. Robertson and S. Walker. Some simple eﬀective approximations
to the 2-poisson model for probabilistic weighted retrieval. In
SIGIR, pages 232–241, 1994.
[79] M. Sahami, T. D. Heilman, A web-based kernel function for measuring
the similarity of short text snippets, Proceedings of WWW
2006, pages 377-386, 2006.
[80] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, R. Padman. Incremental
Hierarchical Clustering of Text Documents, ACM CIKM
Conference, 2006.
[81] G. Salton. An Introduction to Modern Information Retrieval, Mc
Graw Hill, 1983.
[82] G. Salton, C. Buckley. Term Weighting Approaches in Automatic
Text Retrieval, Information Processing and Management, 24(5), pp.
513–523, 1988.
[83] H. Schutze, C. Silverstein. Projections for Eﬃcient Document Clustering,
ACM SIGIR Conference, 1997.
[84] J. Shi, J. Malik. Normalized cuts and image segmentation. IEEE
Transaction on Pattern Analysis and Machine Intelligence, 2000.
[85] C. Silverstein, J. Pedersen. Almost-constant time clustering of arbitrary
corpus subsets. ACM SIGIR Conference, pp. 60–66, 1997.
[86] A. Singhal, C. Buckley, M. Mitra. Pivoted Document Length Normalization.
ACM SIGIR Conference, pp. 21–29, 1996.
[87] N. Slonim, N. Tishby. Document Clustering using word clusters via
the information bottleneck method, ACM SIGIR Conference, 2000.
A Survey of Text Clustering Algorithms 127
[88] N. Slonim, N. Tishby. The power of word clusters for text classiﬁcation.
European Colloquium on Information Retrieval Research
(ECIR), 2001.
[89] N. Slonim, N. Friedman, N. Tishby. Unsupervised document classiﬁcation
using sequential information maximization. ACM SIGIR
Conference, 2002.
[90] M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document
Clustering Techniques, KDD Workshop on text mining, 2000.
[91] Y. Sun, J. Han, J. Gao, Y. Yu. iTopicModel: Information Network
Integrated Topic Modeling, ICDM Conference, 2009.
[92] E. M. Voorhees. Implementing Agglomerative Hierarchical Clustering
for use in Information Retrieval,Technical Report TR86–765,
Cornell University, Ithaca, NY, July 1986.
[93] F. Wang, C. Zhang, T. Li. Regularized clustering for documents.
ACM SIGIR Conference, 2007.
[94] J. Wilbur, K. Sirotkin. The automatic identiﬁcation of stopwords,
J. Inf. Sci., 18: pp. 45–55, 1992.
[95] P. Willett. Document Clustering using an inverted ﬁle approach.
Journal of Information Sciences, 2: pp. 223–231, 1980.
[96] P. Willett. Recent Trends in Hierarchical Document Clustering: A
Critical Review. Information Processing and Management, 24(5):
pp. 577–597, 1988.
[97] W. Xu, X. Liu, Y. Gong. Document Clustering based on nonnegative
matrix factorization, ACM SIGIR Conference, 2003.
[98] W. Xu, Y. Gong. Document clustering by concept factorization.
ACM SIGIR Conference, 2004.
[99] Y. Yang, J. O. Pederson. A comparative study on feature selection
in text categorization, ACM SIGIR Conference, 1995.
[100] Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization,
ACM SIGIR Conference, 1995.
[101] T. Yang, R. Jin, Y. Chi, S. Zhu. Combining link and content for
community detection: a discriminative approach. ACM KDD Conference,
2009.
[102] L. Yao, D. Mimno, A. McCallum. Eﬃcient methods for topic model
inference on streaming document collections, ACM KDD Conference,
2009.
[103] O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility
Demonstration, ACM SIGIR Conference, 1998.
128 MINING TEXT DATA
[104] O. Zamir, O. Etzioni, O. Madani, R. M. Karp. Fast and Intuitive
Clustering of Web Documents, ACM KDD Conference, 1997.
[105] C. Zhai, Statistical Language Models for Information Retrieval
(Synthesis Lectures on Human Language Technologies), Morgan &
Claypool Publishers, 2008.
[106] D. Zhang, J. Wang, L. Si. Document clustering with universum.
ACM SIGIR Conference, 2011.
[107] J. Zhang, Z. Ghahramani, Y. Yang. A probabilistic model for online
document clustering with application to novelty detection. In
Saul L., Weiss Y., Bottou L. (eds) Advances in Neural Information
Processing Letters, 17, 2005.
[108] T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Eﬃcient Data
Clustering Method for Very Large Databases. ACM SIGMOD Conference,
1996.
[109] X. Zhang, X. Hu, X. Zhou. A comparative evaluation of diﬀerent
link types on enhancing document clustering. ACM SIGIR Conference,
2008.
[110] Y. Zhao, G. Karypis. Evaluation of hierarchical clustering algorithms
for document data set, CIKM Conference, 2002.
[111] Y. Zhao, G. Karypis. Empirical and Theoretical comparisons of selected
criterion functions for document clustering, Machine Learning,
55(3), pp. 311–331, 2004.
[112] S. Zhong. Eﬃcient Streaming Text Clustering. Neural Networks,
Volume 18, Issue 5–6, 2005.
[113] Y. Zhou, H. Cheng, J. X. Yu. Graph Clustering based on Structural/Attribute
Similarities, VLDB Conference, 2009.
[114] http://www.lemurproject.org/
Chapter 6
A SURVEY OF TEXT CLASSIFICATION
ALGORITHMS
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
charu@us.ibm.com
ChengXiang Zhai
University of Illinois at Urbana-Champaign
Urbana, IL
czhai@cs.uiuc.edu
Abstract The problem of classiﬁcation has been widely studied in the data mining,
machine learning, database, and information retrieval communities with
applications in a number of diverse domains, such as target marketing,
medical diagnosis, news group ﬁltering, and document organization. In
this paper we will provide a survey of a wide variety of text classiﬁcation
algorithms.
Keywords: Text Classiﬁcation
1. Introduction
The problem of classiﬁcation has been widely studied in the database,
data mining, and information retrieval communities. The problem of
classiﬁcation is deﬁned as follows. We have a set of training records
D = {X1, . . . , XN }, such that each record is labeled with a class value
drawn from a set of k diﬀerent discrete values indexed by {1 . . . k}. The
training data is used in order to construct a classiﬁcation model, which
relates the features in the underlying record to one of the class labels. For
a given test instance for which the class is unknown, the training model
164 MINING TEXT DATA
is used to predict a class label for this instance. In the hard version
of the classiﬁcation problem, a particular label is explicitly assigned to
the instance, whereas in the soft version of the classiﬁcation problem,
a probability value is assigned to the test instance. Other variations of
the classiﬁcation problem allow ranking of diﬀerent class choices for a
test instance, or allow the assignment of multiple labels [52] to a test
instance.
The classiﬁcation problem assumes categorical values for the labels,
though it is also possible to use continuous values as labels. The latter
is referred to as the regression modeling problem. The problem of text
classiﬁcation is closely related to that of classiﬁcation of records with
set-valued features [28]; however, this model assumes that only information
about the presence or absence of words is used in a document. In
reality, the frequency of words also plays a helpful role in the classiﬁcation
process, and the typical domain-size of text data (the entire lexicon
size) is much greater than a typical set-valued classiﬁcation problem. A
broad survey of a wide variety of classiﬁcation methods may be found in
[42, 62], and a survey which is speciﬁc to the text domain may be found
in [111]. A relative evaluation of diﬀerent kinds of text classiﬁcation
methods may be found in [132]. A number of the techniques discussed
in this chapter have also been converted into software and are publicly
available through multiple toolkits such as the BOW toolkit [93], Mallot
[96], WEKA 1, and LingPipe 2.
The problem of text classiﬁcation ﬁnds applications in a wide variety
of domains in text mining. Some examples of domains in which text
classiﬁcation is commonly used are as follows:
News ﬁltering and Organization: Most of the news services
today are electronic in nature in which a large volume of news articles
are created very single day by the organizations. In such cases,
it is diﬃcult to organize the news articles manually. Therefore, automated
methods can be very useful for news categorization in a
variety of web portals [78]. This application is also referred to as
text ﬁltering.
Document Organization and Retrieval: The above application
is generally useful for many applications beyond news ﬁltering
and organization. A variety of supervised methods may be used
for document organization in many domains. These include large
digital libraries of documents, web collections, scientiﬁc literature,
1http://www.cs.waikato.ac.nz/ml/weka/
2http://alias-i.com/lingpipe/
A Survey of Text Classiﬁcation Algorithms 165
or even social feeds. Hierarchically organized document collections
can be particularly useful for browsing and retrieval [19].
Opinion Mining: Customer reviews or opinions are often short
text documents which can be mined to determine useful information
from the review. Details on how classiﬁcation can be used in
order to perform opinion mining are discussed in [89] and Chapter
13 in this book.
Email Classiﬁcation and Spam Filtering: It is often desirable
to classify email [23, 27, 85] in order to determine either
the subject or to determine junk email [113] in an automated way.
This is also referred to as spam ﬁltering or email ﬁltering.
A wide variety of techniques have been designed for text classiﬁcation.
In this chapter, we will discuss the broad classes of techniques, and their
uses for classiﬁcation tasks. We note that these classes of techniques also
generally exist for other data domains such as quantitative or categorical
data. Since text may be modeled as quantitative data with frequencies
on the word attributes, it is possible to use most of the methods for
quantitative data directly on text. However, text is a particular kind of
data in which the word attributes are sparse, and high dimensional, with
low frequencies on most of the words. Therefore, it is critical to design
classiﬁcation methods which eﬀectively account for these characteristics
of text. In this chapter, we will focus on the speciﬁc changes which are
applicable to the text domain. Some key methods, which are commonly
used for text classiﬁcation are as follows:
Decision Trees: Decision trees are designed with the use of a hierarchical
division of the underlying data space with the use of different
text features. The hierarchical division of the data space is
designed in order to create class partitions which are more skewed
in terms of their class distribution. For a given text instance, we
determine the partition that it is most likely to belong to, and use
it for the purposes of classiﬁcation.
Pattern (Rule)-based Classiﬁers: In rule-based classiﬁers we
determine the word patterns which are most likely to be related to
the diﬀerent classes. We construct a set of rules, in which the lefthand
side corresponds to a word pattern, and the right-hand side
corresponds to a class label. These rules are used for the purposes
of classiﬁcation.
SVM Classiﬁers: SVM Classiﬁers attempt to partition the data
space with the use of linear or non-linear delineations between the
166 MINING TEXT DATA
diﬀerent classes. The key in such classiﬁers is to determine the
optimal boundaries between the diﬀerent classes and use them for
the purposes of classiﬁcation.
Neural Network Classiﬁers: Neural networks are used in a wide
variety of domains for the purposes of classiﬁcation. In the context
of text data, the main diﬀerence for neural network classiﬁers is to
adapt these classiﬁers with the use of word features. We note that
neural network classiﬁers are related to SVM classiﬁers; indeed,
they both are in the category of discriminative classiﬁers, which
are in contrast with the generative classiﬁers [102].
Bayesian (Generative) Classiﬁers: In Bayesian classiﬁers (also
called generative classiﬁers), we attempt to build a probabilistic
classiﬁer based on modeling the underlying word features in diﬀerent
classes. The idea is then to classify text based on the posterior
probability of the documents belonging to the diﬀerent classes on
the basis of the word presence in the documents.
Other Classiﬁers: Almost all classiﬁers can be adapted to the
case of text data. Some of the other classiﬁers include nearest
neighbor classiﬁers, and genetic algorithm-based classiﬁers. We
will discuss some of these diﬀerent classiﬁers in some detail and
their use for the case of text data.
The area of text categorization is so vast that it is impossible to cover
all the diﬀerent algorithms in detail in a single chapter. Therefore, our
goal is to provide the reader with an overview of the most important
techniques, and also the pointers to the diﬀerent variations of these
techniques.
Feature selection is an important problem for text classiﬁcation. In
feature selection, we attempt to determine the features which are most
relevant to the classiﬁcation process. This is because some of the words
are much more likely to be correlated to the class distribution than
others. Therefore, a wide variety of methods have been proposed in
the literature in order to determine the most important features for the
purpose of classiﬁcation. These include measures such as the gini-index
or the entropy, which determine the level of which the presence of a
particular feature skews the class distribution in the underlying data.
We will also discuss the diﬀerent feature selection methods which are
commonly used for text classiﬁcation.
The rest of this chapter is organized as follows. In the next section, we
will discuss methods for feature selection in text classiﬁcation. In section
A Survey of Text Classiﬁcation Algorithms 167
3, we will describe decision tree methods for text classiﬁcation. Rulebased
classiﬁers are described in detail in section 4. We discuss naive
Bayes classiﬁers in section 5. The nearest neighbor classiﬁer is discussed
in section 7. In section 7, we will discuss a number of linear classiﬁers,
such as the SVM classiﬁer, direct regression modeling and the neural
network classiﬁer. A discussion of how the classiﬁcation methods can be
adapted to text and web data containing hyperlinks is discussed in section
8. In section 9, we discuss a number of diﬀerent meta-algorithms for
classiﬁcation such as boosting, bagging and ensemble learning. Section
10 contains the conclusions and summary.
2. Feature Selection for Text Classiﬁcation
Before any classiﬁcation task, one of the most fundamental tasks that
needs to be accomplished is that of document representation and feature
selection. While feature selection is also desirable in other classiﬁcation
tasks, it is especially important in text classiﬁcation due to the high
dimensionality of text features and the existence of irrelevant (noisy)
features. In general, text can be represented in two separate ways. The
ﬁrst is as a bag of words, in which a document is represented as a set of
words, together with their associated frequency in the document. Such a
representation is essentially independent of the sequence of words in the
collection. The second method is to represent text directly as strings,
in which each document is a sequence of words. Most text classiﬁcation
methods use the bag-of-words representation because of its simplicity
for classiﬁcation purposes. In this section, we will discuss some of the
methods which are used for feature selection in text classiﬁcation.
The most common feature selection which is used in both supervised
and unsupervised applications is that of stop-word removal and stemming.
In stop-word removal, we determine the common words in the documents
which are not speciﬁc or discriminatory to the diﬀerent classes.
In stemming, diﬀerent forms of the same word are consolidated into a
single word. For example, singular, plural and diﬀerent tenses are consolidated
into a single word. We note that these methods are not speciﬁc
to the case of the classiﬁcation problem, and are often used in a variety
of unsupervised applications such as clustering and indexing. In the
case of the classiﬁcation problem, it makes sense to supervise the feature
selection process with the use of the class labels. This kind of selection
process ensures that those features which are highly skewed towards the
presence of a particular class label are picked for the learning process.
A wide variety of feature selection methods are discussed in [133, 135].
Many of these feature selection methods have been compared with one
168 MINING TEXT DATA
another, and the experimental results are presented in [133]. We will
discuss each of these feature selection methods in this section.
2.1 Gini Index
One of the most common methods for quantifying the discrimination
level of a feature is the use of a measure known as the gini-index. Let
p1(w) . . . pk(w) be the fraction of class-label presence of the k diﬀerent
classes for the word w. In other words, pi(w) is the conditional probability
that a document belongs to class i, given the fact that it contains
the word w. Therefore, we have:
k
i=1
pi(w) = 1 (6.1)
Then, the gini-index for the word w, denoted by G(w) is deﬁned3 as
follows:
G(w) =
k
i=1
pi(w)2
(6.2)
The value of the gini-index G(w) always lies in the range (1/k, 1). Higher
values of the gini-index G(w) represent indicate a greater discriminative
power of the word w. For example, when all documents which contain
word w belong to a particular class, the value of G(w) is 1. On the other
hand, when documents containing word w are evenly distributed among
the k diﬀerent classes, the value of G(w) is 1/k.
One criticism with this approach is that the global class distribution
may be skewed to begin with, and therefore the above measure may
sometimes not accurately reﬂect the discriminative power of the underlying
attributes. Therefore, it is possible to construct a normalized
gini-index in order to reﬂect the discriminative power of the attributes
more accurately. Let P1 . . . Pk represent the global distributions of the
documents in the diﬀerent classes. Then, we determine the normalized
probability value pi(w) as follows:
pi(w) =
pi(w)/Pi
k
j=1 pj(w)/Pj
(6.3)
3The gini-index is also sometimes deﬁned as 1 − k
i=1 pi(w)2, with lower values indicating
greater discriminative power of the feature w.
A Survey of Text Classiﬁcation Algorithms 169
Then, the gini-index is computed in terms of these normalized probability
values.
G(w) =
k
i=1
pi(w)2
(6.4)
The use of the global probabilities Pi ensures that the gini-index more
accurately reﬂects class-discrimination in the case of biased class distributions
in the whole document collection. For a document corpus
containing n documents, d words, and k classes, the complexity of the
information gain computation is O(n · d · k). This is because the computation
of the term pi(w) for all the diﬀerent words and the classes
requires O(n · d · k) time.
2.2 Information Gain
Another related measure which is commonly used for text feature
selection is that of information gain or entropy. Let Pi be the global
probability of class i, and pi(w) be the probability of class i, given that
the document contains the word w. Let F(w) be the fraction of the
documents containing the word w. The information gain measure I(w)
for a given word w is deﬁned as follows:
I(w) = −
k
i=1
Pi · log(Pi) + F(w) ·
k
i=1
pi(w) · log(pi(w)) +
+(1 − F(w)) ·
k
i=1
(1 − pi(w)) · log(1 − pi(w))
The greater the value of the information gain I(w), the greater the discriminatory
power of the word w. For a document corpus containing n
documents and d words, the complexity of the information gain computation
is O(n · d · k).
2.3 Mutual Information
This mutual information measure is derived from information theory
[31], and provides a formal way to model the mutual information between
the features and the classes. The pointwise mutual information Mi(w)
between the word w and the class i is deﬁned on the basis of the level
of co-occurrence between the class i and word w. We note that the
expected co-occurrence of class i and word w on the basis of mutual
independence is given by Pi · F(w). The true co-occurrence is of course
given by F(w)·pi(w). In practice, the value of F(w)·pi(w) may be much
170 MINING TEXT DATA
larger or smaller than Pi · F(w), depending upon the level of correlation
between the class i and word w. The mutual information is deﬁned in
terms of the ratio between these two values. Speciﬁcally, we have:
Mi(w) = log
F(w) · pi(w)
F(w) · Pi
= log
pi(w)
Pi
(6.5)
Clearly, the word w is positively correlated to the class i, when Mi(w) >
0, and the word w is negatively correlated to class i, when Mi(w) < 0.
We note that Mi(w) is speciﬁc to a particular class i. We need to
compute the overall mutual information as a function of the mutual
information of the word w with the diﬀerent classes. These are deﬁned
with the use of the average and maximum values of Mi(w) over the
diﬀerent classes.
Mavg(w) =
k
i=1
Pi · Mi(w)
Mmax(w) = maxi{Mi(w)}
Either of these measures may be used in order to determine the relevance
of the word w. The second measure is particularly useful, when it is more
important to determine high levels of positive correlation of the word w
with any of the classes.
2.4 χ2
-Statistic
The χ2 statistic is a diﬀerent way to compute the lack of independence
between the word w and a particular class i. Let n be the total number of
documents in the collection, pi(w) be the conditional probability of class
i for documents which contain w, Pi be the global fraction of documents
containing the class i, and F(w) be the global fraction of documents
which contain the word w. The χ2-statistic of the word between word
w and class i is deﬁned as follows:
χ2
i (w) =
n · F(w)2 · (pi(w) − Pi)2
F(w) · (1 − F(w)) · Pi · (1 − Pi))
(6.6)
As in the case of the mutual information, we can compute a global χ2
statistic from the class-speciﬁc values. We can use either the average of
maximum values in order to create the composite value:
χ2
avg(w) =
k
i=1
Pi · χ2
i (w)
χ2
max(w) = maxiχ2
i (w)
A Survey of Text Classiﬁcation Algorithms 171
We note that the χ2-statistic and mutual information are diﬀerent
ways of measuring the the correlation between terms and categories.
One major advantage of the χ2-statistic over the mutual information
measure is that it is a normalized value, and therefore these values are
more comparable across terms in the same category.
2.5 Feature Transformation Methods:
Supervised LSI
While feature selection attempts to reduce the dimensionality of the
data by picking from the original set of attributes, feature transformation
methods create a new (and smaller) set of features as a function of
the original set of features. A typical example of such a feature transformation
method is Latent Semantic Indexing (LSI) [38], and its probabilistic
variant PLSA [57]. The LSI method transforms the text space
of a few hundred thousand word features to a new axis system (of size
about a few hundred) which are a linear combination of the original word
features. In order to achieve this goal, Principal Component Analysis
techniques [69] are used to determine the axis-system which retains the
greatest level of information about the variations in the underlying attribute
values. The main disadvantage of using techniques such as LSI is
that these are unsupervised techniques which are blind to the underlying
class-distribution. Thus, the features found by LSI are not necessarily
the directions along which the class-distribution of the underlying documents
can be best separated. A modest level of success has been obtained
in improving classiﬁcation accuracy by using boosting techniques
in conjunction with the conceptual features obtained from unsupervised
pLSA method [17]. A more recent study has systematically compared
pLSA and LDA (which is a Bayesian version of pLSA) in terms of their
eﬀectiveness in transforming features for text categorization and drawn
a similar conclusion and found that pLSA and LDA tend to perform
similarly.
A number of techniques have also been proposed to perform the feature
transformation methods by using the class labels for eﬀective supervision.
The most natural method is to adapt LSI in order to make it work
more eﬀectively for the supervised case. A number of diﬀerent methods
have been proposed in this direction. One common approach is to perform
local LSI on the subsets of data representing the individual classes,
and identify the discriminative eigenvectors from the diﬀerent reductions
with the use of an iterative approach [123]. This method is known as
SLSI (Supervised Latent Semantic Indexing), and the advantages of the
method seem to be relatively limited, because the experiments in [123]
172 MINING TEXT DATA
show that the improvements over a standard SVM classiﬁer, which did
not use a dimensionality reduction process, are relatively limited. The
work in [129] uses a combination of class-speciﬁc LSI and global analysis.
As in the case of [123], class-speciﬁc LSI representations are created.
Test documents are compared against each LSI representation in order
to create the most discriminative reduced space. One problem with this
approach is that the diﬀerent local LSI representations use a diﬀerent
subspace, and therefore it is diﬃcult to compare the similarities of the
diﬀerent documents across the diﬀerent subspaces. Furthermore, both
the methods in [123, 129] tend to be computationally expensive.
A method called sprinkling is proposed in [21], in which artiﬁcial terms
are added to (or “sprinkled” into) the documents, which correspond to
the class labels. In other words, we create a term corresponding to
the class label, and add it to the document. LSI is then performed on
the document collection with these added terms. The sprinkled terms
can then be removed from the representation, once the eigenvectors have
been determined. The sprinkled terms help in making the LSI more sensitive
to the class distribution during the reduction process. It has also
been proposed in [21] that it can be generally useful to make the sprinkling
process adaptive, in which all classes are not necessarily treated
equally, but the relationships between the classes are used in order to
regulate the sprinkling process. Furthermore, methods have also been
proposed in [21] to make the sprinkling process adaptive to the use of a
particular kind of classiﬁer.
2.6 Supervised Clustering for Dimensionality
Reduction
One technique which is commonly used for feature transformation is
that of text clustering [7, 71, 83, 121]. In these techniques, the clusters
are constructed from the underlying text collection, with the use of
supervision from the class distribution. The exception is [83] in which
supervision is not used. In the simplest case, each class can be treated as
a separate cluster, though better results may be obtained by using the
classes for supervision of the clustering process. The frequently occurring
words in these supervised clusters can be used in order to create the
new set of dimensions. The classiﬁcation can be performed with respect
to this new feature representation. One advantage of such an approach
is that it retains interpretability with respect to the original words of the
document collection. The disadvantage is that the optimum directions
of separation may not necessarily be represented in the form of clusters
of words. Furthermore, the underlying axes are not necessarily orthonor-
A Survey of Text Classiﬁcation Algorithms 173
mal to one another. The use of supervised methods [1, 7, 71, 121] has
generally led to good results either in terms of improved classiﬁcation
accuracy, or signiﬁcant performance gains at the expense of a small reduction
in accuracy. The results with the use of unsupervised clustering
[83, 87] are mixed. For example, the work in [83] suggests that the use
of unsupervised term-clusters and phrases is generally not helpful [83]
for the classiﬁcation process. The key observation in [83] is that the
loss of granularity associated with the use of phrases and term clusters
is not necessarily advantageous for the classiﬁcation process. The work
in [8] has shown that the use of the information bottleneck method for
feature distributional clustering can create clustered pseudo-word representations
which are quite eﬀective for text classiﬁcation.
2.7 Linear Discriminant Analysis
Another method for feature transformation is the use of linear discriminants,
which explicitly try to construct directions in the feature space,
along which there is best separation of the diﬀerent classes. A common
method is the Fisher’s linear discriminant [46]. The main idea in the
Fisher’s discriminant method is to determine the directions in the data
along which the points are as well separated as possible. The subspace
of lower dimensionality is constructed by iteratively ﬁnding such unit
vectors αi in the data, where αi is determined in the ith iteration. We
would also like to ensure that the diﬀerent values of αi are orthonormal
to one another. In each step, we determine this vector αi by discriminant
analysis, and project the data onto the remaining orthonormal subspace.
The next vector αi+1 is found in this orthonormal subspace. The quality
of vector αi is measured by an objective function which measures the
separation of the diﬀerent classes. This objective function reduces in
each iteration, since the value of αi in a given iteration is the optimum
discriminant in that subspace, and the vector found in the next iteration
is the optimal one from a smaller search space. The process of ﬁnding
linear discriminants is continued until the class separation, as measured
by an objective function, reduces below a given threshold for the vector
determined in the current iteration. The power of such a dimensionality
reduction approach has been illustrated in [18], in which it has been
shown that a simple decision tree classiﬁer can perform much more effectively
on this transformed data, as compared to more sophisticated
classiﬁers.
Next, we discuss how the Fisher’s discriminant is actually constructed.
First, we will set up the objective function J(α) which determines the
level of separation of the diﬀerent classes along a given direction (unit-
174 MINING TEXT DATA
vector) α. This sets up the crisp optimization problem of determining the
value of α which maximizes J(α). For simplicity, let us assume the case
of binary classes. Let D1 and D2 be the two sets of documents belonging
to the two classes. Then, the projection of a document X ∈ D1∪D2 along
α is given by X ·α. Therefore, the squared class separation S(D1, D2, α)
along the direction α is given by:
S(D1, D2, α) = X∈D1
α · X
|D1|
− X∈D2
α · X
|D2|
2
(6.7)
In addition, we need to normalize this absolute class separation with the
use of the underlying class variances. Let V ar(D1, α) and V ar(D2, α)
be the individual class variances along the direction α. In other words,
we have:
V ar(D1, α) = X∈D1
(X · α)2
|D1|
− X∈D1
X · α
|D1|
2
(6.8)
The value of V ar(D2, α) can be deﬁned in a similar way. Then, the
normalized class-separation J(α) is deﬁned as follows:
J(α) =
S(D1, D2, α)
V ar(D1, α) + V ar(D2, α)
(6.9)
The optimal value of α needs to be determined subject to the constraint
that α is a unit vector. Let μ1 and μ2 be the means of the two data sets
D1 and D2, and C1 and C2 be the corresponding covariance matrices.
It can be shown that the optimal (unscaled) direction α = α∗ can be
expressed in closed form, and is given by the following:
α∗ =
C1 + C2
2
−1
(μ1 − μ2) (6.10)
The main diﬃculty in computing the above equation is that this computation
requires the inversion of the covariance matrix, which is sparse and
computationally diﬃcult in the high-dimensional text domain. Therefore,
a gradient descent approach can be used in order to determine
the value of α in a more computationally eﬀective way. Details of the
approach are presented in [18].
Another related method which attempts to determine projection directions
that maximize the topical diﬀerences between diﬀerent classes
is the Topical Diﬀerence Factor Analysis method proposed in [72]. The
problem has been shown to be solvable as a generalized eigenvalue problem.
The method was used in conjunction with a k-nearest neighbor
A Survey of Text Classiﬁcation Algorithms 175
classiﬁer, and it was shown that the use of this approach signiﬁcantly
improves the accuracy over a classiﬁer which uses the original set of
features.
2.8 Generalized Singular Value Decomposition
While the method discussed above ﬁnds one vector αi at a time in order
to determine the relevant dimension transformation, it is possible to
be much more direct in ﬁnding the optimal subspaces simultaneously by
using a generalized version of dimensionality reduction [58, 59]. It is important
to note that this method has really been proposed in [58, 59] as
an unsupervised method which preserves the underlying clustering structure,
assuming the data has already been clustered in a pre-processing
phase. Thus, the generalized dimensionality reduction method has been
proposed as a much more aggressive dimensionality reduction technique,
which preserves the underlying clustering structure rather than the individual
points. This method can however also be used as a supervised
technique in which the diﬀerent classes are used as input to the dimensionality
reduction algorithm, instead of the clusters constructed in
the pre-processing phase [131]. This method is known as the Optimal
Orthogonal Centroid Feature Selection Algorithm (OCFS), and it directly
targets at the maximization of inter-class scatter. The algorithm
is shown to have promising results for supervised feature selection in
[131].
2.9 Interaction of Feature Selection with
Classiﬁcation
Since the classiﬁcation and feature selection processes are dependent
upon one another, it is interesting to test how the feature selection process
interacts with the underlying classiﬁcation algorithms. In this context,
two questions are relevant:
Can the feature-speciﬁc insights obtained from the intermediate
results of some of the classiﬁcation algorithms be used for creating
feature selection methods that can be used more generally by other
classiﬁcation algorithms?
Do the diﬀerent feature selection methods work better or worse
with diﬀerent kinds of classiﬁers?
Both these issues were explored in some detail in [99]. In regard to
the ﬁrst question, it was shown in [99] that feature selection which was
derived from linear classiﬁers, provided very eﬀective results. In regard
to the second question, it was shown in [99] that the sophistication of
176 MINING TEXT DATA
the feature selection process itself was more important than the speciﬁc
pairing between the feature selection process and the classiﬁer.
Linear Classiﬁers are those for which the output of the linear predictor
is deﬁned to be p = A·X +b, where X = (x1 . . . xn) is the normalized
document word frequency vector,A = (a1 . . . an) is a vector of linear coeﬃcients
with the same dimensionality as the feature space, and b is
a scalar. Both the basic neural network and basic SVM classiﬁers [65]
(which will be discussed later in this chapter) belong to this category.
The idea here is that if the coeﬃcient ai is close to zero, then the corresponding
feature does not have a signiﬁcant eﬀect on the classiﬁcation
process. On the other hand, since large absolute values of aj may signiﬁcantly
inﬂuence the classiﬁcation process, such features should be
selected for classiﬁcation. In the context of the SVM method, which
attempts to determine linear planes of separation between the diﬀerent
classes, the vector A is essentially the normal vector to the corresponding
plane of separation between the diﬀerent classes. This intuitively
explains the choice of selecting features with large values of |aj|. It was
shown in [99] that this class of feature selection methods was quite robust,
and performed well even for classiﬁers such as the Naive Bayes
method, which were unrelated to the linear classiﬁers from which these
features were derived. Further discussions on how SVM and maximum
margin techniques can be used for feature selection may be found in
[51, 56].
3. Decision Tree Classiﬁers
A decision tree [106] is essentially a hierarchical decomposition of the
(training) data space, in which a predicate or a condition on the attribute
value is used in order to divide the data space hierarchically. In
the context of text data, such predicates are typically conditions on the
presence or absence of one or more words in the document. The division
of the data space is performed recursively in the decision tree, until the
leaf nodes contain a certain minimum number of records, or some conditions
on class purity. The majority class label (or cost-weighted majority
label) in the leaf node is used for the purposes of classiﬁcation. For a
given test instance, we apply the sequence of predicates at the nodes, in
order to traverse a path of the tree in top-down fashion and determine
the relevant leaf node. In order to further reduce the overﬁtting, some of
the nodes may be be pruned by holding out a part of the data, which are
not used to construct the tree. The portion of the data which is held out
is used in order to determine whether or not the constructed leaf node
should be pruned or not. In particular, if the class distribution in the
A Survey of Text Classiﬁcation Algorithms 177
training data (for decision tree construction) is very diﬀerent from the
class distribution in the training data which is used for pruning, then it
is assumed that the node overﬁts the training data. Such a node can be
pruned. A detailed discussion of decision tree methods may be found in
[15, 42, 62, 106].
In the particular case of text data, the predicates for the decision tree
nodes are typically deﬁned in terms of the terms in the underlying text
collection. For example, a node may be partitioned into its children
nodes depending upon the presence or absence of a particular term in
the document. We note that diﬀerent nodes at the same level of the tree
may use diﬀerent terms for the partitioning process.
Many other kinds of predicates are possible. It may not be necessary
to use individual terms for partitioning, but one may measure the similarity
of documents to correlated sets of terms. These correlated sets of
terms may be used to further partition the document collection, based
on the similarity of the document to them. The diﬀerent kinds of splits
are as follows:
Single Attribute Splits: In this case, we use the presence or
absence of particular words (or even phrases) at a particular node
in the tree in order to perform the split. At any given level, we pick
the word which provides the maximum discrimination between the
diﬀerent classes. Measures such as the gini-index or information
gain can be used in order to determine the level of entropy. For
example, the DT-min10 algorithm [81] is based on this approach.
Similarity-based multi-attribute split: In this case, we use
documents (or meta-documents such as frequent word clusters),
and use the similarity of the documents to these words clusters in
order to perform the split. For the selected word cluster, the documents
are further partitioned into groups by rank ordering the documents
by similarity value, and splitting at a particular threshold.
We select the word-cluster for which rank-ordering by similarity
provides the best separation between the diﬀerent classes.
Discriminant-based multi-attribute split: For the
multi-attribute case, a natural choice for performing the split is
to use discriminants such as the Fisher discriminant for performing
the split. Such discriminants provide the directions in the data
along which the classes are best separated. The documents are projected
on this discriminant vector for rank ordering, and then split
at a particular coordinate. The choice of split point is picked in order
to maximize the discrimination between the diﬀerent classes.
178 MINING TEXT DATA
The work in [18] uses a discriminant-based split, though this is
done indirectly because of the use of a feature transformation to
the discriminant representation, before building the classiﬁer.
Some of the earliest implementation of classiﬁers may be found in [80,
81, 87, 127]. The last of these is really a rule-based classiﬁer, which can
be interpreted either as a decision tree or a rule-based classiﬁer. Most of
the decision tree implementations in the text literature tend to be small
variations on standard packages such as ID3 and C4.5, in order to adapt
the model to text classiﬁcation. Many of these classiﬁers are typically
designed as baselines for comparison with other learning models [65].
A well known implementation of the decision tree classiﬁer is based on
the C4.5 taxonomy of algorithms [106] is presented in [87]. More specifically,
the work in [87] uses the successor to the C4.5 algorithm, which
is also known as the C5 algorithm. This algorithm uses single-attribute
splits at each node, where the feature with the highest information gain
[31] is used for the purpose of the split. Decision trees have also been
used in conjunction with boosting techniques. An adaptive boosting
technique [48] is used in order to improve the accuracy of classiﬁcation.
In this technique, we use n diﬀerent classiﬁers. The ith classiﬁer is constructed
by examining the errors of the (i − 1)th classiﬁer. A voting
scheme is applied among these classiﬁers in order to report the ﬁnal label.
Other boosting techniques for improving decision tree classiﬁcation
accuracy are proposed in [116].
The work in [43] presents a decision tree algorithm based on the
Bayesian approach developed in [22]. In this classiﬁer, the decision tree
is grown by recursive greedy splits, where the splits are chosen using
Bayesian posterior probability of model structure. The structural prior
penalizes additional model parameters at each node. The output of the
process is a class probability rather than a deterministic class label for
the test instance.
4. Rule-based Classiﬁers
Decision trees are also generally related to rule-based classiﬁers. In
rule-based classiﬁers, the data space is modeled with a set of rules, in
which the left hand side is a condition on the underlying feature set, and
the right hand side is the class label. The rule set is essentially the model
which is generated from the training data. For a given test instance,
we determine the set of rules for which the test instance satisﬁes the
condition on the left hand side of the rule. We determine the predicted
class label as a function of the class labels of the rules which are satisﬁed
by the test instance. We will discuss more on this issue slightly later.
A Survey of Text Classiﬁcation Algorithms 179
In its most general form, the left hand side of the rule is a boolean condition,
which is expressed in Disjunctive Normal Form (DNF). However,
in most cases, the condition on the left hand side is much simpler and
represents a set of terms, all of which must be present in the document
for the condition to be satisﬁed. The absence of terms is rarely used,
because such rules are not likely to be very informative for sparse text
data, in which most words in the lexicon will typically not be present in
it by default (sparseness property). Also, while the set intersection of
conditions on term presence is used often, the union of such conditions
is rarely used in a single rule. This is because such rules can be split
into two separate rules, each of which is more informative on its own.
For example, the rule Honda ∪ Toyota ⇒ Cars can be replaced by two
separate rules Honda ⇒ Cars and Toyota ⇒ Cars without any loss of
information. In fact, since the conﬁdence of each of the two rules can
now be measured separately, this can be more useful. On the other hand,
the rule Honda ∩ Toyota ⇒ Cars is certainly much more informative
than the individual rules. Thus, in practice, for sparse data sets such as
text, rules are much more likely to be expressed as a simple conjunction
of conditions on term presence.
We note that decision trees and decision rules both tend to encode
rules on the feature space, except that the decision tree tends to achieve
this goal with a hierarchical approach. In fact, the original work on decision
tree construction in C4.5 [106] studied the decision tree problem
and decision rule problem within a single framework. This is because a
particular path in the decision tree can be considered a rule for classiﬁcation
of the text instance. The main diﬀerence is that the decision tree
framework is a strict hierarchical partitioning of the data space, whereas
rule-based classiﬁers allow for overlaps in the decision space. The general
principle is to create a rule set, such that all points in the decision
space are covered by at least one rule. In most cases, this is achieved
by generating a set of targeted rules which are related to the diﬀerent
classes, and one default catch-all rule, which can cover all the remaining
instances.
A number of criteria can be used in order to generate the rules from
the training data. Two of the most common conditions which are used
for rule generation are those of support and conﬁdence. These conditions
are common to all rule-based pattern classiﬁers [88] and may be deﬁned
as follows:
Support: This quantiﬁes the absolute number of instances in
the training data set which are relevant to the rule. For example, in
a corpus containing 100,000 documents, a rule in which both the
left-hand set and right-hand side are satisﬁed by 50,000 documents
180 MINING TEXT DATA
is more important than a rule which is satisﬁed by 20 documents.
Essentially, this quantiﬁes the statistical volume which is associated
with the rule. However, it does not encode the strength of
the rule.
Conﬁdence: This quantiﬁes the conditional probability that
the right hand side of the rule is satisﬁed, if the left-hand side
is satisﬁed. This is a more direct measure of the strength of the
underlying rule.
We note that the afore-mentioned measures are not the only measures
which are possible, but are widely used in the data mining and machine
learning literature [88] for both textual and non-textual data, because
of their intuitive nature and simplicity of interpretation. One criticism
of the above measures is that they do not normalize for the a-priori
presence of diﬀerent terms and features, and are therefore prone to misinterpretation,
when the feature distribution or class-distribution in the
underlying data set is skewed.
The training phase constructs all the rules, which are based on measures
such as the above. For a given test instance, we determine all the
rules which are relevant to the test instance. Since we allow overlaps, it
is possible that more than one rule may be relevant to the test instance.
If the class labels on the right hand sides of all these rules are the same,
then it is easy to pick this class as the relevant label for the test instance.
On the other hand, the problem becomes more challenging when there
are conﬂicts between these diﬀerent rules. A variety of diﬀerent methods
are used to rank-order the diﬀerent rules [88], and report the most
relevant rule as a function of these diﬀerent rules. For example, a common
approach is to rank-order the rules by their conﬁdence, and pick
the top-k rules as the most relevant. The class label on the right-hand
side of the most number of these rules is reported as the relevant one.
Am interesting rule-based classiﬁer for the case of text data has been
proposed in [5]. This technique uses an iterative methodology, which
was ﬁrst proposed in [128] for generating rules. Speciﬁcally, the method
determines the single best rule related to any particular class in the
training data. The best rule is deﬁned in terms of the conﬁdence of the
rule, as deﬁned above. This rule along with its corresponding instances
are removed from the training data set. This approach is continuously
repeated, until it is no longer possible to ﬁnd strong rules in the training
data, and complete predictive value is achieved.
The transformation of decision trees to rule-based classiﬁers is discussed
generally in [106], and for the particular case of text data in [68].
For each path in the decision tree a rule can be generated, which repre-
A Survey of Text Classiﬁcation Algorithms 181
sents the conjunction of the predicates along that path. One advantage
of the rule-based classiﬁer over a decision tree is that it is not restricted
to a strict hierarchical partitioning of the feature space, and it allows
for overlaps and inconsistencies among the diﬀerent rules. Therefore,
if a new set of training examples are encountered, which are related to
a new class or new part of the feature space, then it is relatively easy
to modify the rule set for these new examples. Furthermore, rule-based
classiﬁers also allow for a tremendous interpretability of the underlying
decision space. In cases in which domain-speciﬁc expert knowledge is
known, it is possible to encode this into the classiﬁcation process by
manual addition of rules. In many practical scenarios, rule-based techniques
are more commonly used because of their ease of maintenance
and interpretability.
One of the most common rule-based techniques is the RIPPER technique
discussed in [26–28]. The RIPPER technique essentially determines
frequent combinations of words which are related to a particular
class. The RIPPER method has been shown to be especially eﬀective in
scenarios where the number of training examples is relatively small [25].
Another method called sleeping experts [26, 49] generates rules which
take the placement of the words in the documents into account. Most
of the classiﬁers such as RIPPER [26–28] treat documents as set-valued
objects, and generate rules based on the co-presence of the words in the
documents. The rules in sleeping experts are diﬀerent from most of the
other classiﬁers in this respect. In this case [49, 26], the left hand side of
the rule consists of a sparse phrase, which is a group of words close to one
another in the document (though not necessarily completely sequential).
Each such rule has a weight, which depends upon its classiﬁcation speciﬁcity
in the training data. For a given test example, we determine the
sparse phrases which are present in it, and perform the classiﬁcation by
combining the weights of the diﬀerent rules that are ﬁred. The sleeping
experts and RIPPER systems have been compared in [26], and have been
shown to have excellent performance on a variety of text collections.
5. Probabilistic and Naive Bayes Classiﬁers
Probabilistic classiﬁers are designed to use an implicit mixture model
for generation of the underlying documents. This mixture model typically
assumes that each class is a component of the mixture. Each mixture
component is essentially a generative model, which provides the
probability of sampling a particular term for that component or class.
This is why this kind of classiﬁers are often also called generative classiﬁer.
The naive Bayes classiﬁer is perhaps the simplest and also the most
182 MINING TEXT DATA
commonly used generative classifers. It models the distribution of the
documents in each class using a probabilistic model with independence
assumptions about the distributions of diﬀerent terms. Two classes of
models are commonly used for naive Bayes classiﬁcation. Both models
essentially compute the posterior probability of a class, based on the
distribution of the words in the document. These models ignore the actual
position of the words in the document, and work with the “bag of
words” assumption. The major diﬀerence between these two models is
the assumption in terms of taking (or not taking) word frequencies into
account, and the corresponding approach for sampling the probability
space:
Multivariate Bernoulli Model: In this model, we use the presence
or absence of words in a text document as features to represent
a document. Thus, the frequencies of the words are not used for
the modeling a document, and the word features in the text are
assumed to be binary, with the two values indicating presence or
absence of a word in text. Since the features to be modeled are
binary, the model for documents in each class is a multivariate
Bernoulli model.
Multinomial Model: In this model, we captuer the frequencies
of terms in a document by representing a document with a bag
of words. The documents in each class can then be modeled as
samples drawn from a multinomial word distribution. As a result,
the conditional probability of a document given a class is simply
a product of the probability of each observed word in the corresponding
class.
No matter how we model the documents in each class (be it a multivariate
Bernoulli model or a multinomial model), the component class
models (i.e., generative models for documents in each class) can be used
in conjunction with the Bayes rule to compute the posterior probability
of the class for a given document, and the class with the highest posterior
probability can then be assigned to the document.
There has been considerable confusion in the literature on the differences
between the multivariate Bernoulli model and the multinomial
model. A good exposition of the diﬀerences between these two models
may be found in [94]. In the following, we describe these two models in
more detail.
A Survey of Text Classiﬁcation Algorithms 183
5.1 Bernoulli Multivariate Model
This class of techniques treats a document as a set of distinct words
with no frequency information, in which an element (term) may be either
present or absent. The seminal work on this approach may be found in
[82].
Let us assume that the lexicon from which the terms are drawn are
denoted by V = {t1 . . . tn}. Let us assume that the bag-of-words (or
text document) in question contains the terms Q = {ti1 . . . tim }, and the
class is drawn from {1 . . . k}. Then, our goal is to model the posterior
probability that the document (which is assumed to be generated from
the term distributions of one of the classes) belongs to class i, given that
it contains the terms Q = {ti1 . . . tim }. The best way to understand the
Bayes method is by understanding it as a sampling/generative process
from the underlying mixture model of classes. The Bayes probability
of class i can be modeled by sampling a set of terms T from the term
distribution of the classes:
If we sampled a term set T of any size from the term distribution
of one of the randomly chosen classes, and the ﬁnal outcome is the
set Q, then what is the posterior probability that we had originally picked
class i for sampling? The a-priori probability of picking class i is equal
to its fractional presence in the collection.
We denote the class of the sampled set T by CT and the corresponding
posterior probability by P(CT = i|T = Q). This is essentially what
we are trying to ﬁnd. It is important to note that since we do not
allow replacement, we are essentially picking a subset of terms from V
with no frequencies attached to the picked terms. Therefore, the set Q
may not contain duplicate elements. Under the naive Bayes assumption
of independence between terms, this is essentially equivalent to either
selecting or not selecting each term with a probability that depends upon
the underlying term distribution. Furthermore, it is also important to
note that this model has no restriction on the number of terms picked.
As we will see later, these assumptions are the key diﬀerences with the
multinomial Bayes model. The Bayes approach classiﬁes a given set Q
based on the posterior probability that Q is a sample from the data
distribution of class i, i.e., P(CT = i|T = Q), and it requires us to
compute the following two probabilities in order to achieve this:
1 What is the prior probability that a set T is a sample from the term
distribution of class i? This probability is denoted by P(CT = i).
184 MINING TEXT DATA
2 If we sampled a set T of any size from the term distribution of class
i, then what is the probability that our sample is the set Q? This
probability is denoted by P(T = Q|CT = i).
We will now provide a more mathematical description of Bayes modeling.
In other words, we wish to model P(CT = i|Q is sampled). We
can use the Bayes rule in order to write this conditional probability in
a way that can be estimated more easily from the underlying corpus. In
other words, we can simplify as follows:
P(CT
= i|T = Q) =
P(CT = i) · P(T = Q|CT = i)
P(T = Q)
=
P(CT = i) · tj∈Q P(tj ∈ T|CT = i) · tj∈Q(1 − P(tj ∈ T|CT = i))
P(T = Q)
We note that the last condition of the above sequence uses the naive
independence assumption, because we are assuming that the probabilities
of occurrence of the diﬀerent terms are independent of one another. This
is practically necessary, in order to transform the probability equations
to a form which can be estimated from the underlying data.
The class assigned to Q is the one with the highest posterior probability
given Q. It is easy to see that this decision is not aﬀected by the
denominator, which is the marginal probability of observing Q. That is,
we will assign the following class to Q:
ˆi = arg max
i
P(CT
= i|T = Q)
= arg max
i
P(CT
= i) ·
tj∈Q
P(tj ∈ T|CT
= i) ·
tj∈Q
(1 − P(tj ∈ T|CT
= i)).
It is important to note that all terms in the right hand side of the
last equation can be estimated from the training corpus. The value of
P(CT = i) is estimated as the global fraction of documents belonging to
class i, the value of P(tj ∈ T|CT = i) is the fraction of documents in the
ith class which contain term tj, and the value of P(tj ∈ T) is the fraction
of documents (in the whole corpus) containing the term tj. We note that
all of the above are maximum likelihood estimates of the corresponding
probabilities. In practice, Laplacian smoothing [124] is used, in which
small values are added to the frequencies of terms in order to avoid zero
probabilities of sparsely present terms.
In most applications of the Bayes classiﬁer, we only care about the
identity of the class with the highest probability value, rather than the
A Survey of Text Classiﬁcation Algorithms 185
actual probability value associated with it, which is why we do not need
to compute the normalizer P(T = Q). In fact, in the case of binary
classes, a number of simpliﬁcations are possible in computing these Bayes
“probability” values by using the logarithm of the Bayes expression, and
removing a number of terms which do not aﬀect the ordering of class
probabilities. We refer the reader to [108] for details.
Although for classiﬁcation, we do not need to compute P(T = Q),
some applications necessitate the exact computation of the posterior
probability P(CT = i|T = Q). For example, in the case of supervised
anomaly detection (or rare class detection), the exact posterior probability
value P(CT = i|T = Q) is needed in order to fairly compare the
probability value over diﬀerent test instances, and rank them for their
anomalous nature. In such cases, we would need to compute P(T = Q).
One way to achieve this is simply to take a sum over all the classes:
P(T = Q) =
i
P(T = Q|CT
= i)P(CT
= i).
This is based on the conditional independence of features for each class.
Since the parameter values are estimated for each class separately, we
may face the problem of data sparseness. An alternative way of computing
it, which may alleviate the data sparseness problem, is to further
make the assumption of (global) independence of terms, and compute it
as:
P(T = Q) =
j∈Q
P(tj ∈ T) ·
tj∈Q
(1 − P(tj ∈ T))
where the term probabilities are based on global term distributions in
all the classes.
A natural question arises, as to whether it is possible to design a Bayes
classiﬁer which does not use the naive assumption, and models the dependencies
between the terms during the classiﬁcation process. Methods
which generalize the naive Bayes classiﬁer by not using the independence
assumption do not work well because of the higher computational costs
and the inability to estimate the parameters accurately and robustly in
the presence of limited data. The most interesting line of work in relaxing
the independence assumption is provided in [112]. In this work, the
tradeoﬀs in spectrum of allowing diﬀerent levels of dependence among
the terms have been explored. On the one extreme, an assumption of
complete dependence results in a Bayesian network model which turns
out to be computationally very expensive. On the other hand, it has
been shown that allowing limited levels of dependence can provide good
tradeoﬀs between accuracy and computational costs. We note that while
the independence assumption is a practical approximation, it has been
186 MINING TEXT DATA
shown in [29, 39] that the approach does have some theoretical merit.
Indeed, extensive experimental tests have tended to show that the naive
classiﬁer works quite well in practice.
A number of papers [19, 64, 74, 79, 108, 113] have used the naive
Bayes approach for classiﬁcation in a number of diﬀerent application
domains. The classiﬁer has also been extended to modeling temporally
aware training data, in which the importance of a document may decay
with time [114]. As in the case of other statistical classiﬁers, the naive
Bayes classiﬁer [113] can easily incorporate domain-speciﬁc knowledge
into the classiﬁcation process. The particular domain that the work in
[113] addresses is that of ﬁltering junk email. Thus, for such a problem,
we often have a lot of additional domain knowledge which helps us determine
whether a particular email message is junk or not. For example,
some common characteristics of the email which would make an email
to be more or less likely to be junk are as follows:
The domain of the sender such as .edu or .com can make an email
to be more or less likely to be junk.
Phrases such as “Free Money” or over emphasized punctuation
such as “!!!” can make an email more likely to be junk.
Whether the recipient of the message was a particular user, or a
mailing list.
The Bayes method provides a natural way to incorporate such additional
information into the classiﬁcation process, by creating new features for
each of these characteristics. The standard Bayes technique is then used
in conjunction with this augmented representation for classiﬁcation. The
Bayes technique has also been used in conjunction with the incorporation
of other kinds of domain knowledge, such as the incorporation of
hyperlink information into the classiﬁcation process [20, 104].
The Bayes method is also suited to hierarchical classiﬁcation, when
the training data is arranged in a taxonomy of topics. For example,
the Open Directory Project (ODP), Yahoo! Taxonomy, and a variety of
news sites have vast collections of documents which are arranged into
hierarchical groups. The hierarchical structure of the topics can be exploited
to perform more eﬀective classiﬁcation [19, 74], because it has
been observed that context-sensitive feature selection can provide more
useful classiﬁcation results. In hierarchical classiﬁcation, a Bayes classiﬁer
is built at each node, which then provides us with the next branch
to follow for classiﬁcation purposes. Two such methods are proposed in
[19, 74], in which node speciﬁc features are used for the classiﬁcation
process. Clearly, much fewer features are required at a particular node
A Survey of Text Classiﬁcation Algorithms 187
in the hierarchy, because the features which are picked are relevant to
that branch. An example in [74] suggests that a branch of the taxonomy
which is related to Computer may have no relationship with the
word “cow”. These node-speciﬁc features are referred to as signatures
in [19]. Furthermore, it has been observed in [19] that in a given node,
the most discriminative features for a given class may be diﬀerent from
their parent nodes. For example, the word “health” may be discriminative
for the Y ahoo! category @Health, but the word “baby” may be
much more discriminative for the category @Health@Nursing. Thus, it
is critical to have an appropriate feature selection process at each node
of the classiﬁcation tree. The methods in [19, 74] use diﬀerent methods
for this purpose.
The work in [74] uses an information-theoretic approach [31] for
feature selection which takes into account the dependencies between
the attributes [112]. The algorithm greedily eliminates the
features one-by-one so as the least disrupt the conditional class
distribution at that node.
The node-speciﬁc features are referred to as signatures in [19].
These node-speciﬁc signatures are computed by calculating the
ratio of intra-class variance to inter-class variance for the diﬀerent
words at each node of the tree. We note that this measure is the
same as that optimized by the Fisher’s discriminant, except that
it is applied to the original set of words, rather than solved as a
general optimization problem in which arbitrary directions in the
data are picked.
A Bayesian classiﬁer is constructed at each node in order to determine
the appropriate branch. A small number of context-sensitive features
provide One advantage of these methods is that Bayesian classiﬁers work
much more eﬀectively with a much smaller number of features. Another
major diﬀerence between the two methods is that the work in [74] uses
the Bernoulli model, whereas that in [19] uses the multinomial model,
which will be discussed in the next subsection. This approach in [74] is
referred to as the Pachinko Machine classiﬁer and that in [19] is known
as TAPER (Taxonomy and Path Enhanced Retrieval System).
Other noteworthy methods for hierarchical classiﬁcation are proposed
in [11, 130, 95]. The work [11] addresses two common problems associated
with hierarchical text classiﬁcation: (1) error propagation; (2)
non-linear decision surfaces. The problem of error propagation occurs
when the classiﬁcation mistakes made at a parent node are propagated
to its children node. This problem was solved in [11] by using cross validation
to obtain a training data set for a child node that is more similar
188 MINING TEXT DATA
to the actual test data passed to the child node from its parent node
than the training data set normally used for training a classiﬁer at the
child node. The problem of non-linear decision surfaces refers to that
the decision boundary of a category at a higher level is often non-linear
(since its members are the union of the members of its children nodes).
This problem is addressed by using the tentative class labels obtained
at the children nodes as features for use at a parent node. These are
general strategies that can be applied to any base classiﬁer, and the
experimental results in [11] show that both strategies are eﬀective.
5.2 Multinomial Distribution
This class of techniques treats a document as a set of words with
frequencies attached to each word. Thus, the set of words is allowed to
have duplicate elements.
As in the previous case, we assume that the set of words in document
is denoted by Q, drawn from the vocabulary set V . The set
Q contains the distinct terms {ti1 . . . tim } with associated frequencies
F = {Fi1 . . . Fim }. We denote the terms and their frequencies by [Q, F].
The total number of terms in the document (or document length) is
denoted by L = m
j=1 F(ij). Then, our goal is to model the posterior
probability that the document T belongs to class i, given that it contains
the terms in Q with the associated frequencies F. The Bayes probability
of class i can be modeled by using the following sampling process:
If we sampled L terms sequentially from the term distribution of
one of the randomly chosen classes (allowing repetitions) to create
the term set T, and the ﬁnal outcome for sampled set T is the set Q with
the corresponding frequencies F, then what is the posterior probability
that we had originally picked class i for sampling? The a-priori probability
of picking class i is equal to its fractional presence in the collection.
The aforementioned probability is denoted by P(CT = i|T = [Q, F]).
An assumption which is commonly used in these models is that the
length of the document is independent of the class label. While it is
easily possible to generalize the method, so that the document length is
used as a prior, independence is usually assumed for simplicity. As in
the previous case, we need to estimate two values in order to compute
the Bayes posterior.
1 What is the prior probability that a set T is a sample from the term
distribution of class i? This probability is denoted by P(CT = i).
A Survey of Text Classiﬁcation Algorithms 189
2 If we sampled L terms from the term distribution of class i (with
repetitions), then what is the probability that our sampled set T
is the set Q with associated frequencies F? This probability is
denoted by P(T = [Q, F]|CT = i).
Then, the Bayes rule can be applied to this case as follows:
P(CT
= i|T = [Q, F]) =
P(CT = i) · P(T = [Q, F]|CT = i)
P(T = [Q, F])
∝ P(CT
= i) · P(T = [Q, F]|CT
= i) (6.11)
As in the previous case, it is not necessary to compute the denominator,
P(T = [Q, F]), for the purpose of deciding the class label for Q. The
value of the probability P(CT = i) can be estimated as the fraction of
documents belonging to class i. The computation of P([Q, F]|CT = i) is
much more complicated. When we consider the sequential order of the
L diﬀerent samples, the number of possible ways to sample the diﬀerent
terms so as to result in the outcome [Q, F] is given by L!
m
i=1 Fi!
. The
probability of each of these sequences is given by tj∈Q P(tj ∈ T)Fj , by
using the naive independence assumption. Therefore, we have:
P(T = [Q, F]|CT
= i) =
L!
m
i=1 Fi!
·
tj∈Q
P(tj ∈ T|CT
= i)Fj
(6.12)
We can substitute Equations 6.12 in Equation 6.11 to obtain the class
with the highest Bayes posterior probability, where the class priors are
computed as in the previous case, and the probabilities P(tj ∈ T|CT = i)
can also be easily estimated as previously with Laplacian smoothing
[124]. We note that the probabilities of class absence are not present
in the above equations because of the way in which the sampling is
performed.
A number of diﬀerent variations of the multinomial model have been
proposed in [53, 70, 84, 95, 97, 103]. In the work [95], it is shown that
a category hierarchy can be leveraged to improve the estimate of multinomial
parameters in the naive Bayes classiﬁer to signiﬁcantly improve
classiﬁcation accuracy. The key idea is to apply shrinkage techniques to
smooth the parameters for data-sparse child categories with their common
parent nodes. As a result, the training data of related categories
are essentially ”shared” with each other in a weighted manner, which
helps improving the robustness and accuracy of parameter estimation
when there are insuﬃcient training data for each individual child category.
The work in [94] has performed an extensive comparison between
190 MINING TEXT DATA
the bernoulli and the multinomial models on diﬀerent corpora, and the
following conclusions were presented:
The multi-variate Bernoulli model can sometimes perform better
than the multinomial model at small vocabulary sizes.
The multinomial model outperforms the multi-variate Bernoulli
model for large vocabulary sizes, and almost always beats the
multi-variate Bernoulli when vocabulary size is chosen optimally
for both. On the average a 27% reduction in error was reported in
[94].
The afore-mentioned results seem to suggest that the two models may
have diﬀerent strengths, and may therefore be useful in diﬀerent scenar-
ios.
5.3 Mixture Modeling for Text Classiﬁcation
We note that the afore-mentioned Bayes methods simply assume that
each component of the mixture corresponds to the documents belonging
to a class. A more general interpretation is one in which the components
of the mixture are created by a clustering process, and the class
membership probabilities are modeled in terms of this mixture. Mixture
modeling is typically used for unsupervised (probabilistic) clustering or
topic modeling, though the use of clustering can also help in enhancing
the eﬀectiveness of probabilistic classiﬁers [86, 103]. These methods are
particularly useful in cases where the amount of training data is limited.
In particular, clustering can help in the following ways:
The Bayes method implicitly estimates the word probabilities
P(ti ∈ T|CT = i) of a large number of terms in terms of their
fractional presence in the corresponding component. This is clearly
noisy. By treating the clusters as separate entities from the classes,
we now only need to relate (a much smaller number of) cluster
membership probabilities to class probabilities. This reduces the
number of parameters and greatly improves classiﬁcation accuracy
[86].
The use of clustering can help in incorporating unlabeled documents
into the training data for classiﬁcation. The premise is that
unlabeled data is much more copiously available than labeled data,
and when labeled data is sparse, it should be used in order to assist
the classiﬁcation process. While such unlabeled documents do not
contain class-speciﬁc information, they do contain a lot of information
about the clustering behavior of the underlying data. This can
A Survey of Text Classiﬁcation Algorithms 191
be very useful for more robust modeling [103], when the amount
of training data is low. This general approach is also referred to
as co-training [9, 13, 37].
The common characteristic of both the methods [86, 103] is that they
both use a form of supervised clustering for the classiﬁcation process.
While the goal is quite similar (limited training data), the approach used
for this purpose is quite diﬀerent. We will discuss both of these methods
in this section.
In the method discussed in [86], the document corpus is modeled with
the use of supervised word clusters. In this case, the k mixture components
are clusters which are correlated to, but are distinct from the k
groups of documents belonging to the diﬀerent classes. The main diﬀerence
from the Bayes method is that the term probabilities are computed
indirectly by using clustering as an intermediate step. For a sampled
document T, we denote its class label by CT ∈ {1 . . . k}, and its mixture
component by MT ∈ {1 . . . k}. The k diﬀerent mixture components
are essentially word-clusters whose frequencies are generated by using
the frequencies of the terms in the k diﬀerent classes. This ensures
that the word clusters for the mixture components are correlated to the
classes, but they are not assumed to be drawn from the same distribution.
As in the previous case, let us assume that the a document
contains the set of words Q. Then, we would like to estimate the probability
P(T = Q|CT = i) for each class i. An interesting variation of
the work in [86] from the Bayes approach is that it does not attempt
to determine the posterior probability P(CT = i|T = Q). Rather, it
simply reports the class with the highest likelihood P(T = Q|CT = i).
This is essentially equivalent to assuming in the Bayes approach, that
the prior distribution of each class is the same.
The other diﬀerence of the approach is in terms of how the value of
P(T = Q|CT = i) is computed. As before, we need to estimate the
value of P(tj ∈ T|CT = i), according to the naive Bayes rule. However,
unlike the standard Bayes classiﬁer, this is done very indirectly with the
use of mixture modeling. Since the mixture components do not directly
correspond to the class, this term can only be estimated by summing up
the expected value over all the mixture components:
P(tj ∈ T|CT
= i) =
k
s=1
P(tj ∈ T|MT
= s) · P(MT
= s|CT
= i) (6.13)
The value of P(tj ∈ T|MT = s) is easy to estimate by using the fractional
presence of term tj in the sth mixture component. The main
unknown here are the set of model parameters P(MT = s|CT = i).
192 MINING TEXT DATA
Since a total of k classes and k mixture-components are used, this requires
the estimation of only k2 model parameters, which is typically
quite modest for a small number of classes. An EM-approach has been
used in [86] in order to estimate this small number of model parameters
in a robust way. It is important to understand that the work in [86] is
an interesting combination of supervised topic modeling (dimensionality
reduction) and Bayes classiﬁcation after reducing the eﬀective dimensionality
of the feature space to a much smaller value by clustering. The
scheme works well because of the use of supervision in the topic modeling
process, which ensures that the use of an intermediate clustering
approach does not lose information for classiﬁcation. We also note that
in this model, the number of mixtures can be made to vary from the
number of classes. While the work in [86] does not explore this direction,
there is really no reason to assume that the number of mixture
components is the same as the number of classes. Such an assumption
can be particularly useful for data sets in which the classes may not be
contiguous in the feature space, and a natural clustering may contain
far more components than the number of classes.
Next, we will discuss the second method [103] which uses unlabeled
data. The approach is [103] uses the unlabeled data in order to improve
the training model. Why should unlabeled data help in classiﬁcation at
all? In order to understand this point, recall that the Bayes classiﬁcation
process eﬀectively uses k mixture components, which are assumed
to be the k diﬀerent classes. If we had an inﬁnite amount of training
data, it would be possible to create the mixture components, but it
would not be possible to assign labels to these components. However,
the most data-intensive part of modeling the mixture, is that of determining
the shape of the mixture components. The actual assignment
of mixture components to class labels can be achieved with a relatively
small number of class labels. It has been shown in [24] that the accuracy
of assigning components to classes increases exponentially with
the number of labeled samples available. Therefore, the work in [103]
designs an EM-approach [36] to simultaneously determine the relevant
mixture model and its class assignment.
It turns out that the EM-approach, as applied to this problem is
quite simple to implement. It has been shown in [103] that the EMapproach
is equivalent to the following iterative methodology. First, a
naive Bayes classiﬁer is constructed by estimating the model parameters
from the labeled documents only. This is used in order to assign
probabilistically-weighted class labels to the unlabeled documents.
Then, the Bayes classiﬁer is re-constructed, except that we also use
the newly labeled documents in the estimation of the underlying model
A Survey of Text Classiﬁcation Algorithms 193
parameters. We again use this classiﬁer to re-classify the (originally unlabeled)
documents. The process is continually repeated till convergence
is achieved.
The ability to signiﬁcantly improve the quality of text classiﬁcation
with a small amount of labeled data, and the use of clustering on a large
amount of unlabeled data has been a recurring theme in the text mining
literature. For example, the method in [122] performs purely unsupervised
clustering (with no knowledge of class labels), and then as a ﬁnal
step assigns all documents in the cluster to the dominant class label of
that cluster (as an evaluation step for the unsupervised clustering process
in terms of its ability in matching clusters to known topics).4 It has been
shown that this approach is able to achieve a comparable accuracy of
matching clusters to topics as a supervised Naive Bayes classiﬁer trained
over a small data set of about 1000 documents. Similar results were obtained
in [47] where the quality of the unsupervised clustering process
were shown to comparable to an SVM classiﬁer which was trained over
a small data set.
6. Linear Classiﬁers
Linear Classiﬁers are those for which the output of the linear predictor
is deﬁned to be p = A · X + b, where X = (x1 . . . xn) is the normalized
document word frequency vector,A = (a1 . . . an) is a vector of linear
coeﬃcients with the same dimensionality as the feature space, and b is
a scalar. A natural interpretation of the predictor p = A · X + b in
the discrete scenario (categorical class labels) would be as a separating
hyperplane between the diﬀerent classes. Support Vector Machines [30,
125] are a form of classiﬁers which attempt to determine “good” linear
separators between the diﬀerent classes. One characteristic of linear
classiﬁers is that they are closely related to many feature transformation
methods (such as the Fisher discriminant), which attempt to use these
directions in order to transform the feature space, and then use other
classiﬁers on this transformed feature space [51, 56, 99]. Thus, linear
classiﬁers are intimately related to linear feature transformation methods
as well.
Regression modeling (such as the least squares method) is a more
direct and traditional statistical method for text classiﬁcation. However,
it is generally used in cases where the target variable to be learned is
numerical rather than categorical. A number of methods have been
4In a supervised application, the last step would require only a small number of class labels
in the cluster to be known to determine the dominant label very accurately.
194 MINING TEXT DATA
x
x x
x x
x
x
x
o
o
o
o
o
o
o
o
A
B
C
Figure 6.1. What is the Best Separating Hyperplane?
proposed in the literature for adapting such methods to the case of
text data classiﬁcation [134]. A comparison of diﬀerent linear regression
techniques for classiﬁcationm including SVM, may be found in [138].
Finally, simple neural networks are also a form of linear classiﬁers,
since the function computed by a set of neurons is essentially linear. The
simplest form of neural network, known as the perceptron (or single layer
network) are essentially designed for linear separation, and work well for
text. However, by using multiple layers of neurons, it is also possible
to generalize the approach for non-linear separation. In this section, we
will discuss the diﬀerent linear methods for text classiﬁcation.
6.1 SVM Classiﬁers
Support-vector machines were ﬁrst proposed in [30, 124] for numerical
data. The main principle of SVMs is to determine separators in the
search space which can best separate the diﬀerent classes. For example,
consider the example illustrated in Figure 6.1, in which we have two
classes denoted by ’x’ and ’o’ respectively. We have denoted three diﬀerent
separating hyperplanes, which are denoted by A, B, and C respectively.
It is evident that the hyperplane A provides the best separation
between the diﬀerent classes, because the normal distance of any of the
data points from it is the largest. Therefore, the hyperplane A represents
the maximum margin of separation. We note that the normal vector to
this hyperplane (represented by the arrow in the ﬁgure) is a direction in
the feature space along which we have the maximum discrimination. One
advantage of the SVM method is that since it attempts to determine the
optimum direction of discrimination in the feature space by examining
the appropriate combination of features, it is quite robust to high dimen-
A Survey of Text Classiﬁcation Algorithms 195
sionality. It has been noted in [64] that text data is ideally suited for
SVM classiﬁcation because of the sparse high-dimensional nature of text,
in which few features are irrelevant, but they tend to be correlated with
one another and generally organized into linearly separable categories.
We note that it is not necessary to use a linear function for the SVM
classiﬁer. Rather, with the kernel trick [6], SVM can construct a nonlinear
decision surface in the original feature space by mapping the data
instances non-linearly to an inner product space where the classes can be
separated linearly with a hyperplane. However, in practice, linear SVM
is used most often because of their simplicity and ease of interpretability.
The ﬁrst set of SVM classiﬁers, as adapted to the text domain were
proposed in [64–66]. A deeper theoretical study of the SVM method has
been provided in [67]. In particular, it has been shown why the SVM
classiﬁer is expected to work well under a wide variety of circumstances.
This has also been demonstrated experimentally in a few diﬀerent scenarios.
For example, the work in [41] applied the method to email data
for classifying it as spam or non-spam data. It was shown that the SVM
method provides much more robust performance as compared to many
other techniques such as boosting decision trees, the rule based RIPPER
method, and the Rocchio method. The SVM method is ﬂexible and can
easily be combined with interactive user-feedback methods [107].
We note that the problem of ﬁnding the best separator is essentially
an optimization problem, which can typically be reduced to a Quadratic
Programming problem. For example, many of these methods use Newton’s
method for iterative minimization of a convex function. This can
sometimes be slow, especially for high dimensional domains such as text
data. It has been shown [43] that by breaking a large Quadratic Programming
problem (QP problem) to a set of smaller problems, an eﬃcient
solution can be derived for the task. The SVM approach has also
been used successfully [44] in the context of a hierarchical organization
of the classes, as often occurs in web data. In this approach, a diﬀerent
classiﬁer is built at diﬀerent positions of the hierarchy.
The SVM classiﬁer has also been shown to be useful in large scale
scenarios in which a large amount of unlabeled data and a small amount
of labeled data is available [120]. This is essentially a semi-supervised
approach because of its use of unlabeled data in the classiﬁcation process.
This techniques is also quite scalable because of its use of a number of
modiﬁed quasi-newton techniques, which tend to be eﬃcient in practice.
196 MINING TEXT DATA
6.2 Regression-Based Classiﬁers
Regression modeling is a method which is commonly used in order to
learn the relationships between real-valued attributes. Typically, these
methods are designed for real valued attributes, as opposed to binary
attributes. This is however not an impediment to its use in classiﬁcation,
because the binary value of a class may be treated as a rudimentary
special case of a real value, and some regression methods such as logistic
regression can also naturally model discrete response variables.
An early application of regression to text classiﬁcation is the Linear
Least Squares Fit (LLSF) method [134], which works as follows. Suppose
the predicted class label be pi = A · Xi + b, and yi is known to be the
true class label, then our aim is to learn the values of A and b, such
that the Linear Least Squares Fit (LLSF) n
i=1(pi − yi)2 is minimized.
In practice, the value of b is set to 0 for the learning process. let P be
1 × n vector of binary values indicating the binary class to which the
corresponding class belongs. Thus, if X be the the n × d term-matrix,
then we wish to determine the 1 × d vector of regression coeﬃcients
A for which ||A · XT − P|| is minimized, where || · || represents the
Froebinus norm. The problem can be easily generalized from the binary
class scenario to the multi-class scenario with k classes, by using P as a
k × n matrix of binary values. In this matrix, exactly one value in each
column is 1, and the corresponding row identiﬁer represents the class to
which that instance belongs. Similarly, the set A is a k × d vector in the
multi-class scenario. The LLSF method has been compared to a variety
of other methods [132, 134, 138], and has been shown to be very robust
in practice.
A more natural way of modeling the classiﬁcation problem with regression
is the logistic regression classiﬁer [102], which diﬀers from the
LLSF method in that the objective function to be optimized is the likelihood
function. Speciﬁcally, instead of using pi = A · Xi + b directly to
ﬁt the true label yi, we assume that the probability of observing label
yi is:
p(C = yi|Xi) =
exp(A · Xi + b)
1 + exp(A · Xi + b).
This gives us a conditional generative model for yi given Xi. Putting it
in another way, we assume that the logit transformation of p(C = yi|Xi)
can be modeled by the linear combination of features of the instance Xi,
i.e.,
log
p(C = yi|Xi)
1 − p(C = yi|Xi)
= A · Xi + b.
A Survey of Text Classiﬁcation Algorithms 197
Thus logistic regression is also a linear classiﬁer as the decision boundary
is determined by a linear function of the features. In the case of binary
classiﬁcation, p(C = yi|Xi) can be used to determine the class label
(e.g., using a threshold of 0.5). In the case of multi-class classiﬁcation,
we have p(C = yi|Xi) ∝ exp(A · Xi + b), and the class label with the
highest value according to p(C = yi|Xi) would be assigned to Xi. Given
a set of training data points {(X1, yi), ...(Xn, yn)}, the logistic regression
classiﬁer can be trained by choosing parameters A to maximize the
conditional likelihood n
i=1 p(yi|Xi).
In some cases, the domain knowledge may be of the form, where
some sets of words are more important than others for a classiﬁcation
problem. For example, in a classiﬁcation application, we may know that
certain domain-words (Knowledge Words (KW)) may be more important
to classiﬁcation of a particular target category than other words. In
such cases, it has been shown [35] that it may be possible to encode
such domain knowledge into the logistic regression model in the form
of prior on the model parameters and use Bayesian estimation of model
parameters.
It is clear that the regression classiﬁers are extremely similar to the
SVM model for classiﬁcation. Indeed, since LLSF, Logistic Regression,
and SVM are all linear classiﬁers, they are thus identical at a conceptual
level; the main diﬀerence among them lies in the details of the
optimization formulation and implementation. As in the case of SVM
classiﬁers, training a regression classiﬁer also requires an expensive optimization
process. For example, ﬁtting LLSF requires expensive matrix
computations in the form of a singular value decomposition process.
6.3 Neural Network Classiﬁers
The basic unit in a neural network is a neuron or unit. Each unit
receives a set of inputs, which are denoted by the vector Xi, which in
this case, correspond to the term frequencies in the ith document. Each
neuron is also associated with a set of weights A, which are used in
order to compute a function f(·) of its inputs. A typical function which
is often used in the neural network is the linear function as follows:
pi = A · Xi (6.14)
Thus, for a vector Xi drawn from a lexicon of d words, the weight vector
A should also contain d elements. Now consider a binary classiﬁcation
problem, in which all labels are drawn from {+1, −1}. We assume that
the class label of Xi is denoted by yi. In that case, the sign of the
predicted function pi yields the class label.
198 MINING TEXT DATA
x
x x
x x
x
x
x
o
o
o
o
o
o
o
o
A
x
o
o
Figure 6.2. The sign of the projection onto the weight vector A yields the class label
In order to illustrate this point, let us consider a simple example in
a 2-dimensional feature space, as illustrated in Figure 6.2. In this case,
we have illustrated two diﬀerent classes, and the plane corresponding to
Ax = 0 is illustrated in the same ﬁgure. It is evident that the sign of
the function A · Xi yields the class label. Thus, the goal of the approach
is to learn the set of weights A with the use of the training data. The
idea is that we start oﬀ with random weights and gradually update them
when a mistake is made by applying the current function on the training
example. The magnitude of the update is regulated by a learning rate μ.
This forms the core idea of the perceptron algorithm, which is as follows:
Perceptron Algorithm
Inputs: Learning Rate: μ
Training Data (Xi, yi) ∀i ∈ {1 . . . n}
Initialize weight vectors in A to 0 or small random numbers
repeat
Apply each training data to the neural network to check if the
sign of A · Xi matches yi;
if sign of A · Xi does not match yi, then
update weights A based on learning rate μ
until weights in A converge
The weights in A are typically updated (increased or decreased) proportionally
to μ·Xi, so as to reduce the direction of the error of the neuron.
We further note that many diﬀerent update rules have been proposed
in the literature. For example, one may simply update each weight by
μ, rather than by μ · Xi. This is particularly possible in domains such
A Survey of Text Classiﬁcation Algorithms 199
x
x
x
x
oo
o
o
o
xo
o
o
x
x
o
o
o
o
o
ox
Figure 6.3. Multi-Layered Neural Networks for Nonlinear Separation
as text, in which all feature values take on small non-negative values of
relatively similar magnitude. A number of implementations of neural
network methods for text data have been studied in [34, 90, 101, 117,
129].
A natural question arises, as to how a neural network may be used,
if all the classes may not be neatly separated from one another with
a linear separator, as illustrated in Figure 6.2. For example, in Figure
6.3, we have illustrated an example in which the classes may not be
separated with the use of a single linear separator. The use of multiple
layers of neurons can be used in order to induce such non-linear
classiﬁcation boundaries. The eﬀect of such multiple layers is to induce
multiple piece-wise linear boundaries, which can be used to approximate
enclosed regions belonging to a particular class. In such a network, the
outputs of the neurons in the earlier layers feed into the neurons in the
later layers. The training process of such networks is more complex, as
the errors need to be back-propagated over diﬀerent layers. Some examples
of such classiﬁers include those discussed in [75, 110, 126, 132].
However, the general observation [117, 129] for text has been that linear
classiﬁers generally provide comparable results to non-linear data,
and the improvements of non-linear classiﬁcation methods are relatively
small. This suggests that the additional complexity of building more involved
non-linear models does not pay for itself in terms of signiﬁcantly
better classiﬁcation.
6.4 Some Observations about Linear Classiﬁers
While the diﬀerent linear classiﬁers have been developed independently
from one another in the research literature, they are surprisingly
similar at a basic conceptual level. Interestingly, these diﬀerent lines of
work have also resulted in a number of similar conclusions in terms of
200 MINING TEXT DATA
the eﬀectiveness of the diﬀerent classiﬁers. We note that the main difference
between the diﬀerent classiﬁers is in terms of the details of the
objective function which is optimized, and the iterative approach used
in order to determine the optimum direction of separation. For example,
the SVM method uses a Quadratic Programming (QP) formulation,
whereas the LLSF method uses a closed-form least-squares formulation.
On the other hand, the perceptron method does not try to formulate
a closed-form objective function, but works with a softer iterative hill
climbing approach. This technique is essentially inherited from the iterative
learning approach used by neural network algorithms. However,
its goal remains quite similar to the other two methods. Thus, the diﬀerences
between these methods are really at a detailed level, rather than
a conceptual level, in spite of their very diﬀerent research origins.
Another general observation about these methods is that all of them
can be implemented with non-linear versions of their classiﬁers. For example,
it is possible to create non-linear decision surfaces with the SVM
classiﬁer, just as it is possible to create non-linear separation boundaries
by using layered neurons in a neural network [132]. However, the general
consensus has been that the linear versions of these methods work very
well, and the additional complexity of non-linear classiﬁcation does not
tend to pay for itself, except for some special data sets. The reason for
this is perhaps because text is a high dimensional domain with highly
correlated features and small non-negative values on sparse features.
For example, it is hard to easily create class structures such as that indicated
in Figure 6.3 for a sparse domain such as text containing only
small non-negative values on the features. On the other hand, the high
dimensional nature of correlated text dimensions is especially suited to
classiﬁers which can exploit the redundancies and relationships between
the diﬀerent features in separating out the diﬀerent classes. Common
text applications have generally resulted in class structures which are
linearly separable over this high dimensional domain of data. This is
one of the reasons that linear classiﬁers have shown an unprecedented
success in text classiﬁcation.
7. Proximity-based Classiﬁers
Proximity-based classiﬁers essentially use distance-based measures in
order to perform the classiﬁcation. The main thesis is that documents
which belong to the same class are likely to be close to one another
based on similarity measures such as the dot product or the cosine metric
[115]. In order to perform the classiﬁcation for a given test instance, two
possible methods can be used:
A Survey of Text Classiﬁcation Algorithms 201
We determine the k-nearest neighbors in the training data to the
test instance. The majority (or most abundant) class from these k
neighbors are reported as the class label. Some examples of such
methods are discussed in [25, 54, 134]. The choice of k typically
ranges between 20 and 40 in most of the afore-mentioned work,
depending upon the size of the underlying corpus.
We perform training data aggregation during pre-processing, in
which clusters or groups of documents belonging to the same class
are created. A representative meta-document is created from each
group. The same k-nearest neighbor approach is applied as discussed
above, except that it is applied to this new set of metadocuments
(or generalized instances [76]) rather than to the original
documents in the collection. A pre-processing phase of summarization
is useful in improving the eﬃciency of the classiﬁer, because
it signiﬁcantly reduces the number of distance computations.
In some cases, it may also boost the accuracy of the technique, especially
when the data set contains a large number of outliers.
Some examples of such methods are discussed in [55, 76, 109].
A method for performing nearest neighbor classiﬁcation in text data
is the WHIRL method discussed in [25]. The WHIRL method is essentially
a method for performing soft similarity joins on the basis of
text attributes. By soft similarity joins, we refer to the fact that the
two records may not be exactly the same on the joined attribute, but a
notion of similarity used for this purpose. It has been observed in [25]
that any method for performing a similarity-join can be adapted as a
nearest neighbor classiﬁer, by using the relevant text documents as the
joined attributes.
One observation in [134] about nearest neighbor classiﬁers was that
feature selection and document representation play an important part
in the eﬀectiveness of the classiﬁcation process. This is because most
terms in large corpora may not be related to the category of interest.
Therefore, a number of techniques were proposed in [134] in order to
learn the associations between the words and the categories. These
are then used to create a feature representation of the document, so
that the nearest neighbor classiﬁer is more sensitive to the classes in
the document collection. A similar observation has been made in [54],
in which it has been shown that the addition of weights to the terms
(based on their class-sensitivity) signiﬁcantly improves the underlying
classiﬁer performance. The nearest neighbor classiﬁer has also been
extended to the temporally-aware scenario [114], in which the timeliness
of a training document plays a role in the model construction process.
202 MINING TEXT DATA
In order to incorporate such factors, a temporal weighting function has
been introduced in [114], which allows the importance of a document to
gracefully decay with time.
For the case of classiﬁers which use grouping techniques, the most
basic among such methods is that proposed by Rocchio in [109]. In this
method, a single representative meta-document is constructed from each
of the representative classes. For a given class, the weight of the term tk
is the normalized frequency of the term tk in documents belonging to that
class, minus the normalized frequency of the term in documents which
do not belong to that class. Speciﬁcally, let fk
p be the expected weight of
term tk in a randomly picked document belonging to the positive class,
and fk
n be the expected weight of term tk in a randomly picked document
belonging to the negative class. Then, for weighting parameters αp and
αn, the weight fk
rocchio is deﬁned as follows:
fk
rocchio = αp · fk
p − αn · fk
n (6.15)
The weighting parameters αp and αn are picked so that the positive
class has much greater weight as compared to the negative class. For
the relevant class, we now have a vector representation of the terms
(f1
rocchio, f2
rocchio . . . fn
rocchio). This approach is applied separately to each
of the classes, in order to create a separate meta-document for each class.
For a given test document, the closest meta-document to the test document
can be determined by using a vector-based dot product or other
similarity metric. The corresponding class is then reported as the relevant
label. The main distinguishing characteristic of the Rocchio method
is that it creates a single proﬁle of the entire class. This class of methods
is also referred to as the Rocchio framework. The main disadvantage of
this method is that if a single class occurs in multiple disjoint clusters
which are not very well connected in the data, then the centroid of these
examples may not represent the class behavior very well. This is likely
to be a source of inaccuracy for the classiﬁer. The main advantage of
this method is its extreme simplicity and eﬃciency; the training phase is
linear in the corpus size, and the number of computations in the testing
phase are linear to the number of classes, since all the documents have
already been aggregated into a small number of classes. An analysis of
the Rocchio algorithm, along with a number of diﬀerent variations may
be found in [64].
In order to handle the shortcomings of the Rocchio method, a number
of classiﬁers have also been proposed [1, 14, 55, 76], which explicitly
perform the clustering of each of the classes in the document collection.
These clusters are used in order to generate class-speciﬁc proﬁles. These
proﬁles are also referred to as generalized instances in [76]. For a given
A Survey of Text Classiﬁcation Algorithms 203
test instance, the label of the closest generalized instance is reported by
the algorithm. The method in [14] is also a centroid-based classiﬁer, but
is speciﬁcally designed for the case of text documents. The work in [55]
shows that there are some advantages in designing schemes in which the
similarity computations take account of the dependencies between the
terms of the diﬀerent classes.
We note that the nearest neighbor classiﬁer can be used in order to
generate a ranked list of categories for each document. In cases, where a
document is related to multiple categories, these can be reported for the
document, as long as a thresholding method is available. The work in
[136] studies a number of thresholding strategies for the k-nearest neighbor
classiﬁer. It has also been suggested in [136] that these thresholding
strategies can be used to understand the thresholding strategies of other
classiﬁers which use ranking classiﬁers.
8. Classiﬁcation of Linked and Web Data
In recent years, the proliferation of the web and social network technologies
has lead to a tremendous amount of document data, which is
expressed in the form of linked networks. The simplest example of this is
the web, in which the documents are linked to one another with the use
of hyper-links. Social networks can also be considered a noisy example
of such data, because the comments and text proﬁles of diﬀerent users
are connected to one another through a variety of links. Linkage information
is quite relevant to the classiﬁcation process, because documents
of similar subjects are often linked together. This observation has been
used widely in the collective classiﬁcation literature [12], in which a subset
of network nodes are labeled, and the remaining nodes are classiﬁed
on the basis of the linkages among the nodes.
In general, a content-based network may be denoted by G = (N, A, C),
where N is the set of nodes, A is the set of edges between the nodes,
and C is a set of text documents. Each node in N corresponds to a
text document in C, and it is possible for a document to be the empty,
when the corresponding node does not contain any content. A subset of
the nodes in N are labeled. This corresponds to the training data. The
classiﬁcation problem in this scenario is to determine the labels of the
remaining nodes with the use of the training data. It is clear that both
the content and structure can play a useful and complementary role in
the classiﬁcation process.
An interesting method for combining linkage and content information
for classiﬁcation was discussed in [20]. In this paper, a hypertext categorization
method was proposed, which uses the content and labels of
204 MINING TEXT DATA
neighboring web pages for the classiﬁcation process. When the labels of
all the nearest neighbors are available, then a Bayesian method can be
adapted easily for classiﬁcation purposes. Just as the presence of a word
in a document can be considered a Bayesian feature for a text classiﬁer,
the presence of a link between the target page, and a page (for which
the label is known) can be considered a feature for the classiﬁer. The
real challenge arises when the labels of all the nearest neighbors are not
available. In such cases, a relaxation labeling method was proposed in
order to perform the classiﬁcation. Two methods have been proposed in
this work:
Fully Supervised Case of Radius one Enhanced Linkage
Analysis: In this case, it is assumed that all the neighboring class
labels are known. In such a case, a Bayesian approach is utilized
in order to treat the labels on the nearest neighbors as features for
classiﬁcation purposes. In this case, the linkage information is the
sole information which is used for classiﬁcation purposes.
When the class labels of the nearest neighbors are not
known: In this case, an iterative approach is used for combining
text and linkage based classiﬁcation. Rather than using the predeﬁned
labels (which are not available), we perform a ﬁrst labeling
of the neighboring documents with the use of document content.
These labels are then used to classify the label of the target document,
with the use of both the local text and the class labels of
the neighbors. This approach is used iteratively for re-deﬁning the
labels of both the target document and its neighbors until convergence
is achieved.
The conclusion from the work in [20] is that a combination of text and
linkage based classiﬁcation always improves the accuracy of a text classiﬁer.
Even when none of the neighbors of the document have known
classes, it seemed to be always beneﬁcial to add link information to
the classiﬁcation process. When the class labels of all the neighbors
are known, then the advantages of using the scheme seem to be quite
signiﬁcant.
An additional idea in the paper is that of the use of bridges in order
to further improve the classiﬁcation accuracy. The core idea in the use
of a bridge is the use of 2-hop propagation for link-based classiﬁcation.
The results with the use of such an approach are somewhat mixed, as
the accuracy seems to reduce with an increasing number of hops. The
work in [20] shows results on a number of diﬀerent kinds of data sets
such as the Reuters database, US patent database, and Yahoo!. Since
the Reuters database contains the least amount of noise, and pure text
A Survey of Text Classiﬁcation Algorithms 205
classiﬁers were able to do a good job. On the other hand, the US patent
database and the Yahoo! database contain an increasing amount of noise
which reduces the accuracy of text classiﬁers. An interesting observation
in [20] was that a scheme which simply absorbed the neighbor text
into the current document performed signiﬁcantly worse than a scheme
which was based on pure text-based classiﬁcation. This is because there
are often signiﬁcant cross-boundary linkages between topics, and such
linkages are able to confuse the classiﬁer. A publicly available implementation
of this algorithm may be found in the NetKit tool kit available in
[92].
Another relaxation labeling method for graph-based document classiﬁcation
is proposed in [4]. In this technique, the probability that the
end points of a link take on a particular pair of class labels is quantiﬁed.
We refer to this as the link-class pair probability. The posterior
probability of classiﬁcation of a node T into class i is expressed as sum
of the probabilities of pairing all possible class labels of the neighbors
of T with class label i. We note a signiﬁcant percentage of these (exponential
number of ) possibilities are pruned, since only the currently
most probable5 labelings are used in this approach. For this purpose,
it is assumed that the class labels of the diﬀerent neighbors of T (while
dependent on T) are independent of each other. This is similar to the
naive assumption, which is often used in Bayes classiﬁers. Therefore, the
probability for a particular combination of labels on the neighbors can
be expressed as the product of the corresponding link-class pair probabilities.
The approach starts oﬀ with the use of a standard content-based
Bayes or SVM classiﬁer in order to assign the initial labels to the nodes.
Then, an iterative approach is used to reﬁne the labels, by using the
most probably label estimations from the previous iteration in order to
reﬁne the labels in the current iteration. We note that the link-class pair
probabilities can be estimated as the smoothed fraction of edges in the
last iteration which contain a particular pair of classes as the end points
(hard labeling), or it can also be estimated as the average product of
node probabilities over all edges which take on that particular class pair
(soft labeling). This approach is repeated to convergence.
Another method which uses a naive Bayes classiﬁer to enhance linkbased
classiﬁcation is proposed in [104]. This method incrementally
assigns class labels, starting oﬀ with a temporary assignment and then
gradually making them permanent. The initial class assignment is based
on a simple Bayes expression based on both the terms and links in the
5In the case of hard labeling, the single most likely labeling is used, whereas in the case of
soft labeling, a small set of possibilities is used.
206 MINING TEXT DATA
document. In the ﬁnal categorization, the method changes the term
weights for Bayesian classiﬁcation of the target document with the terms
in the neighbor of the current document. This method uses a broad
framework which is similar to that in [20], except that it diﬀerentiates
between the classes in the neighborhood of a document in terms of their
inﬂuence on the class label of the current document. For example, documents
for which the class label was either already available in the training
data, or for which the algorithm has performed a ﬁnal assignment, have
a diﬀerent conﬁdence weighting factor than those documents for which
the class label is currently temporarily assigned. Similarly, documents
which belong to a completely diﬀerent subject (based on content) are also
removed from consideration from the assignment. Then, the Bayesian
classiﬁcation is performed with the re-computed weights, so that the
document can be assigned a ﬁnal class label. By using this approach the
technique is able to compensate for the noise and inconsistencies in the
link structures among diﬀerent documents.
One major diﬀerence between the work in [20] and [104], is that the
former is focussed on using link information in order to propagate the
labels, whereas the latter attempts to use the content of the neighboring
pages. Another work along this direction, which uses the content of the
neighboring pages more explicitly is proposed in [105]. In this case, the
content of the neighboring pages is broken up into diﬀerent ﬁelds such
as titles, anchor text, and general text. The diﬀerent ﬁelds are given
diﬀerent levels of importance, which is learned during the classiﬁcation
process. It was shown in [105] that the use of title ﬁelds and anchor
ﬁelds is much more relevant than the general text. This accounts for
much of the accuracy improvements demonstrated in [105].
The work in [2] proposes a method for dynamic classiﬁcation in text
networks with the use of a random-walk method. The key idea in the
work is to transform the combination of structure and content in the
network into a pure network containing only content. Thus, we transform
the original network G = (N, A, C) into an augmented network
GA = (N ∪ Nc, A ∪ Ac), where Nc and Ac are an additional set of nodes
and edges added to the original network. Each node in Nc corresponds
to a distinct word in the lexicon. Thus, the augmented network contains
the original structural nodes N, and a new set of word nodes Nc. The
added edges in Ac are undirected edges added between the structural
nodes N and the word nodes Nc. Speciﬁcally, an edge (i, j) is added
to Ac, if the word i ∈ Nc occurs in the text content corresponding to
the node j ∈ N. Thus, this network is semi-bipartite, in that there are
no edges between the diﬀerent word nodes. An illustration of the semibipartite
content-structure transformation is provided in Figure 6.4.
A Survey of Text Classiﬁcation Algorithms 207
1
2
3
4
STRUCTURAL NODES
WORD NODES
DASHED LINES => WORD PRESENCE IN NODES
Figure 6.4. The Semi-bipartite Transformation
It is important to note that once such a transformation has been
performed, any of the collective classiﬁcation methods [12] can be applied
to the structural nodes. In the work in [2], a random-walk method
has been used in order to perform the collective classiﬁcation of the
underlying nodes. In this method, repeated random walks are performed
starting at the unlabeled nodes which need to be classiﬁed. The random
walks are deﬁned only on the structural nodes, and each hop may either
be a structural hop or a content hop. We perform l diﬀerent random
walks, each of which contains h nodes. Thus, a total of l · h nodes
are encountered in the diﬀerent walks. The class label of this node
is predicted to be the label with the highest frequency of presence in
the diﬀerent l · h nodes encountered in the diﬀerent walks. The error
of this random walk-based sampling process has been bounded in [12].
In addition, the method in [12] can be adapted to dynamic contentbased
networks, in which the nodes, edges and their underlying content
continuously evolve over time. The method in [2] has been compared
to that proposed in [18] (based on the implementation in [92]), and it
has been shown that the classiﬁcation methods of [12] are signiﬁcantly
superior.
Another method for classiﬁcation of linked text data is discussed in
[139]. This method designs two separate regularization conditions; one
is for the text-only classiﬁer (also referred to as the local classiﬁer), and
208 MINING TEXT DATA
the other is for the link information in the network structure. These regularizers
are expressed in the terms of the underlying kernels; the link
regularizer is related to the standard graph regularizer used in the machine
learning literature, and the text regularizer is expressed in terms
of the kernel gram matrix. These two regularization conditions are combined
in two possible ways. One can either use linear combinations of
the regularizers, or linear combinations of the associated kernels. It
was shown in [139] that both combination methods perform better than
either pure structure-based or pure text-based methods. The method using
a linear combination of regularizers was slightly more accurate and
robust than the method which used a linear combination of the kernels.
A method in [32] designs a classiﬁer which combines a Naive Bayes
classiﬁer (on the text domain), and a rule-based classiﬁer (on the structural
domain). The idea is to invent a set of predicates, which are
deﬁned in the space of links, pages and words. A variety of predicates
(or relations) are deﬁned depending upon the presence of the word in
a page, linkages of pages to each other, the nature of the anchor text
of the hyperlink, and the neighborhood words of the hyperlink. These
essentially encode the graph structure of the documents in the form of
boolean predicates, and can also be used to construct relational learners.
The main contribution in [32] is to combine the relational learners on the
structural domain with the Naive Bayes approach in the text domain.
We refer the reader to [32, 33] for the details of the algorithm, and the
general philosophy of such relational learners.
One of the interesting methods for collective classiﬁcation in the context
of email networks was proposed in [23]. The technique in [23] is
designed to classify speech acts in email. Speech acts essentially characterize,
whether an email refers to a particular kind of action (such as
scheduling a meeting). It has been shown in [23] that the use of sequential
thread-based information from the email is very useful for the
classiﬁcation process. An email system can be modeled as a network
in several ways, one of which is to treat an email as a node, and the
edges as the thread relationships between the diﬀerent emails. In this
sense, the work in [23] devises a network-based mining procedure which
uses both the content and the structure of the email network. However,
this work is rather speciﬁc to the case of email networks, and it is not
clear whether the technique can be adapted (eﬀectively) to more general
networks.
A diﬀerent line of solutions to such problems, which are deﬁned on
a heterogeneous feature space is to use latent space methods in order
to simultaneously homogenize the feature space, and also determine the
latent factors in the underlying data. The resulting representation can
A Survey of Text Classiﬁcation Algorithms 209
be used in conjunction with any of the text classiﬁers which are designed
for latent space representations. A method in [140] uses a matrix factorization
approach in order to construct a latent space from the underlying
data. Both supervised and unsupervised methods were proposed for constructing
the latent space from the underlying data. It was then shown
in [140] that this feature representation provides more accurate results,
when used in conjunction with an SVM-classiﬁer.
Finally, a method for web page classiﬁcation is proposed in [119]. This
method is designed for using intelligent agents in web page categorization.
The overall approach relies on the design of two functions which
correspond to scoring web pages and links respectively. An advice language
is created, and a method is proposed for mapping advice to neural
networks. It is has been shown in [119] how this general purpose system
may be used in order to ﬁnd home pages on the web.
9. Meta-Algorithms for Text Classiﬁcation
Meta-algorithms play an important role in classiﬁcation strategies because
of their ability to enhance the accuracy of existing classiﬁcation
algorithms by combining them, or making a general change in the diﬀerent
algorithms to achieve a speciﬁc goal. Typical examples of classiﬁer
meta-algorithms include bagging, stacking and boosting [42]. Some of
these methods change the underlying distribution of the training data,
others combine classiﬁers, and yet others change the algorithms in order
to satisfy speciﬁc classiﬁcation criteria. We will discuss these diﬀerent
classes of methods in this section.
9.1 Classiﬁer Ensemble Learning
In this method, we use combinations of classiﬁers in conjunction with
a voting mechanism in order to perform the classiﬁcation. The idea is
that since diﬀerent classiﬁers are susceptible to diﬀerent kinds of overtraining
and errors, a combination classiﬁer is likely to yield much more
robust results. This technique is also sometimes referred to as stacking
or classiﬁer committee construction.
Ensemble learning has been used quite frequently in text categorization.
Most methods simply use weighted combinations of classiﬁer outputs
(either in terms of scores or ranks) in order to provide the ﬁnal
classiﬁcation result. For example, the work by Larkey and Croft [79]
used weighted linear combinations of the classiﬁer scores or ranks. The
work by Hull [60] used linear combinations of probabilities for the same
goal. A linear combination of the normalized scores was used for classiﬁcation
[137]. The work in [87] used classiﬁer selection techniques and
210 MINING TEXT DATA
voting in order to provide the ﬁnal classiﬁcation result. Some examples
of such voting and selection techniques are as follows:
In a binary-class application, the class label which obtains the
majority vote is reported as the ﬁnal result.
For a given test instance, a speciﬁc classiﬁer is selected, depending
upon the performance of the classiﬁer which are closest to that
test instance.
A weighted combination of the results from the diﬀerent classiﬁers
are used, where the weight is regulated by the performance of
the classiﬁer on validation instances which are most similar to the
current test instance.
The last two methods above try to select the ﬁnal classiﬁcation in a
smarter way by discriminating between the performances of the classiﬁers
in diﬀerent scenarios. The work by [77] used category-averaged
features in order to construct a diﬀerent classiﬁer for each category.
The major challenge in ensemble learning is to provide the appropriate
combination of classiﬁers for a particular scenario. Clearly, this combination
can signiﬁcantly vary with the scenario and the data set. In order
to achieve this goal, the method in [10] proposes a method for probabilistic
combination of text classiﬁers. The work introduces a number of
variables known as reliability variables in order to regulate the importance
of the diﬀerent classiﬁers. These reliability variables are learned
dynamically for each situation, so as to provide the best classiﬁcation.
9.2 Data Centered Methods: Boosting and
Bagging
While ensemble techniques focus on combining diﬀerent classiﬁers,
data-centered methods such as boosting and bagging typically focus on
training the same classiﬁer on diﬀerent parts of the training data in order
to create diﬀerent models. For a given test instance, a combination of the
results obtained from the use of these diﬀerent models is reported. Another
major diﬀerence between ensemble-methods and boosting methods
is that the training models in a boosting method are not constructed independently,
but are constructed sequentially. Speciﬁcally, after i classiﬁers
are constructed, the (i+1)th classiﬁer is constructed on those parts
of the training data which the ﬁrst i classiﬁers are unable to accurately
classify. The results of these diﬀerent classiﬁers are combined together
carefully, where the weight of each classiﬁer is typically a function of
its error rate. The most well known meta-algorithm for boosting is the
A Survey of Text Classiﬁcation Algorithms 211
AdaBoost algorithm [48]. Such boosting algorithms have been applied to
a variety of scenarios such as decision tree learners, rule-based systems,
and Bayesian classiﬁers [49, 61, 73, 100, 116, 118].
We note that boosting is also a kind of ensemble learning methodology,
except that we train the same model on diﬀerent subsets of the data
in order to create the ensemble. One major criticism of boosting is
that in many data sets, some of the training records are noisy, and
a classiﬁcation model should be resistant to overtraining on the data.
Since the boosting model tends to weight the error-prone examples more
heavily in successive rounds, this can cause the classiﬁcation process to
be more prone to overﬁtting. This is particularly noticeable in the case
of noisy data sets. Some recent results have suggested that all convex
boosting algorithms may perform poorly in the presence of noise [91].
These results tend to suggest that the choice of boosting algorithm may
be critical for a successful outcome, depending upon the underlying data
set.
Bagging methods [16] are generally designed to reduce the model overﬁtting
error which arises during the learning process. The idea in bagging
is to pick bootstrap samples (samples with replacement) from the
underlying collection, and train the classiﬁers in these samples. The
classiﬁcation results from these diﬀerent samples are then combined together
in order to yield the ﬁnal result. Bagging methods are generally
used in conjunction with decision trees, though these methods can be
used in principle with any kind of classiﬁer. The main criticism of the
bagging method is that it can sometimes lead to a reduction in accuracy
because of the smaller size of each individual training sample. Bagging
is useful only if the model is unstable to small details of the training
algorithm, because it reduces the overﬁtting error. An example of such
an algorithm would be the decision tree model, which is highly sensitive
to how the higher levels of the tree are constructed in a high dimensional
feature space such as text. Bagging methods have not been used
frequently in text classiﬁcation.
9.3 Optimizing Speciﬁc Measures of Accuracy
We note that the use of the absolute classiﬁcation accuracy is not
the only measure which is relevant to classiﬁcation algorithms. For example,
in skewed-class scenarios, as often arise in the context of applications
such as fraud detection, and spam ﬁltering, it is more costly
to misclassify examples of one class than another. For example, while
it may be tolerable to misclassify a few spam emails (thereby allowing
them into the inbox), it is much more undesirable to incorrectly mark
212 MINING TEXT DATA
a legitimate email as spam. Cost-sensitive classiﬁcation problems also
naturally arise in cases in which one class is more rare than the other,
and it is therefore more desirable to identify the rare examples. In such
cases, it is desirable to optimize the cost-weighted accuracy of the classiﬁcation
process. We note that many of the broad techniques which
have been designed for non-textual data [40, 42, 45] are also applicable
to text data, because the speciﬁc feature representation is not material
to how standard algorithms for modiﬁed to the cost-sensitive case. A
good understanding of cost-sensitive classiﬁcation both for the textual
and non-textual case may be found in [40, 45, 3]. Some examples of
how classiﬁcation algorithms may be modiﬁed in straightforward ways
to incorporate cost-sensitivity are as follows:
In a decision-tree, the split condition at a given node tries to maximize
the accuracy of its children nodes. In the cost-sensitive case,
the split is engineered to maximize the cost-sensitive accuracy.
In rule-based classiﬁers, the rules are typically quantiﬁed and ordered
by measures corresponding to their predictive accuracy. In
the cost-sensitive case, the rules are quantiﬁed and ordered by their
cost-weighted accuracy.
In Bayesian classiﬁers, the posterior probabilities are weighted by
the cost of the class for which the prediction is made.
In linear classiﬁers, the optimum hyperplane separating the classes
is determined in a cost-weighted sense. Such costs can typically
be incorporated in the underlying objective function. For example,
the least-square error in the objective function of the LLSF method
can be weighted by the underlying costs of the diﬀerent classes.
In a k-nearest neighbor classiﬁer, we report the cost-weighted majority
class among the k nearest neighbors of the test instance.
We note that the use of a cost-sensitive approach is essentially a
change of the objective function of classiﬁcation, which can also be formulated
as an optimization problem. While the standard classiﬁcation
problem generally tries to optimize accuracy, the cost-sensitive version
tries to optimize a cost-weighted objective function. A more general
approach was proposed in [50] in which a meta-algorithm was proposed
for optimizing a speciﬁc ﬁgure of merit such as the accuracy, precision,
recall, or F1-measure. Thus, this approach generalizes this class
of methods to any arbitrary objective function, making it essentially an
objective-centered classiﬁcation method. A generalized probabilistic descent
algorithm (with the desired objective function) is used in conjunc-
A Survey of Text Classiﬁcation Algorithms 213
tion with the classiﬁer of interest in order to derive the class labels of
the test instance. The work in [50] shows the advantages of using the
technique over a standard SVM-based classiﬁer.
10. Conclusions and Summary
The classiﬁcation problem is one of the most fundamental problems
in the machine learning and data mining literature. In the context of
text data, the problem can also be considered similar to that of classiﬁcation
of discrete set-valued attributes, when the frequencies of the
words are ignored. The domains of these sets are rather large, as it comprises
the entire lexicon. Therefore, text mining techniques need to be
designed to eﬀectively manage large numbers of elements with varying
frequencies. Almost all the known techniques for classiﬁcation such as
decision trees, rules, Bayes methods, nearest neighbor classiﬁers, SVM
classiﬁers, and neural networks have been extended to the case of text
data. Recently, a considerable amount of emphasis has been placed on
linear classiﬁers such as neural networks and SVM classiﬁers, with the
latter being particularly suited to the characteristics of text data. In recent
years, the advancement of web and social network technologies have
lead to a tremendous interest in the classiﬁcation of text documents containing
links or other meta-information. Recent research has shown that
the incorporation of linkage information into the classiﬁcation process
can signiﬁcantly improve the quality of the underlying results.
References
[1] C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision
for Text Categorization, IEEE Transactions on Knowledge and
Data Engineering, 16(2), 245–255, 2004.
[2] C. C. Aggarwal, N. Li. On Node Classiﬁcation in Dynamic Contentbased
Networks, SDM Conference, 2011.
[3] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, C.
Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering.
Workshop on Machine Learning in the New Information Age, in
conjunction with ECML Conference, 2000.
http://arxiv.org/PS_cache/cs/pdf/0006/0006013v1.pdf
[4] R. Angelova, G. Weikum. Graph-based text classiﬁcation: learn
from your neighbors. ACM SIGIR Conference, 2006.
[5] C. Apte, F. Damerau, S. Weiss. Automated Learning of Decision
Rules for Text Categorization, ACM Transactions on Information
Systems, 12(3), pp. 233–251, 1994.
214 MINING TEXT DATA
[6] M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations
of the potential function method in pattern recognition learning,
Automation and Remote Control, 25: pp. 821–837, 1964.
[7] L. Baker, A. McCallum. Distributional Clustering of Words for Text
Classiﬁcation, ACM SIGIR Conference, 1998.
[8] R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional
Clustering for Text Categorization. ACM SIGIR Conference,
2001.
[9] S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by
Seeding. ICML Conference, 2002.
[10] P. Bennett, S. Dumais, E. Horvitz. Probabilistic Combination of
Text Classiﬁers using Reliability Indicators: Models and Results.
ACM SIGIR Conference, 2002.
[11] P. Bennett, N. Nguyen. Reﬁned experts: improving classiﬁcation in
large taxonomies. ACM SIGIR Conference, 2009.
[12] S. Bhagat, G. Cormode, S. Muthukrishnan. Node Classiﬁcation in
Social Networks, Book Chapter in Social Network Data Analytics,
Ed. Charu Aggarwal, Springer, 2011.
[13] A. Blum, T. Mitchell. Combining labeled and unlabeled data with
co-training. COLT, 1998.
[14] D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis,
V. Kumar, B. Mobasher, J. Moore. Partitioning-based clustering
for web document categorization. Decision Support Systems, Vol.
27, pp. 329–341, 1999.
[15] L. Brieman, J. Friedman, R. Olshen, C. Stone. Classiﬁcation and
Regression Trees, Wadsworth Advanced Books and Software, CA,
1984.
[16] L. Breiman. Bagging Predictors. Machine Learning, 24(2), pp. 123–
140, 1996.
[17] L. Cai, T. Hofmann. Text categorization by boosting automatically
extracted concepts. ACM SIGIR Conference, 2003.
[18] S. Chakrabarti, S. Roy, M. Soundalgekar. Fast and Accurate Text
Classiﬁcation via Multiple Linear Discriminant Projections, VLDB
Journal, 12(2), pp. 172–185, 2003.
[19] S. Chakrabarti, B. Dom. R. Agrawal, P. Raghavan. Using taxonomy,
discriminants and signatures for navigating in text databases,
VLDB Conference, 1997.
[20] S. Chakrabarti, B. Dom, P. Indyk. Enhanced hypertext categorization
using hyperlinks. ACM SIGMOD Conference, 1998.
A Survey of Text Classiﬁcation Algorithms 215
[21] S. Chakraborti, R. Mukras, R. Lothian, N. Wiratunga, S. Watt,
D. Harper. Supervised Latent Semantic Indexing using Adaptive
Sprinkling, IJCAI, 2007.
[22] D. Chickering, D. Heckerman, C. Meek. A Bayesian approach for
learning Bayesian networks with local structure. Thirteenth Conference
on Uncertainty in Artiﬁcial Intelligence, 1997.
[23] V. R. de Carvalho, W. Cohen. On the collective classiﬁcation of
email ”speech acts”, ACM SIGIR Conference, 2005.
[24] V. Castelli, T. M. Cover. On the exponential value of labeled samples.
Pattern Recognition Letters, 16(1), pp. 105–111, 1995.
[25] W. Cohen, H. Hirsh. Joins that generalize: text classiﬁcation using
Whirl. ACM KDD Conference, 1998.
[26] W. Cohen, Y. Singer. Context-sensitive learning methods for text
categorization. ACM Transactions on Information Systems, 17(2),
pp. 141–173, 1999.
[27] W. Cohen. Learning rules that classify e-mail. AAAI Conference,
1996.
[28] W. Cohen. Learning with set-valued features. AAAI Conference,
1996.
[29] W. Cooper. Some inconsistencies and misnomers in probabilistic
information retrieval. ACM Transactions on Information Systems,
13(1), pp. 100–111, 1995.
[30] C. Cortes, V. Vapnik. Support-vector networks. Machine Learning,
20: pp. 273–297, 1995.
[31] T. M. Cover, J. A. Thomas. Elements of information theory. New
York: John Wiley and Sons, 1991.
[32] M. Craven, S. Slattery. Relational learning with statistical predicate
invention: Better models for hypertext. Machine Learning, 43: pp.
97–119, 2001.
[33] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K.
Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from
the Worldwide Web. AAAI Conference, 1998.
[34] I. Dagan, Y. Karov, D. Roth. Mistake-driven Learning in Text Categorization,
Proceedings of EMNLP, 1997.
[35] A. Dayanik, D. Lewis, D. Madigan, V. Menkov, A. Genkin. Constructing
informative prior distributions from domain knowledge in
text classiﬁcation. ACM SIGIR Conference, 2006.
[36] A. P. Dempster, N.M. Laird, D.B. Rubin. Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical
Society, Series B, 39(1): pp. 1–38, 1977.
216 MINING TEXT DATA
[37] F. Denis, A. Laurent. Text Classiﬁcation and Co-Training from
Positive and Unlabeled Examples, ICML 2003 Workshop: The
Continuum from Labeled to Unlabeled Data. http://www.grappa.
univ-lille3.fr/ftp/reports/icmlws03.pdf.
[38] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman.
Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407,
1990.
[39] P. Domingos, M. J. Pazzani. On the the optimality of the simple
Bayesian classiﬁer under zero-one loss. Machine Learning, 29(2–3),
pp. 103–130, 1997.
[40] P. Domingos. MetaCost: A General Method for making Classiﬁers
Cost-Sensitive. ACM KDD Conference, 1999.
[41] H. Drucker, D. Wu, V. Vapnik. Support Vector Machines for Spam
Categorization. IEEE Transactions on Neural Networks, 10(5), pp.
1048–1054, 1999.
[42] R. Duda, P. Hart, W. Stork. Pattern Classiﬁcation, Wiley Interscience,
2000.
[43] S. Dumais, J. Platt, D. Heckerman, M. Sahami. Inductive learning
algorithms and representations for text categorization. CIKM
Conference, 1998.
[44] S. Dumais, H. Chen. Hierarchical Classiﬁcation of Web Content.
ACM SIGIR Conference, 2000.
[45] C. Elkan. The foundations of cost-sensitive learning, IJCAI Conference,
2001.
[46] R. Fisher. The Use of Multiple Measurements in Taxonomic Problems.
Annals of Eugenics, 7, pp. 179–188, 1936.
[47] R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised
and Semi-supervised Learning. NIPS Conference, 2002.
[48] Y. Freund, R. Schapire. A decision-theoretic generalization of online
learning and an application to boosting. In Proc. Second European
Conference on Computational Learning Theory, pp. 23–37,
1995.
[49] Y. Freund, R. Schapire, Y. Singer, M. Warmuth. Using and combining
predictors that specialize. Proceedings of the 29th Annual ACM
Symposium on Theory of Computing, pp. 334–343, 1997.
[50] S. Gao, W. Wu, C.-H. Lee, T.-S. Chua. A maximal ﬁgure-of-merit
learning approach to text categorization. SIGIR Conference, 2003.
[51] R. Gilad-Bachrach, A. Navot, N. Tishby. Margin based feature selection
– theory and algorithms. ICML Conference, 2004.
A Survey of Text Classiﬁcation Algorithms 217
[52] S. Gopal, Y. Yang. Multilabel classiﬁcation with meta-level features.
ACM SIGIR Conference, 2010.
[53] L. Guthrie, E. Walker. Document Classiﬁcation by Machine: Theory
and Practice. COLING, 1994.
[54] E.-H. Han, G. Karypis, V. Kumar. Text Categorization using
Weighted-Adjusted k-nearest neighbor classiﬁcation, PAKDD Conference,
2001.
[55] E.-H. Han, G. Karypis. Centroid-based Document Classiﬁcation:
Analysis and Experimental Results, PKDD Conference, 2000.
[56] D. Hardin, I. Tsamardinos, C. Aliferis. A theoretical characterization
of linear SVM-based feature selection. ICML Conference, 2004.
[57] T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR
Conference, 1999.
[58] P. Howland, M. Jeon, H. Park. Structure Preserving Dimension Reduction
for Clustered Text Data based on the Generalized Singular
Value Decomposition. SIAM Journal of Matrix Analysis and Applications,
25(1): pp. 165–179, 2003.
[59] P. Howland, H. Park. Generalizing discriminant analysis using the
generalized singular value decomposition, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 26(8), pp. 995–1006,
2004.
[60] D. Hull, J. Pedersen, H. Schutze. Method combination for document
ﬁltering. ACM SIGIR Conference, 1996.
[61] R. Iyer, D. Lewis, R. Schapire, Y. Singer, A. Singhal. Boosting for
document routing. CIKM Conference, 2000.
[62] M. James. Classiﬁcation Algorithms, Wiley Interscience, 1985.
[63] D. Jensen, J. Neville, B. Gallagher. Why collective inference improves
relational classiﬁcation. ACM KDD Conference, 2004.
[64] T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization. ICML Conference, 1997.
[65] T. Joachims. Text categorization with support vector machines:
learning with many relevant features. ECML Conference, 1998.
[66] T. Joachims. Transductive inference for text classiﬁcation using support
vector machines. ICML Conference, 1999.
[67] T. Joachims. A Statistical Learning Model of Text Classiﬁcation for
Support Vector Machines. ACM SIGIR Conference, 2001.
[68] D. Johnson, F. Oles, T. Zhang, T. Goetz. A Decision Tree-based
Symbolic Rule Induction System for Text Categorization, IBM Systems
Journal, 41(3), pp. 428–437, 2002.
218 MINING TEXT DATA
[69] I. T. Jolliﬀee. Principal Component Analysis. Springer, 2002.
[70] T. Kalt, W. B. Croft. A new probabilistic model of text classiﬁcation
and retrieval. Technical Report IR-78, University of Massachusetts
Center for Intelligent Information Retrieval, 1996. http://ciir.
cs.umass.edu/publications/index.shtml
[71] G. Karypis, E.-H. Han. Fast Supervised Dimensionality Reduction
with Applications to Document Categorization and Retrieval, ACM
CIKM Conference, 2000.
[72] T. Kawatani. Topic diﬀerence factor extraction between two document
sets and its application to text categorization. ACM SIGIR
Conference, 2002.
[73] Y.-H. Kim, S.-Y. Hahn, B.-T. Zhang. Text ﬁltering by boosting
naive Bayes classiﬁers. ACM SIGIR Conference, 2000.
[74] D. Koller, M. Sahami. Hierarchically classifying documents with
very few words, ICML Conference, 2007.
[75] S. Lam, D. Lee. Feature reduction for neural network based text
categorization. DASFAA Conference, 1999.
[76] W. Lam, C. Y. Ho. Using a generalized instance set for automatic
text categorization. ACM SIGIR Conference, 1998.
[77] W. Lam, K.-Y. Lai. A meta-learning approach for text categorization.
ACM SIGIR Conference, 2001.
[78] K. Lang. Newsweeder: Learning to ﬁlter netnews. ICML Conference,
1995.
[79] L. S. Larkey, W. B. Croft. Combining Classiﬁers in text categorization.
ACM SIGIR Conference, 1996.
[80] D. Lewis, J. Catlett. Heterogeneous uncertainty sampling for supervised
learning. ICML Conference, 1994.
[81] D. Lewis, M. Ringuette. A comparison of two learning algorithms
for text categorization. SDAIR, 1994.
[82] D. Lewis. Naive (Bayes) at forty: The independence assumption in
information retrieval. ECML Conference, 1998.
[83] D. Lewis. An Evaluation of Phrasal and Clustered Representations
for the Text Categorization Task, ACM SIGIR Conference, 1992.
[84] D. Lewis, W. Gale. A sequential algorithm for training text classiﬁers,
SIGIR Conference, 1994.
[85] D. Lewis, K. Knowles. Threading electronic mail: A preliminary
study. Information Processing and Management, 33(2), pp. 209–
217, 1997.
A Survey of Text Classiﬁcation Algorithms 219
[86] H. Li, K. Yamanishi. Document classiﬁcation using a ﬁnite mixture
model. Annual Meeting of the Association for Computational
Linguistics, 1997.
[87] Y. Li, A. Jain. Classiﬁcation of text documents. The Computer
Journal, 41(8), pp. 537–546, 1998.
[88] B. Liu, W. Hsu, Y. Ma. Integrating Classiﬁcation and Association
Rule Mining. ACM KDD Conference, 1998.
[89] B. Liu, L. Zhang. A Survey of Opinion Mining and Sentiment Analysis.
Book Chapter in Mining Text Data, Ed. C. Aggarwal, C. Zhai,
Springer, 2011.
[90] N. Littlestone. Learning quickly when irrelevant attributes abound:
A new linear-threshold algorithm. Machine Learning, 2: pp. 285–
318, 1988.
[91] P. Long, R. Servedio. Random Classiﬁcation Noise defeats all Convex
Potential Boosters. ICML Conference, 2008.
[92] S. A. Macskassy, F. Provost. Classiﬁcation in Networked Data: A
Toolkit and a Univariate Case Study, Journal of Machine Learning
Research, Vol. 8, pp. 935–983, 2007.
[93] A. McCallum. Bow: A toolkit for statistical language modeling,
text retrieval, classiﬁcation and clustering. http://www.cs.cmu.
edu/~mccallum/bow, 1996.
[94] A. McCallum, K. Nigam. A Comparison of Event Models for Naive
Bayes Text Classiﬁcation. AAAI Workshop on Learning for Text
Categorization, 1998.
[95] A. McCallum, R. Rosenfeld, T. Mitchell, A. Ng. Improving text
classiﬁcation by shrinkage in a hierarchy of classes. ICML Conference,
1998.
[96] McCallum, Andrew Kachites. ”MALLET: A Machine Learning for
Language Toolkit.” http://mallet.cs.umass.edu. 2002.
[97] T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.
[98] T. M. Mitchell. The role of unlabeled data in supervised learning.
Proceedings of the Sixth International Colloquium on Cognitive Science,
1999.
[99] D. Mladenic, J. Brank, M. Grobelnik, N. Milic-Frayling. Feature selection
using linear classiﬁer weights: interaction with classiﬁcation
models. ACM SIGIR Conference, 2004.
[100] K. Myers, M. Kearns, S. Singh, M. Walker. A boosting approach
to topic spotting on subdialogues. ICML Conference, 2000.
220 MINING TEXT DATA
[101] H. T. Ng, W. Goh, K. Low. Feature selection, perceptron learning,
and a usability case study for text categorization. ACM SIGIR
Conference, 1997.
[102] A. Y. Ng, M. I. Jordan. On discriminative vs. generative classiﬁers:
a comparison of logistic regression and naive Bayes. NIPS. pp. 841-
848, 2001.
[103] K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify
text from labeled and unlabeled documents. AAAI Conference,
1998.
[104] H.-J. Oh, S.-H. Myaeng, M.-H. Lee. A practical hypertext categorization
method using links and incrementally available class
information. ACM SIGIR Conference, 2000.
[105] X. Qi, B. Davison. Classiﬁers without borders: incorporating
ﬁelded text from neighboring web pages. ACM SIGIR Conference,
2008.
[106] J. R. Quinlan, Induction of Decision Trees, Machine Learning,
1(1), pp 81–106, 1986.
[107] H. Raghavan, J. Allan. An interactive algorithm for asking and
incorporating feature feedback into support vector machines. ACM
SIGIR Conference, 2007.
[108] S. E. Robertson, K. Sparck-Jones. Relevance weighting of search
terms. Journal of the American Society for Information Science, 27:
pp. 129–146, 1976.
[109] J. Rocchio. Relevance feedback information retrieval. The Smart
Retrieval System- Experiments in Automatic Document Processing,
G. Salton, Ed. Prentice Hall, Englewood Cliﬀs, NJ, pp 313–323,
1971.
[110] M. Ruiz, P. Srinivasan. Hierarchical neural networks for text categorization.
ACM SIGIR Conference, 1999.
[111] F. Sebastiani. Machine Learning in Automated Text Categorization,
ACM Computing Surveys, 34(1), 2002.
[112] M. Sahami. Learning limited dependence Bayesian classiﬁers,
ACM KDD Conference, 1996.
[113] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. A Bayesian
approach to ﬁltering junk e-mail. AAAI Workshop on Learning for
Text Categorization. Tech. Rep. WS-98-05, AAAI Press. http://
robotics.stanford.edu/users/sahami/papers.html
[114] T. Salles, L. Rocha, G. Pappa, G. Mourao, W. Meira Jr., M.
Goncalves. Temporally-aware algorithms for document classiﬁcation.
ACM SIGIR Conference, 2010.
A Survey of Text Classiﬁcation Algorithms 221
[115] G. Salton. An Introduction to Modern Information Retrieval, Mc
Graw Hill, 1983.
[116] R. Schapire, Y. Singer. BOOSTEXTER: A Boosting-based System
for Text Categorization, Machine Learning, 39(2/3), pp. 135–168,
2000.
[117] H. Schutze, D. Hull, J. Pedersen. A comparison of classiﬁers and
document representations for the routing problem. ACM SIGIR
Conference, 1995.
[118] R. Shapire, Y. Singer, A. Singhal. Boosting and Rocchio applied
to text ﬁltering. ACM SIGIR Conference, 1998.
[119] J. Shavlik, T. Eliassi-Rad. Intelligent agents for web-based tasks:
An advice-taking approach. AAAI-98 Workshop on Learning for
Text Categorization. Tech. Rep. WS-98-05, AAAI Press, 1998.
http://www.cs.wisc.edu/~shavlik/mlrg/publications.html
[120] V. Sindhwani, S. S. Keerthi. Large scale semi-supervised linear
SVMs. ACM SIGIR Conference, 2006.
[121] N. Slonim, N. Tishby. The power of word clusters for text classiﬁcation.
European Colloquium on Information Retrieval Research
(ECIR), 2001.
[122] N. Slonim, N. Friedman, N. Tishby. Unsupervised document classiﬁcation
using sequential information maximization. ACM SIGIR
Conference, 2002.
[123] J.-T. Sun, Z. Chen, H.-J. Zeng, Y. Lu, C.-Y. Shi, W.-Y. Ma. Supervised
Latent Semantic Indexing for Document Categorization.
ICDM Conference, 2004.
[124] V. Vapnik. Estimations of dependencies based on statistical data,
Springer, 1982.
[125] V. Vapnik. The Nature of Statistical Learning Theory, Springer,
New York, 1995.
[126] A. Weigand, E. Weiner, J. Pedersen. Exploiting hierarchy in text
catagorization. Information Retrieval, 1(3), pp. 193–216, 1999.
[127] S, M. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz,
T. Hampp. Maximizing text-mining performance. IEEE Intelligent
Systems, 14(4), pp. 63–69, 1999.
[128] S. M. Weiss, N. Indurkhya. Optimized Rule Induction, IEEE Exp.,
8(6), pp. 61–69, 1993.
[129] E. Wiener, J. O. Pedersen, A. S. Weigend. A Neural Network Approach
to Topic Spotting. SDAIR, pp. 317–332, 1995.
222 MINING TEXT DATA
[130] G.-R. Xue, D. Xing, Q. Yang, Y. Yu. Deep classiﬁcation in largescale
text hierarchies. ACM SIGIR Conference, 2008.
[131] J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan,
W.-Y. Ma. OCFS: optimal orthogonal centroid feature selection for
text categorization. ACM SIGIR Conference, 2005.
[132] Y. Yang, L. Liu. A re-examination of text categorization methods,
ACM SIGIR Conference, 1999.
[133] Y. Yang, J. O. Pederson. A comparative study on feature selection
in text categorization, ACM SIGIR Conference, 1995.
[134] Y. Yang, C.G. Chute. An example-based mapping method for
text categorization and retrieval. ACM Transactions on Information
Systems, 12(3), 1994.
[135] Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization,
ACM SIGIR Conference, 1995.
[136] Y. Yang. A Study on Thresholding Strategies for Text Categorization.
ACM SIGIR Conference, 2001.
[137] Y. Yang, T. Ault, T. Pierce. Combining multiple learning strategies
for eﬀective cross-validation. ICML Conference, 2000.
[138] J. Zhang, Y. Yang. Robustness of regularized linear classiﬁcation
methods in text categorization. ACM SIGIR Conference, 2003.
[139] T. Zhang, A. Popescul, B. Dom. Linear prediction models with
graph regularization for web-page categorization, ACM KDD Conference,
2006.
[140] S. Zhu, K. Yu, Y. Chi, Y. Gong. Combining content and link for
classiﬁcation using matrix factorization. ACM SIGIR Conference,
2007.