Document-term matrix
Document-term matrix
Main page

Document-term matrix

logo
Community Hub0 subscribers
What are your thoughts?
Be the first to start a discussion here.
Be the first to start a discussion here.
Document-term matrix

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as row normalizing (i.e. relative frequency/proportions) and tf-idf.

Terms are commonly single words separated by whitespace or punctuation on either side (a.k.a. unigrams). In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.

When creating a data-set of terms that appear in a corpus of documents, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ij cell, then, is the number of times word j occurs in document i. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents:

then the document-term matrix would be:

which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document. For this reason, document-term matrices are usually stored in a sparse matrix format.

As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the term's document frequency.

The document-term matrix emerged in the earliest years of the computerization of text. The increasing capacity for storing documents created the problem of retrieving a given document in an efficient manner. While previously the work of classifying and indexing was accomplished by hand, researchers explored the possibility of doing this automatically using word frequency information.

See all
User Avatar
No comments yet.