Compute word frequency table


This analysis job computes a detailed table of word frequency information.

To start this job, select the questions: “What’s the frequency of word use within a given set of articles? What are the ‘most important’ or ‘most frequent’ words used in a given set?”

This job yields an incredibly detailed word frequency chart, customizable in a variety of ways (see the options section below). You may then choose to divide the text into segments, a common requirement for other analysis algorithms. You can create these segments either by setting an explicit number of words, or by setting the number of blocks you would like to appear in the final result. These blocks can either be produced within each individual journal article, or across journal article boundaries (i.e., segmented after the articles are concatenated into one large stream of text).

A variety of results are then reported. Within each segmented block of text, you receive the following statistics for each word (or n-gram):

  • How many times that word appears within the block
  • That absolute count divided by the number of words within the block (i.e., the fraction of the block that this word constitutes)
  • TF/IDF (term frequency-inverse document frequency) of this term within the dataset
  • TF/IDF of this term within the corpus as a whole (not available for n-grams)

You also can see the number of types and tokens for each segment. And for the entire dataset, you receive the following statistics for each word:

  • How many times that word appears within the entire dataset
  • That absolute count divided by the number of words within the dataset (i.e., the fraction of the dataset that this word constitutes)
  • DF (document frequency) of this term within the entire corpus (i.e., the number of documents in the entire database in which this term appears; not available for n-grams)
  • TF/IDF of this term within the entire corpus (not available for n-grams)

In addition to supplying the raw input for a wide variety of textual analysis algorithms that the user can run on their own, this data can immediately answer a variety of interesting questions:

How often are certain words used within a given dataset? (Input: a domain of interest, looking at the proportion value for the terms at issue)

Does a body of literature use certain words more often than the rest of the culture at large? (Input: a domain of interest, comparing the proportion value for the terms at issue to proportion values queried from the Google Ngram Viewer)

What are the “interesting” or “unusual” words in this particular dataset, with respect to the rest of the corpus? (Input: a domain of interest, looking at the TF/IDF values of terms in the entire dataset against the corpus – large values indicate that a term is “unusual” for the corpus at large but occurs often within the dataset)

Options

The word frequency analyzer has many configurable options – it is one of the most powerful in all of RLetters.

First, you can choose whether you want to analyze the frequencies of single words, or the frequencies of multiple-word phrases (called n-grams). You can analyze n-grams of any size from 2 to 20 words.

Once you have chosen whether to analyze single words or n-grams, you then have more options you can use to refine how many words or n-grams will be returned.

For single words: You can either receive an analysis for the n most frequent words, for all words in the dataset, or for an explicit list of words that you provide. For the first two options, you can also control a list of words to exclude – either the most common words (“stop words”) in a variety of languages, or by providing an explicit list of words to ignore.

For n-grams: You can either receive an analysis for the n most frequent n-grams, or for all n-grams in the dataset. Either way, you can also refine this list further by only returning n-grams containing certain words, or only returning n-grams which do not contain certain words.

For either single words or n-grams, you can choose either to stem or lemmatize words in the documents. Lemmatization attempts to convert inflected forms of verbs (“were”, “are”) to their base form so that they can be analyzed together (“be”). Stemming simply removes word endings, and thus groups words together in a slightly different way (“temptation” to “temptat”).

Finally, you have a variety of options that allow you to determine word frequencies within blocks of text, a common requirement for many language analysis algorithms.

You can choose either to split blocks by providing the number of words that should be in each block, or by providing how many blocks you would like to split the dataset into. If you split the dataset into blocks of a specified number of words, you will have to choose what to do with the remainder. You can either make a large last block (adding the leftover words to the last full block), a small last block (making a block from just the leftover words), truncate the leftover words (discarding them), or you can truncate every document to the length specified, making only one block per document. Lastly, you can choose whether or not these blocks will split across article boundaries.

Here’s how to create a few commonly used block patterns:

  • One block for each document in the database: Choose to split text blocks by number of blocks, choose 1 block, and uncheck the option to split blocks across documents.
  • One block for the entire database: Choose to split text blocks by number of blocks, choose 1 block, and check the option to split blocks across documents.
  • Blocks of a given constant size for use in text analysis: Choose to split text blocks by number of words, and enter the desired number of words (say, 250). Choose to truncate leftover words, and uncheck the option to split blocks across documents.
All content copyright © 2010–2018 Charles Pence. All web content is released under CC-BY-NC-SA 3.0.
Code is released under the MIT License. RLetters logo taken from a photograph by Leo Reynolds.