Analyze cooccurrences


This analysis job finds a list of statistically significant distant pairs of words.

To start this job, select the question: “What pairs of words often appear in the same sentence, paragraph, section, or article?”

In natural language processing, a cooccurrence is a statistically significant association between a pair of words, where those words need not appear immediately next to one another. For example, paragraphs that often mention the United Nations will also likely mention the General Assembly or the Security Council.

(If you would like to determine statistically significant associations between words that are immediate neighbors, check out the collocation analysis.)

Once the job is finished, the requested cooccurrences are offered to the user for download. This job can answer a variety of interesting questions:

What concepts are often invoked together in a body of literature? (Input: a domain of interest, selecting one of the first three analysis methods and then searching for concepts of interest)

Options

You can choose several tests for determining significance values of collocation pairs.

  • Mutual information, which measures the extent to which being informed about the first of a pair of words provides information about the second member of the pair.

  • One-tailed t-test, which determines whether or not there is significant support for the hypothesis that a given pair of words is correlated over the null hypothesis that words are independently distributed.

    For those already experienced with T-tests, you will notice that p-values for collocations are very small – you can no longer use, for example, the rule of thumb that p < 0.05 means that a collocation is significant. This is to be expected, because natural language is far from independently distributed, even if words are not in fact correlated with one another in the linguistic sense.

  • Log-likelihood ratio, which compares the probability that the two words are independent with the probability that they are dependent.

The most important parameter for this analysis is the window size that will be used to detect cooccurrences. The cooccurrence algorithm checks for significant correlations between words that occur within blocks of a length controlled by this parameter. To emulate “phrase-level” cooccurrence, use a distance of 5 words. For “sentence-level” cooccurrence, try 20. For “paragraph-level” cooccurrence, use 200. The maximum distance is the article level – set the distance to a large number to search for article-level cooccurrence.

You may either return a given number of the most significant collocations, or, without any increase in computation time, all collocations regardless of significance values.

You must specify a particular word of interest, or a list of words that is comma-separated. If you specify a single word, then you will receive a list of all (or the most significant n, as specified above) cooccurrences including this word. Alternatively, if you specify a comma-separated list of words, you will receive only the significance values for cooccurrences between all of those words, taken pairwise.

Finally, you can choose either to stem or lemmatize the documents before searching for cooccurrences. Lemmatization attempts to convert inflected forms of verbs (“were”, “are”) to their base form so that they can be analyzed together (“be”). Stemming simply removes word endings, and thus groups words together in a slightly different way (“temptation” to “temptat”).

All content copyright © 2010–2018 Charles Pence. All web content is released under CC-BY-NC-SA 3.0.
Code is released under the MIT License. RLetters logo taken from a photograph by Leo Reynolds.