This analysis job finds a list of statistically significant distant pairs of words.

To start this job, select the question: “What pairs of words often appear in the same sentence, paragraph, section, or article?”

In natural language processing, a cooccurrence is a statistically significant association between a pair of words, where those words need not appear immediately next to one another. For example, paragraphs that often mention the United Nations will also likely mention the General Assembly or the Security Council.

(If you would like to determine statistically significant associations between words that are immediate neighbors, check out the collocation analysis.)

Once the job is finished, the requested cooccurrences are offered to the user for download. This job can answer a variety of interesting questions:

What concepts are often invoked together in a body of literature? (Input: a domain of interest, selecting one of the first three analysis methods and then searching for concepts of interest)


You can choose several tests for determining significance values of collocation pairs.

The most important parameter for this analysis is the window size that will be used to detect cooccurrences. The cooccurrence algorithm checks for significant correlations between words that occur within blocks of a length controlled by this parameter. To emulate “phrase-level” cooccurrence, use a distance of 5 words. For “sentence-level” cooccurrence, try 20. For “paragraph-level” cooccurrence, use 200. The maximum distance is the article level – set the distance to a large number to search for article-level cooccurrence.

You may either return a given number of the most significant collocations, or, without any increase in computation time, all collocations regardless of significance values.

You must specify a particular word of interest, or a list of words that is comma-separated. If you specify a single word, then you will receive a list of all (or the most significant n, as specified above) cooccurrences including this word. Alternatively, if you specify a comma-separated list of words, you will receive only the significance values for cooccurrences between all of those words, taken pairwise.

Finally, you can choose either to stem or lemmatize the documents before searching for cooccurrences. Lemmatization attempts to convert inflected forms of verbs (“were”, “are”) to their base form so that they can be analyzed together (“be”). Stemming simply removes word endings, and thus groups words together in a slightly different way (“temptation” to “temptat”).