Analyze collocations


This analysis job finds a list of statistically significant immediate pairs of words.

To start this job, select the questions: “What pairs of words often appear directly together? What technical terms or phrases appear in the literature?”

In natural language processing, a collocation is a statistically significant association between a pair of words that appear directly next to one another. For example, while English speakers use the phrases “strong tea” and “powerful computers,” it would not be idiomatic English to use “powerful tea” or “strong computers.”

(If you would like to determine statistically significant associations between words that are farther apart than immediate neighbors, check out the cooccurrence analysis.)

The user can specify how many of the most significant collocations to preserve, and these are offered to the user for download. This job can answer a variety of interesting questions:

What concepts are often invoked together in a body of literature? (Input: a domain of interest, selecting one of the first three analysis methods and then searching for concepts of interest)

What technical terms or phrases are often used in a discipline? (Input: a domain of interest, selecting the parts-of-speech analysis method)

Options

You can choose several tests for determining significance values of collocation pairs.

  • Mutual information, which measures the extent to which being informed about the first of a pair of words provides information about the second member of the pair.

  • One-tailed t-test, which determines whether or not there is significant support for the hypothesis that a given pair of words is correlated over the null hypothesis that words are independently distributed.

    For those already experienced with T-tests, you will notice that p-values for collocations are very small – you can no longer use, for example, the rule of thumb that p < 0.05 means that a collocation is significant. This is to be expected, because natural language is far from independently distributed, even if words are not in fact correlated with one another in the linguistic sense.

  • Log-likelihood ratio, which compares the probability that the two words are independent with the probability that they are dependent.

  • Frequency, biased by parts-of-speech, which sorts bigrams and trigrams by their raw frequency counts, and then filters them according to their parts of speech. Justeson and Katz proposed a set of filters based on part-of-speech tagging that are likely to sort useful and interesting collocations from those that merely involve stop-words. (Parts-of-speech tagging is performed by the Stanford POS Tagger.) The parts-of-speech patterns which are kept are:

    • Adjective Noun
    • Noun Noun
    • Adjective Adjective Noun
    • Adjective Noun Noun
    • Noun Adjective Noun
    • Noun Noun Noun
    • Noun Preposition Noun

You may either return a given number of the most significant collocations, or, without any increase in computation time, all collocations regardless of significance values. Lastly, you can choose to filter the list by a particular word, returning only pairs for which one of the two words is the word provided.

All content copyright © 2010–2018 Charles Pence. All web content is released under CC-BY-NC-SA 3.0.
Code is released under the MIT License. RLetters logo taken from a photograph by Leo Reynolds.