Word frequency laws

In 1945 the linguist George Zipf observed two strange word frequency phenomena: the longer a word is, the less common it is; and the most common word is used twice as much as the second most common, three times more than the third.

Brockhaus_Lexikon.jpg: Jvdcderivative work: Veinarde / CC BY-SA

Take a collection of language samples (in linguistics it is called a corpus). List how many times each word appears in that corpus. The resulting chart shows you the word frequency in that language. The corpus used by the Oxford English Dictionary – the largest English corpus in the world, by the way – lists the following words as the most common: the, to, of, and, and a. (Be comes in second, but that includes all the variations like is and was.) Notice how they’re all one syllable long, and three letters or fewer? All of the most common words are short. Going down that frequency list, you get to #45 before you find a two-syllable word (about). The American linguist George Zipf dug a little further, and found something remarkable.

As a general rule, the shorter a word is the more commonly it appears, and the longer a word is the less frequently it appears. And this is true not just in one corpus, but in pretty much every one of sufficient size. And this is also true not just in English but in every language we’ve checked. And it’s not just true for humans! Apparently the vocalisations of dolphins, macaques, marmosets, and bats follow the same pattern.

This relationship is known as the Brevity law. Looking at the same word frequency charts, Zipf went even further and unveiled another interesting pattern in the distribution of words. The most common word occurred twice as often as the second most common word. The most common word occurred three times as often as the third most common word, four times more than the fourth, and so on down the list.

This law is known as Zipf’s law – although he was not the first to make this observation. It too seems to be true in all languages, even the made-up ones. What are we to make of these two laws? One possible conclusion is that languages follow the path of least resistance: the most common words are ground down and abbreviated by use and time; people want to use the minimum possible number of words to effectively communicate and this leads to a Zipf-like distribution.

That’s one explanation. The Wikipedia article mentions another curious experiment: someone generated a string of random text with spaces randomly distributed between the letters. This leaves you with a long list of random collections of letters that form pseudo-“words.” A frequency list from this random text also follows a Zipf distribution. I imagine that this happens because there are fewer possible letter combinations for short words than there are for longer ones… so this may just be a natural mathematical pattern and not some deeper truth about language. Perhaps? Add to this the fact that Zipf’s law has been observed in other phenomena – note distributions in music, city populations in large countries, income distribution (pre-Bezos, I assume) – and the waters become even more muddied.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s