Learning Languages with Zipf's Law

Learning a new language can be a daunting task. However, the word frequencies in many languages follow Zipf’s law in that the most frequent word occurs twice as often as the second most frequent, and three times as often as the third, and so on. This means that a relatively small number of words make up the majority of the spoken and written corpus. So you only need to learn 500 or so words to understand ~75% of the words in common speech.

Here I use subtitles in English, German, French, Spanish, and Russian to explore this. The subtitles are from opensubtitles.org, provided by Hermit Dave as ranked lists of words with their word count. Code for this analysis can be found in the .Rmd file. This project is an expansion of work by Tomi Mester.

By plotting the cumulative frequency for the top N words we can see that you would only have to learn the 500 most frequent words to understand ~75% of all words, 1,000 for ~80%, and 2,000 gets you to ~85% (depending on the language).

We can test if a discrete power law (Zipf) fits the data well (for English). The red line shows the fitted power law with α= 1.6

And what text analysis would be complete without a word cloud, here of the top 1,000 most frequent words.

Caveats