Skip to content Skip to navigation

Data Science for Linguistics

The impact of data science on linguistics has been profound. All areas of the field depend on having a rich picture of the true range of variation, within dialects, across dialects, and among different languages. The subfield of corpus linguistics is arguably as old as the field itself and, with the advent of computers, gave rise to many core techniques in data science. When the first big search engines launched in the late 1990s, linguists immediately began using them to find examples, study trends, and uncover subtle patterns in usage that emerge only at a very large scale. This is arguably the first application of big data in the field; even the largest corpora seemed small by comparison with the open Web. Borrowing techniques from information retrieval, linguists began to piece together an ever-richer picture of how language is used. Since then, techniques from statistics and data science have grown increasingly central to the field. For example, using large annotated corpora, we have established deep connections between frequency and human sentence processing, we have uncovered new associations between language use, social bonds, and non-linguistic aspects of communication like gesture and body language, and we have mapped patterns of historical development to a previously unimaginable level of detail. Within psycholinguistics, data science work has emerged as an important counterpart to traditional experiments, often serving to inform the creation of designs and materials, and to obtain naturalistic evidence to bolster the results of controlled manipulations. Within sociolinguistics, messages exchanged on large social networks are fueling new approaches to dialectology and social meaning. Within the subfields of structural linguistics, which focus on form and meaning, large-scale efforts to create parsed corpora in multiple languages have reshaped our understanding of form and variation across multiple language families. And all of these developments have reunited linguistics and natural language processing; after a period in which relatively few results were shared between these fields, we now have flourishing collaborations between them, with Stanford regarded as arguably the foremost institution for such interdisciplinary work.


Dan Jurafsky
Linguistics & Computer Science
Christopher Potts