Module 40: Language

How language works, where it came from, and what it does to thought

Part A · what language is — the miracle of communication

about this module

Language is the most sophisticated thing human beings do. Every time you speak or read, you are performing a feat of combinatorial computation that no other species can replicate and that linguists have spent a century struggling to explain. There are roughly 7,000 living languages, structured in wildly different ways, yet they share deep organisational principles that suggest something profound about the human mind. This module gives you the conceptual toolkit to think clearly about what language actually is.

Six parts cover: the features that make language unique (A), its structural levels from sound to meaning (B), how we acquire and vary it (C), language and thought (D), major world languages in depth (E), and language in the digital age (F).

what makes language language

design features of human language

The linguist Charles Hockett identified 13 "design features" that characterise human language, three of which are found in no other communication system. Displacement lets us talk about things not present in time or space — you can describe yesterday, tomorrow, or Alpha Centauri. Productivity (also called openness) means there is no upper limit on new sentences: every utterance you produce has almost certainly never been said before in exactly that form. Duality of patterning means meaningless units (sounds) combine into meaningful units (words) which then combine into larger meaningful units (sentences) — a two-level combinatorial system that gives language its explosive expressive power. Bee dances have displacement. Birdsong has productivity of a kind. No other system has all three together.

animal communication vs. language — an explorer

Select a communication system above to compare it with human language.

the numbers — languages, speakers, scripts

living languages

~7,168

Ethnologue 2024 count. The true number is contested — "language vs dialect" has no clean answer.

endangered languages

~3,000

About 42% of all languages have fewer than 1,000 speakers. One language dies roughly every 40 days.

speakers of top 10 languages

~5.2 bn

The top 10 languages account for about 63% of humanity. The bottom 6,000+ share the remaining 37%.

distinct writing systems

~300+

Unicode 15 encodes 161 scripts. Most languages are unwritten or newly written using adapted Latin script.

top 10 languages by total speakers (L1 + L2, 2024 estimate)

English
1.46 billion
Mandarin
1.16 billion
Hindi
820 million
Spanish
760 million
French
580 million
Modern Arabic
550 million
Bengali
340 million
Russian
320 million
Portuguese
310 million
Urdu
240 million

Bars show total speakers (L1 native + L2 fluent). English leads on this measure; Mandarin leads on native speakers alone (~940 million).

language families — how languages are related

the major language families

Languages descended from a shared ancestor are grouped into families. The comparative method — matching regular sound correspondences across languages — lets linguists reconstruct ancestor languages (proto-languages) spoken thousands of years ago, long before writing existed. The Indo-European family is the most studied: its proto-ancestor, Proto-Indo-European (PIE), was spoken on the Pontic-Caspian steppe around 4,000 BCE and eventually gave rise to over 400 languages including English, Hindi, Russian, and Greek.

share of world's L1 speakers by language family

Source: Ethnologue 2024. "Other" includes ~100 smaller families plus language isolates (languages with no known relatives).

the indo-european family — a schematic
Proto-Indo-European Germanic English/German Dutch/Afrikaans Swedish/Norwegian Romance French/Italian Spanish/Portuguese Romanian/Catalan Slavic Russian/Ukrainian Polish/Czech Serbian/Bulgarian Indo-Iranian Hindi/Urdu/Bengali Punjabi/Nepali Persian/Pashto Other branches Greek, Celtic, Baltic Albanian, Armenian Tocharian (extinct) PIE spoken c. 4000 BCE on Pontic-Caspian steppe · 3 billion speakers across all branches today · reconstructed entirely from comparative evidence
Proto-Indo-European was reconstructed without a single written text. Linguists compared regular sound patterns across Latin, Sanskrit, Greek, and Gothic — noticing, for instance, that Latin "pater," Sanskrit "pitar," and Greek "pater" all share a root meaning "father," pointing to a common ancestral form *ph2ter. This comparative method, developed in the 19th century, remains one of the most impressive intellectual achievements in the humanities.

language isolates — the ones with no known relatives

Some languages resist all attempts at family classification. Basque, spoken by 750,000 people in the Pyrenees, is the oldest living language in Europe — predating the arrival of Indo-European languages from the steppe by thousands of years. Korean is likely an isolate (the Altaic hypothesis linking it to Japanese has been largely abandoned). Zuni in New Mexico and Ainu in Japan have no established relatives either. These languages are windows into utterly separate evolutionary lineages of human thought.

Part B · how language is structured — linguistics fundamentals
the levels of linguistic structure

Click a level to explore it:

phonetics and phonology — the sounds of language

phonemes, allophones, and the IPA

A phoneme is the smallest unit of sound that distinguishes meaning in a language. English has roughly 44 phonemes despite only 26 letters — one reason English spelling is notoriously unreliable. The International Phonetic Alphabet (IPA), standardised in 1888, gives a unique symbol to every possible human speech sound across all languages. A key insight: phonemes are not physical sounds but abstract categories. The "p" in "pin" and "spin" are acoustically different (the first is aspirated, with a puff of air; the second is not), yet English speakers hear them as the same sound. In Hindi, these two sounds are distinct phonemes — they can distinguish meaning, as in "phal" (fruit) vs "pal" (moment). Languages divide the continuous acoustic space of sound into different categories.

tonal languages — more common than you think

Tonal languages
~70% of world's languages
Non-tonal languages
~30%

Mandarin uses 4 tones; Cantonese has 6; Vietnamese has 6; Hmong has 8. English and most European languages are non-tonal. Tone languages are concentrated in sub-Saharan Africa, Southeast Asia, and Mesoamerica.

morphology — building words from pieces

morphological types — a comparison scale

Analytic (no inflection)Synthetic (moderate)Polysynthetic (extreme)
Mandarin
English
French
Russian
Turkish
Inuktitut

Turkish "evlerinizden" (from your houses) packs five morphemes into one word. The Inuktitut word "ᑕᑯᒃᓴᐅᓂᕐᒧᑦ" can encode what English needs an entire sentence to express.

the Inuktitut example — a case study in polysynthesis

A famous Yupik (related to Inuktitut) example: "Tuntussuqatarniksaitengqiggtuq" means "He had not yet said again that he was going to hunt reindeer." That is a single grammatical word composed of many morphemes, each adding a layer of meaning. This is not just "packing more in" — it reflects a fundamentally different logic of how meaning is organised. No approach is cognitively superior; they are equally expressive but differently structured. Languages tend to drift between types over centuries: English was far more inflected in Old English (870 CE) than today, having shed most of its case endings.

syntax — word order and sentence structure

basic word order types across world's languages

SOV (Subject-Object-Verb)
~45% of languages
SVO (Subject-Verb-Object)
~42%
VSO
~9%
VOS / OVS / OSV
~4%

English and Mandarin are SVO ("She eats rice"). Japanese, Turkish, and Korean are SOV ("She rice eats"). Classical Arabic and Welsh are VSO ("Eats she rice"). Free word-order languages (Latin, Russian, Hungarian) use case endings rather than position to signal grammatical roles.

Noam Chomsky's 1957 claim that humans have an innate "Universal Grammar" — a set of syntactic principles hard-wired in the brain — transformed linguistics overnight. The most debated piece of evidence: all languages allow recursion, the ability to embed clauses within clauses indefinitely ("The man who knew the woman who found the cat that ate the mouse..."). Critics including Daniel Everett counter that the Amazonian language Pirahã lacks recursion — a claim that remains fiercely contested.
semantics and pragmatics — what we mean vs what we say

Grice's cooperative principle and its four maxims

The philosopher Paul Grice argued in 1975 that conversation works because speakers tacitly agree to be cooperative — and that we infer meaning from apparent violations of this agreement. His four maxims: be truthful (Quality), say as much but not more than needed (Quantity), be relevant (Relation), and be clear (Manner). When someone says "Can you pass the salt?" they are technically asking a yes/no question about physical ability. But because we understand that a cooperative speaker would not ask something so trivially answerable, we infer the real intent: please pass the salt. This gap between sentence meaning and speaker meaning — what Grice called "conversational implicature" — explains most of how language actually functions in social life.

writing systems — from pictograms to alphabets

Select a writing system type above to learn about its structure and examples.

Part C · how we acquire and use language
first language acquisition — the developmental timeline

milestones in first language acquisition

These timelines are approximate norms; there is enormous individual variation. What is remarkable is that every neurotypical child, regardless of intelligence or parental coaching, passes through the same stages in the same order. Children cannot be taught to skip stages.

the nativist vs. empiricist debate

Chomsky's nativism holds that children are born with an innate language acquisition device (LAD) containing the principles of Universal Grammar — which is why acquisition is so fast, uniform, and robust to impoverished input ("poverty of the stimulus"). The behaviourist alternative (Skinner's "Verbal Behaviour," 1957) held that language is learned through reinforcement and imitation. Chomsky's 1959 review of Skinner is one of the most cited academic papers ever written and largely demolished the behaviourist account. The modern debate is more nuanced: statistical learning and input frequency clearly matter, but so does something innately linguistic in the human brain. The strongest evidence for the nativist position is the emergence of new sign languages — when deaf children in Nicaragua were brought together for the first time in the 1980s, they spontaneously created a new, fully grammatical sign language (Nicaraguan Sign Language) without any adult model.

second language acquisition — the age effect

critical period hypothesis — explore the effect of age on ultimate attainment

Adjust your age of first exposure to a second language:

Birth (simultaneous)Age 30Age 60+
Age 12
sociolinguistics — language as social identity

dialect vs accent vs register

An accent is a difference in pronunciation only. A dialect involves differences in vocabulary and grammar as well as pronunciation. There is no linguistic basis for calling one variety "the" language and another a dialect — the old joke is that "a language is a dialect with an army and a navy." Mandarin and Cantonese are officially dialects of Chinese but are mutually unintelligible; Serbian and Croatian are officially different languages but are largely mutually intelligible. The distinction is political, not linguistic. Register is the style variation within a single speaker's repertoire: the same person uses different vocabulary, sentence complexity, and politeness markers when talking to their employer versus their friends — code-switching not between languages but between modes of the same language.

William Labov and the social stratification of language

William Labov's 1963 study of Martha's Vineyard showed that islanders who strongly identified with their community raised the vowels in words like "right" and "house" — an unconscious phonological marker of local identity. His 1966 New York City study found that the pronunciation of post-vocalic "r" (the r in "car" or "fourth") correlated with social class and, crucially, with social aspiration: lower-middle-class speakers were more hypercorrect in formal speech than upper-middle-class speakers, overcorrecting toward the prestige form they associated with upward mobility. Language encodes social meaning at every phoneme.

how languages change — and die

mechanisms of language change

Languages change constantly through four main mechanisms. Sound change is regular and systematic: Grimm's Law (c. 500 BCE) shifted all Proto-Indo-European stops in Germanic — every PIE "p" became an "f," which is why Latin "pater" corresponds to English "father." Semantic shift changes word meanings: "nice" once meant foolish (from Latin "nescius," ignorant); "awful" meant inspiring awe; "bimbo" was a common term for a male idiot in 1920s American slang. Borrowing is the largest source of new vocabulary: English borrowed wholesale from French (after 1066), Latin (Renaissance), and now borrows from global English back-formations. Grammaticalisation turns content words into function words: the English future auxiliary "will" derives from Old English "willan" (to want).

A language is considered "dead" when it has no more native speakers. It is "extinct" when it is no longer spoken or known at all. Latin is dead but not extinct; Cornish died in 1777 (with the death of Dolly Pentreath) and was revived in the 20th century. About 10 languages die every year, most of them undocumented — taking with them irreplaceable knowledge about human cognition, local ecology, and the range of what language can be.
Part D · language and thought — the big questions
does language shape thought? — the whorfian hypothesis

strong vs weak Whorfianism

Benjamin Lee Whorf, a fire prevention engineer turned amateur linguist, proposed that the language you speak determines the thoughts you can have — what you cannot name, you cannot think. This strong version is almost certainly false: thought is not identical with language, as demonstrated by deaf people who think without spoken language, mathematicians who think in notation, and the fact that new words can be coined for previously unnamed concepts (which you could presumably already think). The weak version — that language influences (not determines) certain kinds of cognition — is supported by good evidence. Russian speakers, who have separate basic words for light blue ("goluboy") and dark blue ("siniy"), are measurably faster at discriminating these colours in the region of perceptual space between the two categories. The effect is real but modest — it biases perception, it does not imprison thought.

the Pirahã challenge — a language without numbers or recursion?

Daniel Everett's decades-long fieldwork with the Pirahã people of Brazil produced startling claims: that Pirahã has no numbers beyond "one/two/many," no colour terms, no creation myths, and possibly no recursion in its syntax. The Pirahã apparently could not learn to count even after eight months of training. If this is right — and many linguists dispute Everett's analyses — it suggests that the content of a culture's language and the cognitive capabilities emphasised in it co-evolve. Lera Boroditsky's cross-linguistic studies offer cleaner experimental evidence: Kuuk Thaayorre speakers, who use absolute compass directions rather than relative terms like left/right, have extraordinarily fine-grained spatial orientation abilities, always knowing which direction is north even in windowless rooms.

metaphor and conceptual structure

Lakoff and Johnson's conceptual metaphor theory

In "Metaphors We Live By" (1980), George Lakoff and Mark Johnson argued that metaphor is not a poetic decoration but the basic structure of conceptual thought. We do not merely speak of arguments using war metaphors — we actually conceptualise arguments as war: we attack positions, demolish arguments, shoot down ideas, defend our views. An alternative conceptual metaphor — ARGUMENT IS A DANCE — would produce entirely different linguistic and cognitive responses: we would seek harmony, find satisfying moves, create something together. Political language exploits conceptual metaphors systematically. "Tax relief" frames taxation as an affliction from which citizens need rescuing. "Illegal alien" frames undocumented immigrants as threats from outside. The frame shapes what policy options seem natural.

language and power — framing, euphemism, and control

Orwell's diagnosis — and its limits

George Orwell's 1946 essay "Politics and the English Language" argued that political language is designed to make lies sound truthful and murder respectable. His examples remain instructive: "pacification" for bombing villages; "transfer of population" for mass forced displacement. The appendix to 1984 describes Newspeak — a language engineered to make dissent literally unthinkable by eliminating the vocabulary required to formulate it. The insight is real but partly overstated: people subjected to Newspeak-like language control (Soviet Russia, Maoist China) managed to think and communicate resistance anyway, through irony, private language, and subtext. Language constrains thought at the edges; it does not fully determine it.

untranslatable words — explorer

Select a word above to explore what translation cannot capture.

Part E · major world languages — a guided tour
major world languages — explorer

Select a language above for a detailed profile.

the history of english — a period chart

major periods and their defining influences

English has borrowed from over 350 languages. Its vocabulary is roughly 29% French, 29% Latin, 26% Germanic, and 6% Greek — a unique hybrid that explains both its richness and its orthographic irregularity.

Arabic diglossia — one language, two registers

Arabic is the clearest living example of diglossia: the coexistence of two varieties of the same language used in different social contexts. Modern Standard Arabic (MSA/Fusha), derived directly from Classical Quranic Arabic, is used in newspapers, formal speeches, legal documents, and education across all 22 Arab-majority countries. But nobody grows up speaking MSA at home. Every Arab child learns a regional dialect as their first language — Egyptian, Moroccan Darija, Gulf Arabic, Levantine — which differ from MSA and from each other as much as Spanish differs from Portuguese. An educated Egyptian and an educated Moroccan can communicate in MSA but may struggle to follow each other's everyday dialect speech. The Quran's status means Classical Arabic has been uniquely preserved for 1,400 years — it is still largely comprehensible to modern Arabic speakers in a way that Chaucer's English is not to modern English speakers.

Part F · language in the digital age
machine translation — from Cold War project to neural networks

evolution of machine translation

what machine translation still fails at

Google Translate and DeepL produce impressive output for high-resource language pairs (English-French, English-Spanish) but still fail in characteristic ways. Idiom and metaphor are handled by pattern-matching, not understanding — "kick the bucket" may or may not be correctly translated depending on whether the idiom appeared frequently in training data. Pragmatic register — the difference between a formal and an informal tone in Japanese honorifics — is often flattened. Low-resource languages (Yoruba, Navajo, Tibetan) translate poorly because there is little training data. Ambiguity resolution requiring world knowledge is still problematic: "I saw the man with the telescope" has two readings that require context to disambiguate.

LLMs and language — what AI "knows" about language

word embeddings and distributional semantics

Modern language AI builds on the distributional hypothesis, articulated by linguist John Firth in 1957: "You shall know a word by the company it keeps." Word2Vec (2013) and its successors represent words as vectors in high-dimensional space, where semantically similar words cluster near each other. In this geometry, the vector for "king" minus "man" plus "woman" is approximately the vector for "queen." Large language models extend this to context-sensitive representations — the same word "bank" has different vector representations in "river bank" and "bank account." The crucial question debated since Emily Bender and Timnit Gebru's 2021 "stochastic parrots" paper is whether this distributional knowledge amounts to anything like understanding, or whether it is extraordinarily sophisticated pattern matching without grounding in the world.

language loss — a projection calculator

how long before a language community reaches crisis level?

Enter a language community's current speaker count and annual decline rate to estimate when it reaches 50 speakers (the UNESCO threshold for "critically endangered"):

Current speaker count

Annual decline rate (%)

Enter values above and click Calculate.

the Hebrew miracle — the only successful language revitalisation

Hebrew is the only language in history to be revived from a purely liturgical state to a full first language of a nation. By 1880, Hebrew had not been a spoken vernacular for approximately 1,800 years — it existed only in religious texts and scholarly correspondence. Eliezer Ben-Yehuda, who immigrated to Palestine in 1881, decided that a Jewish state required a Jewish language and raised his son Ben-Zion as the first native Hebrew speaker in millennia. Modern Hebrew required over 30,000 new words for concepts that did not exist in Biblical Hebrew. Today, 9 million people speak it as a first language. Linguists consider this achievement unrepeated and perhaps unrepeatable: it required a unique combination of ideological motivation, a concentrated immigrant population, and deliberate institutional support that no other revitalisation effort has fully replicated.

Welsh and Irish — partial success stories

Welsh is the most successful modern language revitalisation still underway. From a low of about 20% of the Wales population speaking Welsh in 1991, active government policy — bilingual education, S4C Welsh television (launched 1982), Welsh Language Acts, and a cultural renaissance — has stabilised speakers at about 29% (2021 census). Welsh is now required in all schools to age 16. Irish is a more cautionary example: despite being a co-official language of Ireland since 1922, it has continued to decline as a community language, with only about 73,000 native daily speakers in the Gaeltacht regions, despite 40% of the population claiming some knowledge. The difference is that Welsh policy actually created new speakers; Irish policy largely failed to.

share of endangered languages by world region

The Americas and Pacific together account for over half of all endangered languages despite relatively small total speaker populations. Colonial language imposition in the Americas and island isolation in the Pacific both contribute to this concentration.

Part G · Q&A

If bees can communicate location and distance with their waggle dance, what does human language have that bee dances don't?

Bee dances are tied entirely to the present moment and to a single topic: the location of food. They cannot talk about yesterday's flowers, express doubt about the distance, ask another bee a question, discuss abstract concepts like loyalty or weather forecasting, or generate new message types that the species has never used before. Human language has all three of Hockett's critical features: displacement (talking about things not present), productivity (unlimited new sentence types), and duality of patterning (sound units combining into meaning units combining into sentence units). The bee dance has a crude version of displacement — it can indicate distance — but no productivity and no duality. The gap is not one of degree but of kind.

How do we know Proto-Indo-European existed if nobody ever wrote it down?

The comparative method works by identifying regular sound correspondences across related languages. If Latin consistently uses "p" where Germanic uses "f" (pater/father, piscis/fish, pes/foot), that regularity is not coincidence — it is a systematic sound change. By working backwards from dozens of documented languages following predictable rules, linguists can reconstruct probable ancestor forms. The 1786 observation by William Jones that Sanskrit, Latin, and Greek shared "some common source which, perhaps, no longer exists" launched this project. Modern computational phylogenetics, treating word forms like gene sequences, can even estimate branching dates to within a few centuries.

Why doesn't English spelling match its pronunciation — and is this a flaw?

English spelling is essentially a phonological fossil. It reflects how the language was pronounced around 1450 — before the Great Vowel Shift (1400-1700) dramatically altered the pronunciation of long vowels. "Knight" was once pronounced with all its letters: k-n-ight with a guttural fricative. The spelling is not a flaw so much as a record of history. However, this creates real costs: English-speaking children take about twice as long to reach functional literacy as children learning Italian or Finnish, where spelling-to-sound correspondences are near-perfect. The perennial English spelling reform movement has never succeeded partly because spelling also carries etymological information — "sign" and "signature" look related because they are, even though "sign" is now pronounced without the g.

Is sign language a "real" language, or is it just manual English?

Sign languages are fully fledged natural languages — not codes for spoken languages and not universal. American Sign Language (ASL) and British Sign Language (BSL) are mutually unintelligible despite both being used in English-speaking countries. They have their own phonology (handshape, movement, and location play the role that sounds do), their own morphology, and their own syntax — which in ASL is typically topic-comment rather than SVO. William Stokoe demonstrated ASL's linguistic status in 1960, to considerable initial resistance from the deaf community itself, which had internalized the prejudice that sign was "not really" language. Crucially, children born to deaf signing parents acquire sign language on exactly the same developmental timeline as hearing children acquire spoken language — the critical period hypothesis applies identically.

Is there really a critical period for language, and what happens after it closes?

Yes — but the critical period is more of a gradient than a cliff. Native-like phonological attainment drops sharply for learners who begin after about age 8; grammatical attainment declines more gradually through the teens; lexical acquisition remains relatively robust into adulthood. The most dramatic evidence comes from feral children and late learners of sign language: Genie, isolated until age 13 with no language input, never acquired full grammar despite intensive intervention. Adults can become highly proficient in a second language — Joseph Conrad wrote his first-language-quality English novels as a native Polish speaker — but complete phonological nativehood is extremely rare for post-pubescent learners. Brain imaging shows that late L2 learners activate different regions than early learners for grammatical processing, suggesting the critical period reflects neurological constraints.

If languages are constantly changing, how does anyone understand each other across generations?

Because change is gradual — fast enough to accumulate over centuries but slow enough that adjacent generations remain mutually intelligible. Modern English speakers can read Shakespeare (1600) with some effort, struggle with Chaucer's Middle English (1390), and find Old English (Beowulf, 1000 CE) completely foreign without special training — yet it is genetically the same language. Writing also acts as a conservative brake on change: written standard forms, school systems, and mass media slow but do not stop the drift. The key insight is that language change is not decay — there is no such thing as a language deteriorating linguistically. Every generation innovates in vocabulary, phonology, and grammar, and those innovations are neither improvements nor corruptions.

If you think in language, is bilingualism just having two separate thought systems?

Not quite — bilinguals do not maintain two fully separate cognitive systems. Research using brain imaging shows that the two languages share neural substrate, especially when learned early, though they do activate different patterns. More interestingly, bilinguals report that they feel different personalities or emotional registers in each language: the language of childhood often carries stronger emotional charge, while a second language can feel emotionally cooler and more analytical. Bilinguals also show measurable advantages on tasks requiring selective attention and the suppression of irrelevant information (the "bilingual advantage"), though this effect has been contested by large-scale replications and appears more modest than early claims suggested. What bilinguals certainly have is a richer metalinguistic awareness — a stronger sense that language choices are choices.

Are some languages more suited to science or poetry than others?

No — this is one of the most persistent myths in language study, often used to justify linguistic hierarchies. Every natural language is equally capable of expressing any human thought; the apparent limitations are about vocabulary and register, which can be developed. Scientific Arabic was the most sophisticated technical vocabulary on Earth in the 10th century; Classical Chinese carried millennia of philosophical complexity; Sanskrit's phonological and grammatical analysis tradition (Panini, c. 4th century BCE) was not surpassed in depth until the 20th century. What matters for science or poetry is the community of speakers, the tradition of discourse, and the available vocabulary — all of which can be built in any language. The sense that English is "naturally" suited to science is survivorship bias.

How did English become the world's lingua franca — and could it be displaced?

English became a global lingua franca through a combination of British colonial reach (which seeded it across five continents) and 20th-century American economic and cultural dominance. The decisive moment was probably World War II, after which American economic primacy, the Marshall Plan, Hollywood, and eventually the internet created overwhelming incentives to learn English. No language has ever achieved comparable global reach. Could it be displaced? The historical precedent (Latin, French as diplomatic lingua franca, Mandarin in East Asia) suggests yes — eventually — but the network effects of English are now so strong that any displacement is likely centuries away. Mandarin is the most plausible long-run challenger but faces the barrier of tonal phonology and character writing, which are genuinely harder for most foreign learners.

Is Mandarin really one language, or is that a political fiction?

It is largely a political category, though a defensible one. "Chinese" encompasses around 7-14 major varieties — Mandarin, Cantonese, Hokkien, Shanghainese (Wu), Hakka, and others — that are mutually unintelligible in speech. They share a common writing system, which is partly why Chinese governments and traditions have classified them as dialects of one language. Putonghua (Standard Mandarin, based on the Beijing dialect) has been promoted since 1955 and is now spoken by about 80% of the mainland population. Cantonese has roughly 80 million native speakers and is the dominant language in Hong Kong and large diaspora communities worldwide. The line between "language" and "dialect" is genuinely blurry here — as it always is in politics.

Do large language models actually understand language, or are they "just" predicting the next word?

This is the hardest question in contemporary AI and linguistics. LLMs are trained to predict the next token — but critics who say "just predicting" underestimate what that requires. To predict language reliably, a model must learn the distribution of facts about the world encoded in text, the pragmatic conventions of conversation, and something like logical inference. LLMs pass tasks that behaviourists would have called evidence of understanding: analogy, novel grammatical generalisation, common-sense reasoning in text form. The strongest objection — from Bender, Gebru, and others — is that language models have no grounding: they manipulate symbols without connections to perception, action, or the world those symbols refer to. A child learns that "hot" refers to a painful sensation by touching things; an LLM learns it from the statistical context of other words. Whether grounding is necessary for genuine understanding is the question that the "stochastic parrots" debate crystallised but has not settled.

What do we actually lose when a language disappears?

Several things of real consequence. First, linguistic diversity is a repository of solutions to the problem of describing the world: Amazonian languages encode hundreds of plant and animal distinctions that have no names in European languages; Australian Aboriginal languages encode spatial information that has proven scientifically useful in biology and navigation research. Second, each language is a window into a different cognitive possibility — a different way of organising time, space, and social relation that enriches our understanding of what language and thought can be. Third, language death typically accompanies cultural death, meaning the loss of oral literature, historical memory, and traditional ecological knowledge that was never written down. The linguist Ken Hale called the loss of a language like losing a Louvre. The analogy is imperfect but its urgency is not.

Will machine translation make learning foreign languages pointless?

Almost certainly not, for several reasons. Translation, even perfect translation, is fundamentally reactive — it mediates a conversation rather than enabling spontaneous participation in it. People who speak a language can joke, take risks, misunderstand and repair in real time, read body language in sync with words, and build the kind of trust that only direct communication creates. Beyond utility, learning a language is learning a culture and a way of dividing experience: the difference between vous and tu in French encodes a theory of social distance that translation cannot carry over. What machine translation will likely do is reduce the economic incentive for the purely instrumental learning of a language — but it will leave intact the deeper cultural and cognitive rewards of genuine fluency. The analogy is GPS and navigation: GPS eliminated the need to memorise routes, but it did not make understanding geography pointless.