There's a lot going on with lang, the global attribute that defines the language an element contains. Information about it is pretty scattered, so I've collected here what I've found.

Values for language codes

lang doesn't have a real spec for its attribute values. There's a bunch of RFCs, an IANA registry, and other miscellaneous standards. The most definitive one is BCP (Best Current Practice) 47.

These values are officially known as language tags, and they're made of hyphen-separated pieces called language subtags. Those say how the language is written, what dialect it is, and other distinctions:

  1. Language subtag — the base human language. Always required.
  2. Script subtag — the writing system used.
  3. Region subtag — the language as spoken in a certain region.
  4. Variant subtag — special versions, like specific time periods or cultural differences.
  5. Extension subtag — a way for third-parties to tag their own data.

You won't need most of them. The vast majority of use is one or two subtags — three at worst. For example:

  • en for English
  • pt-BR for Brazilian Portuguese
  • zh-Hans for Simplified Chinese
  • es-ES-Brail for Spanish as spoken in Spain, written in Braille

But if you're morbidly curious about how bad it can theoretically get...

en-GB-Cyrl-u-kn-true-x-unproof-t-jp-032-Zxxx-x-matsu
I won't say what this means, partially because I challenge you to figure it out, but mostly because you wouldn't believe me.

Capitalization doesn't matter; language tags are case-insensitive. Informally, subtags are cased differently to be easier to scan.

However, the order of the subtags is important. They become more specific as they progress to the right. Programs process subtags individually, so complex tags still have value if programs only recognize parts of them.

That's the structure of a language tag, but how do you get the individual pieces?

Subtag values

All subtags are listed in IANA's Language Subtag Registry, from which all things lang flow.

However, it's a big ol' .txt file, which is hard to read and harder to search. That's why there's an unofficial explorer and validator. It's as legit as unofficial gets, though; it's maintained by Richard Ishida, a W3C Internationalization member, and it searches the official .txt file.

I highly recommend the unofficial tool, because it validates language tags with friendly suggestions and links to more info.

Language subtags

Yeah, a "language tag" and a "language subtag" are two different things. Delightful, right?

There used to be a complicated system where some languages had 2-letter codes, and others had 3, and some had both, and still others were assembled from a language subtag and an "extlang subtag." That was all rightly decided to be terrible and error-prone and thus promptly washed away.

Today, all languages have 2 or 3-letter codes, with the same process for both: look it up, and it doesn't matter how many letters it has. Some examples:

  • ja for Japanese
  • yo for Yoruba
  • bdz for Badeshi
  • ids for Idesa

Conventionally, language subtags are lowercase.

Script subtags

Script subtags, also known as "writing-system" subtags, say how the language is written. This is critically important for languages with multiple alphabets, like Japanese.

They're always 4 letters long, and typically in Sentence Case. Some examples:

  • ru-Cyrl is Russian written in the Cyrillic alphabet
  • en-Brai is English written in Braille
  • ja-Kana is Japanese written in Katakana

You'll probably know if you need a script subtag with a language you speak. But if you're unfamiliar with the language, or unsure, the Language Registry mentions redundant language/script combinations.

For example, fr-Latn means French written with the Latin alphabet. That's nothing special, because it's the overwhelmingly common way to write French. If you look up fr in the registry, it says this:

Type: language
Subtag: fr
Description: French
Added: 2005-10-16
Suppress-Script: Latn

The "Suppress-Script" fields hold the script subtag safely assumed for that language. So this means a script subtag of Latn on a language subtag of fr is unnecessary.

Region subtags

Region subtags come after the script subtag, but you can omit the script subtag if you don't need it. fr-Latn-CA and fr-CA are equivalent.

They're for particular regional dialects of a language. Either 2-letter country codes in UPPERCASE, or 3-digit region codes. The numeric region codes are for multiple countries, parts of countries, and other geographical mismashes that defy nice clean categorization.

The canonical example is Spain's variety of Spanish (es-ES) vs. Central American (es-013). Some other examples:

  • en-GB for English as spoken in Great Britain
  • nl-014 for Dutch as spoken in eastern Africa
  • sv-Blis-GL for Swedish as spoken in Greenland, written in Blissymbols

If you don't need to distinguish a region, feel free to leave the region subtag off. For an international subset of Spanish, you can use only es. Similarly, ja-JP doesn't say anything that ja doesn't; one can safely assume Japanese as spoken in Japan. There isn't an official way of knowing what's a necessary language + regional subtag combination, so ask your translator.

When regional variants differ in slang, spelling, or pronunciation, it helps to mark it. Screen readers, spell-checking, and other sofware appreciates it.

Variant subtags

You will probably never use these unless you're writing about linguistics. They're mostly about languages during specific time periods; en-emodeng refers to Early Modern English (1500–1700). Others are kind of a grab bag.

Variant subtags follow language, script, and region subtags, like sl-Latn-IT-nedis. They vary in length, and mostly in lowercase.

Most are unique to one language subtag. Check the validator for what certain variants need as a "prefix." For example, Boontling's prefix is always en, because it's an English variant.

Extension subtags

When official subtags aren't enough, you can extend language tags. These extension subtags are marked with a "singleton," which is a letter surrounding by hyphens. Like -x-.

Unicode local extensions

Unicode has very granular language classifications. Language tags can reuse them with the -u- extension.

The official word on these Unicode extensions is RFC 6067, with more information at the Unicode Common Data Locale Repository.

I won't break this down because it's not part of BCP 47, and it's practically its own minilanguage. If you need them, you'll know, like JavaScript's internationalization functions.

Transformed content extensions

This extension indicates text that has been translated/transliterated. You put the end-product language tag first, and then attach the "original" language tag after a -t-. Specifics in RFC 6497.

Private use extensions

If all else fails, anyone can extend language tags. Private use extensions use -x- and are ideally 5–8 letters long, but they're already private so whatever.

The W3C doesn't recommend -x-:

Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.

But since programs ignore subtags they don't understand, stapling -x-butter or whatever to the end won't mess anything up. It's just weird and not useful to anyone. Not that that's ever happened.

Exceptions and quirks

With such a messy subject as human language, no classifications fit perfectly. Here's some edge cases where the right thing may not be clear.

Screwing up the attribute

Let's say you don't indicate languages with lang, or use the wrong language tag. This isn't ideal, but nothing will blow up. User-generated content almost never bothers with lang.

However, it will mess up screen-readers/voice browsers/other assistive technologies, confuse translation software (automated or helpers for human translators), produce weird font choices, and other boogums. WCAG 2.0 requires the right lang for compliance.

Unknown languages

If you don't know a document's language, then don't use the lang attribute at all. (Or a Content-Language header.) But what if you know its general language, but not some parts, or if you make an interface in a known language, but can't predict the language of user-generated content? Say, a forum.

To mark an element's language as unknown, give the language attribute an empty value: lang="". Don't write the attribute name by itself, like <span lang>. It's like how alt="" means an <img> is decorative. You can also do <html lang=""> to override incorrect guesses by Content-Language.

Annoyingly, XML 1.0 forbids the empty value for xml:lang. XML 1.1 fixed that, but some XML dialects don't allow XML 1.1 (like SSML). In that case, you can use xml:lang="und".

Non-languages

For text that isn't a human language, there's a special language subtag: zxx.

This would apply for text such as type samples, part numbers, illustrations of binary data, etc.

This lets screen-readers read out the characters individually. It also prevents the CSS hyphens property from inserting hyphens, since an extra hyphen in a serial number could cause problems.

You don't need this on <code> elements; that tag's semantics already mean it's not a human language.

Multiple languages

The special language subtag mul is short for "multiple." It's useless, just as bad as no lang at all. If you have attributes in different languages, like:

  <img src="2r23452.png" alt="A man with a plan" title="Un hombre con un proyecto" lang="mul">

Try instead:

  <span title="Un hombre con un proyecto" lang="es"><img src="2r23452.png" alt="A man with a plan" lang="en"></span>

More verbose, but more accessible and useful.

Nonstandard languages

The special subtag mis is short for "miscellaneous." It's for when you know the language, but a subtag for it doesn't exist. It's not very useful by itself, but you can use additional subtags for what alphabet it uses and the region it's used in.

Unwritten languages

If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use en-Zxxx to make it clear that an audio recording in English is not written content.

I don't know what this would indicate that the presence of <audio> wouldn't. Pretty weird.

Programming languages

Do not use lang to indicate the computer language an element contains. The lang attribute on <code> would indicate human language within the code, such as variable names, comments, etc.

There isn't an official way of marking up programming languages.

Artificial/fictional languages

Some constructed langauges, like Esperanto and Lojban, actually have language tags defined. In that case, great, use it.

For the rest, there's a special language subtag: art, for "artificial." You extend it with -x- to identify the language. So Solresol is art-x-solresol.

Using language tags

That's how to write 'em, but where do we stick 'em? On the Web, there are 3 main places; the server level, the document level, and the element level.

Server-level language semantics

HTTP has a Content-Language header, for the language a served resource uses. It's often wrong.

Its primary use is content negotiation. Browsers send an Accept-Language header with the user's preferred languages. In theory, users configure what they speak with degrees of desire:

Each language-range MAY be given an associated quality value which represents an estimate of the user's preference for the languages specified by that range. The quality value defaults to "q=1". For example,

  Accept-Language: da, en-gb;q=0.8, en;q=0.7

would mean: "I prefer Danish, but will accept British English and other types of English."

Almost nobody does that, except the sorts of users who get very miffed if you don't respect it. For everybody else, the browser guesses from the Operating System's language settings.

It's handy for guessing what language to send, but let users switch languages anyway. They could be on somebody else's computer, realized that a particular translation on your site is bad, or, you know, life attacked.

Pages shouldn't rely on the Content-Language header either, because it only works when the web page is transmitted from a properly-configured server. Browsers infer the language from it, but for multilanguage pages, HTML saved to disk, or language-dependent features in CSS, the lang attribute is better. The W3C prefers lang in the markup for such reasons.

Document-level language semantics

In HTML, all you need is the lang attribute. In XML or inline SVG/MathML, it's xml:lang. If you're using XHTML (why) you'll need both. For readability, I'll refer to both attributes as just lang.

The first thing to do is setting lang on the root element, usually <html>. This defines the default language for the document, and overrides the Content-Language header.

Don't use this ancient forbidden technique:

  <meta http-equiv="Content-Language" content="zh-Hant">

The HTML5 spec recommends against it. lang on <html> is less code and doesn't risk declaring the language after content (like <title>), so there's no reason to use the legacy method.

A top-level lang is enough for most pages. But you can go deeper.

Element-level language semantics

In multilingual documents, use lang on elements where the language changes. For example, for a bilingual Swahili/Yoruba conversation, you could declare the <html lang> as Yoruba, and mark up the Swahili passages. Or vice-versa.

lang goes on any element, even null-content ones like <img>. It indicates those elements' human-readable attributes are in that language. As for what attributes are human-readable, the HTML5 spec lists them.

It doesn't mention ARIA attributes, but at least aria-label and aria-description should also be translated.

Believe it or not, it may be best to mark snippets of foreign languages with <i>. HTML5 redefined <i> as:

The i element represents a span of text offset from its surrounding content without conveying any extra emphasis or importance, and for which the conventional typographic presentation is italic text; for example, a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, or a ship name.

If you don't want italics and the idea of i { font-style: normal } offends you, a <span> is fine too. If the text direction also changes, you can totally slap lang onto <bdo> or <bdi> and have them pull double-duty.

Related attributes

That was a lot of words on the lang attribute, and now there's more? Don't worry, they're easy.

hreflang

lang on <a> only says what language is within its tags. The hreflang attribute is for what the link points at.

Google uses <link rel="alternate" hreflang="whatever"> to indicate site translations. As a result, this is a de-facto standard for doing so. Note the special value x-default:

Finally, the reserved value "x-default" is used for indicating language selectors/redirectors which are not specific to one language or region, e.g. your homepage showing a clickable map of the world.

srclang

Unique to the <track> element, it works like hreflang: the language that the track's src uses. If the <track> has kind="subtitles", you must include srclang.

translate

HTML5 introduced the translate attribute, for proper nouns, example text (like a manual describing interface labels), and whatever else that shouldn't be translated.

For a page on French linguistics, you could mark example French text like this:

  <i class="example" lang="fr" translate="no">sacré bleu</i>

Translating that would be pretty confusing, since the surrounding text describes it as its French form.

Effects and applications

"Yeah, okay, but what does all these semantic nambery-pambery do?" A fine question; why bother? Here's what I've found.

Smarter font display

lang helps browsers make smarter font choices. This can be as subtle as switching to a font with better support for certain ligatures, or as vital as choosing a font necessary to display the language correctly. It's extremely important to do this for CJK (Chinese/Japanese/Korean) languages, because they share some Unicode characters, and need the proper font to render the language correctly. (This is a touchy subject, by the way.)

The chosen font changes the appearance of the same character, depending on what language family is defined
Example from the W3C

lang also affects typographic details like OpenType features, kerning, ordered list markers, and other bitmabobs. As Web typography advances, more and more subtleties will be introduced: intelligent justification, case-normalization, hyphens, and hanging punctuation, to name a few. Many of them require lang to function.

<q>

Quotation marks differ between languages, so what the browser automatically inserts around the <q> element changes too. It's not the most useful element, but hey.

This also applies to CSS's quotes property. Which is useful for stylized blockquotes.

::first-letter

Defining first-letter is tricky: do you include leading punctuation? What about letter combinations, such as AE or IJ? The answers are language-specific, so CSS needs the right lang for this pseudo-class to work correctly.

text-transform

Surprise!

<input type="number">

Regions differ when punctuating numbers. America uses 1,000.23, and Europe often goes for 1.000,23. It gets hairier.

The lang attribute tells <input type="number">s which punctuation to supply in mobile numeric keyboards, and how to validate. Browser support isn't perfect, but it's a nice bit of removed friction. Do read the linked article if you're serious about form internationalization, because like all things forms, it's complicated and annoying.

I haven't seen <input type="date"> do this in browsers yet, but it's only a matter of time.

The spellcheck attribute

The spellcheck attribute requires a lang to function. How do you proofread an unknown language?

The :lang() CSS pseudo-class

You can use the [attribute] selector for styling elements based on language, but the important difference is that :lang inherits. You wouldn't want something like [lang|="ar"] * { dir: rtl }, right? Yeck. Not to mention the more advanced matching :lang() performs.

Language-specific styling isn't too common, but use-cases pop up if you look. For example, <strong> normally looks bold, but small bold Chinese body text looks terrible. So something like:

  strong:lang(zh) {
  font-weight: normal;
  background-color: yellow;
}

...would be more readable.

Possibly the biggest use-case is different fonts for; system fonts vary wildly in how well they handle various alphabets. You could even load separate @font-faces based on language.

Here's how's Wikipedia uses it in the real world:

  a:lang(ar), /* Arabic */
a:lang(kk-arab), /* Arabian Kazahk */
a:lang(mzn), /* Mazanderani */
a:lang(ps), /* Pashto */
a:lang(ur) /* Urdu */ {
  text-decoration: none;
}

These languages make heavy use of descenders, so underlines make much harder to read.

The hyphens CSS property

hyphens has the browser break lines more intelligently, using built-in hyphentation dictionaries. It won't work at all without a lang value to inherit.

Text-to-speech

Screen-readers and voice browsers guess if they can't find language information for a page, which can go badly. Imagine a screen-reader mangling Italian with Spanish pronunciation — their robotic voices already have tone issues. lang improves things considerably. This is especially important for mixed languages.

With the Apple Watch, Siri/Alexa's Web answers, and other chatty computers, interfaces reading aloud is becoming more common. Voice browsing benefits considerably from proper language annotations.

Defining the region is also useful. For example, JAWS, VoiceOver, and NVDA use different voices between en-US and en-GB, which clears up differences like color/colour, organize/organise, and the temperature beer should be served at.

Other assistive technology

lang is part of accessibility standards for a reason. Beyond screen-readers, Braille output is also bettered by proper language tags.

Language transformation is far more useful when it can be sure what language it's working on in the first place. And assistive technology is like, 95% language transformation.

Search engines and other web crawlers

Search engines historically don't use language tags, because people on the internet can't be trusted. When you're Google, this isn't a problem. But other crawlers appreciate the hint. (And as mentioned earlier, Google does use hreflang, so...)

Computers so far are really, really bad at figuring out language changes, so they could use the help. For example: loanwords, cross-language homonyms, and borrowed phrases. C'est la vie.

Application manifests

The App Manifests that Google and the W3C are working on include a "lang": property to help with localization, then "default_locale": and a "locale": array for worldwide distribution. All of them use language tags.

SVG's <switch> element

One of the "conditional processing attributes" that <switch> uses is systemLanguage. I've it used to change a character's hand gestures, since what's innocuous in some regions can be insulting in others.


1,385 2 1