There’s lots going on with lang, the global attribute that defines the language an element contains. Information about it is scattered, so I’ve collected here what I’ve found.

Values for language codes

lang doesn’t have a real spec for its attribute values. There’s a bunch of RFCs, an IANA registry, and other miscellaneous standards. The most definitive one is BCP 47.

These values are officially known as language tags, made of hyphen-separated language subtags. Those say how the language is written, what dialect it is, and other information:

  1. Language subtag — the base language. Required.
  2. Script subtag — the writing system used.
  3. Region subtag — the area if the dialacet is unique.
  4. Variant subtag — special versions, like time periods or cultural differences.
  5. Extension subtag — ways for third-parties to tag their own data.

You won’t need most of them. The vast majority is 1 or 2 subtags — 3 at worst. Examples:

  • en — English
  • pt-BR — Brazilian Portuguese
  • zh-Hans — Simplified Chinese
  • es-ES-Brail — Spanish as spoken in Spain, written in Braille

But if you’re curious about how bad it could get…

en-GB-Cyrl-u-kn-true-x-unproof-t-jp-032-Zxxx-x-matsu
I won’t say what this means, partially because I challenge you to figure it out, but mostly because you wouldn’t believe me.

Capitalization doesn’t matter: language tags are case-insensitive. Informally, subtags are cased differently to be easier to scan.

However, the order of the subtags is important. They start general and become more specific as they progress to the right. Programs process subtags individually, so complex tags still have value if programs only recognize parts of them.

That’s the structure of a language tag, but how do you get the individual pieces?

Subtag values

All subtags are listed in IANA’s Language Subtag Registry, from which all things lang flow.

However, it's a big ol’ .txt file, which is hard to read and harder to search. That’s why there’s an unofficial explorer and validator. It’s legit, though: maintained by Richard Ishida, a W3C Internationalization member, and it searches the official .txt file.

I highly recommend the unofficial tool, because it validates language tags with friendly suggestions and links to more info.

Language subtags

Yeah, a “language tag” and a “language subtag” are two different things. Delightful, right?

There used to be a complicated system where some languages had 2-letter codes, and others had 3, and some had both, and still others were assembled from a language subtag and an “extlang subtag”. That was terrible and error-prone and thus promptly washed away.

Today, all languages have 2 or 3-letter codes, with the same process for both: look it up. Some examples:

  • ja — Japanese
  • yo — Yoruba
  • bdz — Badeshi
  • ids — Idesa

Conventionally, language subtags are lowercase.

Script subtags

Script subtags, also known as “writing-system” subtags, say how the language is written. This is for languages with multiple alphabets, like Japanese.

They’re always 4 letters long, and typically capitalize the first letter. Examples:

  • ru-Cyrl — Russian written in the Cyrillic alphabet
  • en-Brai — English written in Braille
  • ja-Kana — Japanese written in Katakana

You’ll probably know if you need a script subtag with a language that you speak. But if you’re unfamiliar with the language, the Language Registry mentions redundant language/script combinations.

For example, fr-Latn means French written in the Latin alphabet. That’s nothing special; it’s the overwhelmingly common way to write French. Under fr in the registry, it says:

Type: language
Subtag: fr
Description: French
Added: 2005-10-16
Suppress-Script: Latn

The “Suppress-Script” field holds the script subtag safely assumed for that language. That means a script subtag of Latn on a language subtag of fr is unnecessary.

Region subtags

Region subtags come after the script subtag, but you can omit the script subtag if you don’t need it: fr-Latn-CA and fr-CA are equivalent.

They’re for particular regional dialects of a language. Either 2-letter country codes in UPPERCASE, or 3-digit region codes. The numeric region codes are for multiple countries, parts of countries, or other geographical mismashes that defy categorization.

The canonical example is Spain’s variety of Spanish (es-ES) vs. Central American (es-013). Some other examples:

  • en-GB — English as spoken in Great Britain
  • nl-014 — Dutch as spoken in eastern Africa
  • sv-Blis-GL — Swedish as spoken in Greenland, written in Blissymbols

If you don’t need a region, leave the region subtag off. For an international subset of Spanish, use only es. Similarly, ja-JP doesn't say anything that ja doesn’t; one can safely assume Japanese as spoken in Japan. There isn’t an official way of knowing what’s a necessary language/region combination, so ask your translator.

When regional variants differ in slang, spelling, or pronunciation, mark it. Screen readers, spell-checking, and other software appreciate it.

Variant subtags

You will probably never use these unless you write about linguistics. They’re mostly languages during time periods; en-emodeng refers to Early Modern English (1500–1700). The rest are a grab bag.

Variant subtags go after language, script, and region subtags, like sl-Latn-IT-nedis. They vary in length.

Most are unique to one language. Check the validator for what variants need as a “prefix”. For example, Boontling’s prefix is always en, because it’s an English variant.

Extension subtags

When official subtags aren’t enough, you can extend language tags. These extension subtags are marked with a “singleton”, a letter surrounded by hyphens. Like -x-.

Unicode local extensions

Unicode has very granular language classifications. Language tags can reuse them with -u- extensions.

The official word on Unicode extensions is RFC 6067, with more at the Unicode Common Data Locale Repository.

I won’t break this down because it’s not part of BCP 47, and it’s practically its own minilanguage. If you need them, you’ll know, like for JavaScript’s internationalization API.

Transformed content extensions

These extensions indicate text that has been translated/transliterated. Put the end-product language tag first, then attach the “original” language tag after -t-. Specifics in RFC 6497.

Private use extensions

Private use extensions use -x- and are ideally 5–8 characters long, but they’re already private so whatever.

The W3C doesn’t recommend -x-:

Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.

But since programs ignore subtags they don’t understand, -x-butter or whatever won’t mess anything up. It’s just weird and not useful. Not that that’s ever happened.

Exceptions and quirks

With such a messy subject as human language, no classifications fit perfectly. Here’s some edge cases where the right thing may not be clear.

Screwing it up

Let’s say you forget lang, or use the wrong value. Not ideal, but nothing will blow up. User-generated content almost never bothers with lang.

However, it will mess up screen-readers/voice browsers/other assistive technologies, confuse translation software (automated or helpers for human translators), produce weird font choices, and other boogums. WCAG 2.0 requires the right lang for compliance.

Unknown languages

If you don’t know a document’s language, then don’t use the lang attribute at all. (Or a Content-Language header.) But what if you know the general language, but not some parts, or if you make an interface in a known language, but can’t predict the language of user-generated content? Say, a forum.

To mark an element’s language as unknown, use an empty value: lang="". Don’t write the attribute name alone, like <span lang>. It’s like how alt="" means an <img> is decorative. You can also do <html lang=""> to override incorrect guesses by Content-Language.

Annoyingly, XML 1.0 forbids the empty value for xml:lang. XML 1.1 fixed that, but some XML dialects don’t allow XML 1.1 (like SSML). In that case, use xml:lang="und".

Non-languages

For text that isn’t human language, there’s a special language subtag: zxx.

This would apply for text such as type samples, part numbers, illustrations of binary data, etc.

This lets screen-readers read out the characters individually. It also prevents the CSS hyphens property from inserting hyphens, since an extra hyphen in a serial number/address/ID value could cause problems.

You don’t need this on <code> elements; that tag’s semantics already mean it’s not human language.

Multiple languages

The special language subtag mul is short for “multiple”. It’s useless, just as bad as no lang at all. If you have attributes in different languages, like:

  <img src="2r23452.png" alt="A man with a plan" title="Un hombre con un proyecto" lang="mul">

Try instead:

  <span title="Un hombre con un proyecto" lang="es"><img src="2r23452.png" alt="A man with a plan" lang="en"></span>

More verbose, but more useful.

Nonstandard languages

The special subtag mis is short for “miscellaneous”. It’s for when you know the language, but a subtag for it doesn’t exist. Not very useful by itself, but you can use additional subtags for the alphabet it uses and the region it’s from.

Unwritten languages

If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use en-Zxxx to make it clear that an audio recording in English is not written content.

I don’t know what this would indicate that the presence of <audio> wouldn’t. Weird.

Programming languages

Do not use lang to indicate the computer language an element contains. The lang attribute on <code> would indicate human language within the code, such as variable names, comments, etc.

There isn’t an official way to mark up which programming language something contains.

Artificial/fictional languages

Some constructed langauges, like Esperanto and Lojban, have language tags defined. In that case, great, use them.

For the rest, there’s a special language subtag: art, for “artificial”. You extend it with -x- to identify the language. So Solresol is art-x-solresol.

Using language tags

That's how to write ‘em, but where do we stick ‘em? On the Web, there are 3 places: the server level, the document level, and the element level.

Server-level language semantics

HTTP has a Content-Language header, for the language a served URL uses. It’s often wrong.

Its primary use is content negotiation. Browsers send an Accept-Language header with the user’s preferred languages. In theory, users configure what they speak ranked by desire:

Each language-range MAY be given an associated quality value which represents an estimate of the user's preference for the languages specified by that range. The quality value defaults to “q=1”. For example,

  Accept-Language: da, en-gb;q=0.8, en;q=0.7

would mean: “I prefer Danish, but will accept British English and other types of English.”

Nobody does that, except the sorts of users who get miffed if you don’t respect it. For everybody else, the browser guesses from the Operating System’s language settings.

It’s handy for guessing what language at first, but please let users switch languages afterwards. They could be on somebody else’s device, realized that a particular translation on your site is bad, or, you know, life attacked.

Pages shouldn’t rely on the Content-Language header either, because it only works on properly-configured servers. Browsers infer the language from it, but for multilanguage pages, HTML saved to disk, or certain features in CSS, the lang attribute is better. The W3C prefers lang in the markup for such reasons.

Document-level language semantics

In HTML, all you need is the lang attribute. In XML or inline SVG/MathML, it’s xml:lang. If you're using XHTML (why?) you need both. For readability, I’ll refer to both attributes as lang.

The first thing to do is setting lang on the root element, usually <html>. This defines the default language and overrides the Content-Language header.

Don’t use this ancient forbidden technique:

  <meta http-equiv="Content-Language" content="zh-Hant">

The HTML5 spec recommends against it. lang on <html> is less code and doesn’t risk declaring the language after content (like <title>), so there’s no reason to use the legacy method.

A top-level lang is enough for most pages. But you can go deeper.

Element-level language semantics

In multilingual documents, use lang on elements where the language changes. For example, for a Swahili/Yoruba conversation, you could declare the <html lang> as Yoruba, and mark up the Swahili passages. Or vice-versa.

lang goes on any element, even null-content ones like <img>. It indicates those elements’ human-readable attributes are in that language. As for what attributes are human-readable, the spec lists them.

It doesn’t mention ARIA attributes, but at least aria-label and aria-roledescription should be translated.

Believe it or not, it’s best to mark snippets of foreign languages with <i>. HTML5 redefined <i> as:

The i element represents a span of text offset from its surrounding content without conveying any extra emphasis or importance, and for which the conventional typographic presentation is italic text; for example, a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, or a ship name.

If you don’t want italics and the idea of i { font-style: normal } offends you, <span> is fine too. If the text direction also changes, you can slap lang onto <bdo> or <bdi> and have them pull double-duty.

Related attributes

That was a lot of words on the lang attribute, and now there’s more? Don’t worry, they’re easy.

hreflang

lang on <a> only says what language is within its tags. The hreflang attribute is for what the link points at.

Google uses <link rel="alternate" hreflang="whatever"> to indicate site translations. This has become a de-facto standard. Note the special value x-default:

Finally, the reserved value “x-default” is used for indicating language selectors/redirectors which are not specific to one language or region, e.g. your homepage showing a clickable map of the world.

srclang

Unique to the <track> element, it works like hreflang: the language that the track’s src uses. If the <track> has kind="subtitles", you must include srclang.

translate

HTML5 introduced the translate attribute, for proper nouns, example text (like a manual describing interface labels), and whatever else that shouldn’t be translated.

For a page on French linguistics, you could mark example French text like this:

  <i class="example" lang="fr" translate="no">sacré bleu</i>

Effects and applications

“Yeah, okay, but what does all these semantic nambery-pambery do? A fine question; why bother? Here’s what I found.

Smarter font display

lang helps browsers make smarter font choices. This can be as subtle as switching to fonts with better support for certain ligatures, or as vital as a font necessary to display the language correctly. It’s extremely important to do this for CJK (Chinese/Japanese/Korean) languages, because they share Unicode code points, and need the proper font to render correctly. (This is a touchy subject, by the way.)

The chosen font changes the appearance of the same character, depending on what language is defined.
Example from the W3C

lang also affects details like OpenType features, kerning, <ol> markers, and other bitmabobs. As Web typography advances, more and more subtleties will be introduced: intelligent justification, case-normalization, hyphens, and hanging punctuation, to name a few. Many require lang.

<q>

Quotation marks differ between languages, so what the browser automatically inserts around the <q> element changes too. This also applies to the quotes property, useful for stylized blockquotes.

::first-letter

Defining first-letter is tricky: do you include leading punctuation? What about letter combinations, such as AE or IJ? The answers are language-specific, so CSS needs the right lang for this pseudo-class to work correctly.

text-transform

Surprise!

<input type="number">

Regions differ on number punctuation. America uses “1,000.23”, and Europe likes “1.000,23”. It gets hairier.

The lang attribute tells <input type="number"> which punctuation to supply in mobile numeric keyboards, and how to validate. Browser support isn’t perfect, but it’s a nice bit of removed friction. Read the linked article if you’re serious about form internationalization, because like all things forms, it’s complicated and annoying.

I haven’t seen <input type="date"> do this in browsers yet, but it’s only a matter of time.

The spellcheck attribute

spellcheck requires lang to function. How do you proofread an unknown language?

The :lang() CSS pseudo-class

You can use the [attribute] selector for styling elements based on language, but the important difference is that :lang inherits. You wouldn’t want something like [lang|="ar"] * { dir: rtl }, right? Yeck. Not to mention the more advanced matching :lang() performs.

Language-specific styling isn’t too common, but crops up if you look. For example, <strong> is normally bold, but bold Chinese body text looks terrible. So something like:

  strong:lang(zh) {
  font-weight: normal;
  background-color: yellow;
}

…would be more readable.

Possibly the biggest use-case is different fonts; system fonts vary wildly in how well they handle various alphabets. You could even load separate @font-faces based on language.

Here’s how Wikipedia uses it in the real world:

  a:lang(ar), /* Arabic */
a:lang(kk-arab), /* Arabian Kazahk */
a:lang(mzn), /* Mazanderani */
a:lang(ps), /* Pashto */
a:lang(ur) /* Urdu */ {
  text-decoration: none;
}

These languages make heavy use of descenders, so underlines make them much harder to read.

The hyphens CSS property

hyphens has the browser break lines more intelligently, using built-in hyphentation dictionaries. It works poorly without lang.

Text-to-speech

Screen-readers and voice browsers guess if they can’t find language information for a page, which can go badly. Imagine a screen-reader mangling Italian with Spanish pronunciation — their robotic voices already have tone issues. lang improves things considerably. This is especially important for multilingual documents.

With the Apple Watch, Siri/Alexa’s Web answers, and other chatterbots, computers reading aloud is becoming common. Voice browsing benefits considerably from proper language annotations.

Defining the region is handy. For example, JAWS, VoiceOver, and NVDA use different voices for en-US vs. en-GB, which clears up differences like color/colour, organize/organise, and the temperature beer should be served at.

Other assistive technology

lang is part of accessibility standards for a reason. Beyond screen-readers, Braille output is bettered by language tags.

Language transformation is far more useful when it can be sure what language it’s working on in the first place. And assistive technology is like 95% language transformation.

Search engines and other web crawlers

Search engines historically don’t use language tags, because people on the internet can’t be trusted. When you’re Google, this isn’t a problem. But other crawlers appreciate the hint. (And as mentioned earlier, Google does use hreflang, so…)

Computers so far are really, really bad at figuring out language changes, so they could use the help. For example: loanwords, cross-language homonyms, and borrowed phrases. C’est la vie.

Application manifests

The App Manifests that Google and the W3C are pushing include a "lang": property to help with localization, then "default_locale": and a "locale": array for worldwide distribution. All of them use language tags.

SVG's <switch> element

One of the “conditional processing attributes” that <switch> uses is systemLanguage. I’ve seen it used to change a character’s hand gestures, since what’s innocuous in some regions can be insulting in others.


1,501 2 1