Unicode Codepoints Explained: How Every Character Gets a Number

Learn what Unicode codepoints are, how they map to characters, the difference between Unicode and UTF-8, and how to decode codepoints yourself β€” with tables, examples, and common pitfalls.

The Quick Answer

A Unicode codepoint is a unique number assigned to every character. It is written in hexadecimal with the prefix U+.

Character Codepoint Decimal Description
A U+0041 65 Latin uppercase A
€ U+20AC 8364 Euro sign
δΈ­ U+4E2D 20013 CJK ideograph "middle"
πŸ˜€ U+1F600 128512 Grinning face emoji
Ο€ U+03C0 960 Greek lowercase pi

Example: "Hello" β†’ U+0048 U+0065 U+006C U+006C U+006F

To convert codepoints to characters or characters to codepoints, use our Unicode Codepoint Decoder.


Why Unicode Exists

Before Unicode, every region had its own character encoding. ASCII covered English (128 characters). ISO 8859-1 added Western European characters. Shift_JIS handled Japanese. GB2312 handled Chinese. Windows-1251 handled Cyrillic.

The problem: open a file encoded in Shift_JIS with a Latin-1 decoder, and you get garbage. There was no universal standard.

Unicode solved this by assigning one unique number to every character across all writing systems. One standard. One number per character. No ambiguity.

As of Unicode 16.0 (released 2024), there are over 154,000 assigned characters covering virtually every script in active use, plus thousands of historical scripts, symbols, and emoji.


How Codepoints Work

Every Unicode character has exactly one codepoint: a number from 0 to 1,114,111 (hex: 0 to 10FFFF).

The notation U+ followed by 4 to 6 hex digits is the standard way to write a codepoint:

  • U+0041 β†’ A
  • U+00E9 β†’ Γ©
  • U+4E2D β†’ δΈ­
  • U+1F600 β†’ πŸ˜€

For codepoints in the Basic Multilingual Plane (U+0000–U+FFFF), 4 hex digits are used. For codepoints above U+FFFF, 5 or 6 hex digits are used.

Unicode Planes

The 1,114,112 possible codepoints are organized into 17 planes, each containing 65,536 codepoints:

Plane Range Name What's There
0 U+0000–U+FFFF Basic Multilingual Plane (BMP) Latin, Greek, Cyrillic, CJK, Arabic, Hebrew, common symbols
1 U+10000–U+1FFFF Supplementary Multilingual Plane Emoji, musical notation, historic scripts, mathematical alphanumerics
2 U+20000–U+2FFFF Supplementary Ideographic Plane Rare CJK characters
14 U+E0000–U+EFFFF Supplementary Special-purpose Tag characters, variation selectors
15–16 U+F0000–U+10FFFF Private Use Areas Application-defined (not standardized)

Planes 3 through 13 are mostly unassigned and reserved for future expansion.

Most characters you encounter daily live in Plane 0 (the BMP). Emoji and less common scripts live in Plane 1.


Unicode vs UTF-8 vs UTF-16

This is the single most common source of confusion in character encoding.

  • Unicode is the character set β€” which number maps to which character.
  • UTF-8, UTF-16, UTF-32 are encodings β€” how those numbers are stored as bytes.

Think of it like this: Unicode says "A is number 65." UTF-8 says "I'll store the number 65 as the byte 0x41." UTF-16 says "I'll store it as the bytes 0x00 0x41."

UTF-8 Encoding

UTF-8 is the dominant encoding on the web (over 98% of web pages). It uses 1 to 4 bytes per character:

Codepoint Range UTF-8 Bytes Bit Pattern Example
U+0000–U+007F 1 byte 0xxxxxxx A β†’ 41
U+0080–U+07FF 2 bytes 110xxxxx 10xxxxxx Γ© β†’ C3 A9
U+0800–U+FFFF 3 bytes 1110xxxx 10xxxxxx 10xxxxxx € β†’ E2 82 AC
U+10000–U+10FFFF 4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx πŸ˜€ β†’ F0 9F 98 80

The key advantage of UTF-8: ASCII text (the first 128 characters) is stored identically to plain ASCII. This makes UTF-8 backward-compatible and efficient for English-heavy text.

UTF-16 Encoding

UTF-16 uses 2 bytes for BMP characters and 4 bytes (a surrogate pair) for characters above U+FFFF.

JavaScript strings are internally UTF-16. This matters because:

"A".length      // 1  (one 16-bit code unit)
"πŸ˜€".length     // 2  (two 16-bit code units = surrogate pair)
[..."πŸ˜€"].length // 1  (one actual character)

If you're working with JavaScript and need the real character count, use [...str].length or Array.from(str).length instead of .length.

Quick Comparison

Encoding ASCII character Γ© (U+00E9) € (U+20AC) πŸ˜€ (U+1F600)
UTF-8 1 byte 2 bytes 3 bytes 4 bytes
UTF-16 2 bytes 2 bytes 2 bytes 4 bytes
UTF-32 4 bytes 4 bytes 4 bytes 4 bytes

How Emoji Codepoints Work

Most simple emoji are a single codepoint: πŸ˜€ is U+1F600, πŸŽ‰ is U+1F389, 🌍 is U+1F30D.

But many modern emoji are actually sequences of multiple codepoints:

Skin Tone Modifiers

A base emoji followed by a skin tone modifier (U+1F3FB through U+1F3FF):

Emoji Sequence Explanation
πŸ‘‹ U+1F44B Base waving hand
πŸ‘‹πŸ» U+1F44B U+1F3FB + Light skin tone
πŸ‘‹πŸ½ U+1F44B U+1F3FD + Medium skin tone
πŸ‘‹πŸΏ U+1F44B U+1F3FF + Dark skin tone

Zero Width Joiner (ZWJ) Sequences

Multiple emoji joined by U+200D (Zero Width Joiner):

Emoji Sequence Components
πŸ‘©β€πŸ’» U+1F469 U+200D U+1F4BB Woman + ZWJ + Laptop
πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy

Flag Sequences

Two Regional Indicator symbols (U+1F1E6 through U+1F1FF) combine to form a flag:

Flag Sequence Letters
πŸ‡ΊπŸ‡Έ U+1F1FA U+1F1F8 Regional Indicator U + Regional Indicator S
πŸ‡©πŸ‡ͺ U+1F1E9 U+1F1EA Regional Indicator D + Regional Indicator E

Variation Selectors

Some characters can appear as either text or emoji depending on a trailing variation selector:

  • ❀ (U+2764) β€” default text presentation
  • ❀️ (U+2764 U+FE0F) β€” emoji presentation (with Variation Selector-16)

The Unicode Codepoint Decoder shows every codepoint in a sequence, which is useful for debugging emoji rendering issues.


Unicode vs ASCII

ASCII was the original character encoding standard (1963). It defines 128 characters using 7 bits:

Range Content
0–31 Control characters (tab, newline, etc.)
32–126 Printable characters (letters, digits, punctuation)
127 Delete

Unicode's first 128 codepoints (U+0000–U+007F) are identical to ASCII. This was deliberate β€” it means every valid ASCII text is also valid Unicode (in UTF-8 encoding).

The difference: ASCII stops at 128 characters. Unicode goes to 1,114,112. ASCII handles English. Unicode handles every writing system.

For a full ASCII reference, see our ASCII Table.


Normalization: When the Same Character Has Different Codepoints

Unicode has a subtle but important concept: some characters can be represented in more than one way.

The letter "Γ©" can be:

  1. U+00E9 β€” a single precomposed character (Γ©)
  2. U+0065 U+0301 β€” the letter "e" followed by a combining acute accent (Γ©)

Both look identical on screen. But they are different byte sequences. This means string comparison can fail:

"Γ©" === "Γ©"  // might be false if one is precomposed and the other is decomposed

Unicode normalization converts between these forms:

Form Name Example for Γ©
NFC Composed U+00E9 (single codepoint)
NFD Decomposed U+0065 U+0301 (two codepoints)

When comparing strings that might come from different sources, normalize them first. In JavaScript: str.normalize('NFC').


Common Mistakes with Unicode

  1. Assuming 1 character = 1 byte. Only true for ASCII in UTF-8. A CJK character takes 3 bytes. An emoji takes 4.

  2. Using .length in JavaScript for character count. "πŸ˜€".length returns 2 because JavaScript counts UTF-16 code units, not characters. Use [..."πŸ˜€"].length instead.

  3. Slicing strings through surrogate pairs. "πŸ˜€".slice(0, 1) produces a broken half of a surrogate pair. Use [..."πŸ˜€text"].slice(0, 1).join("") for safe slicing.

  4. Confusing codepoints with bytes. U+00E9 is a codepoint (the number 233). The UTF-8 bytes for this codepoint are 0xC3 0xA9 (two bytes that together encode the number 233 in UTF-8's scheme). They are different numbers.

  5. Ignoring normalization. Two strings that look identical may have different codepoint sequences. Always normalize before comparing.

  6. Hardcoding maximum byte counts. A single Unicode character can be up to 4 bytes in UTF-8. A single user-perceived "character" (grapheme cluster) like a flag emoji or family emoji can be 20+ bytes.


Practical Uses for Codepoint Lookup

  • Debugging encoding issues: When text shows as "é" instead of "Γ©", the bytes C3 A9 (UTF-8 for U+00E9) are being decoded as Latin-1 instead of UTF-8. Knowing the codepoint helps diagnose the mismatch.
  • Working with special characters in code: Need to insert a non-breaking space? That's U+00A0. A zero-width space? U+200B. Knowing the codepoint lets you use escape sequences like \u00A0 in JavaScript.
  • Understanding emoji composition: When an emoji doesn't render correctly, decomposing it into codepoints reveals whether skin tone modifiers, ZWJ sequences, or variation selectors are present or missing.
  • Database and API debugging: Character encoding mismatches between systems often corrupt specific codepoint ranges. Identifying which codepoints are affected narrows down the encoding mismatch.
  • Accessibility and internationalization: Verifying that text contains the correct codepoints for a given language (e.g., distinguishing Cyrillic "Π°" U+0430 from Latin "a" U+0061, which look identical in many fonts).

Frequently Asked Questions

What is the difference between a codepoint and a glyph?

A codepoint is a number (e.g., U+0041). A glyph is the visual shape drawn on screen for that number. The same codepoint can look different in different fonts. Multiple codepoints can combine into a single glyph (like combining accents). One codepoint can produce different glyphs depending on context (like Arabic letter shaping).

How many Unicode characters are there?

As of Unicode 16.0, there are over 154,000 assigned characters. The maximum possible is 1,114,112 codepoints (U+0000–U+10FFFF), though many ranges are reserved or unassigned.

Is Unicode the same as UTF-8?

No. Unicode is the standard that assigns numbers to characters. UTF-8 is one way to encode those numbers as bytes. Other encodings include UTF-16 and UTF-32. When people say "Unicode file" they usually mean "UTF-8 encoded file."

What is the BOM (Byte Order Mark)?

The BOM is the character U+FEFF placed at the start of a file to signal its encoding. In UTF-8, the BOM is the byte sequence EF BB BF. It's optional in UTF-8 (and often discouraged), but required in UTF-16 to indicate byte order (big-endian vs little-endian).

Can I use Unicode codepoints in HTML?

Yes. Use &#x followed by the hex codepoint and a semicolon: é produces Γ©. Or use the decimal form: é. Named entities like é work for some characters but not for emoji or obscure symbols.

What are Private Use Area codepoints?

Three ranges are reserved for application-specific characters: U+E000–U+F8FF (BMP Private Use Area), U+F0000–U+FFFFD (Supplementary Private Use Area A), and U+100000–U+10FFFD (Supplementary Private Use Area B). Custom icon fonts like Font Awesome often use BMP Private Use Area codepoints.

How do I type a Unicode character by codepoint?

On Windows: hold Alt, type + and the hex codepoint on the numpad (requires registry setting). On macOS: press Ctrl+Cmd+Space for the character viewer, or enable the Unicode Hex Input keyboard and type the hex code while holding Option. On Linux: press Ctrl+Shift+U, type the hex code, press Enter. Or use our Unicode Codepoint Decoder to convert and copy.

What is a grapheme cluster?

A grapheme cluster is what a user perceives as a single "character." It may consist of multiple codepoints. For example, the flag emoji πŸ‡ΊπŸ‡Έ is two codepoints (U+1F1FA U+1F1F8) but appears as one symbol. The family emoji πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ is seven codepoints but one visual unit. When counting characters for user-facing purposes (like a character limit), count grapheme clusters, not codepoints.


Related Tools

Related Tools