Unicode Codepoints Explained: How Every Character Gets a Number

The Quick Answer

A Unicode codepoint is a unique number assigned to every character. It is written in hexadecimal with the prefix U+.

Character	Codepoint	Decimal	Description
A	U+0041	65	Latin uppercase A
€	U+20AC	8364	Euro sign
中	U+4E2D	20013	CJK ideograph "middle"
😀	U+1F600	128512	Grinning face emoji
π	U+03C0	960	Greek lowercase pi

Example: "Hello" → U+0048 U+0065 U+006C U+006C U+006F

To convert codepoints to characters or characters to codepoints, use our Unicode Codepoint Decoder.

Why Unicode Exists

Before Unicode, every region had its own character encoding. ASCII covered English (128 characters). ISO 8859-1 added Western European characters. Shift_JIS handled Japanese. GB2312 handled Chinese. Windows-1251 handled Cyrillic.

The problem: open a file encoded in Shift_JIS with a Latin-1 decoder, and you get garbage. There was no universal standard.

Unicode solved this by assigning one unique number to every character across all writing systems. One standard. One number per character. No ambiguity.

As of Unicode 16.0 (released 2024), there are over 154,000 assigned characters covering virtually every script in active use, plus thousands of historical scripts, symbols, and emoji.

How Codepoints Work

Every Unicode character has exactly one codepoint: a number from 0 to 1,114,111 (hex: 0 to 10FFFF).

The notation U+ followed by 4 to 6 hex digits is the standard way to write a codepoint:

U+0041 → A
U+00E9 → é
U+4E2D → 中
U+1F600 → 😀

For codepoints in the Basic Multilingual Plane (U+0000–U+FFFF), 4 hex digits are used. For codepoints above U+FFFF, 5 or 6 hex digits are used.

Unicode Planes

The 1,114,112 possible codepoints are organized into 17 planes, each containing 65,536 codepoints:

Plane	Range	Name	What's There
0	U+0000–U+FFFF	Basic Multilingual Plane (BMP)	Latin, Greek, Cyrillic, CJK, Arabic, Hebrew, common symbols
1	U+10000–U+1FFFF	Supplementary Multilingual Plane	Emoji, musical notation, historic scripts, mathematical alphanumerics
2	U+20000–U+2FFFF	Supplementary Ideographic Plane	Rare CJK characters
14	U+E0000–U+EFFFF	Supplementary Special-purpose	Tag characters, variation selectors
15–16	U+F0000–U+10FFFF	Private Use Areas	Application-defined (not standardized)

Planes 3 through 13 are mostly unassigned and reserved for future expansion.

Most characters you encounter daily live in Plane 0 (the BMP). Emoji and less common scripts live in Plane 1.

Unicode vs UTF-8 vs UTF-16

This is the single most common source of confusion in character encoding.

Unicode is the character set — which number maps to which character.
UTF-8, UTF-16, UTF-32 are encodings — how those numbers are stored as bytes.

Think of it like this: Unicode says "A is number 65." UTF-8 says "I'll store the number 65 as the byte 0x41." UTF-16 says "I'll store it as the bytes 0x00 0x41."

UTF-8 Encoding

UTF-8 is the dominant encoding on the web (over 98% of web pages). It uses 1 to 4 bytes per character:

Codepoint Range	UTF-8 Bytes	Bit Pattern	Example
U+0000–U+007F	1 byte	`0xxxxxxx`	A → `41`
U+0080–U+07FF	2 bytes	`110xxxxx 10xxxxxx`	é → `C3 A9`
U+0800–U+FFFF	3 bytes	`1110xxxx 10xxxxxx 10xxxxxx`	€ → `E2 82 AC`
U+10000–U+10FFFF	4 bytes	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	😀 → `F0 9F 98 80`

The key advantage of UTF-8: ASCII text (the first 128 characters) is stored identically to plain ASCII. This makes UTF-8 backward-compatible and efficient for English-heavy text.

UTF-16 Encoding

UTF-16 uses 2 bytes for BMP characters and 4 bytes (a surrogate pair) for characters above U+FFFF.

JavaScript strings are internally UTF-16. This matters because:

"A".length      // 1  (one 16-bit code unit)
"😀".length     // 2  (two 16-bit code units = surrogate pair)
[..."😀"].length // 1  (one actual character)

If you're working with JavaScript and need the real character count, use [...str].length or Array.from(str).length instead of .length.

Quick Comparison

Encoding	ASCII character	é (U+00E9)	€ (U+20AC)	😀 (U+1F600)
UTF-8	1 byte	2 bytes	3 bytes	4 bytes
UTF-16	2 bytes	2 bytes	2 bytes	4 bytes
UTF-32	4 bytes	4 bytes	4 bytes	4 bytes

How Emoji Codepoints Work

Most simple emoji are a single codepoint: 😀 is U+1F600, 🎉 is U+1F389, 🌍 is U+1F30D.

But many modern emoji are actually sequences of multiple codepoints:

Skin Tone Modifiers

A base emoji followed by a skin tone modifier (U+1F3FB through U+1F3FF):

Emoji	Sequence	Explanation
👋	U+1F44B	Base waving hand
👋🏻	U+1F44B U+1F3FB	+ Light skin tone
👋🏽	U+1F44B U+1F3FD	+ Medium skin tone
👋🏿	U+1F44B U+1F3FF	+ Dark skin tone

Zero Width Joiner (ZWJ) Sequences

Multiple emoji joined by U+200D (Zero Width Joiner):

Emoji	Sequence	Components
👩‍💻	U+1F469 U+200D U+1F4BB	Woman + ZWJ + Laptop
👨‍👩‍👧‍👦	U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466	Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy

Flag Sequences

Two Regional Indicator symbols (U+1F1E6 through U+1F1FF) combine to form a flag:

Flag	Sequence	Letters
🇺🇸	U+1F1FA U+1F1F8	Regional Indicator U + Regional Indicator S
🇩🇪	U+1F1E9 U+1F1EA	Regional Indicator D + Regional Indicator E

Variation Selectors

Some characters can appear as either text or emoji depending on a trailing variation selector:

❤ (U+2764) — default text presentation
❤️ (U+2764 U+FE0F) — emoji presentation (with Variation Selector-16)

The Unicode Codepoint Decoder shows every codepoint in a sequence, which is useful for debugging emoji rendering issues.

Unicode vs ASCII

ASCII was the original character encoding standard (1963). It defines 128 characters using 7 bits:

Range	Content
0–31	Control characters (tab, newline, etc.)
32–126	Printable characters (letters, digits, punctuation)
127	Delete

Unicode's first 128 codepoints (U+0000–U+007F) are identical to ASCII. This was deliberate — it means every valid ASCII text is also valid Unicode (in UTF-8 encoding).

The difference: ASCII stops at 128 characters. Unicode goes to 1,114,112. ASCII handles English. Unicode handles every writing system.

For a full ASCII reference, see our ASCII Table.

Normalization: When the Same Character Has Different Codepoints

Unicode has a subtle but important concept: some characters can be represented in more than one way.

The letter "é" can be:

U+00E9 — a single precomposed character (é)
U+0065 U+0301 — the letter "e" followed by a combining acute accent (é)

Both look identical on screen. But they are different byte sequences. This means string comparison can fail:

"é" === "é"  // might be false if one is precomposed and the other is decomposed

Unicode normalization converts between these forms:

Form	Name	Example for é
NFC	Composed	U+00E9 (single codepoint)
NFD	Decomposed	U+0065 U+0301 (two codepoints)

When comparing strings that might come from different sources, normalize them first. In JavaScript: str.normalize('NFC').

Common Mistakes with Unicode

Assuming 1 character = 1 byte. Only true for ASCII in UTF-8. A CJK character takes 3 bytes. An emoji takes 4.
Using .length in JavaScript for character count. "😀".length returns 2 because JavaScript counts UTF-16 code units, not characters. Use [..."😀"].length instead.
Slicing strings through surrogate pairs. "😀".slice(0, 1) produces a broken half of a surrogate pair. Use [..."😀text"].slice(0, 1).join("") for safe slicing.
Confusing codepoints with bytes. U+00E9 is a codepoint (the number 233). The UTF-8 bytes for this codepoint are 0xC3 0xA9 (two bytes that together encode the number 233 in UTF-8's scheme). They are different numbers.
Ignoring normalization. Two strings that look identical may have different codepoint sequences. Always normalize before comparing.
Hardcoding maximum byte counts. A single Unicode character can be up to 4 bytes in UTF-8. A single user-perceived "character" (grapheme cluster) like a flag emoji or family emoji can be 20+ bytes.

Practical Uses for Codepoint Lookup

Debugging encoding issues: When text shows as "Ã©" instead of "é", the bytes C3 A9 (UTF-8 for U+00E9) are being decoded as Latin-1 instead of UTF-8. Knowing the codepoint helps diagnose the mismatch.
Working with special characters in code: Need to insert a non-breaking space? That's U+00A0. A zero-width space? U+200B. Knowing the codepoint lets you use escape sequences like \u00A0 in JavaScript.
Understanding emoji composition: When an emoji doesn't render correctly, decomposing it into codepoints reveals whether skin tone modifiers, ZWJ sequences, or variation selectors are present or missing.
Database and API debugging: Character encoding mismatches between systems often corrupt specific codepoint ranges. Identifying which codepoints are affected narrows down the encoding mismatch.
Accessibility and internationalization: Verifying that text contains the correct codepoints for a given language (e.g., distinguishing Cyrillic "а" U+0430 from Latin "a" U+0061, which look identical in many fonts).

Frequently Asked Questions

What is the difference between a codepoint and a glyph?

A codepoint is a number (e.g., U+0041). A glyph is the visual shape drawn on screen for that number. The same codepoint can look different in different fonts. Multiple codepoints can combine into a single glyph (like combining accents). One codepoint can produce different glyphs depending on context (like Arabic letter shaping).

How many Unicode characters are there?

As of Unicode 16.0, there are over 154,000 assigned characters. The maximum possible is 1,114,112 codepoints (U+0000–U+10FFFF), though many ranges are reserved or unassigned.

Is Unicode the same as UTF-8?

No. Unicode is the standard that assigns numbers to characters. UTF-8 is one way to encode those numbers as bytes. Other encodings include UTF-16 and UTF-32. When people say "Unicode file" they usually mean "UTF-8 encoded file."

What is the BOM (Byte Order Mark)?

The BOM is the character U+FEFF placed at the start of a file to signal its encoding. In UTF-8, the BOM is the byte sequence EF BB BF. It's optional in UTF-8 (and often discouraged), but required in UTF-16 to indicate byte order (big-endian vs little-endian).

Can I use Unicode codepoints in HTML?

Yes. Use &#x followed by the hex codepoint and a semicolon: é produces é. Or use the decimal form: é. Named entities like é work for some characters but not for emoji or obscure symbols.

What are Private Use Area codepoints?

Three ranges are reserved for application-specific characters: U+E000–U+F8FF (BMP Private Use Area), U+F0000–U+FFFFD (Supplementary Private Use Area A), and U+100000–U+10FFFD (Supplementary Private Use Area B). Custom icon fonts like Font Awesome often use BMP Private Use Area codepoints.

How do I type a Unicode character by codepoint?

On Windows: hold Alt, type + and the hex codepoint on the numpad (requires registry setting). On macOS: press Ctrl+Cmd+Space for the character viewer, or enable the Unicode Hex Input keyboard and type the hex code while holding Option. On Linux: press Ctrl+Shift+U, type the hex code, press Enter. Or use our Unicode Codepoint Decoder to convert and copy.

What is a grapheme cluster?

A grapheme cluster is what a user perceives as a single "character." It may consist of multiple codepoints. For example, the flag emoji 🇺🇸 is two codepoints (U+1F1FA U+1F1F8) but appears as one symbol. The family emoji 👨‍👩‍👧‍👦 is seven codepoints but one visual unit. When counting characters for user-facing purposes (like a character limit), count grapheme clusters, not codepoints.

Related Tools

Unicode Codepoint Decoder — Convert between codepoints and characters instantly
ASCII Decimal Converter — Convert text to ASCII decimal codes and back
ASCII Table — Full ASCII character reference with decimal, hex, and binary values
Hex to Text Decoder — Decode hexadecimal byte strings to readable text
Text to Binary — Convert text to binary representation
HTML Entity Encoder/Decoder — Encode and decode HTML character entities

The Quick Answer

Why Unicode Exists

How Codepoints Work

Unicode Planes

Unicode vs UTF-8 vs UTF-16

UTF-8 Encoding

UTF-16 Encoding

Quick Comparison

How Emoji Codepoints Work

Skin Tone Modifiers

Zero Width Joiner (ZWJ) Sequences

Flag Sequences

Variation Selectors

Unicode vs ASCII

Normalization: When the Same Character Has Different Codepoints

Common Mistakes with Unicode

Practical Uses for Codepoint Lookup

Frequently Asked Questions

What is the difference between a codepoint and a glyph?

How many Unicode characters are there?

Is Unicode the same as UTF-8?

What is the BOM (Byte Order Mark)?

Can I use Unicode codepoints in HTML?

What are Private Use Area codepoints?

How do I type a Unicode character by codepoint?

What is a grapheme cluster?

Related Tools

Related Tools

Unicode Codepoint Decoder

ASCII Decimal Converter

ASCII Table

Hex to Text Decoder

Text to Binary Converter