What is a Unicode code point?

A code point is the abstract numeric identity of a character, written like U+0041 for the letter A. It is independent of how the character is stored in bytes. Unicode defines over a million possible code points across many scripts.

Why does an emoji count as one code point but two UTF-16 units?

Code points above U+FFFF live in the astral planes and are stored in UTF-16 as a surrogate pair, two 16-bit units. This tool iterates the string by code point, so a single emoji is one row even though the UTF-16 column shows two units.

What does the general category mean?

Every Unicode character has a two-letter category such as Lu (uppercase letter), Nd (decimal digit), or Sc (currency symbol). It classifies the character's role and is used by regular expressions, word breaking, and validation rules.

How is the UTF-8 encoding shown?

The tool computes the actual UTF-8 byte sequence for each code point. ASCII characters are one byte, most Latin and Greek letters are two, most other scripts three, and emoji four. The bytes are shown in hexadecimal.

Can it identify the Unicode block?

Yes. The code point is matched against the standard block ranges, so you can see whether a character belongs to Basic Latin, Cyrillic, Arabic, CJK Unified Ideographs, the Emoticons block, and so on.

Unicode Code Point Inspector

Mystery characters in text — an invisible control byte, a look-alike Cyrillic letter, or an emoji that breaks a database column — are easy to misread. This inspector breaks any string into its individual Unicode code points and shows the full identity of each one.

How it works

The tool iterates the string by code point rather than by UTF-16 unit, so emoji and other astral characters are treated as single characters. For each one it reports:

the code point in U+XXXX notation via codePointAt,
the general category (such as Lu, Nd, or So), derived from the browser’s Unicode property escapes like \p{Lu},
the Unicode block, matched against the standard range table,
the UTF-8 bytes, computed directly from the code point, and the UTF-16 units that make up the JavaScript string.

Tips and notes

Use the UTF-8 column to debug encoding bugs: a character that should be one byte but shows up as several often means text was double-encoded. The category column helps when writing regular expressions, since \p{Nd} matches any decimal digit across scripts, not just 0-9. Watch for control characters (category Cc), which display as a ctrl marker here because they have no visible glyph but can still corrupt files and break parsers.