FC

🔬 String Inspector

Analyse any string: length, bytes, Unicode codepoints, frequency analysis

Characters

26

Unicode-aware

JS .length

27

UTF-16 units

UTF-8 Bytes

38

bytes

Words

5

whitespace split

Lines

1

newline count

Unique Chars

18

distinct

Non-ASCII

10

Unicode chars

UTF-16 Bytes

54

estimated

Complete Guide

📊 Key Data Points

U+XXXX notation

Unicode code points written as U+ followed by 4-6 uppercase hex digits

UTF-8 vs UTF-16

JavaScript strings are UTF-16 internally — emoji use 2 code units (4 bytes) in JavaScript

Zero-width chars

Invisible characters that cause string comparison and display bugs

String Inspector — Unicode and Byte Analysis -- Complete USA Guide 2026

Encoding bugs are among the hardest to debug because strings look correct visually. A string with a zero-width space looks identical to one without, but fails string comparison. A string with a combining accent is different bytes than its precomposed equivalent. This inspector makes every character visible.

Runs entirely in your browser — no data transmitted.

**Long-tail searches answered here:** string inspector online free, unicode character analyzer browser tool, string byte length character codes online free.

For encoding operations, pair with Character Encoder and URL Encoder.

🔬 How This Calculator Works

Analyzes a string at the character level: shows each character with its Unicode code point (U+xxxx), UTF-8 byte count, HTML entity equivalent, and category (letter, digit, punctuation, space, control, emoji). Byte length in UTF-8, UTF-16, and UTF-32 encodings. Useful for debugging encoding issues, unexpected whitespace, zero-width characters, and invisible Unicode control characters.

✅ What You Can Calculate

Character-level Unicode breakdown

Shows each character with its Unicode code point, UTF-8 byte sequence, HTML entity, and category — makes invisible characters visible.

Multiple encoding byte counts

Shows string length in UTF-8, UTF-16, and UTF-32 bytes separately. JavaScript strings are UTF-16 internally — emoji use 2 code units (4 bytes).

Zero-width character detection

Zero-width space (U+200B), zero-width non-joiner, and left-to-right mark appear invisible in most editors but cause string comparison failures.

Normalization form detection

Shows whether e-with-accent is a precomposed character (U+00E9) or decomposed (U+0065 + U+0301) — visually identical but different byte sequences.

🎯 Real Scenarios & Use Cases

Debugging encoding issues

A string looks correct but fails comparison. The inspector reveals hidden characters: zero-width spaces, Unicode normalization differences, and invisible control characters.

Database byte limit verification

A VARCHAR(255) column stores 255 bytes. UTF-8 emoji use 4 bytes each. Calculate the worst-case byte size of your Unicode strings here.

PDF text extraction debugging

Text copied from PDF files sometimes includes control characters (backspace, form feed) that cause unexpected behavior. The inspector shows these invisible characters explicitly.

API key character verification

Verify that a pasted API key contains only the expected alphanumeric characters — no invisible Unicode that looks correct but fails authentication.

💡 Pro Tips for Accurate Results

Zero-width characters are invisible. Zero-width space (U+200B) and zero-width non-joiner (U+200C) appear invisible in most editors but cause string comparison failures. The inspector reveals them.

UTF-8 vs UTF-16 byte length. JavaScript strings are UTF-16 internally. An emoji uses 2 UTF-16 code units (4 bytes). If you store strings in a database with a byte limit, check the UTF-8 byte length here.

Normalization forms. e-with-accent can be a single precomposed character (U+00E9) or the letter e followed by a combining accent (U+0065 + U+0301) — visually identical but different byte sequences.

Control characters in pasted text. Text copied from PDF files sometimes includes control characters (backspace, form feed) that cause unexpected behavior.

🔗 Use These Together

🏁 Bottom Line

String encoding issues are among the hardest bugs to debug because the characters look correct visually. This inspector makes every character visible at the byte level. For encoding: Character Encoder and URL Encoder.

What information does the string inspector show for each character?

For each character in the input, this tool shows: the character itself, its Unicode code point (U+XXXX notation), the character name (from the Unicode database), the UTF-8 byte sequence (how many bytes this character occupies in UTF-8), the UTF-16 code unit(s), the decimal ASCII value (for ASCII characters), and the character category (letter, digit, punctuation, symbol, whitespace, control). This is invaluable for debugging encoding issues, invisible characters, and unexpected whitespace that causes string comparison failures.

Why does JavaScript report different string lengths for emoji?

JavaScript strings use UTF-16 internally. Characters in the Basic Multilingual Plane (U+0000 to U+FFFF) use one UTF-16 code unit — string.length counts 1. Characters above U+FFFF (most emoji: rocket U+1F680, face U+1F600) use two UTF-16 code units (a surrogate pair) — string.length counts 2. So '🚀'.length === 2 in JavaScript, not 1. For accurate character counting: use Array.from(str).length or [...str].length which counts Unicode code points rather than UTF-16 code units. Some combined emoji (family, skin tone variations) consist of multiple code points joined by ZWJ (Zero Width Joiner) and count even higher.

How do I find invisible characters causing string comparison failures?

Common invisible characters that cause === comparisons to fail despite strings looking identical: U+200B (Zero Width Space), U+00A0 (Non-Breaking Space — looks like a space but is different), U+FEFF (BOM — Byte Order Mark, often at start of files), U+200D (Zero Width Joiner — used in emoji sequences), U+200C (Zero Width Non-Joiner). Paste your string into this inspector and examine each position — invisible characters show their code point even though they display as blank. Removal: str.replace(/[\u200B\u00A0\uFEFF]/g, '') or more broadly str.replace(/[^\x20-\x7E]/g, '') for ASCII-only cleanup.

What is the difference between Unicode code points and UTF-8 encoding?

A Unicode code point is a number identifying a character (U+0041 = Latin letter A, U+1F680 = Rocket). It is abstract — not yet stored bytes. UTF-8 is an encoding scheme that converts code points to actual bytes for storage and transmission. ASCII characters (U+0000 to U+007F) use 1 byte in UTF-8. Latin extended, Greek, Cyrillic (U+0080 to U+07FF) use 2 bytes. Most East Asian characters (U+0800 to U+FFFF) use 3 bytes. Emoji and less common symbols (U+10000+) use 4 bytes. The string inspector shows both the code point (abstract) and UTF-8 bytes (concrete storage representation).

What are control characters and why do they appear in strings?

Control characters are non-printable characters in the ASCII range 0-31 and 127. Common ones: 0 = NUL (C string terminator), 7 = BEL (terminal bell), 8 = BS (backspace), 9 = TAB (horizontal tab), 10 = LF (line feed / Unix newline), 13 = CR (carriage return / Windows CRLF pair), 27 = ESC (terminal escape sequences). They appear in strings from: copy-pasting from terminal output (ANSI escape codes), reading files with mixed line endings (CRLF showing \r at end of lines in Unix), data from legacy systems, and user input from certain mobile keyboards.

How do I detect and remove all whitespace-like characters?

Standard whitespace: \s in regex matches space (0x20), tab (0x09), newline (0x0A), carriage return (0x0D), form feed (0x0C), vertical tab (0x0B). Unicode whitespace (not matched by \s by default): U+00A0 non-breaking space, U+2002-U+200A various width spaces, U+2028 line separator, U+2029 paragraph separator, U+3000 ideographic space. For comprehensive removal: use Unicode-aware regex with the \p{Z} property in languages that support it (Python, Java, PCRE), or explicitly list the code points to remove. The Regex Tester on this site can test whitespace detection patterns.

What other text and encoding tools are on this site?

The Character Encoder looks up entity references and encoding values for specific characters. The Binary to Text Converter shows the binary representation of text. The HTML Encoder escapes characters for safe HTML insertion. The Regex Tester finds specific Unicode characters or character ranges in text. The Duplicate Remover and Line Sorter work with text that has been cleaned using the inspector's findings. All are in the Dev Tools section.