🔬 String Inspector
Analyse any string: length, bytes, Unicode codepoints, frequency analysis
Characters
26
Unicode-aware
JS .length
27
UTF-16 units
UTF-8 Bytes
38
bytes
Words
5
whitespace split
Lines
1
newline count
Unique Chars
18
distinct
Non-ASCII
10
Unicode chars
UTF-16 Bytes
54
estimated
📊 Key Data Points
U+XXXX notation
Unicode code points written as U+ followed by 4-6 uppercase hex digits
UTF-8 vs UTF-16
JavaScript strings are UTF-16 internally — emoji use 2 code units (4 bytes) in JavaScript
Zero-width chars
Invisible characters that cause string comparison and display bugs
String Inspector — Unicode and Byte Analysis -- Complete USA Guide 2026
Encoding bugs are among the hardest to debug because strings look correct visually. A string with a zero-width space looks identical to one without, but fails string comparison. A string with a combining accent is different bytes than its precomposed equivalent. This inspector makes every character visible.
Runs entirely in your browser — no data transmitted.
**Long-tail searches answered here:** string inspector online free, unicode character analyzer browser tool, string byte length character codes online free.
For encoding operations, pair with Character Encoder and URL Encoder.
🔬 How This Calculator Works
Analyzes a string at the character level: shows each character with its Unicode code point (U+xxxx), UTF-8 byte count, HTML entity equivalent, and category (letter, digit, punctuation, space, control, emoji). Byte length in UTF-8, UTF-16, and UTF-32 encodings. Useful for debugging encoding issues, unexpected whitespace, zero-width characters, and invisible Unicode control characters.
✅ What You Can Calculate
Character-level Unicode breakdown
Shows each character with its Unicode code point, UTF-8 byte sequence, HTML entity, and category — makes invisible characters visible.
Multiple encoding byte counts
Shows string length in UTF-8, UTF-16, and UTF-32 bytes separately. JavaScript strings are UTF-16 internally — emoji use 2 code units (4 bytes).
Zero-width character detection
Zero-width space (U+200B), zero-width non-joiner, and left-to-right mark appear invisible in most editors but cause string comparison failures.
Normalization form detection
Shows whether e-with-accent is a precomposed character (U+00E9) or decomposed (U+0065 + U+0301) — visually identical but different byte sequences.
🎯 Real Scenarios & Use Cases
Debugging encoding issues
A string looks correct but fails comparison. The inspector reveals hidden characters: zero-width spaces, Unicode normalization differences, and invisible control characters.
Database byte limit verification
A VARCHAR(255) column stores 255 bytes. UTF-8 emoji use 4 bytes each. Calculate the worst-case byte size of your Unicode strings here.
PDF text extraction debugging
Text copied from PDF files sometimes includes control characters (backspace, form feed) that cause unexpected behavior. The inspector shows these invisible characters explicitly.
API key character verification
Verify that a pasted API key contains only the expected alphanumeric characters — no invisible Unicode that looks correct but fails authentication.
💡 Pro Tips for Accurate Results
Zero-width characters are invisible. Zero-width space (U+200B) and zero-width non-joiner (U+200C) appear invisible in most editors but cause string comparison failures. The inspector reveals them.
UTF-8 vs UTF-16 byte length. JavaScript strings are UTF-16 internally. An emoji uses 2 UTF-16 code units (4 bytes). If you store strings in a database with a byte limit, check the UTF-8 byte length here.
Normalization forms. e-with-accent can be a single precomposed character (U+00E9) or the letter e followed by a combining accent (U+0065 + U+0301) — visually identical but different byte sequences.
Control characters in pasted text. Text copied from PDF files sometimes includes control characters (backspace, form feed) that cause unexpected behavior.
🔗 Use These Together
🏁 Bottom Line
String encoding issues are among the hardest bugs to debug because the characters look correct visually. This inspector makes every character visible at the byte level. For encoding: Character Encoder and URL Encoder.
What information does the string inspector show for each character?
For each character in the input, this tool shows: the character itself, its Unicode code point (U+XXXX notation), the character name (from the Unicode database), the UTF-8 byte sequence (how many bytes this character occupies in UTF-8), the UTF-16 code unit(s), the decimal ASCII value (for ASCII characters), and the character category (letter, digit, punctuation, symbol, whitespace, control). This is invaluable for debugging encoding issues, invisible characters, and unexpected whitespace that causes string comparison failures.
Why does JavaScript report different string lengths for emoji?
JavaScript strings use UTF-16 internally. Characters in the Basic Multilingual Plane (U+0000 to U+FFFF) use one UTF-16 code unit — string.length counts 1. Characters above U+FFFF (most emoji: rocket U+1F680, face U+1F600) use two UTF-16 code units (a surrogate pair) — string.length counts 2. So '🚀'.length === 2 in JavaScript, not 1. For accurate character counting: use Array.from(str).length or [...str].length which counts Unicode code points rather than UTF-16 code units. Some combined emoji (family, skin tone variations) consist of multiple code points joined by ZWJ (Zero Width Joiner) and count even higher.
How do I find invisible characters causing string comparison failures?
Common invisible characters that cause === comparisons to fail despite strings looking identical: U+200B (Zero Width Space), U+00A0 (Non-Breaking Space — looks like a space but is different), U+FEFF (BOM — Byte Order Mark, often at start of files), U+200D (Zero Width Joiner — used in emoji sequences), U+200C (Zero Width Non-Joiner). Paste your string into this inspector and examine each position — invisible characters show their code point even though they display as blank. Removal: str.replace(/[\u200B\u00A0\uFEFF]/g, '') or more broadly str.replace(/[^\x20-\x7E]/g, '') for ASCII-only cleanup.
What is the difference between Unicode code points and UTF-8 encoding?
A Unicode code point is a number identifying a character (U+0041 = Latin letter A, U+1F680 = Rocket). It is abstract — not yet stored bytes. UTF-8 is an encoding scheme that converts code points to actual bytes for storage and transmission. ASCII characters (U+0000 to U+007F) use 1 byte in UTF-8. Latin extended, Greek, Cyrillic (U+0080 to U+07FF) use 2 bytes. Most East Asian characters (U+0800 to U+FFFF) use 3 bytes. Emoji and less common symbols (U+10000+) use 4 bytes. The string inspector shows both the code point (abstract) and UTF-8 bytes (concrete storage representation).
What are control characters and why do they appear in strings?
Control characters are non-printable characters in the ASCII range 0-31 and 127. Common ones: 0 = NUL (C string terminator), 7 = BEL (terminal bell), 8 = BS (backspace), 9 = TAB (horizontal tab), 10 = LF (line feed / Unix newline), 13 = CR (carriage return / Windows CRLF pair), 27 = ESC (terminal escape sequences). They appear in strings from: copy-pasting from terminal output (ANSI escape codes), reading files with mixed line endings (CRLF showing \r at end of lines in Unix), data from legacy systems, and user input from certain mobile keyboards.
How do I detect and remove all whitespace-like characters?
Standard whitespace: \s in regex matches space (0x20), tab (0x09), newline (0x0A), carriage return (0x0D), form feed (0x0C), vertical tab (0x0B). Unicode whitespace (not matched by \s by default): U+00A0 non-breaking space, U+2002-U+200A various width spaces, U+2028 line separator, U+2029 paragraph separator, U+3000 ideographic space. For comprehensive removal: use Unicode-aware regex with the \p{Z} property in languages that support it (Python, Java, PCRE), or explicitly list the code points to remove. The Regex Tester on this site can test whitespace detection patterns.
What other text and encoding tools are on this site?
The Character Encoder looks up entity references and encoding values for specific characters. The Binary to Text Converter shows the binary representation of text. The HTML Encoder escapes characters for safe HTML insertion. The Regex Tester finds specific Unicode characters or character ranges in text. The Duplicate Remover and Line Sorter work with text that has been cleaned using the inspector's findings. All are in the Dev Tools section.