Remove non-UTF8 characters from string with PHP

If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.

Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found a number of solutions to this that did not work for me, namely using iconv and utf8_encode.

Then I found this excellent explanation of using UTF8 with PHP, which is well worth a read.

Encoding gives me a headache but from this explanation this is how I see it.

I had some character that the parser does not know how to interput because it was outside the byte range of the UTF8 format. Some of the PHP functions, like iconv, still let some non-UTF8 characters through which breaks the parser. The preg_replace just rips out any non-UTF8 character based on it’s byte sequence and replaces it with a question mark.

From that article above, I use the following code to remove any non-UTF8 characters.

//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );

16 thoughts on “Remove non-UTF8 characters from string with PHP

  1. Pingback: Removing non-UTF8 characters from strings with PHP | Joseph Scott

  2. Pingback: Don’t use strlen() - WordPress Blog Man

  3. You, kind sir, have saved my life. I was at the end of my wits with massive data entry operations failing because many of our strings in JSON were going to NULL.

    Thank you very much, and keep blogging :)

  4. Hello, there!

    I need to filter all not printable chars in text, so i try to use your solution – unfortunatelly some chars left.

    After a couple of hours in testing i’m got solution:

    <?php
    // Remove all none utf-8 symbols
    $text = htmlspecialchars_decode(htmlspecialchars($text, ENT_IGNORE, 'UTF-8'));
    // remove non-breaking spaces and other non-standart spaces
    $text = preg_replace('~\s+~u', ' ', $text);
    // replace controls symbols with "?"
    $text = preg_replace('~\p{C}+~u', '?', $text);

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s