Remove non-UTF8 characters from string with PHP

If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.

Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found a number of solutions to this that did not work for me, namely using iconv and utf8_encode.

Then I found this excellent explanation of using UTF8 with PHP, which is well worth a read.

Encoding gives me a headache but from this explanation this is how I see it.

I had some character that the parser does not know how to interput because it was outside the byte range of the UTF8 format. Some of the PHP functions, like iconv, still let some non-UTF8 characters through which breaks the parser. The preg_replace just rips out any non-UTF8 character based on it’s byte sequence and replaces it with a question mark.

From that article above, I use the following code to remove any non-UTF8 characters.

//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );

Update: If the string you are parsing is a serialized data string, string lengths specified in this data will need to be updated after you remove non-UTF8 characters, otherwise you will not be able to unserialize this data. Just so happens I came across this issue recently and have a fix.

26 thoughts on “Remove non-UTF8 characters from string with PHP

  1. You, kind sir, have saved my life. I was at the end of my wits with massive data entry operations failing because many of our strings in JSON were going to NULL.

    Thank you very much, and keep blogging πŸ™‚

  2. Hello, there!

    I need to filter all not printable chars in text, so i try to use your solution – unfortunatelly some chars left.

    After a couple of hours in testing i’m got solution:

    <?php
    // Remove all none utf-8 symbols
    $text = htmlspecialchars_decode(htmlspecialchars($text, ENT_IGNORE, 'UTF-8'));
    // remove non-breaking spaces and other non-standart spaces
    $text = preg_replace('~\s+~u', ' ', $text);
    // replace controls symbols with "?"
    $text = preg_replace('~\p{C}+~u', '?', $text);

  3. Wow, thank you. I’ve been trying to find a solution to this problem for the last 4 hours. It works perfectly!

  4. My good sir, you just saved me an entire day of torture looking for a solution. Hats Off!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s