Remove non-UTF8 characters from string with PHP

If you have come across the cursed ‚Invalid Character‚ error while using PHP’s XML or JSON parser then you may be interested in this.

Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found a number of solutions to this that did not work for me, namely using iconv and utf8_encode.

Then I found this excellent explanation of using UTF8 with PHP, which is well worth a read.

Encoding gives me a headache but from this explanation this is how I see it.

I had some character that the parser does not know how to interput because it was outside the byte range of the UTF8 format. Some of the PHP functions, like iconv, still let some non-UTF8 characters through which breaks the parser. The preg_replace just rips out any non-UTF8 character based on it’s byte sequence and replaces it with a question mark.

From that article above, I use the following code to remove any non-UTF8 characters.

//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );

Update: If the string you are parsing is a serialized data string, string lengths specified in this data will need to be updated after you remove non-UTF8 characters, otherwise you will not be able to unserialize this data. Just so happens I came across this issue recently and have a fix.

26 Kommentare zu „Remove non-UTF8 characters from string with PHP

  1. You, kind sir, have saved my life. I was at the end of my wits with massive data entry operations failing because many of our strings in JSON were going to NULL.

    Thank you very much, and keep blogging 🙂

  2. Hello, there!

    I need to filter all not printable chars in text, so i try to use your solution – unfortunatelly some chars left.

    After a couple of hours in testing i’m got solution:

    <?php
    // Remove all none utf-8 symbols
    $text = htmlspecialchars_decode(htmlspecialchars($text, ENT_IGNORE, 'UTF-8'));
    // remove non-breaking spaces and other non-standart spaces
    $text = preg_replace('~\s+~u', ' ', $text);
    // replace controls symbols with "?"
    $text = preg_replace('~\p{C}+~u', '?', $text);

Hinterlasse einen Kommentar