If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.
Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found a number of solutions to this that did not work for me, namely using iconv and utf8_encode.
Then I found this excellent explanation of using UTF8 with PHP, which is well worth a read.
Encoding gives me a headache but from this explanation this is how I see it.
I had some character that the parser does not know how to interput because it was outside the byte range of the UTF8 format. Some of the PHP functions, like iconv, still let some non-UTF8 characters through which breaks the parser. The preg_replace just rips out any non-UTF8 character based on it’s byte sequence and replaces it with a question mark.
From that article above, I use the following code to remove any non-UTF8 characters.
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
'|[\x00-\x7F][\x80-\xBF]+'.
'|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
'|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
'|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
'?', $some_string );
//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
'|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );
Pingback: Removing non-UTF8 characters from strings with PHP | Joseph Scott
Pingback: Don’t use strlen() - WordPress Blog Man
You, kind sir, have saved my life. I was at the end of my wits with massive data entry operations failing because many of our strings in JSON were going to NULL.
Thank you very much, and keep blogging
That is some clever stuff right there… thanks for the guide
I am not an expert in php. I have a database with problems and I don’t understand how to use this script. Where I should insert this code and how to run it to clean the database ? The database is downloaded in sql format.
Thank you so much! I breaked my brain before i was finded your post!
Thanks!
Saved my life.
Thank you. This was exactly what I was looking for!
Thank you, this is better than all the other stack overflow answers on this and simply explained reasoning behind your theory.
+1 on what Phil C. said. Thank you.