If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.
Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found a number of solutions to this that did not work for me, namely using iconv and utf8_encode.
Then I found this excellent explanation of using UTF8 with PHP, which is well worth a read.
Encoding gives me a headache but from this explanation this is how I see it.
I had some character that the parser does not know how to interput because it was outside the byte range of the UTF8 format. Some of the PHP functions, like iconv, still let some non-UTF8 characters through which breaks the parser. The preg_replace just rips out any non-UTF8 character based on it’s byte sequence and replaces it with a question mark.
From that article above, I use the following code to remove any non-UTF8 characters.
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ? $some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'. '|[\x00-\x7F][\x80-\xBF]+'. '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'. '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'. '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S', '?', $some_string ); //reject overly long 3 byte sequences and UTF-16 surrogates and replace with ? $some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'. '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );
Update: If the string you are parsing is a serialized data string, string lengths specified in this data will need to be updated after you remove non-UTF8 characters, otherwise you will not be able to unserialize this data. Just so happens I came across this issue recently and have a fix.
You, kind sir, have saved my life. I was at the end of my wits with massive data entry operations failing because many of our strings in JSON were going to NULL.
Thank you very much, and keep blogging π
That is some clever stuff right there… thanks for the guide
I am not an expert in php. I have a database with problems and I don’t understand how to use this script. Where I should insert this code and how to run it to clean the database ? The database is downloaded in sql format.
Thank you so much! I breaked my brain before i was finded your post!
Thanks!
Saved my life.
Thank you. This was exactly what I was looking for!
Thank you, this is better than all the other stack overflow answers on this and simply explained reasoning behind your theory.
+1 on what Phil C. said. Thank you.
Hello, there!
I need to filter all not printable chars in text, so i try to use your solution – unfortunatelly some chars left.
After a couple of hours in testing i’m got solution:
<?php
// Remove all none utf-8 symbols
$text = htmlspecialchars_decode(htmlspecialchars($text, ENT_IGNORE, 'UTF-8'));
// remove non-breaking spaces and other non-standart spaces
$text = preg_replace('~\s+~u', ' ', $text);
// replace controls symbols with "?"
$text = preg_replace('~\p{C}+~u', '?', $text);
Thank you very much sir. this is really awesome. it helps all utf -8 problem. thank you
Thanks, has worked.
Wow, thank you. I’ve been trying to find a solution to this problem for the last 4 hours. It works perfectly!
My good sir, you just saved me an entire day of torture looking for a solution. Hats Off!
You just saved my day π
Wonderful! Thank you.
My life was solved!!
Thank you so much π
Thank you so mutch for your Help π
its perfect, Thank you so mutch π
Thank you very much for this article
Thank you very much ,Eoin
Thanks mate!! Been looking over an hour for a solution.
How to replace also “Emoticons ( 1F601 – 1F64F )”?
For example: U+1F60F \xF0\x9F\x98\x8F π