<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Magp.ie &#187; UTF8</title>
	<atom:link href="http://magp.ie/tag/utf8/feed/" rel="self" type="application/rss+xml" />
	<link>http://magp.ie</link>
	<description>A nest for the random, shiny, online tidbits I stumble across...</description>
	<lastBuildDate>Tue, 31 Jan 2012 19:01:48 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='magp.ie' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/061e340c5da13b5a41ae8016bee03aa8?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>Magp.ie &#187; UTF8</title>
		<link>http://magp.ie</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://magp.ie/osd.xml" title="Magp.ie" />
	<atom:link rel='hub' href='http://magp.ie/?pushpress=hub'/>
		<item>
		<title>Remove non-UTF8 characters from string with PHP</title>
		<link>http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/</link>
		<comments>http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/#comments</comments>
		<pubDate>Thu, 06 Jan 2011 21:58:22 +0000</pubDate>
		<dc:creator>Eoin</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[encoding]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[UTF8]]></category>

		<guid isPermaLink="false">http://magp.ie/?p=447</guid>
		<description><![CDATA[If you have come across the cursed &#8216;Invalid Character&#8216; error while using PHP&#8217;s XML or JSON parser then you may be interested in this. Unfortunately, PHP&#8217;s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and &#8230; <a href="http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=magp.ie&amp;blog=11708208&amp;post=447&amp;subd=blogalhost&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>If you have come across the cursed &#8216;<a href="http://petewarden.typepad.com/searchbrowser/2008/04/illegal-charact.html">Invalid Character</a>&#8216; error while using PHP&#8217;s <a href="http://ie2.php.net/manual/en/book.xml.php">XML</a> or <a href="http://www.php.net/manual/en/book.json.php">JSON</a> parser then you may be interested in this.<br />
<span id="more-447"></span><br />
Unfortunately, PHP&#8217;s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found a number of solutions to this that did not work for me, namely using <a href="http://www.php.net/manual/en/function.iconv.php">iconv</a> and <a href="http://www.php.net/manual/en/function.utf8-encode.php">utf8_encode</a>.</p>
<p>Then I found this <a href="http://webcollab.sourceforge.net/unicode.html">excellent explanation</a> of using UTF8 with PHP, which is well worth a read.</p>
<p>Encoding gives me a headache but from this explanation this is how I see it. </p>
<p>I had some character that the parser does not know how to interput because it was outside the byte range of the <a href="http://en.wikipedia.org/wiki/UTF-8">UTF8</a> format. Some of the PHP functions, like iconv, still let some non-UTF8 characters through which breaks the parser. The <a href="http://www.php.net/manual/en/function.preg-replace.php">preg_replace</a> just rips out any non-UTF8 character based on it&#8217;s byte sequence and replaces it with a question mark. </p>
<p>From that article above, I use the following code to remove any non-UTF8 characters.</p>
<p><pre class="brush: php;">
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );
</pre></p>
<br />Filed under: <a href='http://magp.ie/category/code/'>Code</a> Tagged: <a href='http://magp.ie/tag/encoding/'>encoding</a>, <a href='http://magp.ie/tag/php/'>php</a>, <a href='http://magp.ie/tag/regex/'>regex</a>, <a href='http://magp.ie/tag/utf8/'>UTF8</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/blogalhost.wordpress.com/447/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/blogalhost.wordpress.com/447/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/blogalhost.wordpress.com/447/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/blogalhost.wordpress.com/447/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/blogalhost.wordpress.com/447/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/blogalhost.wordpress.com/447/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/blogalhost.wordpress.com/447/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/blogalhost.wordpress.com/447/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=magp.ie&amp;blog=11708208&amp;post=447&amp;subd=blogalhost&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<georss:point>53.734750 -8.989992</georss:point>
		<geo:lat>53.734750</geo:lat>
		<geo:long>-8.989992</geo:long>
		<media:content url="http://1.gravatar.com/avatar/72dd449e5e79e046c1c09ed8712b525a?s=96&#38;d=monsterid&#38;r=PG" medium="image">
			<media:title type="html">eoigal</media:title>
		</media:content>
	</item>
		<item>
		<title>Javascript encoding</title>
		<link>http://magp.ie/2010/02/05/javascript-encoding/</link>
		<comments>http://magp.ie/2010/02/05/javascript-encoding/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 11:15:34 +0000</pubDate>
		<dc:creator>Eoin</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[character set]]></category>
		<category><![CDATA[encoding]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[UTF8]]></category>

		<guid isPermaLink="false">http://blogalhost.wordpress.com/?p=34</guid>
		<description><![CDATA[We encode our polls in UTF8 so all sites will be able to render them. But there was an issue with our polls at one stage, when the poll dynamically loaded content into the site, like after a vote, the &#8230; <a href="http://magp.ie/2010/02/05/javascript-encoding/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=magp.ie&amp;blog=11708208&amp;post=34&amp;subd=blogalhost&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>We encode our polls in <a href="http://www.utf8.com/" target="_blank">UTF8</a> so all sites will be able to render them. But there was an issue with our polls at one stage, when the poll dynamically loaded content into the site, like after a vote, the text in the poll would render all screwed up.</p>
<p>For people unfamiliar with how widgets work, I&#8217;ll briefly explain. We give you an embed code (javascript) that you will place in your site HTML (where you want to display the widget). When this page loads, the javascript is enabled and calls a URL from our site to retrieve HTML. The code then places this HTML into a div container in your sites HTML using the <a href="http://www.javascriptkit.com/javatutors/dom2.shtml" target="_blank">DOM</a> and <a href="http://www.tizag.com/javascriptT/javascript-innerHTML.php" target="_blank">innerHTML</a>.  Hey presto, the widget is rendered into your site.</p>
<p>The encoding problem occurred due to a site using another character encoding type other than UTF8. The HTML we send is encoded in UTF8 and the first time the poll widget loaded, it renders and looks fine. But it appeared that after a vote, so another request for HTML, the HTML seemed to adopt the sites encoding, causing the text in the poll to look a mess.</p>
<p>This bugged the hell out of us and like all the best solutions it was a simple one. You simply need to add a charset attribute to the script tag that references or encloses your javascript code.</p>
<p>So something like&#8230;</p>
<p><pre class="brush: jscript;">
&lt;script type=&quot;text/javascript&quot; language=&quot;javascript&quot; charset=&quot;utf-8&quot; src=&quot;http://static.polldaddy.com/p/2064343.js&quot;&gt;&lt;/script&gt;
</pre></p>
<p>Here&#8217;s a decent <a href="http://www.capitolacomputing.com/intl_js_charset.htm">description</a> of the problem.</p>
<br />Filed under: <a href='http://magp.ie/category/code/'>Code</a> Tagged: <a href='http://magp.ie/tag/character-set/'>character set</a>, <a href='http://magp.ie/tag/encoding/'>encoding</a>, <a href='http://magp.ie/tag/javascript/'>javascript</a>, <a href='http://magp.ie/tag/utf8/'>UTF8</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/blogalhost.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/blogalhost.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/blogalhost.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/blogalhost.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/blogalhost.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/blogalhost.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/blogalhost.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/blogalhost.wordpress.com/34/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=magp.ie&amp;blog=11708208&amp;post=34&amp;subd=blogalhost&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://magp.ie/2010/02/05/javascript-encoding/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<georss:point>53.734750 -8.989992</georss:point>
		<geo:lat>53.734750</geo:lat>
		<geo:long>-8.989992</geo:long>
		<media:content url="http://1.gravatar.com/avatar/72dd449e5e79e046c1c09ed8712b525a?s=96&#38;d=monsterid&#38;r=PG" medium="image">
			<media:title type="html">eoigal</media:title>
		</media:content>
	</item>
	</channel>
</rss>
