7 Answers

  1. Charles- Reply

    2019-11-16

    Try this:

    $str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
        return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
    }, $str);
    

    In case it's UTF-16 based C/C++/Java/Json-style:

    $str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
        return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');
    }, $str);
    
  2. Chris- Reply

    2019-11-16

    print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )
    
  3. Cosmo- Reply

    2019-11-16

    $str = '\u0063\u0061\u0074'.'\ud83d\ude38';
    $str2 = '\u0063\u0061\u0074'.'\ud83d';
    
    // U+1F638
    var_dump(
        "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
        "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
    );
    
    function escape_sequence_decode($str) {
    
        // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
        $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
                  |\\\u([\da-fA-F]{4})/sx';
    
        return preg_replace_callback($regex, function($matches) {
    
            if (isset($matches[3])) {
                $cp = hexdec($matches[3]);
            } else {
                $lead = hexdec($matches[1]);
                $trail = hexdec($matches[2]);
    
                // http://unicode.org/faq/utf_bom.html#utf16-4
                $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
            }
    
            // https://tools.ietf.org/html/rfc3629#section-3
            // Characters between U+D800 and U+DFFF are not allowed in UTF-8
            if ($cp > 0xD7FF && 0xE000 > $cp) {
                $cp = 0xFFFD;
            }
    
            // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
            // php_utf32_utf8(unsigned char *buf, unsigned k)
    
            if ($cp < 0x80) {
                return chr($cp);
            } else if ($cp < 0xA0) {
                return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
            }
    
            return html_entity_decode('&#'.$cp.';');
        }, $str);
    }
    
  4. Daniel- Reply

    2019-11-16

    This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

    Apply this str_replace function to the RAW JSON, before doing anything else.

    function unicode2html($str){
        $i=65535;
        while($i>0){
            $hex=dechex($i);
            $str=str_replace("\u$hex","&#$i;",$str);
            $i--;
         }
         return $str;
    }
    

    This won't take as long as you think, and this will replace ANY unicode with HTML.

    Of course this can be reduced if you know the unicode types that are being returned in the JSON.

    For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

    $i=11263;
    while($i>08448){
        ...etc...
    

    You can look up the blocks of Unicode by type here: http://unicode-table.com/en/ If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

    You could apply this same sledgehammer to simple encoding:

     $str=str_replace("\u$hex",chr($i),$str);
    
  5. Dennis- Reply

    2019-11-16

    There is also a solution:
    http://www.welefen.com/php-unicode-to-utf8.html

    function entity2utf8onechar($unicode_c){
        $unicode_c_val = intval($unicode_c);
        $f=0x80; // 10000000
        $str = "";
        // U-00000000 - U-0000007F:   0xxxxxxx
        if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
        else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000
            $c1 = $unicode_c_val >> 6 | $h;
            $c2 = ($unicode_c_val & 0x3F) | $f;
            $str = chr($c1).chr($c2);
        } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000
            $c1 = $unicode_c_val >> 12 | $h;
            $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;
            $c3 = ($unicode_c_val & 0x3F) | $f;
            $str=chr($c1).chr($c2).chr($c3);
        }
        //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
        else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000
            $c1 = $unicode_c_val >> 18 | $h;
            $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;
            $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;
            $c4 = ($unicode_c_val & 0x3F) | $f;
            $str = chr($c1).chr($c2).chr($c3).chr($c4);
        }
        //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000
            $c1 = $unicode_c_val >> 24 | $h;
            $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;
            $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;
            $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;
            $c5 = ($unicode_c_val & 0x3F) | $f;
            $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);
        }
        //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100
            $c1 = $unicode_c_val >> 30 | $h;
            $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;
            $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;
            $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;
            $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;
            $c6 = ($unicode_c_val & 0x3F) | $f;
            $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);
        }
        return $str;
    }
    function entities2utf8($unicode_c){
        $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\\1')", $unicode_c);
        return $unicode_c;
    }
    
  6. Derek- Reply

    2019-11-16

    fix json values, it's add \ before u{xxx} to all +" "

      $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {
            $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);
                $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]); 
                $matches[2] = json_decode('"' . $matches[2] . '"'); 
                return '"' . $matches[1] . '":"' . $matches[2] . '",';
            }, $item);
    

Leave a Reply

Your email address will not be published. Required fields are marked *

You can use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>