Tuesday, September 22, 2015

Javascript Dataview to UTF-8 encoding

Javascript DataView comes with more flexibility than TypedArray for working with binary data aka byteArray aka ArrayBuffer. It provides some getter/setter methods to read and write arbitrary data to the buffer. It is very easy to get numeric value using these getter methods like getUint32() or setFloat64() but there is no getter method like getString() , problem will be more complex if the character encoding is any multi-bytes encoding like UTF-8 or UTF-16. UTF-8 is multi-bytes encoding system which means for each character, it may take up to 4 bytes starting from one byte. Thats means minimum number of byte is 1 and maximum numbers of bytes is 4 for UTF-8 encoding.

In this article we will write a function getString()  and add it into DataView prototype so that we can use getString() in the similar way of getUint32(). First look at the UTF-8 binary format of byte sequence.

Binary format of bytes in sequence
1st Byte2nd Byte3rd Byte4th ByteNumber of Free BitsMaximum Expressible Unicode Value
0xxxxxxx7007F hex (127)
110xxxxx10xxxxxx(5+6)=1107FF hex (2047)
1110xxxx10xxxxxx10xxxxxx(4+6+6)=16FFFF hex (65535)
11110xxx10xxxxxx10xxxxxx10xxxxxx(3+6+6+6)=2110FFFF hex (1,114,111)
The value of each individual byte indicates its UTF-8 function, as follows:
  • 00 to 7F hex (0 to 127): first and only byte of a sequence.
  • 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
  • C2 to DF hex (194 to 223): first byte of a two-byte sequence.
  • E0 to EF hex (224 to 239): first byte of a three-byte sequence.
  • F0 to FF hex (240 to 255): first byte of a four-byte sequence.
For reading UTF-8 format detail click here

I am not going detail line by line. Here is the final code of getString()
You can download source code from github 


 DataView.prototype.getString = function(offset,length){
        var self = this,bitArray = [], firstByte, highSurrogate, lowSurrogate, codePoint;
        length = length || self.byteLength;
        while( length > 0  ) {
            firstByte = self.getUint8(offset);
            if(self.getUint8(offset) <= 127) {
              bitArray.push(self.getUint8(offset++));
              length--;
            }
            else if(self.getUint8(offset) >= 128 && self.getUint8(offset) <= 223) {
              bitArray.push(((self.getUint8(offset++) & 0x1F) << 6) | (self.getUint8(offset++) & 0x3F));
              length -=2;
            }
            else if(self.getUint8(offset) >= 224 && self.getUint8(offset) <= 239) {
              bitArray.push(((self.getUint8(offset++) & 0x1F) << 12) | ((self.getUint8(offset++) & 0x3F) << 6 | (self.getUint8(offset++) & 0x3F)));
              length -=3;
            }
            else {
               codePoint = ((self.getUint8(offset++) & 0x07) << 18) | (((self.getUint8(offset++) & 0x3F) << 12) | ((self.getUint8(offset++) & 0x3F) << 6 | (self.getUint8(offset++) & 0x3F)));
               codePoint -= 0x10000;
               highSurrogate = (codePoint >> 10) + 0xD800;
               lowSurrogate = (codePoint % 0x400) + 0xDC00;
               bitArray.push(highSurrogate, lowSurrogate);
               length -=4;
            }
        }
        return String.fromCharCode.apply(null,bitArray);
    };

Now we can get UTF-8 encoded string as bellow. I am assuming we have ready buffer from websocket or ajax
 var dataview = new DataView(buffer);  
 dataview.getString(0,100); // get string of 100 lengths from the offset 0 

No comments: