lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes" <nutch-...@dragonflymc.com>
Subject writeChars method in IndexOutput
Date Thu, 30 Mar 2006 17:09:07 GMT
I was reading up on conversion of characters to UTF-8 and I now understand
why it is writing out UTF-8 (to be able to support most of the worlds
languages with minimal space?). But after reading up on the algorithms for
conversion as given below, does the writeChars method not support the
U+10000→U+10FFFF conversions or am I misreading the code?

 


Character Range

Bit Encoding


U+0000→U+007F

0xxxxxxx


U+0080→U+07FF

110xxxxx 10xxxxxx


U+0800→U+FFFF

1110xxxx 10xxxxxx 10xxxxxx


U+10000→U+10FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

 

  public void writeChars(String s, int start, int length)

    throws IOException {

 

    final int end = start + length;

    for (int i = start; i < end; i++) {

      

      final int code = (int)s.charAt(i);

      

      if (code >= 0x01 && code <= 0x7F)

        writeByte((byte)code);

      else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {

        writeByte((byte)(0xC0 | (code >> 6)));

        writeByte((byte)(0x80 | (code & 0x3F)));

      }

      else {

        writeByte((byte)(0xE0 | (code >>> 12)));

        writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));

        writeByte((byte)(0x80 | (code & 0x3F)));

      }

    }

  }


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message