Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 69494 invoked from network); 17 May 2006 20:33:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 17 May 2006 20:33:54 -0000 Received: (qmail 16400 invoked by uid 500); 17 May 2006 20:33:51 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 16337 invoked by uid 500); 17 May 2006 20:33:50 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 16326 invoked by uid 99); 17 May 2006 20:33:50 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2006 13:33:50 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [12.154.210.214] (HELO rectangular.com) (12.154.210.214) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2006 13:33:50 -0700 Received: from p00.ohwy.com ([12.154.210.213] helo=[192.168.4.23]) by rectangular.com with esmtpa (Exim 4.44) id 1FgSpt-000829-EU for java-dev@lucene.apache.org; Wed, 17 May 2006 13:41:37 -0700 Mime-Version: 1.0 (Apple Message framework v750) In-Reply-To: <446B668F.60300@apache.org> References: <2FD7D6BE-BA67-4106-967B-6D245D7D442B@rectangular.com> <200605170858.40132.paul.elschot@xs4all.nl> <3EC2389E-1444-42CC-85A5-CED39BDD022B@rectangular.com> <446B668F.60300@apache.org> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Marvin Humphrey Subject: Re: Hacking Luke for bytecount-based strings Date: Wed, 17 May 2006 13:33:22 -0700 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.750) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On May 17, 2006, at 11:08 AM, Doug Cutting wrote: > Marvin Humphrey wrote: >> What I'd like to do is augment my existing patch by making it >> possible to specify a particular encoding, both for Lucene and Luke. > > What ensures that all documents in fact use the same encoding? In KinoSearch at this moment, zilch. Lucene would still need to read stuff into Java chars then write it out using the specified encoding. If we opt for output buffering rather than output counting (the patch currently does counting, but that would have to change if we're flexible about encoding in the index), then string.getBytes (encoding) would guarantee it. > The current approach of converting everything to Unicode and then > writing UTF-8 to indexes makes indexes portable and simplifies the > construction of search user interfaces, since only indexing code > needs to know about other character sets and encodings. Sure. OTOH, it's not so good for CJK users. I also opted against it in KinoSearch because A) compatibility with the current Java Lucene file format wasn't going to happen anyway, and B) not all Perlers use or require valid UTF-8. I've considered adding a UTF8Enforcer Analyzer subclass, but it hasn't been an issue. Right now, if your source docs are mucked up, they'll be mucked up when you retrieve them after searching. If you want to fix that, you preprocess. Ensuring consistent encoding is the application developer's responsibility. > If a collection has invalidly encoded text, how does it help to > detect that later rather than sooner? I *think* that whether it was invalidly encoded or not wouldn't impact searching -- it doesn't in KinoSearch. It should only affect display. Detecting invalidly encoded text later doesn't help anything in and of itself; lifting the requirement that everything be converted to Unicode early on opens up some options. >> Searches will continue to work regardless because the patched >> Termbuffer compares raw bytes. (A comparison based on >> Term.compareTo () would likely fail because raw bytes translated >> to UTF-8 may not produce the same results.) > > UTF-8 has the property that bytewise lexicographic order is the > same as Unicode character order. Yes. I'm suggesting that an unpatched TermBuffer would have problems with my index with corrupt character data because the sort order by bytestring may not be the same as sort order by Unicode code point. However, the patched TermBuffer uses compareBytes() rather than compareChars(), so TermInfosReader should work fine. Marvin Humphrey Rectangular Research http://www.rectangular.com/ public final int compareTo(TermBuffer other) { if (field == other.field) // fields are interned - return compareChars(text, textLength, other.text, other.textLength); + return compareBytes(bytes, bytesLength, other.bytes, other.bytesLength); else return field.compareTo(other.field); } - private static final int compareChars(char[] v1, int len1, - char[] v2, int len2) { + private static final int compareBytes(byte[] bytes1, int len1, + byte[] bytes2, int len2) { int end = Math.min(len1, len2); for (int k = 0; k < end; k++) { - char c1 = v1[k]; - char c2 = v2[k]; - if (c1 != c2) { - return c1 - c2; + int b1 = (bytes1[k] & 0xFF); + int b2 = (bytes2[k] & 0xFF); + if (b1 != b2) { + return b1 - b2; } } return len1 - len2; } --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org