Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36034 invoked from network); 25 Feb 2011 13:53:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Feb 2011 13:53:57 -0000 Received: (qmail 51886 invoked by uid 500); 25 Feb 2011 13:53:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 51635 invoked by uid 500); 25 Feb 2011 13:53:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 51619 invoked by uid 99); 25 Feb 2011 13:53:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Feb 2011 13:53:49 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Feb 2011 13:53:44 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id F1CFED36003 for ; Fri, 25 Feb 2011 14:53:22 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DtNcUtGzu8WC for ; Fri, 25 Feb 2011 14:53:07 +0100 (CET) Received: from VEGA (WDC-MARE.marum.de [134.102.249.81]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 3186DD36002 for ; Fri, 25 Feb 2011 14:53:02 +0100 (CET) From: "Uwe Schindler" To: References: <14212_1298629418_ZZh0k3sw3i7S6.00_4D67831F.90801@uni-bielefeld.de> <23538_1298635379_ZZh0k6skodisY.00_4D679A50.1000308@uni-bielefeld.de> <23505_1298639997_ZZh0k5s9GBkrP.00_4D67AC5A.7080708@uni-bielefeld.de> <006e01cbd4f1$94590560$bd0b1020$@thetaphi.de> <23510_1298641750_ZZh0k1svNf89G.00_4D67B322.1010403@uni-bielefeld.de> In-Reply-To: <23510_1298641750_ZZh0k1svNf89G.00_4D67B322.1010403@uni-bielefeld.de> Subject: RE: which unicode version is supported with lucene Date: Fri, 25 Feb 2011 14:53:56 +0100 Message-ID: <006f01cbd4f3$756a6700$603f3500$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 14.0 Thread-index: AQJgT4j57ILItKNXquSmYbd5SLswTwDA6WRnAP3RwU0DHZnPRwDIHJzeAjA/VdMBTPweRpKgOiVw Content-language: de What APIs are you using to communicate with Solr? If you are using XML = it may be limited by the XML parser used... If you are using SolrJ with = binary request handler it should in all cases go through. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de] > Sent: Friday, February 25, 2011 2:48 PM > To: java-user@lucene.apache.org > Subject: Re: which unicode version is supported with lucene >=20 >=20 > So Solr trunk should already handle Unicode above BMP for field type = string? > Strange... >=20 > Regards, > Bernd >=20 > Am 25.02.2011 14:40, schrieb Uwe Schindler: > > Solr trunk is using Lucene trunk since Lucene and Solr are merged. > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: uwe@thetaphi.de > > > >> -----Original Message----- > >> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de] > >> Sent: Friday, February 25, 2011 2:19 PM > >> To: simon.willnauer@gmail.com > >> Cc: java-user@lucene.apache.org > >> Subject: Re: which unicode version is supported with lucene > >> > >> Hi Simon, > >> > >> actually I'm working with Solr from trunk but followed the problem > >> all the way down to Lucene. I think Solr trunk is build with Lucene = 3.0.3. > >> > >> My field is: > >> >> stored=3D"true" /> > >> > >> No analysis done at all, just stored the content for result = display. > >> But the result is unpredictable and can end in invalid utf-8 code. > >> > >> Regards, > >> Bernd > >> > >> > >> Am 25.02.2011 13:43, schrieb Simon Willnauer: > >>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling > >>> wrote: > >>>> Hi Simon, > >>>> > >>>> thanks for the details. > >>>> > >>>> My platform supports and uses code above BMP (0x10000 and up). > >>>> So the limit is Lucene. > >>>> Don't know how to handle this problem. > >>>> May be deleting all code above BMP...??? > >>> > >>> the code will work fine even if they are in you text. It will just > >>> not respect them maybe throw them away during tokenization etc. so > >>> it really depends what you are using on the analyzer side. maybe = you > >>> can give us little more details on what you use for analysis. One > >>> option would be to build 3.1 from the source and use the analyzers > >>> from there?! > >>> > >>>> > >>>> Good to hear that Lucene 3.1 will come soon. > >>>> Any rough estimation when Lucene 3.1 will be available? > >>> > >>> I hope it will happen within the next 4 weeks > >>> > >>> simon > >>> > >>>> > >>>> Regards, > >>>> Bernd > >>>> > >>>> Am 25.02.2011 12:04, schrieb Simon Willnauer: > >>>>> Hey Bernd, > >>>>> > >>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling > >>>>> wrote: > >>>>>> Dear list, > >>>>>> > >>>>>> a very basic question about lucene, which version of unicode = can > >>>>>> be handled (indexed and searched) with lucene? > >>>>> > >>>>> if you ask for what the indexer / query can handle then it is > >>>>> really what UTF-8 can handle. Strings passed to the writer / > >>>>> reader are converted to UTF-8 internally (rough picture). On = Trunk > >>>>> we are indexing bytes only (UTF-8 bytes by default). so the > >>>>> question is really what you platform supports in terms of > >>>>> utilities / operations on characters and strings. Since Lucene = 3.0 > >>>>> we are on Java 1.5 and have the possibility to respect code = points > which are above the BMP. > >>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented > >>>>> us from moving forward to Unicode 4.0. If you look at > >>>>> Character.java all methods have been converted to operate on > >>>>> UTF-32 code points instead of UTF-16 code points in Java 1.4. > >>>>> > >>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release = these > >>>>> APIs are not in use yet in the latest released version. Lucene = 3.1 > >>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer > >>>>> codebase (I think there are one or two which still have = problems, > >>>>> I should check... Robert did we fix all NGram stuff?). > >>>>> > >>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will = only > >>>>> support characters within the BMP <=3D 0xFFFF. 3.1 (to be = released > >>>>> soon I hope) will fix most of the problems and includes ICU = based > >>>>> analysis for full Unicode 5 support. > >>>>> > >>>>> hope that helps > >>>>> > >>>>> simon > >>>>>> > >>>>>> It looks like lucene can only handle the very old Unicode 2.0 = but > >>>>>> not the newer 3.1 version (4 byte utf-8 unicode). > >>>>>> > >>>>>> Is that true? > >>>>>> > >>>>>> Regards, > >>>>>> Bernd > >>>>>> > >>>> >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org