lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: which unicode version is supported with lucene
Date Fri, 25 Feb 2011 13:40:34 GMT
Solr trunk is using Lucene trunk since Lucene and Solr are merged.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Friday, February 25, 2011 2:19 PM
> To: simon.willnauer@gmail.com
> Cc: java-user@lucene.apache.org
> Subject: Re: which unicode version is supported with lucene
> 
> Hi Simon,
> 
> actually I'm working with Solr from trunk but followed the problem all the
> way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
> 
> My field is:
> <field name="dcdescription" type="string" indexed="false" stored="true" />
> 
> No analysis done at all, just stored the content for result display.
> But the result is unpredictable and can end in invalid utf-8 code.
> 
> Regards,
> Bernd
> 
> 
> Am 25.02.2011 13:43, schrieb Simon Willnauer:
> > On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> > <bernd.fehling@uni-bielefeld.de> wrote:
> >> Hi Simon,
> >>
> >> thanks for the details.
> >>
> >> My platform supports and uses code above BMP (0x10000 and up).
> >> So the limit is Lucene.
> >> Don't know how to handle this problem.
> >> May be deleting all code above BMP...???
> >
> > the code will work fine even if they are in you text. It will just not
> > respect them maybe throw them away during tokenization etc. so it
> > really depends what you are using on the analyzer side. maybe you can
> > give us little more details on what you use for analysis. One option
> > would be to build 3.1 from the source and use the analyzers from
> > there?!
> >
> >>
> >> Good to hear that Lucene 3.1 will come soon.
> >> Any rough estimation when Lucene 3.1 will be available?
> >
> > I hope it will happen within the next 4 weeks
> >
> > simon
> >
> >>
> >> Regards,
> >> Bernd
> >>
> >> Am 25.02.2011 12:04, schrieb Simon Willnauer:
> >>> Hey Bernd,
> >>>
> >>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
> >>> <bernd.fehling@uni-bielefeld.de> wrote:
> >>>> Dear list,
> >>>>
> >>>> a very basic question about lucene, which version of unicode can be
> >>>> handled (indexed and searched) with lucene?
> >>>
> >>> if you ask for what the indexer / query can handle then it is really
> >>> what UTF-8 can handle. Strings passed to the writer / reader are
> >>> converted to UTF-8 internally (rough picture). On Trunk we are
> >>> indexing bytes only (UTF-8 bytes by default). so the question is
> >>> really what you platform supports in terms of utilities / operations
> >>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
> >>> have the possibility to respect code points which are above the BMP.
> >>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
> >>> from moving forward to Unicode 4.0. If you look at Character.java
> >>> all methods have been converted to operate on UTF-32 code points
> >>> instead of UTF-16 code points in Java 1.4.
> >>>
> >>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
> >>> APIs are not in use yet in the latest released version. Lucene 3.1
> >>> holds a largely converted Analyzer / TokenFilter / Tokenizer
> >>> codebase (I think there are one or two which still have problems, I
> >>> should check... Robert did we fix all NGram stuff?).
> >>>
> >>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
> >>> support characters within the BMP <= 0xFFFF. 3.1 (to be released
> >>> soon I hope) will fix most of the problems and includes ICU based
> >>> analysis for full Unicode 5 support.
> >>>
> >>> hope that helps
> >>>
> >>> simon
> >>>>
> >>>> It looks like lucene can only handle the very old Unicode 2.0 but
> >>>> not the newer 3.1 version (4 byte utf-8 unicode).
> >>>>
> >>>> Is that true?
> >>>>
> >>>> Regards,
> >>>> Bernd
> >>>>
> >>
> 
> --
> **********************************************************
> ***
> Bernd Fehling                Universit├Ątsbibliothek Bielefeld
> Dipl.-Inform. (FH)                        Universit├Ątsstr. 25
> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
> bernd.fehling@uni-bielefeld.de                33615 Bielefeld
> 
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> **********************************************************
> ***
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message