lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: which unicode version is supported with lucene
Date Fri, 25 Feb 2011 13:53:56 GMT
What APIs are you using to communicate with Solr? If you are using XML it may be limited by
the XML parser used... If you are using SolrJ with binary request handler it should in all
cases go through.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Friday, February 25, 2011 2:48 PM
> To: java-user@lucene.apache.org
> Subject: Re: which unicode version is supported with lucene
> 
> 
> So Solr trunk should already handle Unicode above BMP for field type string?
> Strange...
> 
> Regards,
> Bernd
> 
> Am 25.02.2011 14:40, schrieb Uwe Schindler:
> > Solr trunk is using Lucene trunk since Lucene and Solr are merged.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> >> Sent: Friday, February 25, 2011 2:19 PM
> >> To: simon.willnauer@gmail.com
> >> Cc: java-user@lucene.apache.org
> >> Subject: Re: which unicode version is supported with lucene
> >>
> >> Hi Simon,
> >>
> >> actually I'm working with Solr from trunk but followed the problem
> >> all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
> >>
> >> My field is:
> >> <field name="dcdescription" type="string" indexed="false"
> >> stored="true" />
> >>
> >> No analysis done at all, just stored the content for result display.
> >> But the result is unpredictable and can end in invalid utf-8 code.
> >>
> >> Regards,
> >> Bernd
> >>
> >>
> >> Am 25.02.2011 13:43, schrieb Simon Willnauer:
> >>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> >>> <bernd.fehling@uni-bielefeld.de> wrote:
> >>>> Hi Simon,
> >>>>
> >>>> thanks for the details.
> >>>>
> >>>> My platform supports and uses code above BMP (0x10000 and up).
> >>>> So the limit is Lucene.
> >>>> Don't know how to handle this problem.
> >>>> May be deleting all code above BMP...???
> >>>
> >>> the code will work fine even if they are in you text. It will just
> >>> not respect them maybe throw them away during tokenization etc. so
> >>> it really depends what you are using on the analyzer side. maybe you
> >>> can give us little more details on what you use for analysis. One
> >>> option would be to build 3.1 from the source and use the analyzers
> >>> from there?!
> >>>
> >>>>
> >>>> Good to hear that Lucene 3.1 will come soon.
> >>>> Any rough estimation when Lucene 3.1 will be available?
> >>>
> >>> I hope it will happen within the next 4 weeks
> >>>
> >>> simon
> >>>
> >>>>
> >>>> Regards,
> >>>> Bernd
> >>>>
> >>>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
> >>>>> Hey Bernd,
> >>>>>
> >>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
> >>>>> <bernd.fehling@uni-bielefeld.de> wrote:
> >>>>>> Dear list,
> >>>>>>
> >>>>>> a very basic question about lucene, which version of unicode
can
> >>>>>> be handled (indexed and searched) with lucene?
> >>>>>
> >>>>> if you ask for what the indexer / query can handle then it is
> >>>>> really what UTF-8 can handle. Strings passed to the writer /
> >>>>> reader are converted to UTF-8 internally (rough picture). On Trunk
> >>>>> we are indexing bytes only (UTF-8 bytes by default). so the
> >>>>> question is really what you platform supports in terms of
> >>>>> utilities / operations on characters and strings. Since Lucene 3.0
> >>>>> we are on Java 1.5 and have the possibility to respect code points
> which are above the BMP.
> >>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented
> >>>>> us from moving forward to Unicode 4.0. If you look at
> >>>>> Character.java all methods have been converted to operate on
> >>>>> UTF-32 code points instead of UTF-16 code points in Java 1.4.
> >>>>>
> >>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
> >>>>> APIs are not in use yet in the latest released version. Lucene 3.1
> >>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer
> >>>>> codebase (I think there are one or two which still have problems,
> >>>>> I should check... Robert did we fix all NGram stuff?).
> >>>>>
> >>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
> >>>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released
> >>>>> soon I hope) will fix most of the problems and includes ICU based
> >>>>> analysis for full Unicode 5 support.
> >>>>>
> >>>>> hope that helps
> >>>>>
> >>>>> simon
> >>>>>>
> >>>>>> It looks like lucene can only handle the very old Unicode 2.0
but
> >>>>>> not the newer 3.1 version (4 byte utf-8 unicode).
> >>>>>>
> >>>>>> Is that true?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Bernd
> >>>>>>
> >>>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message