lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: which unicode version is supported with lucene
Date Fri, 25 Feb 2011 12:43:35 GMT
On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:
> Hi Simon,
>
> thanks for the details.
>
> My platform supports and uses code above BMP (0x10000 and up).
> So the limit is Lucene.
> Don't know how to handle this problem.
> May be deleting all code above BMP...???

the code will work fine even if they are in you text. It will just not
respect them maybe throw them away during tokenization etc. so it
really depends what you are using on the analyzer side. maybe you can
give us little more details on what you use for analysis. One option
would be to build 3.1 from the source and use the analyzers from
there?!

>
> Good to hear that Lucene 3.1 will come soon.
> Any rough estimation when Lucene 3.1 will be available?

I hope it will happen within the next 4 weeks

simon

>
> Regards,
> Bernd
>
> Am 25.02.2011 12:04, schrieb Simon Willnauer:
>> Hey Bernd,
>>
>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
>> <bernd.fehling@uni-bielefeld.de> wrote:
>>> Dear list,
>>>
>>> a very basic question about lucene, which version of
>>> unicode can be handled (indexed and searched) with lucene?
>>
>> if you ask for what the indexer / query can handle then it is really
>> what UTF-8 can handle. Strings passed to the writer / reader are
>> converted to UTF-8 internally (rough picture). On Trunk we are
>> indexing bytes only (UTF-8 bytes by default). so the question is
>> really what you platform supports in terms of utilities / operations
>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
>> have the possibility to respect code points which are above the BMP.
>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
>> from moving forward to Unicode 4.0. If you look at Character.java all
>> methods have been converted to operate on UTF-32 code points instead
>> of UTF-16 code points in Java 1.4.
>>
>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
>> APIs are not in use yet in the latest released version. Lucene 3.1
>> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
>> (I think there are one or two which still have problems, I should
>> check... Robert did we fix all NGram stuff?).
>>
>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
>> support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
>> I hope) will fix most of the problems and includes ICU based analysis
>> for full Unicode 5 support.
>>
>> hope that helps
>>
>> simon
>>>
>>> It looks like lucene can only handle the very old Unicode 2.0
>>> but not the newer 3.1 version (4 byte utf-8 unicode).
>>>
>>> Is that true?
>>>
>>> Regards,
>>> Bernd
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message