lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: which unicode version is supported with lucene
Date Fri, 25 Feb 2011 11:04:54 GMT
Hey Bernd,

On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
<> wrote:
> Dear list,
> a very basic question about lucene, which version of
> unicode can be handled (indexed and searched) with lucene?

if you ask for what the indexer / query can handle then it is really
what UTF-8 can handle. Strings passed to the writer / reader are
converted to UTF-8 internally (rough picture). On Trunk we are
indexing bytes only (UTF-8 bytes by default). so the question is
really what you platform supports in terms of utilities / operations
on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
have the possibility to respect code points which are above the BMP.
Lucene 2.9 still has Java 1.4 System Requirements that prevented us
from moving forward to Unicode 4.0. If you look at all
methods have been converted to operate on UTF-32 code points instead
of UTF-16 code points in Java 1.4.

Since 3.0 is a Java Generics / move to Java 1.5 only release these
APIs are not in use yet in the latest released version. Lucene 3.1
holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
(I think there are one or two which still have problems, I should
check... Robert did we fix all NGram stuff?).

So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
I hope) will fix most of the problems and includes ICU based analysis
for full Unicode 5 support.

hope that helps

> It looks like lucene can only handle the very old Unicode 2.0
> but not the newer 3.1 version (4 byte utf-8 unicode).
> Is that true?
> Regards,
> Bernd
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message