Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: 
 <14212_1298629418_ZZh0k3sw3i7S6.00_4D67831F.90801@uni-bielefeld.de>
 <AANLkTi=M6oAO2ZApNB-Cr+fpHL5Oct3PY+mCkbFbUVd9@mail.gmail.com>
 <23538_1298635379_ZZh0k6skodisY.00_4D679A50.1000308@uni-bielefeld.de>
 <AANLkTinc3Rm=sQCvm=tYUHFThFSC+DF8xNYB81jmdatY@mail.gmail.com>
 <23505_1298639997_ZZh0k5s9GBkrP.00_4D67AC5A.7080708@uni-bielefeld.de>
 <006e01cbd4f1$94590560$bd0b1020$@thetaphi.de>
 <23510_1298641750_ZZh0k1svNf89G.00_4D67B322.1010403@uni-bielefeld.de>
In-Reply-To: 
 <23510_1298641750_ZZh0k1svNf89G.00_4D67B322.1010403@uni-bielefeld.de>
Subject: RE: which unicode version is supported with lucene
Date: Fri, 25 Feb 2011 14:53:56 +0100
Message-ID: <006f01cbd4f3$756a6700$603f3500$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Thread-index: 
 AQJgT4j57ILItKNXquSmYbd5SLswTwDA6WRnAP3RwU0DHZnPRwDIHJzeAjA/VdMBTPweRpKgOiVw
Content-language: de

What APIs are you using to communicate with Solr? If you are using XML =
it may be limited by the XML parser used... If you are using SolrJ with =
binary request handler it should in all cases go through.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Friday, February 25, 2011 2:48 PM
> To: java-user@lucene.apache.org
> Subject: Re: which unicode version is supported with lucene
>=20
>=20
> So Solr trunk should already handle Unicode above BMP for field type =
string?
> Strange...
>=20
> Regards,
> Bernd
>=20
> Am 25.02.2011 14:40, schrieb Uwe Schindler:
> > Solr trunk is using Lucene trunk since Lucene and Solr are merged.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> >> Sent: Friday, February 25, 2011 2:19 PM
> >> To: simon.willnauer@gmail.com
> >> Cc: java-user@lucene.apache.org
> >> Subject: Re: which unicode version is supported with lucene
> >>
> >> Hi Simon,
> >>
> >> actually I'm working with Solr from trunk but followed the problem
> >> all the way down to Lucene. I think Solr trunk is build with Lucene =
3.0.3.
> >>
> >> My field is:
> >> <field name=3D"dcdescription" type=3D"string" indexed=3D"false"
> >> stored=3D"true" />
> >>
> >> No analysis done at all, just stored the content for result =
display.
> >> But the result is unpredictable and can end in invalid utf-8 code.
> >>
> >> Regards,
> >> Bernd
> >>
> >>
> >> Am 25.02.2011 13:43, schrieb Simon Willnauer:
> >>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> >>> <bernd.fehling@uni-bielefeld.de> wrote:
> >>>> Hi Simon,
> >>>>
> >>>> thanks for the details.
> >>>>
> >>>> My platform supports and uses code above BMP (0x10000 and up).
> >>>> So the limit is Lucene.
> >>>> Don't know how to handle this problem.
> >>>> May be deleting all code above BMP...???
> >>>
> >>> the code will work fine even if they are in you text. It will just
> >>> not respect them maybe throw them away during tokenization etc. so
> >>> it really depends what you are using on the analyzer side. maybe =
you
> >>> can give us little more details on what you use for analysis. One
> >>> option would be to build 3.1 from the source and use the analyzers
> >>> from there?!
> >>>
> >>>>
> >>>> Good to hear that Lucene 3.1 will come soon.
> >>>> Any rough estimation when Lucene 3.1 will be available?
> >>>
> >>> I hope it will happen within the next 4 weeks
> >>>
> >>> simon
> >>>
> >>>>
> >>>> Regards,
> >>>> Bernd
> >>>>
> >>>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
> >>>>> Hey Bernd,
> >>>>>
> >>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
> >>>>> <bernd.fehling@uni-bielefeld.de> wrote:
> >>>>>> Dear list,
> >>>>>>
> >>>>>> a very basic question about lucene, which version of unicode =
can
> >>>>>> be handled (indexed and searched) with lucene?
> >>>>>
> >>>>> if you ask for what the indexer / query can handle then it is
> >>>>> really what UTF-8 can handle. Strings passed to the writer /
> >>>>> reader are converted to UTF-8 internally (rough picture). On =
Trunk
> >>>>> we are indexing bytes only (UTF-8 bytes by default). so the
> >>>>> question is really what you platform supports in terms of
> >>>>> utilities / operations on characters and strings. Since Lucene =
3.0
> >>>>> we are on Java 1.5 and have the possibility to respect code =
points
> which are above the BMP.
> >>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented
> >>>>> us from moving forward to Unicode 4.0. If you look at
> >>>>> Character.java all methods have been converted to operate on
> >>>>> UTF-32 code points instead of UTF-16 code points in Java 1.4.
> >>>>>
> >>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release =
these
> >>>>> APIs are not in use yet in the latest released version. Lucene =
3.1
> >>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer
> >>>>> codebase (I think there are one or two which still have =
problems,
> >>>>> I should check... Robert did we fix all NGram stuff?).
> >>>>>
> >>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will =
only
> >>>>> support characters within the BMP <=3D 0xFFFF. 3.1 (to be =
released
> >>>>> soon I hope) will fix most of the problems and includes ICU =
based
> >>>>> analysis for full Unicode 5 support.
> >>>>>
> >>>>> hope that helps
> >>>>>
> >>>>> simon
> >>>>>>
> >>>>>> It looks like lucene can only handle the very old Unicode 2.0 =
but
> >>>>>> not the newer 3.1 version (4 byte utf-8 unicode).
> >>>>>>
> >>>>>> Is that true?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Bernd
> >>>>>>
> >>>>
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org