lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How to query/search unicoded docs in lucene using unicode text as query?
Date Thu, 21 May 2009 14:10:47 GMT
hello, your example (hindi), is probably suffering from a number of search
issues:

i dont recommend standardanalyzer as for this example, it will break words
around dependent the vowels and nukta dot, etc.
whitespaceanalyzer might be a good start.

also, is it possible to apply unicode normalization to your text before
indexing it?
normalization will standardize things in indian languages.

in your example, the pha + nukda dot you queries on is the normalized form,
but i wonder if in your text its encoded as fa (u095E)
if you apply normalization mode NFC it will standardize to pha + nukda dot.

On Thu, May 21, 2009 at 9:26 AM, KK <dioxide.software@gmail.com> wrote:

> Hi All,
> I've indexed some docs[non-english] in unicoded utf=8 format. For both
> indexing as well as searching/querying I'm using simpleanalyzer. For
> english
> texts when I tried with single words its working then I thought of trying
> for non-english texts. So I wrote those words[multiple words] in babelmap[a
> unicode converter] and got the unicode for the text string and tried that
> as
> query but it din't work. Earlier I've used the same method to query solr
> index which use lucene at the backend. I tried say this query,
> \u0938\u0941\u0939\u093E\u0928\u093E\u0020\u0938\u092B\u093C\u0930
> which is unicoded for some non-english text, but this give me zero search
> result in lucene. I want to know whats going wrong. As I know at the end of
> the day lucene writes my non-english texts in unicodes, so if I'm reading
> say the index it'll have this kind of characters on the disk, right? So
> when
> I query using the same thing it should work. This used to work perfectly
> well with Solr where I was indexing all docs in unicode utf-8 encoding and
> the query was also unicoded as show above. Can someone point me what is
> going wrong here?
> May be I've to have a look over the analyzer solr was using in the default
> setting[i used the default setting only, and pretty sure it was using lot
> many analyzers/filter factory]. Thanks for all your time and appreciation.
>
> Thanks,
> KK.
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message