lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Non-English Languages Search
Date Wed, 11 May 2011 13:24:14 GMT
On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan
<Ivan.Provalov@cengage.com> wrote:
> We are planning to ingest some non-English content into our application.  All content
is OCR'ed and there are a lot of misspellings and garbage terms because of this.  Each document
has one primary language with a some exceptions (e.g. a few English terms mixed in with primarily
non-English document terms).
>

sounds like you should talk to Tom Burton-West!

> 1. Does it make sense to mix two or more different Latin-based languages in the same
index directory in Lucene (e.g. Spanish/French/English)?

I think it depends upon the application. If the user is specifying the
language via the UI somehow then its probably simplest to just use
different indexes for each collection.

> 2. What about mixing Latin and non-Latin languages?  We ran tests on English and Chinese
collections mixed together and didn't see any negative impact (precision/recall).  Any other
potential issues?

Right, none of the terms would overlap here... the only "issue" would
be a skewed maxDoc but this is probably not a big deal at all. But
whats the benefit to mixing them?

> 3. Any recommendations for an Urdu analyzer?
>

you can always start with standardanalyzer as it will tokenize it...
you might be able to make use of resources such as
http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm
and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm
as a stoplist.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message