On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan
<Ivan.Provalov@cengage.com> wrote:
> We are planning to ingest some non-English content into our application. All content
is OCR'ed and there are a lot of misspellings and garbage terms because of this. Each document
has one primary language with a some exceptions (e.g. a few English terms mixed in with primarily
non-English document terms).
>
sounds like you should talk to Tom Burton-West!
> 1. Does it make sense to mix two or more different Latin-based languages in the same
index directory in Lucene (e.g. Spanish/French/English)?
I think it depends upon the application. If the user is specifying the
language via the UI somehow then its probably simplest to just use
different indexes for each collection.
> 2. What about mixing Latin and non-Latin languages? We ran tests on English and Chinese
collections mixed together and didn't see any negative impact (precision/recall). Any other
potential issues?
Right, none of the terms would overlap here... the only "issue" would
be a skewed maxDoc but this is probably not a big deal at all. But
whats the benefit to mixing them?
> 3. Any recommendations for an Urdu analyzer?
>
you can always start with standardanalyzer as it will tokenize it...
you might be able to make use of resources such as
http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm
and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm
as a stoplist.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|