lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: Non-English Languages Search
Date Fri, 13 May 2011 15:38:05 GMT
Hi Ivan and Robert,

>> sounds like you should talk to Tom Burton-West!
Ok, I'll bite.

A few questions:

Are you planning to have separate fields for each language or the same fields with contents
in different languages?
If #2 are you planning to have a field to indicate the language so you can do filter queries?
Do you need to accommodate searches where you don't know what language the user is searching
in?

>> 2. What about mixing Latin and non-Latin languages?  We ran tests on English and
Chinese collections mixed together >>and didn't see any negative impact (precision/recall).

 Interesting.  I've wondered whether mixing languages would cause any issues with idf stats
in the ranking formula, especially if the number of documents in each language is significantly
different.

This may not be relevant to your use case. We found that dirty OCR combined with multiple
languages can cause a large number of unique terms.  If you have a large enough index, this
can make multiterm queries (i.e. prefix/wildcard etc) computationally expensive.   It can
also seriously increase memory use.  We started by changing the termInfosIndexDivisor to deal
with this at search time (http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again),
but then when we were re-indexing, we discovered that the termInfosIndexDivisor doesn't currently
affect the indexReader opened when indexing so we changed the termIndexInterval from 128 to
1024.  This took our memory use from over 18GB down to under 4GB and also eliminated large
stop-the-world garbage collection pauses. (Index size is about 350GB).

Tom

-----Original Message-----
On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan
<Ivan.Provalov@cengage.com> wrote:
> We are planning to ingest some non-English content into our application.  All content
is OCR'ed and there are a lot of misspellings and garbage terms because of this.  Each document
has one primary language with a some exceptions (e.g. a few English terms mixed in with primarily
non-English document terms).
>

sounds like you should talk to Tom Burton-West!

> 1. Does it make sense to mix two or more different Latin-based languages in the same
index directory in Lucene (e.g. Spanish/French/English)?

I think it depends upon the application. If the user is specifying the
language via the UI somehow then its probably simplest to just use
different indexes for each collection.

> 2. What about mixing Latin and non-Latin languages?  We ran tests on English and Chinese
collections mixed together and didn't see any negative impact (precision/recall).  Any other
potential issues?

Right, none of the terms would overlap here... the only "issue" would
be a skewed maxDoc but this is probably not a big deal at all. But
whats the benefit to mixing them?

> 3. Any recommendations for an Urdu analyzer?
>

you can always start with standardanalyzer as it will tokenize it...
you might be able to make use of resources such as
http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm
and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm
as a stoplist.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message