lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@syr.edu>
Subject Re: Subject indexing and seraching documents with multiple languages
Date Tue, 09 May 2006 11:49:44 GMT
pbatcoi@gmx.net wrote:
> Grant,
>
> considering the answer from Karl, it seems that we have to choice to put all
> the documents in one index or use an index for each language. You are using
> an index for each language. We are currently discussing the pros and cons
> for both solutions. Thus we would be very interested to find out about your
> reasons to use a separate index for each language.
>
>   
Hmmm, it was a while ago and I am not 100% convinced I would make the 
same decision again, but I seem to recall a few reasons:

1. I thought it would be easier to manage them separately, as they all 
come from distinct collections.  We can easily take one language/index 
off line w/o affecting the other indexes.  I think this has been proved 
out over time in our case.  We often index the same set of documents 
several different ways (different stemmers, adding proper nouns, 
phrases, transliteration, translation, etc.), trying to evaluate which 
one gives us the best results.  Having them all in the same collection 
makes this harder.  What works for you probably depends on how often you 
update/delete, etc.

2. I am not certain of what it means to have an English query match 
against Arabic documents that contain English in them (or any other 
language).  On the surface this seems fine, since it is a term match, 
but I am not sure if it is meaningful when it comes to meeting a user's 
information need.  This is just a hunch and I am not sure it is a 
correct one.  I think it would probably warrant more study from a user's 
perspective.  I would like to hear other's opinions on this.

3. Logically, to me, our collections our separate.  They come from 
different sources, they are in different languages, they have different 
styles, different authors, etc.  It just _feels_ like they should be 
kept separate.  You may not care about this distinction.

One question I have always had is whether storing everything in the same 
index skews the IDF values by giving a term more importance than it 
really warrants.  My guess is it doesn't b/c this would happen evenly 
for all terms.  However, some of the times you have terms that occur in 
both collections, so the IDF for these may be smaller than it would be 
relative to having indexed the collection separately.  Is this valid or 
am I talking crazy?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message