lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tansley, Robert" <robert.tans...@hp.com>
Subject RE: Indexing multiple languages
Date Thu, 02 Jun 2005 19:42:51 GMT
Thanks all for the useful comments.

It seems that there are even more options --

4/ One index, with a separate Lucene document for each (item,language) combination, with one
field that specifies the language
5/ One index, one Lucene document per item, with field names that include the language (e.g.
title_en, title_cn)

I quite like 4, because you can search with no language constraint, or with one as Paul suggests
below.  However, some "non language-specific" data might need to be repeated (e.g. dates),
unless we had an extra Lucene document for all that.  I wonder what the various pros and cons
in terms of index size and performance would be in each case?  I really don't have enough
knowledge of Lucene to have any idea...

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/

> -----Original Message-----
> From: Paul Libbrecht [mailto:paul@activemath.org] 
> Sent: 01 June 2005 04:10
> To: java-user@lucene.apache.org
> Subject: Re: Indexing multiple languages
> 
> Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so 
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> > I would vote for option #2 as it gives the most flexibilty 
> - you can 
> > query with or without concern for language.
> 
> The way I've solved this is to make a different field-name 
> per-language 
> as our documents can be multilingual.
> What's then done is query expansion at query time: given a term-query 
> for text, I duplicate it for each accepted language of the 
> user with a 
> factor related to the preference of the language (e.g. the q 
> factor in 
> Accept-Language http header). Presumably I could be using solution 2/ 
> as well if my queries become too big, making several 
> documents for each 
> language of the document.
> 
> I think it's very important to care about guessing the accepted 
> languages of the user. Typically, the default behaviour of 
> Google is to 
> only give you matches in your primary language but then allow 
> expansion 
> in any language.
> 
> >> On the other hand, if people are searching for proper nouns in 
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at 
> >> once.
> 
> This one may need particular treatment.
> 
> Tell us your success!
> 
> paul
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message