Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (hermes.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Indexing multiple languages
Date: Thu, 2 Jun 2005 15:42:51 -0400
Message-ID: 
 <19ADCC0B9D4CAD4582BB9900BBCE3574019450A0@tayexc13.americas.cpqcorp.net>
Thread-Topic: Indexing multiple languages
Thread-Index: AcVmgXERRepaLfgIRyegyuSx+OofjgBKMrjg
From: "Tansley, Robert" <robert.tansley@hp.com>
To: <java-user@lucene.apache.org>

Thanks all for the useful comments.

It seems that there are even more options --

4/ One index, with a separate Lucene document for each (item,language) =
combination, with one field that specifies the language
5/ One index, one Lucene document per item, with field names that =
include the language (e.g. title_en, title_cn)

I quite like 4, because you can search with no language constraint, or =
with one as Paul suggests below.  However, some "non language-specific" =
data might need to be repeated (e.g. dates), unless we had an extra =
Lucene document for all that.  I wonder what the various pros and cons =
in terms of index size and performance would be in each case?  I really =
don't have enough knowledge of Lucene to have any idea...

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/

> -----Original Message-----
> From: Paul Libbrecht [mailto:paul@activemath.org]=20
> Sent: 01 June 2005 04:10
> To: java-user@lucene.apache.org
> Subject: Re: Indexing multiple languages
>=20
> Le 1 juin 05, =E0 01:12, Erik Hatcher a =E9crit :
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so=20
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> > I would vote for option #2 as it gives the most flexibilty=20
> - you can=20
> > query with or without concern for language.
>=20
> The way I've solved this is to make a different field-name=20
> per-language=20
> as our documents can be multilingual.
> What's then done is query expansion at query time: given a term-query=20
> for text, I duplicate it for each accepted language of the=20
> user with a=20
> factor related to the preference of the language (e.g. the q=20
> factor in=20
> Accept-Language http header). Presumably I could be using solution 2/=20
> as well if my queries become too big, making several=20
> documents for each=20
> language of the document.
>=20
> I think it's very important to care about guessing the accepted=20
> languages of the user. Typically, the default behaviour of=20
> Google is to=20
> only give you matches in your primary language but then allow=20
> expansion=20
> in any language.
>=20
> >> On the other hand, if people are searching for proper nouns in=20
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at=20
> >> once.
>=20
> This one may need particular treatment.
>=20
> Tell us your success!
>=20
> paul
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org