Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 40767 invoked from network); 2 Jun 2005 19:44:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 2 Jun 2005 19:44:00 -0000 Received: (qmail 70376 invoked by uid 500); 2 Jun 2005 19:43:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 70347 invoked by uid 500); 2 Jun 2005 19:43:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 70334 invoked by uid 99); 2 Jun 2005 19:43:52 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from tayrelbas04.tay.hp.com (HELO tayrelbas04.tay.hp.com) (161.114.80.247) by apache.org (qpsmtpd/0.28) with ESMTP; Thu, 02 Jun 2005 12:43:46 -0700 Received: from tayexg11.americas.cpqcorp.net (tayexg11.americas.cpqcorp.net [16.103.130.186]) by tayrelbas04.tay.hp.com (Postfix) with ESMTP id 2D6D22000143 for ; Thu, 2 Jun 2005 15:43:33 -0400 (EDT) Received: from tayexc13.americas.cpqcorp.net ([16.103.130.26]) by tayexg11.americas.cpqcorp.net with Microsoft SMTPSVC(6.0.3790.211); Thu, 2 Jun 2005 15:42:52 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: Indexing multiple languages Date: Thu, 2 Jun 2005 15:42:51 -0400 Message-ID: <19ADCC0B9D4CAD4582BB9900BBCE3574019450A0@tayexc13.americas.cpqcorp.net> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Indexing multiple languages Thread-Index: AcVmgXERRepaLfgIRyegyuSx+OofjgBKMrjg From: "Tansley, Robert" To: X-OriginalArrivalTime: 02 Jun 2005 19:42:52.0981 (UTC) FILETIME=[441DF650:01C567AB] X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Thanks all for the useful comments. It seems that there are even more options -- 4/ One index, with a separate Lucene document for each (item,language) = combination, with one field that specifies the language 5/ One index, one Lucene document per item, with field names that = include the language (e.g. title_en, title_cn) I quite like 4, because you can search with no language constraint, or = with one as Paul suggests below. However, some "non language-specific" = data might need to be repeated (e.g. dates), unless we had an extra = Lucene document for all that. I wonder what the various pros and cons = in terms of index size and performance would be in each case? I really = don't have enough knowledge of Lucene to have any idea... Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ > -----Original Message----- > From: Paul Libbrecht [mailto:paul@activemath.org]=20 > Sent: 01 June 2005 04:10 > To: java-user@lucene.apache.org > Subject: Re: Indexing multiple languages >=20 > Le 1 juin 05, =E0 01:12, Erik Hatcher a =E9crit : > >> 1/ one index for all languages > >> 2/ one index for all languages, with an extra language field so=20 > >> searches > >> can be constrained to a particular language > >> 3/ separate indices for each language? > > I would vote for option #2 as it gives the most flexibilty=20 > - you can=20 > > query with or without concern for language. >=20 > The way I've solved this is to make a different field-name=20 > per-language=20 > as our documents can be multilingual. > What's then done is query expansion at query time: given a term-query=20 > for text, I duplicate it for each accepted language of the=20 > user with a=20 > factor related to the preference of the language (e.g. the q=20 > factor in=20 > Accept-Language http header). Presumably I could be using solution 2/=20 > as well if my queries become too big, making several=20 > documents for each=20 > language of the document. >=20 > I think it's very important to care about guessing the accepted=20 > languages of the user. Typically, the default behaviour of=20 > Google is to=20 > only give you matches in your primary language but then allow=20 > expansion=20 > in any language. >=20 > >> On the other hand, if people are searching for proper nouns in=20 > >> metadata > >> (e.g. "DSpace") it may be advantageous to search all languages at=20 > >> once. >=20 > This one may need particular treatment. >=20 > Tell us your success! >=20 > paul >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org