Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Subject: Re: Best practices for multiple languages?
Mime-Version: 1.0 (Apple Message framework v1082)
Content-Type: text/plain; charset=iso-8859-1
From: Paul Libbrecht <paul@hoplahup.net>
In-Reply-To: <4D38045F.3060103@eolya.fr>
Date: Thu, 20 Jan 2011 22:56:20 +0100
Cc: Bill Janssen <janssen@parc.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <964C28DD-954B-4A5D-A7F6-3DB07F82A650@hoplahup.net>
References: 
 <E594BA962D832C49A3CF858DAA3A696C103F5FDFD0@Exchange2007.mysigndomain.corp>
 <AANLkTi=GCeDjseXdRa6YZBjHxQyrXoxCbrSn8TboXVzr@mail.gmail.com>
 <E594BA962D832C49A3CF858DAA3A696C103F5FDFDA@Exchange2007.mysigndomain.corp>
 <36966.1295461311@parc.com>
 <7C4FC074-AF5B-425B-9B69-1AE1CE818B6D@hoplahup.net>
 <39362.1295466994@parc.com>
 <A14418A8-8D65-4E0D-8F02-8C49B8ACC8AB@hoplahup.net>
 <43989.1295479776@parc.com> <4D38045F.3060103@eolya.fr>
To: java-user@lucene.apache.org

Isn't this approach somewhat bad for term-frequency?

Words that would appear in several languages would be a lot more =
frequent (hence less significative).

I'm still preferring the split-field method with a proper query =
expansion.
This way, the term-frequency is evaluated on the corpus of one language.

Dominique, in your case, at least if on the web, you have:
- the user's preferred language (if defined in a profile)
- the list of languages the browser says it accepts
And that can easily be limited to around 8 so that you cover any =
language the user is expecting to search.

paul


Le 20 janv. 2011 =E0 10:46, Dominique Bejean a =E9crit :

> Hi,
>=20
> During a recent Solr project we needed to index document in a lot of =
languages. The natural solution with Lucene and Solr is to define one =
field per languages. Each field is configured in the schema.xml file to =
use a language specific processing (tokenizing, stop words, stemmer, =
...).  This is really not easy to manage if you have a lot of languages =
and this means that 1) the search interface need to know in which =
language your are searching 2) the search interface can't search in all =
languages at the same time.
>=20
> So, I decided that the only solution was to index all languages in =
only one field.
>=20
> Obviously, each language needs to be processed specifically. For this, =
I developped a analyzer that is in charge to redirect content to the =
correct tockenizer, filters and stemmer  accordingly to its language. =
This analyzer is also used at query time. If the user specify the =
language of its query, the query is processed by appropriate tockenizer, =
filters and stemmer otherwise the query is processed by a defaut =
tockenizer, filters and stemmer.
>=20
> With this solution :
>=20
> 1. I only need one field (or two if I want both stemmed and unstemmed =
processing)
> 2. The user can search in all document regarless to there language
>=20
> I hope this help.
>=20
> Dominique
> www.zoonix.fr
> www.crawl-anywhere.com
>=20
>=20
>=20
> Le 20/01/11 00:29, Bill Janssen a =E9crit :
>> Paul Libbrecht<paul@hoplahup.net>  wrote:
>>=20
>>> I did several changes of this sort and the precision and recall
>>> measures went better in particular in presence of =
language-indication
>>> failure which happened to be very common in our authoring =
environment.
>> There are two kinds of failures:  no language, or wrong language.
>>=20
>> For no language, I fall back to StandardAnalyzer, so I should have
>> results similar to yours.  For wrong language, well, I'm using OTS
>> trigram-based language guessers, and they're pretty good these days.
>>=20
>>>>> Wouldn't it be better to prefer precise matches (a field that is
>>>>> analyzed with StandardAnalyzer for example) but also allow matches =
are
>>>>> stemmed.
>> Yes, I think it might improve things, but again, by how much?  =
Stemming is
>> better than no stemming, in terms of recall.  But this approach would =
also
>> improve precision.
>>=20
>> Bill
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>=20
>>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org