Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Subject: Re: Indexing of multilingual labels
Mime-Version: 1.0 (Apple Message framework v1082)
Content-Type: text/plain; charset=iso-8859-1
From: Paul Libbrecht <paul@hoplahup.net>
In-Reply-To: <AANLkTing9656NYf2A+bd5L=2R0qDwbm5m+Yk8TXSpKot@mail.gmail.com>
Date: Mon, 14 Mar 2011 14:49:26 +0100
Cc: Erick Erickson <erickerickson@gmail.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <4120CCC9-340D-46C9-A8C9-E48CC32C860A@hoplahup.net>
References: <AANLkTimFT3cheCPptbk2JwkNhjOU-X1wDw9k-O2jU8=n@mail.gmail.com>
 <AANLkTinExquAVyzKSPi4joHkOqdtPOyAfZ1DeK_DNiMg@mail.gmail.com>
 <AANLkTing9656NYf2A+bd5L=2R0qDwbm5m+Yk8TXSpKot@mail.gmail.com>
To: java-user@lucene.apache.org

Stephane,

I think that you have the freedom to put what you want in the stored =
value of a field.

The simplest would even be to make it that the fields that you want to =
use for display are stored, preformatted, xml-ished, owl-ified, or =
json-ized, to be separate from the indexed fields (where you are only =
interested to the plain text).=20
Payloads seem to be doing a similar job as a separate stored, =
non-indexed field.

The best approach I had thus far was to use a multiplexing analyzer =
(which is called for indexed fields only anyways) that recognizes the =
language by the suffix of the field name.

As to the difference between one index and several fields or one field =
in many indices, I think it is just a programming difference. The tf and =
idf are always done at the term level so they make no difference.=20

I tend to prefer multiple fields because it's easier to expand a query =
for, say, Fourrier sent by a browser that says English but also accepts =
french and German into:
- a query for Fourrier in the whitespace-tokenized track (always prefer =
that one)
- a query for fouri in the French field
- a query for fourier in the English and German fields
My current experience is that many users appear or claim to speak many =
languages (they do, a little bit).

hope it helps.

paul

PS: not that my code is ideal but here are the ones I have:
 - i2geo, based on an ontology of concepts in OWL,=20
	http://i2geo.net/xwiki/bin/view/About/GeoSkills
   and http://svn.activemath.org/intergeo/Platform/SearchI2G/
 - ActiveMath, fed by XML, =
http://www.activemath.org/javadoc/org/activemath/omdocjdom/index/package-s=
ummary.html and=20


Le 11 mars 2011 =E0 16:35, Stephane Fellah a =E9crit :

> Erick,
>=20
> I am trying to index multilingual taxonomies such as SKOS, Wordnet,
> Eurowordnet. Taxonomies are composed of concepts which have preferred =
and
> alternative labels in different languages. Some labels are the same =
lexical
> form in different languages. I want to be able to index these concepts =
in
> Lucene in order to be able to search concepts by their label in one or
> several languages. I want also be able to display concept definition =
with
> all the alternative labels in different languages. My question is: =
could we
> use the payload mechanism to store the language assigned to the word =
(i read
> somewhere Google was using payload to store information such as font =
for
> example, so why not language) ? Wouldn't be a better approach then =
using one
> field per language or one index per language ?
>=20
> REgards
> Stephane
>=20
> On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson =
<erickerickson@gmail.com>wrote:
>=20
>> It's not so much a matter of problems with indexing/searching
>> as it is with search behavior. The reason these strategies
>> are implemented is that using English stemming, say, on
>> other languages will produce "interesting" results.
>>=20
>> There's no a-priori reason you can't index multiple languages
>> in the same field.
>>=20
>> So I don't see what you would accomplish by using payloads
>> to indicate which language the term is in. Could you expand
>> a bit on what you're trying to accomplish here? Maybe there
>> are better solutions....
>>=20
>> Best
>> Erick
>>=20
>>=20
>> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
>> <sfellah@smartrealm.com> wrote:
>>> I  am trying to index in Lucene a field that could have label of =
concepts
>> in
>>> different languages. Most of the approaches I have seen so far are:
>>>=20
>>> -
>>>=20
>>> Use a single index, where each document has a field per each =
language
>> it
>>> uses, or
>>> -
>>>=20
>>> Use M indexes, M being the number of languages in the corpus.
>>>=20
>>> Lucene 2.9+ has a feature called Payload that allows to attach =
attributes
>> to
>>> term. Is anyone use this mechanism to store language (or other =
attributes
>>> such as datatypes) information ? Does this approach if labels are =
the
>> same
>>> in different languages (does it break inverted index) ? How is
>> performance
>>> compared to the two other approaches ? Any pointer on source code =
showing
>>> how it is done would help.
>>>=20
>>> Thanks
>>>=20
>>> --
>>> Stephane Fellah, M.Sc, B.Sc
>>> Principal Engineer/Product Manager
>>> smartRealm LLC
>>> 201 Loudoun St. SW
>>> Leesburg, VA 20175
>>> Tel: 703 669 5514
>>> Cell: 571 502 8478
>>> Fax: 703 669 5515
>>>=20
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>=20
>>=20
>=20
>=20
> --=20
> Stephane Fellah, M.Sc, B.Sc
> Principal Engineer/Product Manager
> smartRealm LLC
> 201 Loudoun St. SW
> Leesburg, VA 20175
> Tel: 703 669 5514
> Cell: 571 502 8478
> Fax: 703 669 5515


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org