lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: Indexing of multilingual labels
Date Mon, 14 Mar 2011 13:49:26 GMT
Stephane,

I think that you have the freedom to put what you want in the stored value of a field.

The simplest would even be to make it that the fields that you want to use for display are
stored, preformatted, xml-ished, owl-ified, or json-ized, to be separate from the indexed
fields (where you are only interested to the plain text). 
Payloads seem to be doing a similar job as a separate stored, non-indexed field.

The best approach I had thus far was to use a multiplexing analyzer (which is called for indexed
fields only anyways) that recognizes the language by the suffix of the field name.

As to the difference between one index and several fields or one field in many indices, I
think it is just a programming difference. The tf and idf are always done at the term level
so they make no difference. 

I tend to prefer multiple fields because it's easier to expand a query for, say, Fourrier
sent by a browser that says English but also accepts french and German into:
- a query for Fourrier in the whitespace-tokenized track (always prefer that one)
- a query for fouri in the French field
- a query for fourier in the English and German fields
My current experience is that many users appear or claim to speak many languages (they do,
a little bit).

hope it helps.

paul

PS: not that my code is ideal but here are the ones I have:
 - i2geo, based on an ontology of concepts in OWL, 
	http://i2geo.net/xwiki/bin/view/About/GeoSkills
   and http://svn.activemath.org/intergeo/Platform/SearchI2G/
 - ActiveMath, fed by XML, http://www.activemath.org/javadoc/org/activemath/omdocjdom/index/package-summary.html
and 


Le 11 mars 2011 à 16:35, Stephane Fellah a écrit :

> Erick,
> 
> I am trying to index multilingual taxonomies such as SKOS, Wordnet,
> Eurowordnet. Taxonomies are composed of concepts which have preferred and
> alternative labels in different languages. Some labels are the same lexical
> form in different languages. I want to be able to index these concepts in
> Lucene in order to be able to search concepts by their label in one or
> several languages. I want also be able to display concept definition with
> all the alternative labels in different languages. My question is: could we
> use the payload mechanism to store the language assigned to the word (i read
> somewhere Google was using payload to store information such as font for
> example, so why not language) ? Wouldn't be a better approach then using one
> field per language or one index per language ?
> 
> REgards
> Stephane
> 
> On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson <erickerickson@gmail.com>wrote:
> 
>> It's not so much a matter of problems with indexing/searching
>> as it is with search behavior. The reason these strategies
>> are implemented is that using English stemming, say, on
>> other languages will produce "interesting" results.
>> 
>> There's no a-priori reason you can't index multiple languages
>> in the same field.
>> 
>> So I don't see what you would accomplish by using payloads
>> to indicate which language the term is in. Could you expand
>> a bit on what you're trying to accomplish here? Maybe there
>> are better solutions....
>> 
>> Best
>> Erick
>> 
>> 
>> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
>> <sfellah@smartrealm.com> wrote:
>>> I  am trying to index in Lucene a field that could have label of concepts
>> in
>>> different languages. Most of the approaches I have seen so far are:
>>> 
>>> -
>>> 
>>> Use a single index, where each document has a field per each language
>> it
>>> uses, or
>>> -
>>> 
>>> Use M indexes, M being the number of languages in the corpus.
>>> 
>>> Lucene 2.9+ has a feature called Payload that allows to attach attributes
>> to
>>> term. Is anyone use this mechanism to store language (or other attributes
>>> such as datatypes) information ? Does this approach if labels are the
>> same
>>> in different languages (does it break inverted index) ? How is
>> performance
>>> compared to the two other approaches ? Any pointer on source code showing
>>> how it is done would help.
>>> 
>>> Thanks
>>> 
>>> --
>>> Stephane Fellah, M.Sc, B.Sc
>>> Principal Engineer/Product Manager
>>> smartRealm LLC
>>> 201 Loudoun St. SW
>>> Leesburg, VA 20175
>>> Tel: 703 669 5514
>>> Cell: 571 502 8478
>>> Fax: 703 669 5515
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> Stephane Fellah, M.Sc, B.Sc
> Principal Engineer/Product Manager
> smartRealm LLC
> 201 Loudoun St. SW
> Leesburg, VA 20175
> Tel: 703 669 5514
> Cell: 571 502 8478
> Fax: 703 669 5515


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message