lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinaya Kumar Thimmappa <>
Subject Re: Indexing of multilingual labels
Date Mon, 14 Mar 2011 12:06:33 GMT
Hello Stephane,

I think a better way is to have resource file with different language 
and store pointer in the index to get to correct resource file ( 
Something like  I18N and L10N approach). Store the internationalised 
string in index  and all related localised string in resource file .

This way index size will be reduced (adding to payload would have impact 
on performance)
and help performance too.

Now your Total Search Time would be (searchtime+time to retrieve the 
language based data)

Hope this helps.

On Friday 11 March 2011 09:05 PM, Stephane Fellah wrote:
> Erick,
> I am trying to index multilingual taxonomies such as SKOS, Wordnet,
> Eurowordnet. Taxonomies are composed of concepts which have preferred and
> alternative labels in different languages. Some labels are the same lexical
> form in different languages. I want to be able to index these concepts in
> Lucene in order to be able to search concepts by their label in one or
> several languages. I want also be able to display concept definition with
> all the alternative labels in different languages. My question is: could we
> use the payload mechanism to store the language assigned to the word (i read
> somewhere Google was using payload to store information such as font for
> example, so why not language) ? Wouldn't be a better approach then using one
> field per language or one index per language ?
> REgards
> Stephane
> On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson<>wrote:
>> It's not so much a matter of problems with indexing/searching
>> as it is with search behavior. The reason these strategies
>> are implemented is that using English stemming, say, on
>> other languages will produce "interesting" results.
>> There's no a-priori reason you can't index multiple languages
>> in the same field.
>> So I don't see what you would accomplish by using payloads
>> to indicate which language the term is in. Could you expand
>> a bit on what you're trying to accomplish here? Maybe there
>> are better solutions....
>> Best
>> Erick
>> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
>> <>  wrote:
>>> I  am trying to index in Lucene a field that could have label of concepts
>> in
>>> different languages. Most of the approaches I have seen so far are:
>>>    -
>>>    Use a single index, where each document has a field per each language
>> it
>>>    uses, or
>>>    -
>>>    Use M indexes, M being the number of languages in the corpus.
>>> Lucene 2.9+ has a feature called Payload that allows to attach attributes
>> to
>>> term. Is anyone use this mechanism to store language (or other attributes
>>> such as datatypes) information ? Does this approach if labels are the
>> same
>>> in different languages (does it break inverted index) ? How is
>> performance
>>> compared to the two other approaches ? Any pointer on source code showing
>>> how it is done would help.
>>> Thanks
>>> --
>>> Stephane Fellah, M.Sc, B.Sc
>>> Principal Engineer/Product Manager
>>> smartRealm LLC
>>> 201 Loudoun St. SW
>>> Leesburg, VA 20175
>>> Tel: 703 669 5514
>>> Cell: 571 502 8478
>>> Fax: 703 669 5515
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message