lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Fellah <sfel...@smartrealm.com>
Subject Re: Indexing of multilingual labels
Date Fri, 11 Mar 2011 15:35:41 GMT
Erick,

I am trying to index multilingual taxonomies such as SKOS, Wordnet,
Eurowordnet. Taxonomies are composed of concepts which have preferred and
alternative labels in different languages. Some labels are the same lexical
form in different languages. I want to be able to index these concepts in
Lucene in order to be able to search concepts by their label in one or
several languages. I want also be able to display concept definition with
all the alternative labels in different languages. My question is: could we
use the payload mechanism to store the language assigned to the word (i read
somewhere Google was using payload to store information such as font for
example, so why not language) ? Wouldn't be a better approach then using one
field per language or one index per language ?

REgards
Stephane

On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> It's not so much a matter of problems with indexing/searching
> as it is with search behavior. The reason these strategies
> are implemented is that using English stemming, say, on
> other languages will produce "interesting" results.
>
> There's no a-priori reason you can't index multiple languages
> in the same field.
>
> So I don't see what you would accomplish by using payloads
> to indicate which language the term is in. Could you expand
> a bit on what you're trying to accomplish here? Maybe there
> are better solutions....
>
> Best
> Erick
>
>
> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
> <sfellah@smartrealm.com> wrote:
> > I  am trying to index in Lucene a field that could have label of concepts
> in
> > different languages. Most of the approaches I have seen so far are:
> >
> >   -
> >
> >   Use a single index, where each document has a field per each language
> it
> >   uses, or
> >   -
> >
> >   Use M indexes, M being the number of languages in the corpus.
> >
> > Lucene 2.9+ has a feature called Payload that allows to attach attributes
> to
> > term. Is anyone use this mechanism to store language (or other attributes
> > such as datatypes) information ? Does this approach if labels are the
> same
> > in different languages (does it break inverted index) ? How is
> performance
> > compared to the two other approaches ? Any pointer on source code showing
> > how it is done would help.
> >
> > Thanks
> >
> > --
> > Stephane Fellah, M.Sc, B.Sc
> > Principal Engineer/Product Manager
> > smartRealm LLC
> > 201 Loudoun St. SW
> > Leesburg, VA 20175
> > Tel: 703 669 5514
> > Cell: 571 502 8478
> > Fax: 703 669 5515
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Stephane Fellah, M.Sc, B.Sc
Principal Engineer/Product Manager
smartRealm LLC
201 Loudoun St. SW
Leesburg, VA 20175
Tel: 703 669 5514
Cell: 571 502 8478
Fax: 703 669 5515

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message