lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Query and language conversion
Date Tue, 01 Sep 2009 17:10:53 GMT
Alex,

That's right, you'll have to roll your own if you want to do cross-language search in Lucene.
 But some of the components you need are available.

Two possible cross-language search strategies don't scale well when the document collection
size is non-trivial: a) translating all documents into all possible query languages; and b)
requiring all documents and queries to use "controlled vocabulary", with fully mapped correspondences
between languages.

The most commonly chosen strategy is to translate the query into the language(s) of the documents.
 Typically, because queries are so short, machine translation is overkill, since it depends
on larger context for accuracy than is available.  So people generally use bi-lingual dictionaries,
sometimes combined with stemming and/or stopword removal, to convert queries into the language(s)
of the documents.

Lucene does have analyzers (some or all of: tokenizers, stemmers and stopword lists) for several
languages, but these are monolingual in nature.

Lucene does not have bi-lingual dictionaries.  You would have to supply these.

You didn't directly say so, but if you're unsure of the language of any of your content, you'll
also need to be able to identify it.  Nutch, a Lucene sub-project, has language identification
code.

Steve

> -----Original Message-----
> From: Alex [mailto:azlist1@gmail.com]
> Sent: Tuesday, September 01, 2009 12:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: Query and language conversion
> 
> Many thanks Steve for all that information.
> 
> I understand by your answer that cross-lingual search doesn't come
> "out-of-the-box" in Lucene.
> 
> Cheers.
> 
> Alex
> 
> On Tue, Sep 1, 2009 at 6:46 PM, Steven A Rowe <sarowe@syr.edu> wrote:
> 
> > Hi Alex,
> >
> > What you want to do is commonly referred to as "Cross Language
> Information
> > Retrieval".  Doug Oard at the University of Maryland has a page of
> CLIR
> > resources here:
> >
> >
> http://terpconnect.umd.edu/~dlrg/clir/<http://terpconnect.umd.edu/%7Edl
> rg/clir/>
> >
> > Grant Ingersoll responded to a similar question a couple of years ago
> on
> > this list:
> >
> > <
> >
> http://search.lucidimagination.com/search/document/e1398067af353a49/cro
> ss_lingual_ir#e1398067af353a49
> > >
> >
> > Here's another recent thread with lots of good info, from the solr-
> user
> > mailing list, on the same topic:
> >
> > <
> >
> http://search.lucidimagination.com/search/document/f7c17dc516c89bf6/pre
> paring_the_ground_for_a_real_multilang_index#797001daa3f73e17
> > >
> >
> > Here's a paper written by a group that put together a Greek-English
> > cross-language retrieval system using Lucene:
> >
> > http://www.springerlink.com/content/n172420t1346q683/
> >
> > And here's another paper written by a group that made a Hindi and
> Telugu to
> > English cross-language retrieval system using Lucene, from the CLEF
> 2006
> > conference proceedings:
> >
> > http://www.iiit.ac.in/techreports/2008_76.pdf
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: Alex [mailto:azlist1@gmail.com]
> > > Sent: Tuesday, September 01, 2009 10:30 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Query and language conversion
> > >
> > > Hi,
> > >
> > > I am new to Lucene so excuse me if this is a trivial question ..
> > >
> > >
> > > I have data that I Index in a given language (English). My users
> will
> > > come from different countries and my search screen will be
> > > internationalized. My users will then probably query thing in their
> > > own language. Is it possible too lookup for Items that were indexed
> > > in a different language.
> > >
> > > To make thing a bit more clear.
> > >
> > > My "Business" object has a "type" attribute. In lucene the "type"
> field
> > > is created. The Business object for  "Doctor Smuck" will be indexed
> with
> > > the "type" field as  "medical doctor" or anything similar. My
> German
> > > users will query using german languange. He tries to find a Doctor
> > > using "Arzt" or maybe "Mediziner" as a query. Is Lucene able to
> match
> > > the query to the value that was indexed in another language ?
> > > Is there an analyser for that ?
> > >
> > > By the way : I can provide the probable input language, based on
> the
> > > client's search page language,  as a parameter if that helps (it
> > > probably will) .
> > >
> > > Many thanks for your thoughts !
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message