lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: What's the best way to translate a query in multiple languages?
Date Wed, 02 Nov 2011 11:31:19 GMT
Raf,

I always do this: query expansion.

Take the Lucene QueryParser, default field "default", default analyzer whitespace analyzer...
feed the query in.
You typically get a BooleanQuery which you can now process to perform the query expansion.

For example I replace all termQueries by a boolean with, foremost a termQuery in the field
of the user's language, followed by termQueries in other languages (that the browser tells
me is supported) with less preference; for each of them the term is passed through the analyzer
of that language (generally a porter-stemmer).

At the top of the expansion above, I also add a termQuery in the field "exact" which is language
less and indexed using the whitespace analyzer. This should be the best match.
At the tail, you can also add fuzzy queries in the same language or queries to a phonetic
indexing field.

My experience is a good acceptance of the users of such.
The trouble with query expansion: reading the output of explain() is quite monstruous.
I note that SOLR's DismaxQueryParser is doing something similar to this (but it does not support
Lucene syntax to, e.g., indicate fields, wildcard, ranges, or fuzzy-queries).

Hope it helps.

Paul


Le 1 nov. 2011 à 18:07, Raf a écrit :

> Hi,
> I have a Lucene index containing documents written in different languages.
> 
> Each document is written only in one language and I have a *language* field
> containing the corresponding language identifier (it, en, fr, ...).
> The *content* is saved in different fields for each language (e.g.
> contents_it, contents_en, ...) and I use a specific language analyzer for
> each of these field.
> 
> When the user inputs a query it selects also the language he is using to
> write the query so I can create a *QueryParser* choosing the right *
> defaultField* and* analyzer.*
> *
> *
> This works fine, but, using this approach, users can find only documents
> written in the same language used to write the query.
> 
> Now, I would like to *translate* user query in order to find also documents
> written in different languages (that match the same query).
> 
> For example:
> * *user_query =*   cane         *query_language* = it
> * In this moment, using standard *QueryParser* I obtain this query   -->   *
> contents_it:cane*
> * In the new scenario, I would like to have this query   -->
> (*contents_it:cane
> contents_en:dog contents_fr:chien*)
> 
> but also
> 
> * *user_query* =  +"operating system" -linux      *query_language* = en
> * I would like to have this query   -->  *+(contents_en:"operating system" *
> *contents_it:"sistema operativo"**) -(contents_en:linux **contents_it:linux*
> *)*
> *
> *
> Suppose that:
> * for each index/application I have a fixed number of available languages,
> each with its *defaultField* and specific *analyzer.*
> * I already have a service that is able to translate words and/or small
> phrases between languages I am interested in.
> 
> 
> I was thinking about extending *QueryParser* overriding some methods to add
> my custom behaviour.
> 
> This looks quite easy for TermQuery, for example doing something like this:
> 
> protected Query newTermQuery(Term term){
> 
>    BooleanQuery bq = new BooleanQuery();
>    bq.add(new BooleanClause(new TermQuery(term),
> BooleanClause.Occur.SHOULD));
> 
>    *for each language except queryLanguage *{
>         TermQuery translatedTQ = translateTerm(term, queryLanguage,
> language);
>         bq.add(new BooleanClause(translatedTQ,
> BooleanClause.Occur.SHOULD));
> *    *}
> 
>    return bq;
>  }
> 
> But it looks quite more difficult for other query types (without *rewriting
> QueryParser* instead of extending it).
> Am I missing something? Is there a better approach to achieve the same goal?
> 
> I am using *lucene 3.0.3* and, for now, I cannot upgrade to more recent
> versions.
> 
> Thanks in advance,
> Bye.
> 
> *Raf*


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message