lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: What's the best way to translate a query in multiple languages?
Date Wed, 02 Nov 2011 09:36:10 GMT
How many of the different query types would you need to mess around
with?  TermQuery certainly, as you show, and PhraseQuery presumably
for "operating system" in your example, but prefix and wildcard etc
are maybe not relevant.

An alternative might be to tackle it one level up, along the lines of

String querystr = ...;
String deflang = ...;
BooleanQuery bq = new BooleanQuery();
QueryParser qp = new QueryParser(,,,, "contents_"+deflang, ...);
bq.add(qp.parse(querystr), ...);
for (String lang in allLangs) {
  if (!lang.equals(deflang)) {
    QueryParser qp2 = new QueryParser(,,,, "contents_"+lang, ...);
    bq.add(qp2.parse(translate(querystr, deflang, lang)));
  }
}

I think I'd try that approach first.  If appropriate to your
application you could boost the native language query too.


--
Ian.

On Tue, Nov 1, 2011 at 5:07 PM, Raf <r.ventaglio@gmail.com> wrote:
> Hi,
> I have a Lucene index containing documents written in different languages.
>
> Each document is written only in one language and I have a *language* field
> containing the corresponding language identifier (it, en, fr, ...).
> The *content* is saved in different fields for each language (e.g.
> contents_it, contents_en, ...) and I use a specific language analyzer for
> each of these field.
>
> When the user inputs a query it selects also the language he is using to
> write the query so I can create a *QueryParser* choosing the right *
> defaultField* and* analyzer.*
> *
> *
> This works fine, but, using this approach, users can find only documents
> written in the same language used to write the query.
>
> Now, I would like to *translate* user query in order to find also documents
> written in different languages (that match the same query).
>
> For example:
> * *user_query =*   cane         *query_language* = it
> * In this moment, using standard *QueryParser* I obtain this query   -->   *
> contents_it:cane*
> * In the new scenario, I would like to have this query   -->
> (*contents_it:cane
> contents_en:dog contents_fr:chien*)
>
> but also
>
> * *user_query* =  +"operating system" -linux      *query_language* = en
> * I would like to have this query   -->  *+(contents_en:"operating system" *
> *contents_it:"sistema operativo"**) -(contents_en:linux **contents_it:linux*
> *)*
> *
> *
> Suppose that:
> * for each index/application I have a fixed number of available languages,
> each with its *defaultField* and specific *analyzer.*
> * I already have a service that is able to translate words and/or small
> phrases between languages I am interested in.
>
>
> I was thinking about extending *QueryParser* overriding some methods to add
> my custom behaviour.
>
> This looks quite easy for TermQuery, for example doing something like this:
>
> protected Query newTermQuery(Term term){
>
>    BooleanQuery bq = new BooleanQuery();
>    bq.add(new BooleanClause(new TermQuery(term),
> BooleanClause.Occur.SHOULD));
>
>    *for each language except queryLanguage *{
>         TermQuery translatedTQ = translateTerm(term, queryLanguage,
> language);
>         bq.add(new BooleanClause(translatedTQ,
> BooleanClause.Occur.SHOULD));
> *    *}
>
>    return bq;
>  }
>
> But it looks quite more difficult for other query types (without *rewriting
> QueryParser* instead of extending it).
> Am I missing something? Is there a better approach to achieve the same goal?
>
> I am using *lucene 3.0.3* and, for now, I cannot upgrade to more recent
> versions.
>
> Thanks in advance,
> Bye.
>
> *Raf*
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message