lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: Multilingual - Search against the appropriate field
Date Thu, 01 Jul 2010 11:26:06 GMT
Hi,

I have chosen the same approach as you, indexing content into text_<language> fields
with custom analysis, and it works great. Solr does not have any overhead with this even if
there are hundreds of languages, due to the schema-less nature of Lucene.

And if you know which language is being searched, you can select only those fields in question,
and you'd still be as fast as the mono language case. But you'd only get documents in that
language returned.

Say you want to match across languages, it could be you search for "obama" which would be
written the same in all languages. How to achieve this? I see two approaches:
a) Seach across all languages with proper analysis, as you suggest qf=text_fr text_en^10 (you
can even boost the preferred languages).
b) Index all content in a "text_all" field with no stemming involved and search qf=text_all
(you will match "obama" in all languages but lose stemming)

My feeling is that a) would work if you have a limited set of languages, but b) might be necessary
if you have dozens of languages to search across, due to reduced query performance with such
a large disMax query.

Of course with a) there may be ambiguities that an english word gets stemmed and hits the
same stem as a totally different french word - I don't have any hands on examples, but I'm
sure the issue exists. Then it is probably better to search the other languages un-stemmed,
like a hybrid approach:

c) Search the query language stemmed and all other unstemmed (qf=text_en^10 text_all - giving
increased recall)

The downside of a text_all field is you almost double the size of your index worst-case.

Then you have the issue of displaying the results in front end.
Which title do you pick? title_en or title_fr? Here, I also see two solutions and I have tried
both:
1) Store a title_display which is stored, while the title_<language> fields are only
indexed, not stored. Use the title_display in frontend
2) Make a wrapper around QueryResult class so when frontend asks for "title", you intelligently
try to pull out title_XY where XY is pulled from documents "language" metadata.

I think which you choose depends on taste, each has its + and -

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 12.26, Saïd Radhouani wrote:

> Hi,
> 
> I know this topic has been treated many times in the (distant) past, but I wonder whether
there are new better practices/tendencies.
> 
> In my application, I'm dealing with documents in different languages. Each document is
monolingual; it has some fields containing free text and a set of fields that do not require
any text analysis. For the free text, we need to make a specific analysis based of the language
of the document.
> 
> I'm for the use of a single index for all the documents instead of one index per language
(any objection?). Thus, in schema.xml, I need to declare a separate field for each language
(text_fr, text_en, etc.), each with its own appropriate analysis. Then, during the indexing,
I need to assign the free text content of each document to the appropriate field. Thus, for
each document, only one of the freetext fields would be populated.
> 
> My question is, at search time, what is the best solution to search against the appropriate
field?
> 
> I know that using dismax, we can define in "qf" the set the fields we want to search
against. e.g., <str name="qf"> text_fr text_en</str>
> 
> With this solution, does Solr choose the appropriate analysis for the query. i.e., if
a query is compared to a document having English free text (text_en is populated), does Solr
analyze the query as it was in English ?
> 
> One problem with this approach is that, each query will be compared to all the available
documents. i.e., a query in English would be compared to a document in French. As I know,
if we know the query language, this problem can be avoided, either by searching against the
appropriate field (e.g., text_fr:query), or by using a filter to select only those documents
having English text. Am I correct? Or is there a better solution?
> 
> Thanks,
> -Saïd
> 



Mime
View raw message