lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saïd Radhouani <r.steve....@gmail.com>
Subject Re: Multilingual - Search against the appropriate field
Date Thu, 01 Jul 2010 16:50:12 GMT
Hi Jan,

I totally agree with what you said.

In a), you talked about boosting. I guess you meant to boost at the client side, right?

I still have a question: 

>> does Solr choose the appropriate analysis for the query. i.e., if a query is compared
to a document having English free text (text_en is populated), does Solr analyze it as it
was in English ?


Thanks,
-Saïd

On Jul 1, 2010, at 1:26 PM, Jan Høydahl / Cominvent wrote:

> Hi,
> 
> I have chosen the same approach as you, indexing content into text_<language> fields
with custom analysis, and it works great. Solr does not have any overhead with this even if
there are hundreds of languages, due to the schema-less nature of Lucene.
> 
> And if you know which language is being searched, you can select only those fields in
question, and you'd still be as fast as the mono language case. But you'd only get documents
in that language returned.
> 
> Say you want to match across languages, it could be you search for "obama" which would
be written the same in all languages. How to achieve this? I see two approaches:
> a) Seach across all languages with proper analysis, as you suggest qf=text_fr text_en^10
(you can even boost the preferred languages).
> b) Index all content in a "text_all" field with no stemming involved and search qf=text_all
(you will match "obama" in all languages but lose stemming)
> 
> My feeling is that a) would work if you have a limited set of languages, but b) might
be necessary if you have dozens of languages to search across, due to reduced query performance
with such a large disMax query.
> 
> Of course with a) there may be ambiguities that an english word gets stemmed and hits
the same stem as a totally different french word - I don't have any hands on examples, but
I'm sure the issue exists. Then it is probably better to search the other languages un-stemmed,
like a hybrid approach:
> 
> c) Search the query language stemmed and all other unstemmed (qf=text_en^10 text_all
- giving increased recall)
> 
> The downside of a text_all field is you almost double the size of your index worst-case.
> 
> Then you have the issue of displaying the results in front end.
> Which title do you pick? title_en or title_fr? Here, I also see two solutions and I have
tried both:
> 1) Store a title_display which is stored, while the title_<language> fields are
only indexed, not stored. Use the title_display in frontend
> 2) Make a wrapper around QueryResult class so when frontend asks for "title", you intelligently
try to pull out title_XY where XY is pulled from documents "language" metadata.
> 
> I think which you choose depends on taste, each has its + and -
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 1. juli 2010, at 12.26, Saïd Radhouani wrote:
> 
>> Hi,
>> 
>> I know this topic has been treated many times in the (distant) past, but I wonder
whether there are new better practices/tendencies.
>> 
>> In my application, I'm dealing with documents in different languages. Each document
is monolingual; it has some fields containing free text and a set of fields that do not require
any text analysis. For the free text, we need to make a specific analysis based of the language
of the document.
>> 
>> I'm for the use of a single index for all the documents instead of one index per
language (any objection?). Thus, in schema.xml, I need to declare a separate field for each
language (text_fr, text_en, etc.), each with its own appropriate analysis. Then, during the
indexing, I need to assign the free text content of each document to the appropriate field.
Thus, for each document, only one of the freetext fields would be populated.
>> 
>> My question is, at search time, what is the best solution to search against the appropriate
field?
>> 
>> I know that using dismax, we can define in "qf" the set the fields we want to search
against. e.g., <str name="qf"> text_fr text_en</str>
>> 
>> With this solution, does Solr choose the appropriate analysis for the query. i.e.,
if a query is compared to a document having English free text (text_en is populated), does
Solr analyze the query as it was in English ?
>> 
>> One problem with this approach is that, each query will be compared to all the available
documents. i.e., a query in English would be compared to a document in French. As I know,
if we know the query language, this problem can be avoided, either by searching against the
appropriate field (e.g., text_fr:query), or by using a filter to select only those documents
having English text. Am I correct? Or is there a better solution?
>> 
>> Thanks,
>> -Saïd
>> 
> 
> 


Mime
View raw message