lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wunderw...@netflix.com>
Subject Re: Preparing the ground for a real multilang index
Date Thu, 02 Jul 2009 20:38:52 GMT
Not to mention Americans who call themselves "wunder". Or brand names, like
LaserJet, which are the same in all languages. Queries are far too short for
effective language id.

You can get language preferences from an HTTP request headers, then allow
people to override them. I think the header is Accept-language, but it has
been a long time since I did that.

I recommend using ISO language codes, en, de, es, fr, and so on, instead of
making up your own, like eng and ger. Don't confuse them with ISO country
codes: uk, us, etc. Korean and Japanese are easy to mix up with the country
codes.

wunder

On 7/2/09 1:15 PM, "Otis Gospodnetic" <otis_gospodnetic@yahoo.com> wrote:

> 
> Michael,
> 
> I think you really aught to know the language of the query (from a pulldown,
> from the browser, from user settings, somewhere) and pass that to the
> backend.... unless your queries are sufficiently long that their language can
> be identified.
> 
> Here is a handy tool for playing with language identification:
> 
>   http://www.sematext.com/demo/lid/
> 
> You'll see how hard it is to guess a language of very short texts. :)
> You really want to avoid that huge OR.  Often it makes no sense to OR in
> multilingual context.  Think about the word "die" (English and German, as you
> know) and what happens when you include that in an OR.  And does it make sense
> to include a "very language specific word", say "wunderbar", in an OR that
> goes across multiple/all languages?  Funny, they have it listed at
> http://www.merriam-webster.com/dictionary/wunderbar
> 
> 
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Michael Lackhoff <michael@lackhoff.de>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, July 2, 2009 2:58:41 PM
>> Subject: Preparing the ground for a real multilang index
>> 
>> As pointed out in the recent thread about stemmers and other language
>> specifics I should handle them all in their own right. But how?
>> 
>> The first problem is how to know the language. Sometimes I have a
>> language identifier within the record, sometimes I have more than one,
>> sometimes I have none. How should I handle the non-obvious cases?
>> 
>> Given I somehow know record1 is English and record2 is German. Then I
>> need all my (relevant) fields for every language, e.g. I will have
>> TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
>> what with exotic languages? Use a catch all "language" without a stemmer?
>> 
>> Now a user searches for TITLE:term and I don't know beforehand the
>> language of "term". Do I have to expand the query to something like
>> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there
>> some sort of copyfield for analyzed fields? Then I could just copy all
>> the TITLE_* fields to TITLE and don't bother with the language of the query.
>> 
>> Are there any solutions that prevent an index with thousands of fields
>> and dozens of ORed query terms?
>> 
>> I know I will have to implement some better multilanguage support but
>> would also like to keep it as simple as possible.
>> 
>> -Michael
> 


Mime
View raw message