lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wunderw...@netflix.com>
Subject Re: multilanguage + howto search in all languages?
Date Thu, 29 Jan 2009 01:24:07 GMT
Duh. Four cases. For extra credit, what language is "wunder" in?

wunder

On 1/28/09 5:12 PM, "Walter Underwood" <wunderwood@netflix.com> wrote:

> I've done this. There are five cases for the tokens in the search
> index:
> 
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common after stemming (usually trademarks,
>    like LaserJet).
> 3. Tokens with collisions after stemming:
>    German "mit", "MIT" the university
>    German "Boot" (boat), English "boot" (a heavy shoe)
> 4. Tokens with collisions in the surface form:
>    Dutch "mobile" (plural of furniture), English "mobile"
>    German "die" (stemmed to "das"), English "die"
> 
> You cannot fix every spurious match, but you can do OK with
> stemmed fields for each language and a raw (unstemmed surface
> token) field.
> 
> I won't recommend weights, but you could have fields for
> text_en, text_de, and text_raw, for example.
> 
> You really cannot automatically determine the language of a
> query, mostly because of proper nouns, especially trademarks.
> Identify the language of these queries:
> 
> * Google
> * LaserJet
> * Obama
> * Las Vegas
> * Paris
> 
> HTTP supports an Accept-Language header, but I have no idea
> how often that is sent. We honored that in Ultraseek, mostly
> because it was standard.
> 
> Finally, if you are working with localization, please take the
> time to understand the difference between ISO language codes
> and ISO country codes.
> 
> wunder
> 
> On 1/28/09 4:47 PM, "Erick Erickson" <erickerickson@gmail.com> wrote:
> 
>> I'm not entirely sure about the fine points, but consider the
>> filters that are available that fold all the diacritics into their
>> low-ascii equivalents. Perhaps using that filter at *both* index
>> and search time on the English index would do the trick.
>> 
>> In your example, both would be 'munchen'. Straight English
>> would be unaffected by the filter, but any German words with
>> diacritics that crept in would be folded into their low-ascii
>> "equivalents". This would also work at index time, just in case
>> you indexed English text that had some German words.
>> 
>> NOTE: My experience is more on the Lucene side than the SOLR
>> side, but I'm sure the filters are available.
>> 
>> Best
>> Erick
>> 
>> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <jmut@drun.net> wrote:
>> 
>>> Hi,
>>> I currently have two indexes with solr. One for english version and one
>>> with german version. They use respectively english/german2 snowball
>>> factory.
>>> Right now depending on which language is website currently I query
>>> corresponding index.
>>> There is requirement though that stuff is found regardless in which
>>> language is found.
>>> So for example if searching for muenchen (will be caught correctly by
>>> german snowball factory as m√ľnchen) in english index it should be found.
>>> Right now
>>> it is not as I suppose english factory doesn't really care about umlauts.
>>> 
>>> Any pointers are more than welcome. I am considering synonyms  but this
>>> will be kinda to heavy to follow/create.
>>> Cheers,
>>> JD
>>> 
> 


Mime
View raw message