lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: Language Identification and Stemming
Date Sat, 02 Mar 2013 22:06:44 GMT
In addition to the text_lang fields you can of course have a text_general
field which is unstemmed, where you put documents that you don't yet have
language specific handling for.

One potential issue of multi language search is detecting the language of the query itself.
Sometimes your search page knows in advance what language will be input, then you can
target the search towards text_<lang> only. Other times you won't know what language
it is, and then you have a few choices:

a) Try to detect the language
b) Search across all languages (text_en OR text_fr OR ...)
c) Skip stemming and use only text_general

Detecting the language of a short 1-2 words query is hard. You will be able
to distinguish chinese from japanese from western languages based on unique characters,
but much harder to distinguish western languages.

Search across all languages works great, but you may get some false positives in
e.g. stemming when a word overlaps with different meaning in several languages.
Besides, if you have 200 languages in your index it is impractical to search across
200 fields. 

If you skip stemming you will in many cases still be able to build a great search,
but you may be better off trying to guess the input language by means of IP detection,
browser headers, statistical analysis or simply asking the user.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

1. mars 2013 kl. 23:47 skrev vybe3142 <vybe3142@gmail.com>:

> From your response, I gather that there's no way to maintain a single set of
> fields for multiple languages i.e. I can't use a field "text" for the body
> text. Instead, I would have to define text_en, text_fr, text_ru etc each
> mapped to their specific languages.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044132.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message