lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl>
Subject Re: Preparing the ground for a real multilang index
Date Tue, 07 Jul 2009 22:50:18 GMT
When using stemming, you have to know the query language.
For your project, perhaps you should look into switching to a  
lemmatizer instead. I believe Lucid can provide integration with a  
commercial lemmatizer. This way you can expand the document field  
itself and do not need to know the query language. You may then want  
to do a copyfield from all your text_<lang> -> text for convenient one- 
field-to-rule-them-all search.

Jan Høydahl
Gründer & senior architect
Cominvent AS, Stabekk, Norway
+20 100930908

On 3. juli. 2009, at 08.43, Michael Lackhoff wrote:

> On 03.07.2009 00:49 Paul Libbrecht wrote:
> [I'll try to address the other responses as well]
>> I believe the proper way is for the server to compute a list of
>> accepted languages in order of preferences.
>> The web-platform language (e.g. the user-setting), and the values in
>> the Accept-Language http header (which are from the browser or
>> platform).
> All this is not going to help much because the main application is a
> scientific search portal for books and articles with many users
> searching cross-language. The most typical use case is a German user
> searching multilingual. So we might even get the search multilingual,
> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
> Accept-headers or a language select field (would be left on "any" in
> most cases). Other popular use cases are citations (in whatever
> language) cut and pasted into the search field.
>> Then you expand your query for surfing waves (say) to:
>> - phrase query: surfing waves exactly (^2.0)
>> - two terms, no stemming: surfing waves (^1.5)
>> - iterate through the languages and query for stemmed variants:
>>   - english: surf wav ^1.0
>>   - german surfing wave ^0.9
>>   - ....
>> - then maybe even try the phonetic analyzer (matched in a separate
>> field probably)
> This is an even more sophisticated variant of the multiple "OR" I came
> up with. Oh well...
>> I think this is a common pattern on the web where the users,  
>> browsers,
>> and servers are all somewhat multilingual.
> indeed and often users are not even aware of it, especially in a
> scientific context they use their native tongue and English almost
> interchangably -- and they expect the search engine to cope with it.
> I think the best would be to process the data according to its  
> language
> but don't make any assumptions about the query language and I am  
> totally
> lost how to get a clever schema.xml out of all this.
> Thanks everyone for listening and I am still open for good suggestions
> to deal with this problem!
> -Michael

View raw message