lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cnyee <yeec...@gmail.com>
Subject Multiplexing TokenFilter for multi-language?
Date Mon, 08 Aug 2011 11:57:00 GMT
Sorry if this has already been discussed, but I have already spent a couple
of days googling in vain....

The problem:
- documents in multiple languages (us, de, fr, es).
- language is known (a team of editors determines the language manually, and
users are asked to specify language option for searching).

My intended approach:
- one index.
- a multiplexing token filter, a MultilingualSnowballFilterFactory that
instantiates a Snowball Stemmer for the appropriate language.
- language is a facet, to get rid of cross-language ambiguities with
multiple languages mixed in the same field.

The problem is how to communicate the language to the
MultilingualSnowballFilterFactory. Once the language is known, instantiating
the Snowball Stemmer for the right language is easy. I got a working version
attached below. 

My solution:
- append the language as the first token for the FilterFactory to pick up.
E.g. "es This is a spanish document....".
- this would mean I need to duplicate the fields - an original version for
storing, and a version with the language marker appended for indexing. E.g
description (indexed=false, stored=true), description_i (indexed=true,
stored=false).

Is there a better way?

Many thanks in advance.

Yee

http://lucene.472066.n3.nabble.com/file/n3235341/MultilingualSnowballFilterFactory.java
MultilingualSnowballFilterFactory.java 



--
View this message in context: http://lucene.472066.n3.nabble.com/Multiplexing-TokenFilter-for-multi-language-tp3235341p3235341.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message