lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <jans...@parc.com>
Subject Re: AW: Best practices for multiple languages?
Date Wed, 19 Jan 2011 10:21:51 GMT
Clemens Wyss <clemensdev@mysign.ch> wrote:

> > 1) Docs in different languages -- every document is one language
> > 2) Each document has fields in different languages
> We mainly have 1)-models

I've recently done this for UpLib.  I run a language-guesser over the
document to identify the primary language when the document comes into
my repository, and save that language as part of the metadata for my
document.  When UpLib indexes the document into Lucene, it uses that
language as a key into a table of available Analyzers, and uses the
selected Analyzer for the document's text.  (I'm actually doing this on
a per-paragraph level now, but the principle is the same.)

The tricky part is the query parser.  My extended query parser allows
a pseudo-field "_query_language" to specify that the query itself is in
a particular language, in which case the appropriate Analyzer is used
for the query.

You can read the code for all this at
<http://uplib.parc.com/hg/uplib/file/e29e36f751f7/python/uplib/indexing.py>.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message