lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: AW: Best practices for multiple languages?
Date Wed, 19 Jan 2011 18:36:08 GMT
So you are only indexing "analyzed" and querying "analyzed". Is that correct?
Wouldn't it be better to prefer precise matches (a field that is analyzed with StandardAnalyzer
for example) but also allow matches are stemmed.


Le 19 janv. 2011 à 19:21, Bill Janssen a écrit :

> Clemens Wyss <> wrote:
>>> 1) Docs in different languages -- every document is one language
>>> 2) Each document has fields in different languages
>> We mainly have 1)-models
> I've recently done this for UpLib.  I run a language-guesser over the
> document to identify the primary language when the document comes into
> my repository, and save that language as part of the metadata for my
> document.  When UpLib indexes the document into Lucene, it uses that
> language as a key into a table of available Analyzers, and uses the
> selected Analyzer for the document's text.  (I'm actually doing this on
> a per-paragraph level now, but the principle is the same.)
> The tricky part is the query parser.  My extended query parser allows
> a pseudo-field "_query_language" to specify that the query itself is in
> a particular language, in which case the appropriate Analyzer is used
> for the query.
> You can read the code for all this at
> <>.
> Bill
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message