lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <jans...@parc.com>
Subject Re: AW: Best practices for multiple languages?
Date Wed, 19 Jan 2011 11:56:34 GMT
Paul Libbrecht <paul@hoplahup.net> wrote:

> So you are only indexing "analyzed" and querying "analyzed". Is that correct?

Yes, that's correct.  I fall back to StandardAnalyzer if no
language-specific analyzer is available.

> Wouldn't it be better to prefer precise matches (a field that is
> analyzed with StandardAnalyzer for example) but also allow matches are
> stemmed.

StandardAnalyzer isn't quite precise, is it?  StandardFilter does some
kind of English-centric alterations to things.

Perhaps the approach you suggest would be slightly better, but I'd have
to see numbers on that from some reasonable corpus to be convinced it
would be worth it.

Bill

> 
> paul
> 
> 
> Le 19 janv. 2011 à 19:21, Bill Janssen a écrit :
> 
> > Clemens Wyss <clemensdev@mysign.ch> wrote:
> > 
> >>> 1) Docs in different languages -- every document is one language
> >>> 2) Each document has fields in different languages
> >> We mainly have 1)-models
> > 
> > I've recently done this for UpLib.  I run a language-guesser over the
> > document to identify the primary language when the document comes into
> > my repository, and save that language as part of the metadata for my
> > document.  When UpLib indexes the document into Lucene, it uses that
> > language as a key into a table of available Analyzers, and uses the
> > selected Analyzer for the document's text.  (I'm actually doing this on
> > a per-paragraph level now, but the principle is the same.)
> > 
> > The tricky part is the query parser.  My extended query parser allows
> > a pseudo-field "_query_language" to specify that the query itself is in
> > a particular language, in which case the appropriate Analyzer is used
> > for the query.
> > 
> > You can read the code for all this at
> > <http://uplib.parc.com/hg/uplib/file/e29e36f751f7/python/uplib/indexing.py>.
> > 
> > Bill
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message