lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject finding the analyzer for a language...
Date Fri, 24 Sep 2010 18:58:07 GMT
I thought that since I'm updating UpLib's Lucene code, I should tackle
the issue of document languages, as well.  Right now I'm using an
off-the-shelf language identifier, textcat, to figure out which language
a Web page or PDF is (mainly) written in.  I then want to analyze that
document with an appropriate analyzer.  I'd then like to map to the
correct Lucene analyzer for that language, falling back to
StandardAnalyzer if the installed Lucene library doesn't have an
analyzer for that language.

It would be *very* handy if Analyzer had a static method

  static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);

Right now I'm consulting a hand-compiled mapping of
langtag-to-Lucene-classname to figure out which Analyzer to use.
Wearisome, and it will be out-of-date for future releases of Lucenen
which will presumably support more languages.

Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
to look "inside" it, and see what language it's for.  That's a problem
on the search side.  My QueryParser is a subclass of
MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
field "_query_language", i.e., "_query_language:de" to tell the query
parser to use a German analyzer on this query.  What I'd like to be able
to do is interrogate the current analyzer attached to the query parser
instance, and throw an exception if it's not for the specified language.
I can do this for non-Snowball analyzers, because of the brittle
hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
there's no way to tell what the language inside it is.  So it would be
nice if SnowballAnalyzer grew a method

  String getLanguageName();

Even better would be

  String getLanguageTag();

And, it would be nice if QueryParser grew a method

  void setAnalyzer(Analyzer a);

which would allow me to simply replace the current analyzer for the
parsing of the rest of the query, instead of going through the rigmarole
of throwing an exception, catching it, recreating the QueryParser with a
different analyzer, and trying again.  What would break if you changed
the analyzer in midstream?  Wouldn't it simply be used for analyzing
remaining terms in the query?

I see that Robert Muir has been doing a lot of good work on the Snowball
code.  I'd really like to see the stopword work finished, so that a
SnowballAnalyzer for a particular language has a decent set of


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message