lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: language identifier, stemmers and analyzers
Date Mon, 18 Nov 2002 03:22:36 GMT

1. Ideally, yes, if you ask me.  You get email in at least 2 languages
- wouldn't it make sense to have it all indexed in a single email

2. I think it would be nice to have an Analyzer that can pick the
correct Analyzer based on the language, but since language identifier
can also be retrieved from Brad's code directly, one will always be
able to opt for using custom logic in their application instead of
using your language-aware Analyzer.
So my opinion is that a specialized Analyzer that can pick the right
Analyzer implementation based on the language of the input would be
good, as it does not prevent developers from using Brad's code

Is this something that can be included in Lucene core/sandbox?


--- maurits van wijland <> wrote:
> Dear all,
> Brad Wellington has created a language identifier which can be used
> in
> combination with
> the snowball stemmers donated to Lucene by Alex Murzaku. I have
> currently
> build a solid language model for use with the language identifier for
> the
> languages: Danish, Dutch, English, Finnish, French, German, Italian,
> Norwegian, Portuguese, Spanish and Swedisch.
> The language identifier is based on a Naive Bayes classifier. Now,
> this is
> all nice, but I have some integration questions, and I hope you can
> help
> out.
> Basically, the process of indexing is:
> Create an analyzer
> Open a IndexWriter
> Pass it the analyzer
> Proces a document
> Add document to Index
> Optimize writer
> Close writer
> Now, the language identifier can help automatically identify what
> langauge a
> document is written in. Based on the suggestion of the identifier, an
> apropriate analyzer can be selected.
> This is al great, but...
> 1. Do we index all the terms from various documents in various
> languages
> into 1 index?
> 2. Do I build a specialised Analyzer that selects the stemmer based
> on the
> Language Identifier or leave that up to the custom indexing
> application?
> Your thoughts please...
> regards,
> Maurits

Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message