lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maurits van wijland <>
Subject Language identifier, stemmers and analyzers
Date Fri, 15 Nov 2002 23:42:54 GMT
Hi there,
this is a cross post. I first send this to the developers list, but some how
no response yet. Maybe here, there is someone that can help me!

I am hoping to improve Lucene and add a strategy for multi lingual
support. We already have stemmers for almost all european languages,
now, I think this is the next step.

Any thoughts, please??


> Dear all,
> Brad Wellington has created a language identifier which can be used in
> combination with
> the snowball stemmers donated to Lucene by Alex Murzaku. I have currently
> build a solid language model for use with the language identifier for the
> languages: Danish, Dutch, English, Finnish, French, German, Italian,
> Norwegian, Portuguese, Spanish and Swedisch.
> The language identifier is based on a Naive Bayes classifier. Now, this is
> all nice, but I have some integration questions, and I hope you can help
> out.
> Basically, the process of indexing is:
> Create an analyzer
> Open a IndexWriter
> Pass it the analyzer
> Proces a document
> Add document to Index
> Optimize writer
> Close writer
> Now, the language identifier can help automatically identify what langauge
> document is written in. Based on the suggestion of the identifier, an
> apropriate analyzer can be selected.
> This is al great, but...
> 1. Do we index all the terms from various documents in various languages
> into 1 index?
> 2. Do I build a specialised Analyzer that selects the stemmer based on the
> Language Identifier or leave that up to the custom indexing application?
> Your thoughts please...
> regards,
> Maurits

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message