lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maurits van wijland <m.vanwijl...@quicknet.nl>
Subject language identifier, stemmers and analyzers
Date Wed, 13 Nov 2002 14:33:18 GMT
Dear all,

Brad Wellington has created a language identifier which can be used in
combination with
the snowball stemmers donated to Lucene by Alex Murzaku. I have currently
build a solid language model for use with the language identifier for the
languages: Danish, Dutch, English, Finnish, French, German, Italian,
Norwegian, Portuguese, Spanish and Swedisch.

The language identifier is based on a Naive Bayes classifier. Now, this is
all nice, but I have some integration questions, and I hope you can help
out.

Basically, the process of indexing is:
Create an analyzer
Open a IndexWriter
Pass it the analyzer
Proces a document
Add document to Index
Optimize writer
Close writer

Now, the language identifier can help automatically identify what langauge a
document is written in. Based on the suggestion of the identifier, an
apropriate analyzer can be selected.

This is al great, but...

1. Do we index all the terms from various documents in various languages
into 1 index?
2. Do I build a specialised Analyzer that selects the stemmer based on the
Language Identifier or leave that up to the custom indexing application?

Your thoughts please...

regards,

Maurits




--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message