lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <reng...@ix.netcom.com>
Subject RE: N-gram layer
Date Mon, 02 Feb 2004 04:15:26 GMT
Actually, you do not always need to store it in a field.

See the Phonetic Query patch I posted (which does Soundex, Metaphone, and
can actually do any 'secondary' info query).

Robert Engels

-----Original Message-----
From: karl wettin [mailto:kalle@snigel.dnsalias.net]
Sent: Sunday, February 01, 2004 3:07 PM
To: lucene-dev@jakarta.apache.org
Subject: N-gram layer



Hello list,

I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with.

I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by langauge. The
algorithm is very similair to the "TextCat" C-libray.

And then I though, maybe it would be possible to use the same N-gram
classifier to make an automatic stemmer that works on all languages.
Hopefully I'll have something up and running for tests by next weekend.

The same classifier could be used for a simple metaphone index.

However, I need some help on understanding the Analyzer. Where can I
find some tutorials on how to write my own? I didn't check with Google,
maybe I should before posting here. Since the stemmer (and metaphone)
data would have to be indexed in their own field(?) querying the stemmed
would require one to stem the query too. Can I create a subclass of
Query (or so), or do I need to create my own Query-class that handles
the stemming all the way for the user? The last option is my current
approach, so I would appreciate some hints and pointers here.


Great project!


karl



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message