lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nestel, Frank IZ/HZA-IOL" <>
Subject RE: N-gram layer
Date Mon, 09 Feb 2004 15:24:48 GMT

sorry for being late, I don't follow this list all the time. I once 
started writing a ngram version similar to TextCat for java. I soon
were distracted and so it is still somewhat raw, but it is long
published under

As far as I remember it was able to parse TextCat resources, using variable
length ngrams, generating ngram sets from samples. It is intended to look
at files as a byte stream.

Things which should be easy departing from there: 
* Make it plugable into a general stemming system.
* Looking at character streams instead of byte streams (i.e. encoding stuff 
handled by Java)
* cosine valued reporting and orthogonalization of ngram space variables.

I could spend some work to do this, but I'd need help, cause it is not my
only pasttime.


> -----Original Message-----
> From: karl wettin [] 
> Sent: Sunday, February 01, 2004 10:07 PM
> To:
> Subject: N-gram layer
> Hello list,
> I'm Karl, and I just started testing Lucene the other day. 
> It's a great core engine, but feel there are some things 
> missing I'd be happy to contribute with. 
> I stated with writing a simple N-gram classifier to detect 
> language of a text in order to automatically cluster 
> documents by langauge. The 
> algorithm is very similair to the "TextCat" C-libray. 
> And then I though, maybe it would be possible to use the same N-gram 
> classifier to make an automatic stemmer that works on all languages. 
> Hopefully I'll have something up and running for tests by 
> next weekend.
> The same classifier could be used for a simple metaphone index.
> However, I need some help on understanding the Analyzer. 
> Where can I find some tutorials on how to write my own? I 
> didn't check with Google, maybe I should before posting here. 
> Since the stemmer (and metaphone) data would have to be 
> indexed in their own field(?) querying the stemmed would 
> require one to stem the query too. Can I create a subclass of 
> Query (or so), or do I need to create my own Query-class that 
> handles the stemming all the way for the user? The last 
> option is my current approach, so I would appreciate some 
> hints and pointers here.
> Great project! 
> karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message