lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Using Lucene for a classification problem
Date Tue, 19 May 2009 12:38:51 GMT
Hi Jeetu,

wether or not it makes sense to use Lucene as your data matrix depends  
a bit on your requirements. There is a Bayesian classifier available  
in the issue tracker < 
LUCENE-1039> that might be helpful, although it does need a little bit  
of refactoring in order to handle more than one field as the class  

The biggest problem with naive classifiers (according to me) is the  
speed on a large data set. If this is a problem for you and your data  
set is not way to large then InstantiatedIndex might be a good fit.  
And if that is not enough I would take a look at libSVM. You could  
also take a look at Weka that contains quite a few compilable  
classifiers available. The problem with Weka is that your data set is  
rather limited to amount of RAM in your computer, while using a naive  
classifier on top of a Lucene index allows for very large data set.  
You could of course also use Weka in order to do some feature  
selection and then only use the output when using your naive  
classifier that access Lucene. It would speed things up and you can  
recalculate the feature selection at any time if your data set changes.

You should also check out Apache Mahout, < 

I hope this helps.


19 maj 2009 kl. 02.55 skrev Jeetendra Mirchandani:

> Hi Lucene users,
> This might seem a little vague to people just using lucene. I am  
> trying to
> see if I can use lucene for my specific problem
> I am trying to build a classification solution, where in I need to  
> index
> each *structured* document into its category in training phase, and  
> lookup a
> suitable category for a document on runtime.
> I have a naive algorithm ready, that generates TFIDF vectors from the
> document, with custom boost values for each field in the document, and
> computes cosine similarity on the fly for the document to be  
> classified.
> My problem:
> - Do this classification in 5 different languages
> - The target categories are not large, so I dont necessarily need an
> inverted index, but it does not gurt
> Where does Lucene fit in?
> - Lucene gives me standard interface to process various languages
> (Tokenizers/Analyzers under org.apache.lucene.analysis)
> - Lucene gives me persistence of my index over the corpus
> I want to decide in betwen following two approaches -
> 1. Use lucene directly, and build my algorithm over it
> 2. Just use the language specific classes from lucene , and continue  
> to
> build on my algorithm
> Am sure many of you might have hit this scenario. What do you guys
> recommend?
> Regards,
> Jeetu
> ps: I am not on the list, so please cc me on the replies

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message