Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of karl.wettin@gmail.com
 designates 209.85.220.206 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=cc:message-id:from:to:in-reply-to:content-type
         :content-transfer-encoding:mime-version:subject:date:references
         :x-mailer;
        b=QNv+x68/y5wkWyyJrf0PNoyRtOKvChFsPx1DlZ15VSWRc/HDzei6nck/Ohc5T2Z/P4
         N/U689G1+JaZu9z36VzRwhLgS8FFHrIHRK1Wz5HUeSz0ePZsAd95rhFBv7pqYt8B97HE
         udTnOMU29rZtjAyot0ZBzRJ7NsUbXc8vAYcpo=
Cc: Jeetendra Mirchandani <jeetum@gmail.com>
Message-Id: <C38B2083-BFDC-4929-96CE-B1B2538684D6@gmail.com>
From: Karl Wettin <karl.wettin@gmail.com>
To: java-user@lucene.apache.org
In-Reply-To: <ac9da6b30905181755i40836149s6a05e70b97e6cfd7@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: Using Lucene for a classification problem
Date: Tue, 19 May 2009 14:38:51 +0200
References: <ac9da6b30905181755i40836149s6a05e70b97e6cfd7@mail.gmail.com>

Hi Jeetu,

wether or not it makes sense to use Lucene as your data matrix depends  
a bit on your requirements. There is a Bayesian classifier available  
in the issue tracker <http://issues.apache.org/jira/browse/ 
LUCENE-1039> that might be helpful, although it does need a little bit  
of refactoring in order to handle more than one field as the class  
value.

The biggest problem with naive classifiers (according to me) is the  
speed on a large data set. If this is a problem for you and your data  
set is not way to large then InstantiatedIndex might be a good fit.  
And if that is not enough I would take a look at libSVM. You could  
also take a look at Weka that contains quite a few compilable  
classifiers available. The problem with Weka is that your data set is  
rather limited to amount of RAM in your computer, while using a naive  
classifier on top of a Lucene index allows for very large data set.  
You could of course also use Weka in order to do some feature  
selection and then only use the output when using your naive  
classifier that access Lucene. It would speed things up and you can  
recalculate the feature selection at any time if your data set changes.

You should also check out Apache Mahout, <http://lucene.apache.org/mahout 
 >.

I hope this helps.


       karl

19 maj 2009 kl. 02.55 skrev Jeetendra Mirchandani:

> Hi Lucene users,
>
> This might seem a little vague to people just using lucene. I am  
> trying to
> see if I can use lucene for my specific problem
>
> I am trying to build a classification solution, where in I need to  
> index
> each *structured* document into its category in training phase, and  
> lookup a
> suitable category for a document on runtime.
>
> I have a naive algorithm ready, that generates TFIDF vectors from the
> document, with custom boost values for each field in the document, and
> computes cosine similarity on the fly for the document to be  
> classified.
>
> My problem:
> - Do this classification in 5 different languages
> - The target categories are not large, so I dont necessarily need an
> inverted index, but it does not gurt
>
> Where does Lucene fit in?
>
> - Lucene gives me standard interface to process various languages
> (Tokenizers/Analyzers under org.apache.lucene.analysis)
> - Lucene gives me persistence of my index over the corpus
>
> I want to decide in betwen following two approaches -
> 1. Use lucene directly, and build my algorithm over it
> 2. Just use the language specific classes from lucene , and continue  
> to
> build on my algorithm
>
> Am sure many of you might have hit this scenario. What do you guys
> recommend?
>
> Regards,
> Jeetu
>
> ps: I am not on the list, so please cc me on the replies


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org