Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 26263 invoked from network); 19 May 2009 12:39:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 May 2009 12:39:27 -0000 Received: (qmail 75682 invoked by uid 500); 19 May 2009 12:39:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 75602 invoked by uid 500); 19 May 2009 12:39:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 75592 invoked by uid 99); 19 May 2009 12:39:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 May 2009 12:39:25 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of karl.wettin@gmail.com designates 209.85.220.206 as permitted sender) Received: from [209.85.220.206] (HELO mail-fx0-f206.google.com) (209.85.220.206) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 May 2009 12:39:14 +0000 Received: by fxm2 with SMTP id 2so4426457fxm.5 for ; Tue, 19 May 2009 05:38:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:cc:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=DVpw4O4IniJf3XYyz0A6BhCwh5WgeLOZyizqOX/zwrI=; b=OS9ZlX7t6WbQHDMTrxfAW5dnUJQMkPjdGZkJhODMxeDf7w0/aUj4S3dlp1T8TfrYVq mLk66XHCdYFwRNsYDC56a2uWhxURu1HB6Ht4TTMpSpOyjhpkTvO7hv05OXJ0EQ7A+OUG DugO/7tcaWfj3BPnRRLD3xKQErMFB7FM7aCiU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=cc:message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=QNv+x68/y5wkWyyJrf0PNoyRtOKvChFsPx1DlZ15VSWRc/HDzei6nck/Ohc5T2Z/P4 N/U689G1+JaZu9z36VzRwhLgS8FFHrIHRK1Wz5HUeSz0ePZsAd95rhFBv7pqYt8B97HE udTnOMU29rZtjAyot0ZBzRJ7NsUbXc8vAYcpo= Received: by 10.103.233.12 with SMTP id k12mr4752500mur.108.1242736733932; Tue, 19 May 2009 05:38:53 -0700 (PDT) Received: from ?192.168.1.201? (c-c98770d5.029-18-6d6c6d2.cust.bredbandsbolaget.se [213.112.135.201]) by mx.google.com with ESMTPS id w5sm1656530mue.4.2009.05.19.05.38.52 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 19 May 2009 05:38:53 -0700 (PDT) Cc: Jeetendra Mirchandani Message-Id: From: Karl Wettin To: java-user@lucene.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Using Lucene for a classification problem Date: Tue, 19 May 2009 14:38:51 +0200 References: X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org Hi Jeetu, wether or not it makes sense to use Lucene as your data matrix depends a bit on your requirements. There is a Bayesian classifier available in the issue tracker that might be helpful, although it does need a little bit of refactoring in order to handle more than one field as the class value. The biggest problem with naive classifiers (according to me) is the speed on a large data set. If this is a problem for you and your data set is not way to large then InstantiatedIndex might be a good fit. And if that is not enough I would take a look at libSVM. You could also take a look at Weka that contains quite a few compilable classifiers available. The problem with Weka is that your data set is rather limited to amount of RAM in your computer, while using a naive classifier on top of a Lucene index allows for very large data set. You could of course also use Weka in order to do some feature selection and then only use the output when using your naive classifier that access Lucene. It would speed things up and you can recalculate the feature selection at any time if your data set changes. You should also check out Apache Mahout, . I hope this helps. karl 19 maj 2009 kl. 02.55 skrev Jeetendra Mirchandani: > Hi Lucene users, > > This might seem a little vague to people just using lucene. I am > trying to > see if I can use lucene for my specific problem > > I am trying to build a classification solution, where in I need to > index > each *structured* document into its category in training phase, and > lookup a > suitable category for a document on runtime. > > I have a naive algorithm ready, that generates TFIDF vectors from the > document, with custom boost values for each field in the document, and > computes cosine similarity on the fly for the document to be > classified. > > My problem: > - Do this classification in 5 different languages > - The target categories are not large, so I dont necessarily need an > inverted index, but it does not gurt > > Where does Lucene fit in? > > - Lucene gives me standard interface to process various languages > (Tokenizers/Analyzers under org.apache.lucene.analysis) > - Lucene gives me persistence of my index over the corpus > > I want to decide in betwen following two approaches - > 1. Use lucene directly, and build my algorithm over it > 2. Just use the language specific classes from lucene , and continue > to > build on my algorithm > > Am sure many of you might have hit this scenario. What do you guys > recommend? > > Regards, > Jeetu > > ps: I am not on the list, so please cc me on the replies --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org