Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 46DC69F32 for ; Tue, 4 Sep 2012 00:01:09 +0000 (UTC) Received: (qmail 64913 invoked by uid 500); 4 Sep 2012 00:01:08 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 64850 invoked by uid 500); 4 Sep 2012 00:01:08 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 64842 invoked by uid 99); 4 Sep 2012 00:01:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 00:01:08 +0000 Date: Tue, 4 Sep 2012 11:01:08 +1100 (NCT) From: "Lance Norskog (JIRA)" To: dev@lucene.apache.org Message-ID: <79246014.31240.1346716868053.JavaMail.jiratomcat@arcas> In-Reply-To: <1885008277.2008.1343851322284.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (LUCENE-4345) Create a Classification module MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447429#comment-13447429 ] Lance Norskog commented on LUCENE-4345: --------------------------------------- bq. would make training slower but it could be useful to improve accuracy If you use index data which is already analyzed with the same analyzer as your test (unseen) documents, you can use a lot more documents as input. More is better. As the training data increases, signal drives out noise. Once you add the ability to store & load models, training speed becomes less important. Look at the Mahout project for ideas about text classifiers. The ConfusionMatrix class and the html page it prints are really handy for summarizing and probing the classifier's performance. > Create a Classification module > ------------------------------ > > Key: LUCENE-4345 > URL: https://issues.apache.org/jira/browse/LUCENE-4345 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Tommaso Teofili > Assignee: Tommaso Teofili > Priority: Minor > Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch > > > Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service. > So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments. > The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org