lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4345) Create a Classification module
Date Mon, 03 Sep 2012 12:13:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447243#comment-13447243
] 

Tommaso Teofili commented on LUCENE-4345:
-----------------------------------------

bq. So we could consider performance-driven heuristics/approximations like MoreLikeThis does
based on things like local term frequency within the document/term length, whatever to save
on docFreq() calls, if it makes sense (i have to look at the formula in more detail here).

The generic formula is _C = argmax( P(doc|class) * P(class) )_ , I agree it makes sense to
incrementally see if we can find good heuristics / approximations which low the computational
cost of this calculation.

bq. the current code, given a word that appears many times in the document, will do many computations
when instead we could really just work across the unique terms within the document.

another good point where we can improve, thanks :)

I managed to remove all the Maps from the code, I'll attach the patch shortly. I'll then work
on removing the tokenizeDoc() loop.
                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in fields
so that these can be used as training examples (w/ features) in order to very quickly create
classifiers algorithms to use on new documents and / or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent
that will use already seen data (the indexed documents / fields) to classify new documents
/ text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes classifier but
more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message