lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4345) Create a Classification module
Date Fri, 31 Aug 2012 12:25:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445867#comment-13445867
] 

Tommaso Teofili commented on LUCENE-4345:
-----------------------------------------

bq. docsWithClassSize should ideally be terms.getDocCount() for the field as well rather than
maxDoc.

yep, the early assumption here was that all the docs have a value for the class field but
your suggestion is good.

bq. docCount() should not do a search, instead I think it should just return IR.docFreq(term)
?

correct

bq. it seems we dont need this classCount map at all, nor the priors map?

yes and no, having the priors map slows the training phase (each time it needs to recompute
the priors for all the classes), but fasten the classification task with the unseen text (it's
a cache in the end), wrt the classCount I agree with you it could be easily replaced (with
TermsEnum).

bq. Instead we would just tokenize each doc a single time, and compute the prior of the terms
we find on the fly (it seems to just be

you mean because of the likelihood calculation tokenizing the same doc multiple times (|terms
in the class field|), right? That'd be surely good, I'll work on improving that.

Thanks Robert :)

                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in fields
so that these can be used as training examples (w/ features) in order to very quickly create
classifiers algorithms to use on new documents and / or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent
that will use already seen data (the indexed documents / fields) to classify new documents
/ text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes classifier but
more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message