lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4345) Create a Classification module
Date Fri, 31 Aug 2012 10:44:07 GMT


Robert Muir commented on LUCENE-4345:

docsWithClassSize should ideally be terms.getDocCount() for the field as well
rather than maxDoc.

docCount() should not do a search, instead I think it should just return IR.docFreq(term)

One more piece: if classCount is just a Map<UniqueValues,DocFreq>,
it would be a lot better to just compute this with a TermsEnum,
just iterating over the terms for the field.

It seems the "value" part is not used, so for now it could be
just a hashset as well?

This would remove the stored fields loop (replacing it with a termsenum
loop), but I think we can probably remove the loop entirely too as
a second step.

I don't like that assignClass has a loop over all possible terms in the
field, re-tokenizing the doc for each one! 

it seems we dont need this classCount map at all, nor the priors map?

Instead we would just tokenize each doc a single time, and compute the prior of the terms
we find on the fly (it seems to just be IDF anyway really).

And we wouldnt need any maps for that.

> Create a Classification module
> ------------------------------
>                 Key: LUCENE-4345
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
> Lucene/Solr can host huge sets of documents containing lots of information in fields
so that these can be used as training examples (w/ features) in order to very quickly create
classifiers algorithms to use on new documents and / or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent
that will use already seen data (the indexed documents / fields) to classify new documents
/ text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes classifier but
more implementations should be added in the future.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message