Thanks Robert, 

2012/8/10 Robert Muir <>
I got the patch before JIRA was down, and just saw another thing:

+  private double countInClassC(String c) throws IOException {
+    TopDocs topDocs = TermQuery(new
Term(classFieldName, c)), Integer.MAX_VALUE);
+    int res = 0;
+    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+      Fields termVectors =
+      if (termVectors != null) {
+        res += termVectors.terms(textFieldName).size();
+      } else {
+        // TODO : warn about not existing term vectors for field
+      }
+    }
+    return res;
+  }

For this part, I am unsure what the statistic is you are driving for:

It seems currently that it takes all documents that have term c in
field classFieldName, and sums the number of unique terms each doc has
that in field classFieldName?

yes, it is.

If this is really what you want and you need 100% exact numbers, just
like the other computation i would not do a search with a PQ of
Integer.MAX_VALUE, but instead just iterate over a DocsEnum for that

I noticed that just after I submitted the patch but then Jira was down again :-)

But if a good approximation is ok, I would do this, which is instant
and needs no term vectors:

    Terms terms = MultiFields.getTerms(reader, classFieldName);
    long numPostings = terms.getSumDocFreq(); // number of term/doc pairs
    double avgNumberOfUniqueTerms = numPostings /
(double)terms.getDocCount(); // avg # of unique terms per doc
    return avgNumberOfUniqueTerms * reader.docFreq(c); // avg # of
unique terms per doc * # docs with c

this may be good (and much more performant), I'll give it a try, thanks :-)
The NB classifier there is very simplistic and could much be improved (or at least provided with parameters / options)
Apart from that a kNN / MoreLikeThis based classifier should be fairly easy to add.


On Fri, Aug 10, 2012 at 8:36 AM, Tommaso Teofili (JIRA) <> wrote:
>      [ ]
> Tommaso Teofili updated SOLR-3700:
> ----------------------------------
>     Attachment: SOLR-3700_2.patch
> new patch incorporating Robert's suggestions (plus added a couple more TODOs)
>> Create a Classification component
>> ---------------------------------
>>                 Key: SOLR-3700
>>                 URL:
>>             Project: Solr
>>          Issue Type: New Feature
>>            Reporter: Tommaso Teofili
>>            Priority: Minor
>>         Attachments: SOLR-3700.patch, SOLR-3700_2.patch
>> Lucene/Solr can host huge sets of documents containing lots of information in fields so that these can be used as training examples (w/ features) in order to very quickly create classifiers algorithms to use on new documents and / or to provide an additional service.
>> So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent that will use already seen data (the indexed documents / fields) to classify new documents / text fragments.
>> The first version will contain a (simplistic) Lucene based Naive Bayes classifier but more implementations should be added in the future.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
> For more information on JIRA, see:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:


To unsubscribe, e-mail:
For additional commands, e-mail: