lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4345) Create a Classification module
Date Fri, 31 Aug 2012 12:55:08 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445884#comment-13445884
] 

Robert Muir commented on LUCENE-4345:
-------------------------------------

{quote}
yes and no, having the priors map slows the training phase (each time it needs to recompute
the priors for all the classes), but fasten the classification task with the unseen text (it's
a cache in the end), wrt the classCount I agree with you it could be easily replaced (with
TermsEnum).
{quote}

My concern here is that if the # of terms is large, its a lot of ram too. We can see though,
I think tokenizing the doc so many times today is
actually the slowest part. But we can move to termsenum as a step, just an iteration :)

{quote}
you mean because of the likelihood calculation tokenizing the same doc multiple times (|terms
in the class field|), right? That'd be surely good, I'll work on improving that
{quote}

Exactly, basically i was thinking in the short term lets remove the extra loop, as an iteration.

long term I think we would not need the maps and just call docFreq on the terms from the term
dictionary on the fly here.
While this sounds like a lot of docFreq calls, i am not so sure. it seems the larger formula
is looking for a max() here?

So we could consider performance-driven heuristics/approximations like MoreLikeThis does based
on things like local
term frequency within the document/term length, whatever to save on docFreq() calls, if it
makes sense (i have to look at the formula in more detail here).

In that case instead of consuming the tokenStream as an array, it probably makes more sense
to consume it into a Map<string,freq>
so we have a little 'inverted index' for the doc. the current code, given a word that appears
many times in the document,
will do many computations when instead we could really just work across the unique terms within
the document.

                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in fields
so that these can be used as training examples (w/ features) in order to very quickly create
classifiers algorithms to use on new documents and / or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a ClassificationComponent
that will use already seen data (the indexed documents / fields) to classify new documents
/ text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes classifier but
more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message