lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-9252) Feature selection and logistic regression on text
Date Wed, 03 Aug 2016 15:42:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406088#comment-15406088
] 

Joel Bernstein edited comment on SOLR-9252 at 8/3/16 3:41 PM:
--------------------------------------------------------------

This looks great, thanks for adding this.

I've got a  commit ready to push out that doesn't include this patch, but we can work it into
a follow-up commit.


was (Author: joel.bernstein):
This looks great, thanks for adding this.

I've got a  commit ready to push out that doesn't include this patch, but we can work it in
a follow-up commit.

> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to rebuild the tf-idf
vector for each documents. It is costly computation if we represent doc by a lot of terms.
Features selection can help reducing the computation.
> Due to its computational efficiency and simple interpretation, information gain is one
of the most popular feature selection methods. It is used to measure the dependence between
features and labels and calculates the information gain between the i-th feature and the class
labels (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in which each
email is represented by top 100 terms that have highest information gain) and got the accuracy
by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the same *parallel
iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", positiveLabel=1,
numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain scores. It can
be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>          featuresSelection(collection1, 
>                                       q="*:*",  
>                                       field="tv_text", 
>                                       outcome="out_i", 
>                                       positiveLabel=1, 
>                                       numTerms=100),
>          field="tv_text",
>          outcome="out_i",
>          maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and compute the
error of (n-1)th model. Because the error will be wrong if we compute the error dynamically
in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous iteration.
It will increase the learning rate by 5% if error is going down and It will decrease the learning
rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, sentiment analysis
and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message