lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeroen Steggink (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-9252) Feature selection and logistic regression on text
Date Wed, 30 Nov 2016 15:11:59 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708840#comment-15708840
] 

Jeroen Steggink edited comment on SOLR-9252 at 11/30/16 3:11 PM:
-----------------------------------------------------------------

This would be great, as the regularization makes the training way more useful.


was (Author: jeroens):
This would be great, as the regularization makes this the training way more useful.

> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search, SolrCloud, SolrJ
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>              Labels: Streaming
>             Fix For: 6.2
>
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
SOLR-9299-1.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on text, from
a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>       features(collection1, 
>                q="*:*",  
>                field="body", 
>                outcome="out_i", 
>                positiveLabel=1, 
>                numTerms=100),
>       field="body",
>       outcome="out_i",
>       maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using *information
gain* to score the terms. http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic regression model
on a text field in the training set.
> For both *features* and *train* the training set is defined by a query. The doc vectors
in the *train* function use tf-idf to represent the terms in the document. The idf is calculated
for the specific training set, allowing multiple training sets to be stored in the same collection
without polluting the idf. 
> In the *train* function a batch gradient descent approach is used to iteratively train
the model.
> Both the *features* and the *train* function are embedded in Solr using the AnalyticsQuery
framework. So only the model is transported across the network with each iteration.
> Both the features and the models can be stored in a SolrCloud collection. Using this
approach Solr can hold millions of models which can be selectively deployed. For example a
model could be trained for each user, to personalize ranking and recommendations.
> Below is the final iteration of a model trained on the Enron Ham/Spam dataset. The model
includes the terms and their idfs and weights as well as a classification evaluation describing
the accuracy of model on the training set. 
> {code}
> {
> 			"idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268,
1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897,
1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337,
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861,
2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885,
1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317,
2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736,
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158,
2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843,
4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544,
3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343,
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435,
2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183,
2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316,
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283,
3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657,
3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757,
4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076,
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601,
2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767,
3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842,
4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158,
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232,
2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298,
3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627,
3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234,
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823,
3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637,
4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919,
3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433,
3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465,
4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599,
3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078,
4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069,
4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111,
3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379,
4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515,
4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653,
3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625,
4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606,
4.007719957621744],
> 			"alpha_d": 7.150861416624748E-4,
> 			"terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm",
"ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom",
"corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004",
"sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please",
"online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas",
"03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain",
"713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco",
"pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act",
"remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n",
"dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free",
"cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r",
"tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket",
"counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop",
"julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link",
"featured", "net", "international", "most", "investing", "works", "readers", "uncertainties",
"differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers",
"should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy",
"sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith",
"ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding",
"events", "advice", "now", "com", "browser"],
> 			"iteration_i": 100,
> 			"weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036,
-1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335,
-0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718,
-1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678,
0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332,
-0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017,
-0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138,
0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349,
1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053,
-0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927,
-0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186,
-1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874,
1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506,
0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864,
0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104,
0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488,
0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243,
-0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707,
1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033,
-1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647,
0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489,
-0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496,
0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348,
0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787,
0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753,
1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955,
0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484,
-0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911,
-0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388,
0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813,
-0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452,
-0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415,
0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024,
0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015,
0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301,
0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604,
1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532,
-0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848,
1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046,
0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395],
> 			"field_s": "body",
> 			"trueNegative_i": 3570,
> 			"falseNegative_i": 35,
> 			"falsePositive_i": 75,
> 			"error_d": 176.8112932306374,
> 			"truePositive_i": 1381,
> 			"id": "model_100"
> 		}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message