mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schilling <ch...@cellixis.com>
Subject Strange results running SGD TrainNewsGroups example
Date Fri, 17 Dec 2010 00:24:40 GMT
Hi,

I am able to run the o.a.m.classifier.sgd.TrainNewsGroups example.  However I am getting strange
results for the top weighted features from the dissector.  Here are some snippets from the
output.

Training and evaluation:
...
0.00	0.00	0.00	0.00	0.0000000	0.0000000	700	0.000	0.00	none
0.00	0.00	0.00	0.00	0.0000000	0.0000000	800	0.000	0.00	none
0.23	186868.00	52255.00	1265.34	1.3754325e-07	1.0028590e-08	1000	-2.608	34.71	none
0.23	186868.00	52255.00	1265.34	1.3754325e-07	1.0028590e-08	1200	-2.608	34.71	none
0.23	186868.00	52255.00	1265.34	1.3754325e-07	1.0028590e-08	1400	-2.608	34.71	none
0.23	186868.00	52255.00	1265.34	1.3754325e-07	1.0028590e-08	1500	-2.608	34.71	none
1.04	189460.00	55622.00	4651.08	2.5962146e-08	1.0060092e-08	2000	-1.768	56.43	none
1.09	189837.00	60531.00	5314.93	3.1550927e-08	1.0048498e-08	2500	-1.191	71.41	none
...
1.12	189992.00	68364.00	6384.13	2.4446595e-07	1.0034217e-08	6000	-0.880	80.90	none
1.14	189991.00	68775.00	6439.89	3.0565171e-07	1.0033774e-08	7000	-0.849	82.27	none
1.16	189995.00	69360.00	6491.92	3.0565171e-07	1.0000002e-08	8000	-0.860	81.02	none
1.16	189999.00	69919.00	6527.12	3.0566116e-07	1.0000000e-08	10000	-0.851	82.40	none

So, I am running over the files in /20news-bydate/20news-bydate-train.  I think the above
looks reasonable.  At least, I like the ~80% accuracy of the classifier.  Now, when I look
at the top results from the dissector, the features do not make sense (at least compared to
similar results given in Listing 15-9 of MIA).  In fact, these do not make any sense at all
to me.  

First few results of dissect()
body=god	-0.1	sci.space	4.0	-0.1394994576714021	5.0	-0.10322063352194852
body=atheists	-0.1	comp.windows.x	5.0	-0.07383748917466922	1.0	-0.037205929610919175
body=christian	-0.1	talk.politics.mideast	2.0	-0.029106552130967654	4.0	-0.0033808015660384875
body=he	0.1	talk.politics.mideast	18.0	0.07845100216340763	5.0	-0.011218075788326903
body=martin	-0.1	talk.politics.mideast	7.0	-0.019407188307985972	10.0	0.00782255718617942
body=say	-0.1	comp.sys.ibm.pc.hardware	4.0	-0.0480512351042981	17.0	0.0037854045183534166
body=windows	0.1	comp.windows.x	18.0	-0.06722265016470273	5.0	-0.009627757932247396
body=file	-0.1	sci.med	7.0	-0.05790809278204335	5.0	-0.050492324263356765
body=government	0.1	talk.religion.misc	3.0	-0.06076111927305433	2.0	-0.052663471587524276
body=sale	-0.1	talk.religion.misc	15.0	-0.03535708180324768	12.0	-0.03532746353789419
body=atheism	-0.1	misc.forsale	8.0	-0.05941771751639946	1.0	-0.0500729187538798
body=program	-0.1	sci.med	16.0	-0.03820018259936702	7.0	-2.9675316187177843E-4
body=193	0.1	talk.politics.mideast	5.0	0.05061582599095028	17.0	-0.032606809778589076
body=his	-0.1	talk.politics.misc	12.0	0.05030942352260737	5.0	-0.04490996261214399

I am not adding any leaks (leakType = 0).  

Any ideas here?  
Thanks
Chris S.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message