mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 万代豊 <20525entrad...@gmail.com>
Subject Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Date Mon, 18 Mar 2013 05:56:35 GMT
Jake
Hi.
Due to my housekeeping matters for other things, I have actually not built
Mahout 0.7 from the trunk code yet, but before doing so,I have tried
Mahout-0.6 so that I can run LDA straight forward.

I have successfully ran LDA with input as TF vector file wuth 68 iteration
across 43 documents, specifying 12 topics to be identified.

$MAHOUT_HOME/bin/mahout lda --input
JAText-Mahout-0.6-LDA/JAText-luceneTFvectors01/part-out.vec --output
JAText-Mahout-0.6-LDA/output --numTopics 12
$HADOOP_HOME/bin/hadoop dfs -ls JAText-Mahout-0.6-LDA/output/
Found 70 items
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:48
/user/hadoop/JAText-Mahout-0.6-LDA/output/docTopics
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:03
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-0
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:04
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-1
      .....
      .....
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:47
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-67
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:48
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-68

I actually see part-m-00000 sequencefiles per each iteration stages.

Question here is that $MAHOUT_HOME/bin/mahout ldatopics utility for Mahout
0.6 (https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html)
doesn't work right due to
NullPointerException.

I could only confirm the result of docTopics using seqdumper but not able
to see any of the results for the
above state-* sequencefiles.
Here is what will happen with ldatopics comand.
$MAHOUT_HOME/bin/mahout ldatopics -i JAText-Mahout-0.6-LDA/output/state-68
-d JAText-TFDictionary.txt
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Exception in thread "main" java.lang.NullPointerException
 at org.apache.mahout.common.Pair.compareTo(Pair.java:90)
 at org.apache.mahout.common.Pair.compareTo(Pair.java:23)
 at java.util.PriorityQueue.siftUpComparable(PriorityQueue.java:582)
 at java.util.PriorityQueue.siftUp(PriorityQueue.java:574)
 at java.util.PriorityQueue.offer(PriorityQueue.java:274)
 at java.util.PriorityQueue.add(PriorityQueue.java:251)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.maybeEnqueue(LDAPrintTopics.java:150)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.topWordsForTopics(LDAPrintTopics.java:216)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.main(LDAPrintTopics.java:128)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Apart from this, I can confirm 12 topics per each documents, total of 43
given as follows, using seqdumper.(but not with "ldatopics")

$MAHOUT_HOME/bin/mahout seqdumper -s
JAText-Mahout-0.6-LDA/output/docTopics/part-m-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar
13/03/18 14:12:04 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647,
--seqFile=JAText-Mahout-0.6-LDA/output/docTopics/part-m-00000,
--startPhase=0, --tempDir=temp}
Input Path: JAText-Mahout-0.6-LDA/output/docTopics/part-m-00000
Key class: class org.apache.hadoop.io.LongWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value:
{0:0.0718128116030847,1:0.07204818495147658,2:0.07165839473775905,3:0.07471123413425951,4:0.07228942239756206,5:0.07223674970698116,6:0.08965049111711978,7:0.07114235379664942,8:0.18392117686641946,9:0.07555513290760585,10:0.0725383327578603,11:0.07243571502322221}
Key: 1: Value:
{0:0.07340159249672981,1:0.07673280973643179,2:0.07227506725925102,3:0.17698846760344888,4:0.07957759924990469,5:0.07593691263843196,6:0.07237777139656294,7:0.07195475314903217,8:0.07480823084457076,9:0.07539197289261017,10:0.07323355269079787,11:0.07732127004222797}
Key: 2: Value:
{0:0.07537526514889709,1:0.0740932684425483,2:0.07401704886882894,3:0.15947138780941505,4:0.07626805801786213,5:0.07594534542558737,6:0.0774595088530317,7:0.07360052680038194,8:0.08527758756831044,9:0.07528779975758186,10:0.07796910710111517,11:0.07523509620643995}
Key: 3: Value:
{0:0.07750073784804058,1:0.07356280672637124,2:0.07221662738530421,3:0.08091893349893002,4:0.0754552021503801,5:0.07417277880780081,6:0.07361890672000053,7:0.071745551181584,8:0.07641778805410677,9:0.07534234418941486,10:0.17176841942790796,11:0.07727990401015893}
Key: 4: Value:
{0:0.07145940471259196,1:0.07184955998834013,2:0.2058935801891931,3:0.07292422029832372,4:0.07153173687477948,5:0.07537967128784669,6:0.06978672313809625,7:0.07538935753144121,8:0.07114074879665583,9:0.07002063039135159,10:0.0725472962330038,11:0.0720770705583762}
Key: 5: Value:
{0:0.07057892607036045,1:0.07093406005560303,2:0.08568518347169529,3:0.07391938432755896,4:0.0759092820317773,5:0.07467329995382449,6:0.07068594441945125,7:0.1875358999487775,8:0.07173558822628456,9:0.07494521218787105,10:0.07199846644784835,11:0.07139875285894795}
Key: 6: Value:
{0:0.07794348574663487,1:0.07824368289133427,2:0.07165595068330537,3:0.07774954595685425,4:0.0766120838468042,5:0.0747036076270153,6:0.07454553055882684,7:0.07060172890982834,8:0.07844336583237742,9:0.16374785108989987,10:0.07569450143990512,11:0.08005866541721406}
Key: 7: Value:
{0:0.07112572397111286,1:0.07300363380720896,2:0.07232747075644068,3:0.07518314545335636,4:0.07513477084824424,5:0.07297914727519446,6:0.18805919233900711,7:0.07148839019410413,8:0.08024531932230204,9:0.07541179727952517,10:0.07177860391564407,11:0.07326280483785985}
Key: 8: Value:
{0:0.07554100328008571,1:0.07304022699955763,2:0.08056213189891553,3:0.07686453603967164,4:0.07608453289792913,5:0.0808118995073971,6:0.07391884403051643,7:0.15412003436750293,8:0.07603694729795683,9:0.0778324702234325,10:0.07587701532696045,11:0.07931035813007417}
Key: 9: Value:
{0:0.0766887909399859,1:0.16442715116981202,2:0.07453094096788057,3:0.08083048768359999,4:0.07496728961744424,5:0.07571845025936895,6:0.07456869584503094,7:0.07451361974201046,8:0.07533011667739489,9:0.07504805610051823,10:0.07533897793717347,11:0.07803742305978043}
Key: 10: Value:
{0:0.07644186790750442,1:0.07188354701541526,2:0.08362649017738835,3:0.07494440006591431,4:0.07042864597575031,5:0.08043564052162107,6:0.07319758531674558,7:0.0919172838161376,8:0.07389616525845054,9:0.07439065779785406,10:0.07382932903997881,11:0.15500838710723974}
Key: 11: Value:
{0:0.07513407214936361,1:0.07403840577927905,2:0.08218396128030249,3:0.07475162143684473,4:0.0753271291729383,5:0.0817824361595834,6:0.07401634646350805,7:0.15391594459572414,8:0.07599876967046014,9:0.07493677823965805,10:0.07464925643938625,11:0.08326527861295184}
Key: 12: Value:
{0:0.07224524537312543,1:0.07243726695303788,2:0.06882751916311756,3:0.07338082060232978,4:0.0789656395130017,5:0.07129041659924686,6:0.07112717550480875,7:0.06914855037083716,8:0.0739188820784451,9:0.20207104638363793,10:0.07262957199026002,11:0.07395786546815176}
Key: 13: Value:
{0:0.07788076680635052,1:0.0755008470585809,2:0.13726003168639164,3:0.07605002630571478,4:0.07286650305830464,5:0.08165367846340763,6:0.07498018069088153,7:0.09097790918226874,8:0.07657447805252125,9:0.07485142034070366,10:0.0796177815538129,11:0.08178637680106185}
Key: 14: Value:
{0:0.07295702097991869,1:0.07692830323645224,2:0.07114936656535441,3:0.07813243236981465,4:0.07458741344071758,5:0.07428206772212516,6:0.07306149947659399,7:0.07119208068308296,8:0.07661679604020216,9:0.1808171438863629,10:0.07362165596251889,11:0.07665421963685642}
Key: 15: Value:
{0:0.07438303415521565,1:0.08936472527092182,2:0.0721407480734369,3:0.08024517373634514,4:0.08495112352899714,5:0.12798950687476365,6:0.0738652708344951,7:0.07102391500075698,8:0.07642018373472435,9:0.09212303419756263,10:0.07471433966355276,11:0.0827789449292279}
Key: 16: Value:
{0:0.07405933140932965,1:0.07237135168339745,2:0.07710396626904192,3:0.07537188986011914,4:0.07492427432053816,5:0.1848448388705187,6:0.07160376504616668,7:0.07275258542679869,8:0.07447012105434946,9:0.07322622120669901,10:0.07491117175584533,11:0.07436048309719598}
Key: 17: Value:
{0:0.07270150174820346,1:0.07336075496870366,2:0.07206132040435817,3:0.07505175890951873,4:0.17078284419746403,5:0.07234844853114338,6:0.07349550481242655,7:0.07140894496080638,8:0.09526112605801787,9:0.07827594135881187,10:0.07256797723381891,11:0.07268387681672689}
Key: 18: Value:
{0:0.06822147036587765,1:0.06931705341385691,2:0.06755017517379881,3:0.07084620473648064,4:0.07209772088260505,5:0.06914901640907882,6:0.06833666233433819,7:0.0677132907737504,8:0.2380642077190045,9:0.07110086555849322,10:0.06789165498406487,11:0.06971167764865105}
Key: 19: Value:
{0:0.1521172159468707,1:0.07458905157538626,2:0.0812452433567503,3:0.0737934714691652,4:0.07557798634927729,5:0.0914283581095297,6:0.07331776882423938,7:0.07983474181400593,8:0.07424156632991395,9:0.07399015637787039,10:0.07553157126005484,11:0.07433286858693594}
Key: 20: Value:
{0:0.07037088504213293,1:0.07753173778259684,2:0.06963656610525527,3:0.07652331293278451,4:0.07577186332142459,5:0.07107649480246177,6:0.069634707285086,7:0.06796954595633013,8:0.07111380206716118,9:0.07329473161326701,10:0.07085393447941721,11:0.20622241861208232}
Key: 21: Value:
{0:0.07267052371362376,1:0.10838457382222143,2:0.07288626490852075,3:0.07665501391683498,4:0.08122877546341466,5:0.07648757678490509,6:0.07309754896275286,7:0.07196924485513742,8:0.07510273320968298,9:0.08027888654475641,10:0.07434582247515231,11:0.13689303534299732}
Key: 22: Value:
{0:0.07590299260794119,1:0.07182671472972624,2:0.07866170391847252,3:0.07196318892167715,4:0.07391467745715426,5:0.08199253610117335,6:0.07112057388229735,7:0.17995567881742183,8:0.07313986248980091,9:0.0719897705532455,10:0.07707137069166255,11:0.07246092982942719}
Key: 23: Value:
{0:0.07510400993439616,1:0.07492238051881782,2:0.07249891469501611,3:0.07874118062111353,4:0.07740863036871856,5:0.07495020931710737,6:0.07487026612707089,7:0.07228496799337089,8:0.07554589769543246,9:0.07754569165808944,10:0.17058694106855654,11:0.07554091000231035}
Key: 24: Value:
{0:0.07321784157147374,1:0.07353176432663289,2:0.07461758340564321,3:0.07478301843275811,4:0.07224848003389454,5:0.19336440372330538,6:0.07161969153588899,7:0.07227009244323178,8:0.07326037281412974,9:0.07238700047752525,10:0.07413516314997139,11:0.07456458808554488}
Key: 25: Value:
{0:0.07284344738966637,1:0.08048892935573505,2:0.07348587608823655,3:0.15224626772000824,4:0.08057971643942442,5:0.07524631733926435,6:0.07367650429704985,7:0.07298624476286161,8:0.07815841965985619,9:0.08327719963091014,10:0.07388054717836262,11:0.08313053013862455}
Key: 26: Value:
{0:0.2088641213128677,1:0.07217357239954633,2:0.06977268024837441,3:0.0723781559516539,4:0.0733843609346381,5:0.0725289468047051,6:0.070112386821386,7:0.06847188560423208,8:0.07342905740312945,9:0.073599059451924,10:0.07197003494288245,11:0.07331573812466023}
Key: 27: Value:
{0:0.07289600892713649,1:0.07348311866263774,2:0.18778903162140495,3:0.07205340827187023,4:0.07273935194516655,5:0.07893144141414665,6:0.07082017634447321,7:0.07673952529223389,8:0.072997596716132,9:0.07235187747014486,10:0.07347067257673276,11:0.07572779075792047}
Key: 28: Value:
{0:0.07381955730631025,1:0.07480361210390583,2:0.07327103357211588,3:0.14958183573796774,4:0.07688133966924651,5:0.07487721773410637,6:0.07612971131475965,7:0.07259117525483041,8:0.0912047102466518,9:0.07874554927502336,10:0.08172598454324863,11:0.07636827324183355}
Key: 29: Value:
{0:0.1692644666391619,1:0.07390683762411573,2:0.07342660554852809,3:0.07696353108313878,4:0.07347832825418264,5:0.07694955797130756,6:0.07407584772559297,7:0.07304732544104044,8:0.07641684226027642,9:0.079829317439002,10:0.07558535930959318,11:0.07705598070406015}
Key: 30: Value:
{0:0.07468947736146606,1:0.07786760785575264,2:0.07668849483827693,3:0.07516000273386261,4:0.07543363636150069,5:0.09036250176865988,6:0.07429970446478658,7:0.0746898616436536,8:0.07542013353273386,9:0.07554322999724328,10:0.14955906751038184,11:0.08028628193168191}
Key: 31: Value:
{0:0.1867332766391863,1:0.07235638381666432,2:0.0721089231183253,3:0.0757112543523175,4:0.07145873422020084,5:0.07560655750201013,6:0.073010166169799,7:0.07066438533719611,8:0.0762163155789337,9:0.07595318328439196,10:0.07414114215270613,11:0.07603967782826869}
Key: 32: Value:
{0:0.07143510793815012,1:0.17614219360492722,2:0.07252143229313225,3:0.07688914097701999,4:0.08026672117717741,5:0.0758298558372681,6:0.07247829561880598,7:0.07067774151598975,8:0.0752054952096508,9:0.07724201853484558,10:0.07231627604723512,11:0.07899572124579768}
Key: 33: Value:
{0:0.07368475771373456,1:0.07517339691091597,2:0.166888256975227,3:0.07570453838729124,4:0.07668800276241097,5:0.08187813165924461,6:0.07168006348981189,7:0.07687788615746113,8:0.07416523041349614,9:0.07479041817884187,10:0.07410767337558434,11:0.07836164397598028}
Key: 34: Value:
{0:0.07522712005110395,1:0.07327306939184465,2:0.07911963461113429,3:0.07656086465591931,4:0.07583432111852756,5:0.08188051955733927,6:0.07225128645628258,7:0.15999385884776768,8:0.07571652495939142,9:0.07509326402680659,10:0.07548674460502434,11:0.07956279171885833}
Key: 35: Value:
{0:0.07267242275737872,1:0.07878949220377902,2:0.07617507452134238,3:0.07582983686365709,4:0.13527442765196027,5:0.09540599826115263,6:0.07277660873228448,7:0.07713264456951018,8:0.07503816659261497,9:0.0770062890646006,10:0.07959344680297649,11:0.08430559197874316}
Key: 36: Value:
{0:0.07272621795829833,1:0.16990320956026245,2:0.071331582511733,3:0.07741849862522854,4:0.08302168024422839,5:0.07354592757236778,6:0.07279496297014562,7:0.070798928709588,8:0.07414387540194455,9:0.07769684807216959,10:0.07231738568725134,11:0.08430088268678237}
Key: 37: Value:
{0:0.07012467167032206,1:0.07103553606627135,2:0.0698525275633188,3:0.0712926267418177,4:0.22431986557710554,5:0.06974652518500885,6:0.06982647001354533,7:0.06987254073204445,8:0.06983962747680701,9:0.07160929845412142,10:0.07086036087361314,11:0.07161994964602439}
Key: 38: Value:
{0:0.07786050954613562,1:0.1408430559342099,2:0.08833468037109693,3:0.07375969748045628,4:0.07472552006208838,5:0.08045807947499396,6:0.07476725512435577,7:0.08639595288916795,8:0.07608567357407875,9:0.07407901796443528,10:0.07732204782836997,11:0.07536850975061114}
Key: 39: Value:
{0:0.07282613739549393,1:0.07282699349864358,2:0.08011836101605058,3:0.07244839722230774,4:0.07339239380356403,5:0.08010831474869388,6:0.07235509848723476,7:0.0756913055286426,8:0.07273110519910651,9:0.07180132292872553,10:0.18272192763698494,11:0.07297864253455194}
Key: 40: Value:
{0:0.0720367504527421,1:0.07359541056796724,2:0.07241019929875296,3:0.07428091543798446,4:0.07796541190573303,5:0.07434079197594493,6:0.17141569830676887,7:0.07225079621214267,8:0.08594739058245894,9:0.077968805627261,10:0.0719804556201378,11:0.0758073740121059}
Key: 41: Value:
{0:0.07105868013287045,1:0.07188669942062725,2:0.07000889617200293,3:0.07758422486163055,4:0.07367082545375013,5:0.07274706379711163,6:0.19093192769727652,7:0.0714291088454229,8:0.07867747306834934,9:0.07695822357867978,10:0.07099234851453332,11:0.07405452845774509}
Key: 42: Value:
{0:0.07012456232391874,1:0.07202756188360443,2:0.07104733404285919,3:0.07529434754067578,4:0.07144748290615627,5:0.07073096608087212,6:0.19684570628989637,7:0.07070807572900607,8:0.0796551738020907,9:0.07771780444970888,10:0.07103534947742626,11:0.07336563547378515}
Count: 43
13/03/18 14:12:05 INFO driver.MahoutDriver: Program took 780 ms (Minutes:
0.013)

I believe the seqdumper of docTopics represents weights contribution of
each topics to specific documents.
I'm also expecting to see the list of keywords per topics in conjunction
with the above.
In my personal impression, LDA gives you back the very similar notion of
results as you will get from some other matrix factorization algorythms
such as NMF (Non-Negative-Matrix-Factorization)

Please let me be advised.
Regards,,,
Y.Mandai

2013/2/23 Yutaka Mandai <20525entradero@gmail.com>

> Jake
> Now this is very clear and I will work on this build from the latest
> source.
> Thank you.
> Regards,,,
> Y.Mandai
>
>
> iPhoneから送信
>
> On 2013/02/23, at 3:14, Jake Mannix <jake.mannix@gmail.com> wrote:
>
> > On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 <20525entradero@gmail.com> wrote:
> >
> >> Thanks Jake for your attention on this.
> >> I believe I have the trunk code from the official download site.
> >> Well my Mahout version is 0.7 and I have downloaded from local mirror
> site
> >> as
> >> http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
> >> timestamp on ther mirror
> >> site as 12-Jun-2012 and the time stamp for my installed files are all
> >> identical.
> >> Note that I'm using the precompiled Jar files only and have not built
> on my
> >> machine from source code locally.
> >> I believe this will not affect negatively.
> >>
> >> Mahout-0.7 is my first and only experienced version. Never have tried
> older
> >> ones nor newer 0.8 snapshot either...
> >>
> >> Can you think of any other possible workaround?
> >>
> >
> > You should try to build from trunk source, this bug is fixed in trunk,
> > that's the
> > correct workaround.  That, or wait for our next officially released
> version
> > (0.8).
> >
> >
> >>
> >> Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
> >> this case?
> >> I could confirm the heap assignment for the Hadoop jobs since they are
> >> resident processes while
> >> Mahout RunJob immediately dies before the VisualVM utility can
> recognozes
> >> it, so I'm not confident if
> >> RunJob really got how much he really wanted or not...
> >>
> >
> > Heap is not going to help you here, you're dealing with a bug.  The
> correct
> > code doesn't need really very much memory at all (less than 100MB to do
> > the job you're talking about).
> >
> >
> >>
> >> Regards,,,
> >> Y.Mandai
> >>
> >>
> >>
> >> 2013/2/22 Jake Mannix <jake.mannix@gmail.com>
> >>
> >>> This looks like you've got an old version of Mahout - are you running
> on
> >>> trunk?  This has been fixed on trunk, there was a bug in the 0.6
> >> (roughly)
> >>> timeframe in which vectors for vectordump --sort were assumed
> incorrectly
> >>> to be of size MAX_INT, which lead to heap problems no matter how much
> >> heap
> >>> you gave it.   Well, maybe you could have worked around it with 2^32 *
> >> (4 +
> >>> 8) bytes ~ 48GB, but really the solution is to upgrade to run off of
> >> trunk.
> >>>
> >>>
> >>> On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20525entradero@gmail.com>
wrote:
> >>>
> >>>> My trial as below. However still doesn't get through...
> >>>>
> >>>> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment
> >> mark
> >>>> from mahout shell script so that I can check it's actually taking
> >> effect.
> >>>> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
> >>>>
> >>>> ~bin/mahout~
> >>>> JAVA=$JAVA_HOME/bin/java
> >>>> JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to
4g*
> >>>> # check envvars which might override default args
> >>>> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
> >>>>  echo "run with heapsize $MAHOUT_HEAPSIZE"
> >>>>  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
> >>>>  echo $JAVA_HEAP_MAX
> >>>> fi
> >>>>
> >>>> Also added the same heap size as 4G in hadoop-env.sh as
> >>>>
> >>>> ~hadoop-env.sh~
> >>>> # The maximum amount of heap to use, in MB. Default is 1000.
> >>>> export HADOOP_HEAPSIZE=4000
> >>>>
> >>>> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> >>>> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> >>>> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> >>>> --vectorSize 5 --printKey TRUE --sortVectors TRUE
> >>>> run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
> >>>> -Xmx4000m                       *<- Right?*
> >>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> >>> HADOOP_CONF_DIR=
> >>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> >>>> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> >>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> >>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> >>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> >>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> >>>> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> >>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>>> at
> >>> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >>>> at
> >>> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>> at
> >>>>
> >> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>>> [hadoop@localhost NHTSA]$
> >>>> I've also monitored that at least all the Hadoop tasks are taking 4GB
> >> of
> >>>> heap through VisualVM utility.
> >>>>
> >>>> I have done ClusterDump to extract the top 10 terms from the result
of
> >>>> K-Means as below using the exactly same input data sets as below,
> >>> however,
> >>>> this tasks requires no extra heap other that the default.
> >>>>
> >>>> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> >>>> NHTSA-vectors01/dictionary.file-* -i
> >>>> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> >>>> -b 30-n 10
> >>>>
> >>>> I believe the vectordump utility and the clusterdump derive from
> >>> different
> >>>> roots in terms of it's heap requirement.
> >>>>
> >>>> Still waiting for some advise from you people.
> >>>> Regards,,,
> >>>> Y.Mandai
> >>>> 2013/2/19 万代豊 <20525entradero@gmail.com>
> >>>>
> >>>>>
> >>>>> Well , the --sortVectors for the vectordump utility to evaluate
the
> >>>> result
> >>>>> for CVB clistering unfortunately brought me OutofMemory issue...
> >>>>>
> >>>>> Here is the case that seem to goes well without --sortVectors option.
> >>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> >>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize
5
> >>>>> --printKey TRUE
> >>>>> ...
> >>>>> WHILE FOR:1.3623429635926918E-6,WHILE
> >>> FRONT:1.6746456292420305E-11,WHILE
> >>>>> FUELING:1.9818992669733008E-11,WHILE
> >>>> FUELING,:1.0646022811429909E-11,WHILE
> >>>>> GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> >>>>> HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> >>>>> I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> >>>>> IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> >>>>> IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> >>>>> IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> >>>>> INSPECTING:3.854370531928256E-
> >>>>> ...
> >>>>>
> >>>>> Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> >>>>> exception.
> >>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> >>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize
5
> >>>>> --printKey TRUE *--sortVectors TRUE*
> >>>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> >>>> HADOOP_CONF_DIR=
> >>>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> >>>>> 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> >>>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> >>>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> >>>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> >>>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> >>>>> 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> >>>>> *Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> >>> space*
> >>>>> at
> >>>>
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >>>>> at
> >>>>
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>>> at
> >>>>>
> >>>
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>>> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>>>> I see that there are several parameters  that are sensitive to giving
> >>>> heap
> >>>>> to Mahout job either dependently/independent across Hadoop and Mahout
> >>>> such
> >>>>> as
> >>>>> MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
> >>>>>
> >>>>> Can anyone advise me which configuration file, shell scripts, XMLs
> >>> that I
> >>>>> should give some addiotnal heap and also the proper way to monitor
> >> the
> >>>>> actual heap usage here?
> >>>>>
> >>>>> I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> >>>>> pseudo-distributed configuration on a VMWare Player partition running
> >>>>> CentOS6.3 64Bit.
> >>>>>
> >>>>> Regards,,,
> >>>>> Y.Mandai
> >>>>> 2013/2/1 Jake Mannix <jake.mannix@gmail.com>
> >>>>>
> >>>>>> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <
> >>> 20525entradero@gmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Thank Jake for your guidance.
> >>>>>>> Good to know that I wasn't alway wrong but was just not
familiar
> >>>> enough
> >>>>>>> about the vector dump usage.
> >>>>>>> I'll try this out later when I can as soon as possible.
> >>>>>>> Hope that --sort doesn't eat up too much heap.
> >>>>>>>
> >>>>>>
> >>>>>> If you're using code on master, --sort should only be using
an
> >>>> additional
> >>>>>> K
> >>>>>> objects of memory (where K is the value you passed to --vectorSize),
> >>> as
> >>>>>> it's just using an auxiliary heap to grab the top k items of
the
> >>> vector.
> >>>>>> It was a bug previously that it tried to instantiate a
> >> vector.size()
> >>>>>> [which in some cases was Integer.MAX_INT] sized list somewhere.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Regards,,,
> >>>>>>> Yutaka
> >>>>>>>
> >>>>>>> iPhoneから送信
> >>>>>>>
> >>>>>>> On 2013/01/31, at 23:33, Jake Mannix <jake.mannix@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi Yutaka,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entradero@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi
> >>>>>>>>> Here is a question around how to evaluate the result
of Mahout
> >>> 0.7
> >>>>>> CVB
> >>>>>>>>> (Collapsed Variational Bayes), which used to be
LDA
> >>>>>>>>> (Latent Dirichlet Allocation) in Mahout version
under 0.5.
> >>>>>>>>> I believe I have no prpblem running CVB itself and
this is
> >>> purely a
> >>>>>>>>> question on the efficient way to visualize or evaluate
the
> >>> result.
> >>>>>>>>
> >>>>>>>> Looks like result evaluation in Mahout-0.5 at least
could be
> >> done
> >>>>>> using
> >>>>>>> the
> >>>>>>>>> utility called "LDAPrintTopic", however this is
already
> >>>>>>>>> obsolete since Mahout 0.5. (See "Mahout in Action"
p.181 on
> >> LDA)
> >>>>>>>>>
> >>>>>>>>> I'm using , as said using Mahout-0.7. I believe
I'm running CVB
> >>>>>>>>> successfully and obtained results in two separate
directory in
> >>>>>>>>> /user/hadoop/temp/topicModelState/model-1 through
model-20 as
> >>>>>> specified
> >>>>>>> as
> >>>>>>>>> number of iterations and also in
> >>>>>>>>> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through
part-m-00009
> >>> as
> >>>>>>>>> specified as number of topics tha I wanted to
> >>> extract/decomposite.
> >>>>>>>>>
> >>>>>>>>> Neither of the files contained in the directory
can be dumped
> >>> using
> >>>>>>> Mahout
> >>>>>>>>> vectordump, however the output format is way different
> >>>>>>>>> from what you should've gotten using LDAPrintTopic
in below 0.5
> >>>> which
> >>>>>>>>> should give you back the result as the Topic Id.
and it's
> >>>>>>>>> associated top terms in very direct format. (See
"Mahout in
> >>> Action"
> >>>>>>> p.181
> >>>>>>>>> again).
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Vectordump should be exactly what you want, actually.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Here is what I've done as below.
> >>>>>>>>> 1. Say I have already generated document vector
and use
> >>> tf-vectors
> >>>> to
> >>>>>>>>> generate a document/term matrix as
> >>>>>>>>>
> >>>>>>>>> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors
-o
> >>>>>>>>> NHTSA-matrix03
> >>>>>>>>>
> >>>>>>>>> 2. and get rid of the matrix docIndex as it should
get in my
> >> way
> >>>> (as
> >>>>>>> been
> >>>>>>>>> advised somewhere…)
> >>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> >>>>>>>>> NHTSA-matrix03-docIndex
> >>>>>>>>>
> >>>>>>>>> 3. confirmed if I have only what I need here as
> >>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> >>>>>>>>> Found 1 items
> >>>>>>>>> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20
07:11
> >>>>>>>>> /user/hadoop/NHTSA-matrix03/matrix
> >>>>>>>>>
> >>>>>>>>> 4.and kick off CVB as
> >>>>>>>>> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o
> >> NHTSA-LDA-sparse
> >>>>>> -dict
> >>>>>>>>> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> >>>>>>>>> …
> >>>>>>>>> ….
> >>>>>>>>> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program
took
> >> 43987688
> >>>> ms
> >>>>>>>>> (Minutes: 733.1281333333334)
> >>>>>>>>> (Took over 12hrs to complete to process 100k documents
on my
> >>> laptop
> >>>>>> with
> >>>>>>>>> pseudo-distributed Hadoop 0.20.203)
> >>>>>>>>>
> >>>>>>>>> 5. Take a look at what I've got.
> >>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> >>>>>>>>> Found 12 items
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message