mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Katherine Huang <khu...@shopzilla.com>
Subject seq2sparse generated dictionary is missing words
Date Thu, 26 Jan 2012 02:52:39 GMT
I am doing a trial run starting with a sequence file that contains: (this is from seqdumper
and I just made my key the same as my value):

Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text
Key: first book nature specialword boxes: Value: first book nature specialword boxes
Key: fourth fake example with fake: Value: fourth fake example with fake
Key: second book fun: Value: second book fun
Key: third unique document item: Value: third unique document item
Key: fifth bag of words: Value: fifth bag of words
Count: 5


When I run
mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o /khuang/trial_01252012/keyword_Vectors_461_named
-ow -md 1 -a org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq –nv

And I look dump tokenized vectors:
mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000

I only have three of my 'orig' documents:

Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
Key: first book nature specialword boxes: Value: org.apache.mahout.math.VectorWritable@e5d391d
Key: fourth fake example with fake: Value: org.apache.mahout.math.VectorWritable@e5d391d
Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d
Count: 3


In addition, the dictionary is missing words. Is there a reason for this?




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message