mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liz Merkhofer <lmerkho...@bericotechnologies.com>
Subject Re: NaN in cvb topic models after lucene2seq
Date Thu, 25 Jul 2013 16:07:37 GMT
Thanks so much for your response, Suneel.

Unfortunately, the Solr index is not mine to post. But short of that, are
there any useful answers I can provide? At the time I ran this, it
contained 70,000 documents... I'm adding several times that today, though.

I tried lucene2seq again.

Running with the MapReduce default, the directory it creates contains
_SUCCESS part-m-00003 part-m-00007 part-m-00011
part-m-00000 part-m-00004 part-m-00008 part-m-00012
part-m-00001 part-m-00005 part-m-00009 part-m-00013
part-m-00002 part-m-00006 part-m-00010 part-m-00014

With -xm sequential, however, it creates only "index."

Looking at part-m-00014 or index, I see about the same thing: a header like

SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%(

And then the concatenated text of (all?) my documents

When I run "rowid," I get

13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows and
465540 columns to /tmp/cvb/rowidout/matrix


In comparison, I'm working off the closest example I could find, from the
book Hadoop MapReduce Cookbook (page in Safari Books Online:
http://goo.gl/n3YVCz). Running seqdirectory on their sample, a directory
containing data from 20 newsgroups, my output is called part-m-00000 and
looks like

SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87>

etc. When that gets to the point of running rowid, I get

13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997 rows
and 193659 columns to tmp/20news/int/matrix

where those aprox 20,000 rows are plausibly each a document in the 20news
dataset.

It seems then, to me, that lucene2seq is the culprit. Maybe the best
solution will falling back on lucene.vector:

./mahout lucene.vector --dir <path to solr data>/index --output
/tmp/lv-cvb/luceneout --field textbody_en --dictOut /tmp/lv-cvb/lucenedict
--idField docid --norm 2 --weight TF --seqDictOut /tmp/lv-cvb/seqDictOut
--norm 2 -x 70

The output did look like the appropriately garbled.

However, rowid doesn't like the output from lucene.vector,
"java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping
rowid also had a problem with the LongWriteable,
"java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
cast to org.apache.hadoop.io.IntWritable."

My commands:
./mahout rowid -i /tmp/lv-cvb/luceneout  -o /tmp/lv-cvb/matrix

./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10 -dict
/tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model

Is there something I'm missing?

Thank you,
Liz


On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:

> Liz,
>
> lucene2seq was a recent addition to Mahout 0.8 and its good that you are
> taking this for a test drive and reporting issues.
> In order to troubleshoot this:
>
> a) Could you try running lucene2seq with a '-xm sequential' option and
> verify the output?  The default option now is MapReduce and I am trying to
> determine
>  if the issue could be with the MapReduce version or if its something more
> basic.
> b) Is it possible for you to post your Solr index to these forums, I can
> take a stab at this to see as to what's wrong.
>
> Suneel
>
>
>
>
> ________________________________
>  From: Liz Merkhofer <lmerkhofer@bericotechnologies.com>
> To: user@mahout.apache.org
> Sent: Wednesday, July 24, 2013 5:07 PM
> Subject: NaN in cvb topic models after lucene2seq
>
>
> Hello list,
>
> I'm having some problems using cvb (now that lda is deprecated) on my
> Lucene (or Solr, if you will) index. I am using Mahout 0.8.
>
> My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything seems to
> be working, until all my topics come out, with seqdumper, as NaN, like:
>
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 0: Value:
>
> {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN,
>
> ... etc. I suspect my problem is in the output of lucene2seq, which is a
> folder of files 14 files called /part-m-000xx that look very much like the
> text in my Lucene index and nothing like the unreadable jumble I would get
> from 'seqdirectory' on an actual directory of text files.
>
> If it helps, here's how I'm doing this:
>
> ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
> data>index -id docId -f textbody_en
>
> ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o /tmp/cvb/seq2sparseout
> --namedVector --maxDFPercent 70 --weight TF -n 2 -a
> org.apache.lucene.analysis.core.WhitespaceAnalyzer
>
> ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o /tmp/cvb/rowidout
>
> ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30 -dict
> /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt
> /tmp/cvb/model
>
> Any thoughts?
>
> Thank you,
> Liz
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message