mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Re: Question about storage in Pig-vector (Pig + Mahout)
Date Fri, 11 May 2012 19:40:39 GMT
Thanks for the help Jake. Makes sense about interfacing with other Mahout
classes. What is confusing is that the PigModelStorage class also seems to
produce a SequenceFile, i.e

public OutputFormat getOutputFormat() throws IOException {

        return new SequenceFileOutputFormat();

}

Maven couldn't resolve elephant-bird at the time I tried to build
pig-vector ... just tried again and am getting:

[ERROR] Failed to execute goal on project pig-vector: Could not resolve
dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
artifact com.twitter:elephant-bird:jar:2.1.2 in central (
http://repo1.maven.org/maven2) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal on project pig-vector: Could not resolve dependencies for project
pig-vector:pig-vector:jar:1.0: Could not find artifact
com.twitter:elephant-bird:jar:2.1.2 in central (
http://repo1.maven.org/maven2)

So looking at the ele..-bird readme, I see mention of Maven repo:
https://raw.github.com/kevinweil/elephant-bird/master/repo

That didn't work either :-(

[ERROR] Failed to execute goal on project pig-vector: Could not resolve
dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
artifact com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
https://raw.github.com/kevinweil/elephant-bird/master/repo) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal on project pig-vector: Could not resolve dependencies for project
pig-vector:pig-vector:jar:1.0: Could not find artifact
com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
https://raw.github.com/kevinweil/elephant-bird/master/repo)
 at
org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependency

In any case, I got a 2.1.2 version to build, but when running the training
example, the storing the model using elephant-bird fails. It looks to me
like the serialized model is corrupted somehow. I added some debug
statements to Classifier to see what it thinks it's serializing and
de-serializing in write and readFields respectively. There's clearly a
mis-match (see below). I've tried this with elephant-bird 2.1.2 and the
latest 2.2.3-SNAPSHOT from Github

Here is the debug output from the code I added to Classifier:

*In write(DataOutput dataOutput)
*>> wrote size: 20
>> wrote category: alt.atheism
>> wrote category: comp.sys.mac.hardware
>> wrote category: rec.motorcycles
>> wrote category: sci.electronics
>> wrote category: talk.politics.guns
>> wrote category: comp.graphics
>> wrote category: comp.windows.x
>> wrote category: rec.sport.baseball
>> wrote category: sci.med
>> wrote category: talk.politics.mideast
>> wrote category: comp.os.ms-windows.misc
>> wrote category: misc.forsale
>> wrote category: rec.sport.hockey
>> wrote category: sci.space
>> wrote category: talk.politics.misc
>> wrote category: comp.sys.ibm.pc.hardware
>> wrote category: rec.autos
>> wrote category: sci.crypt
>> wrote category: soc.religion.christian
>> wrote category: talk.religion.misc
>> wrote model

so far so good ...

*In readFields(DataInput dataInput)
*>> read size: 2125682
>> read category: .apache.mahout.pig.Classifier

alt.atheismcomp.sys.mac.hardwarerec.motorcyclessci.electronicstal
comp.graphicscomp.windows.xrec.sport.baseballsci.medtalk.politics.mideastcomp.os.ms-win
>> read category: ows.misc
                          misc.forsalerec.sport.hockey
sci.spacetalk.politics.misccomp.sys.ibm.pc.hardware    rec.aut
>> read category: s    sci.cryptsoc.religion.christianta


Cheers,
Tim


On Fri, May 11, 2012 at 1:09 PM, Jake Mannix <jake.mannix@gmail.com> wrote:

> On Fri, May 11, 2012 at 11:38 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > I'm trying to run the simple 20-newsgroups example to train a Mahout
> > classifier using Pig and am unsure about the elephant-bird stuff.
> >
> > First, after battling with getting a build of elephant-bird,
>
>
> Why did you have to build it?  Aren't the jars available via maven?
>
>
> > the store to
> > SequenceFile didn't work for me. Then I saw the PigModelStorage and just
> > used that and it works just fine. Here is my script (with comments
> removed
> > for brevity):
> >
> > -- Train:
> >
> > register '.../target/pig-vector-1.0-jar-with-dependencies.jar';
> >
> > define train org.apache.mahout.pig.LogisticRegression('iterations=5,
> > inMemory=true, features=100000, categories=alt.atheism
> > comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
> > comp.graphics comp.windows.x rec.sport.baseball sci.med
> > talk.politics.mideast comp.os.ms-windows.misc misc.forsale
> rec.sport.hockey
> > sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
> > soc.religion.christian talk.religion.misc');
> >
> > docs = load '20news-bydate-train/*/*' using
> > org.apache.mahout.pig.MessageLoader()
> >    as (newsgroup, id:int, subject, body);
> >
> > define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
> > 'subject+body', 'group:word, article:numeric, subject:text, body:text');
> > vectors = foreach docs generate newsgroup, encodeVector(*) as v;
> >
> > grouped = group vectors all;
> >
> > model = foreach grouped generate 1 as key, train(vectors) as model;
> >
> > store model into 'pv-tmp/news_model2' using
> > org.apache.mahout.pig.PigModelStorage();
> >
> >
> > -- Eval:
> >
> > define evaluate
> >
> >
> org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
> > key=1');
> > test = load '20news-bydate-test/*/*' using
> > org.apache.mahout.pig.MessageLoader()
> >    as (newsgroup, id:int, subject, body);
> > testvecs = foreach test generate newsgroup, encodeVector(*) as v;
> > describe testvecs;
> > evalvecs = foreach testvecs generate evaluate(v);
> >
> > dump evalvecs;
> >
> > ----
> >
> > So my main question is what does the elephant-bird model storage stuff do
> > that PigModelStorage doesn't?
> >
>
> SequenceFileStorage leads to producing data in a format which many of the
> other
> Mahout utilities can read (they typically assume things like SequenceFile's
> of Text,
> IntWritable, and/or VectorWritable).
>
>
> >
> > Cheers,
> > Tim
> >
>
>
>
> --
>
>  -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message