mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Spitz <msp...@meebo-inc.com>
Subject Re: Getting mahout to run on the DFS
Date Fri, 29 Oct 2010 16:41:40 GMT
OK!  Further investigation!

Running with --method sequential works just fine, so my guess is that it's
something in the difference in which we set the clustersIn parameter between
clusterDataMR() and clusterDataSeq()

According to this line:
10/10/29 09:26:20 INFO kmeans.KMeansDriver: Input:
examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors Clusters In:
examples/bin/work/clusters/part-randomSeed Out:
examples/bin/work/reuters-kmeans Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

CLUSTERS_IN_OPTION is 'examples/bin/work/clusters/part-randomSeed'
Both start with:
Path clustersIn = new
Path(getOption(DefaultOptionsCreator.CLUSTERS_IN_OPTION);

And they proceed as follows:
*clusterDataSeq()*
KMeansUtil.configureWithClusterInfo(clustersIn, clusters);
... check to see if clusters is empty ...
... proceed with clustering ...

*clusterDataMR()*
conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());
-- KMeansMapper.setup()
String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
KMeansUtil.configureWithClusterInfo(new Path(clusterPath), clusters)

I wonder if there might be something fishy with converting path to and from
a string?

Can you modify bin/mahout to print the classpath and the command it runs?

*echo "Classpath: $HADOOP_CLASSPATH"*
*echo "Command:   $HADOOP_HOME/bin/hadoop jar $MAHOUT_JOB $CLASS $@"*
*exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@"*

Yields:
*Classpath: /home/mspitz/mahoutplayground/mahout-distribution-0.4/conf:*
* Command:   /usr/lib/hadoop-0.20/bin/hadoop jar
/home/mspitz/mahoutplayground/mahout-distribution-0.4/mahout-examples-0.4-job.jar
org.apache.mahout.driver.MahoutDriver clusterdump -s
examples/bin/work/reuters-kmeans/clusters-10 -d
examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 20*

Thanks,
Matt

On Thu, Oct 28, 2010 at 5:04 PM, Jeff Eastman <jeastman@narus.com> wrote:

> Have you checked the Hadoop logs? The only way I know to get more output
> would be to add some printouts to the code.
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, October 28, 2010 11:00 AM
> To: user@mahout.apache.org
> Subject: Re: Getting mahout to run on the DFS
>
> Is there a way to get more/nicer output so we can track this down?
>
> On Thu, Oct 28, 2010 at 1:57 PM, Jeff Eastman <jeastman@narus.com> wrote:
>
> > I'm puzzled too. I just unzipped the same distribution and it ran without
> > issues on my CHD3 unicluster. I'm running as my own userId, not
> > hadoop-anything, on RHE Linux 5.3 64-bit VM on VMware.
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Thursday, October 28, 2010 10:12 AM
> > To: user@mahout.apache.org
> > Subject: Re: Getting mahout to run on the DFS
> >
> > OK, using
> >
> >
> https://repository.apache.org/content/repositories/orgapachemahout-004/org/apache/mahout/mahout-distribution/0.4/mahout-distribution-0.4.zip
> >
> > Same error on a clean unzip.  Running 64-bit Linux CentOS 5.3.
> >
> > Running the lda command over hadoop yields results as expected.
> >  Puzzling...
> >
> > -Matt
> >
> > On Thu, Oct 28, 2010 at 12:52 PM, Jeff Eastman <jeastman@narus.com>
> wrote:
> >
> > > Don't recognize that zip. Can you try with the latest 0.4 RC at
> > >
> https://repository.apache.org/content/repositories/orgapachemahout-004/?
> > I
> > > just ran that successfully on my CHD3 unicluster. What OS are you
> > running?
> > > I'm on 64-bit Linux EL-5.
> > >
> > > -----Original Message-----
> > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > Sent: Thursday, October 28, 2010 9:45 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Getting mahout to run on the DFS
> > >
> > > I'm running mahout-distribution-0.4-20101027.194721-1.zip on CDH3.
> >  Running
> > > examples/bin/build-reuters.sh from a clean unzip results in the same
> > error.
> > >  I definitely have read/write access to the DFS, as  reuters-seqdir and
> > > reuters-seqdir-sparse have been created correctly.
> > >
> > > Running locally with a clean unzip is fine.  It's just the
> > > running-on-the-DFS part that breaks when we try to cluster.
> > >
> > > Thanks,
> > > Matt
> > >
> > > On Thu, Oct 28, 2010 at 12:23 PM, Jeff Eastman <jeastman@narus.com>
> > wrote:
> > >
> > > > Maybe I missed it, but are you running on trunk? Can you run
> > > > examples/bin/build-reuters.sh out of the box? I'm running that
> > > successfully
> > > > on a CHD3 cluster logged-in as myself.
> > > >
> > > > Jeff
> > > >
> > > > -----Original Message-----
> > > > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > > > Sent: Thursday, October 28, 2010 9:01 AM
> > > > To: user@mahout.apache.org
> > > > Subject: Re: Getting mahout to run on the DFS
> > > >
> > > > Hm.  So, I'm running the cloudera hadoop distribution, and I'm
> running
> > as
> > > a
> > > > hadoop user.
> > > >
> > > > Here's my script (with MAHOUT_HOME, HADOOP_HOME, and HADOOP_CONF_DIR
> > > > specified):
> > > > ./bin/mahout seqdirectory -i ../reuters-out -o reuters-seqdir -c
> UTF-8
> > > > -chunk 5
> > > >
> > > > ./bin/mahout seq2sparse -i reuters-seqdir/ -o reuters-seqdir-sparse
> > > > ./bin/mahout kmeans -i reuters-seqdir-sparse/tf-vectors/ -o
> > > reuters-kmeans
> > > > -k 20 --clustering --maxIter 10 --clusters reuters-clusters
> > > >
> > > > ../reuters-out is a local directory (see
> > > > https://issues.apache.org/jira/browse/MAHOUT-535)
> > > >
> > > > After running the third command, I see a non-empty reuters-clusters
> > > > directory on the DFS, so presumably the initial clusters are getting
> > > > created.
> > > >
> > > > These commands run fine in local mode, but no dice running on the
> DFS.
> >  I
> > > > even copied the reuters-clusters directory from the DFS to my local
> > > > machine,
> > > > hoping that mahout was looking there, but I still got the same
> > error(s):
> > > > 10/10/28 08:57:55 INFO mapred.JobClient: Task Id :
> > > > attempt_201008241139_108731_m_000000_2, Status : FAILED
> > > > java.lang.IllegalStateException: No clusters found. Check your -c
> path.
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > >        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >
> > > > 10/10/28 08:58:05 INFO mapred.JobClient: Job complete:
> > > > job_201008241139_108731
> > > > 10/10/28 08:58:05 INFO mapred.JobClient: Counters: 4
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:   Job Counters
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Rack-local map tasks=2
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Launched map tasks=4
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Data-local map tasks=2
> > > > 10/10/28 08:58:05 INFO mapred.JobClient:     Failed map tasks=1
> > > > Exception in thread "main" java.lang.InterruptedException: K-Means
> > > > Iteration
> > > > failed processing reuters-clusters/part-randomSeed
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >        at
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > >        at
> > > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > >        at
> > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >        at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > >
> > > > As a sanity check:
> > > > [mspitz@wowzers mahout-distribution-0.4-SNAPSHOT]$ hadoop dfs -du
> > > > reuters-clusters
> > > > Found 1 items
> > > > 8246        hdfs://
> > > > 192.168.1.100:54310/user/mspitz/reuters-clusters/part-randomSeed
> > > >
> > > > By the way, I really really appreciate the help.  Thank you so much.
> > > >
> > > > -Matt
> > > >
> > > > On Thu, Oct 28, 2010 at 10:01 AM, Jeff Eastman
> > > > <jdog@windwardsolutions.com>wrote:
> > > >
> > > > > With k-means, the initial clusters directory can either 1) contain
> > some
> > > > > initial clusters you produced somehow (a common method is via
> Canopy)
> > > or
> > > > 2)
> > > > > be empty. In the empty case; however, you also need to specify the
> > > number
> > > > of
> > > > > initial clusters (-k) so that your input data can be sampled and
> the
> > > > initial
> > > > > clusters put into the empty directory. Note that if you do 1) and
> > also
> > > > > specify -k that your initial clusters will be overwritten by k
> > sampled
> > > > > values from your input data.
> > > > >
> > > > >
> > > > > On 10/28/10 3:35 AM, pragnesh radadia wrote:
> > > > >
> > > > >> are you using cloudera hadoop distribution ?
> > > > >> if yes then run kmean using hadoop or hdfs user to solve your
> > problem
> > > > >>
> > > > >>
> > > > >> On Thu, Oct 28, 2010 at 3:45 AM, Matt Spitz<mspitz@meebo-inc.com>
> > > >  wrote:
> > > > >>
> > > > >>> Bug report created!  Thanks!
> > > > >>>
> > > > >>> One more random question: when running kmeans, there's a
required
> > -c
> > > > >>> (initial clusters) argument.  All the examples I've seen
using
> > kmeans
> > > > >>> (namely https://issues.apache.org/jira/browse/MAHOUT-390)
> specify
> > a
> > > > >>> non-existent directory (presumably the algorithm would select
> some
> > > > >>> initial
> > > > >>> random clusters).
> > > > >>>
> > > > >>> But, when specifying some initial, nonexistent clusters
> directory,
> > I
> > > > get
> > > > >>> a
> > > > >>> bunch of:
> > > > >>> 10/10/27 15:14:57 INFO mapred.JobClient: Task Id :
> > > > >>> attempt_201008241139_107461_m_000002_2, Status : FAILED
> > > > >>> java.lang.IllegalStateException: No clusters found. Check
your -c
> > > path.
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:61)
> > > > >>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> > > > >>>        at
> > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> > > > >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > > >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >>>
> > > > >>> And the job eventually fails with:
> > > > >>> Exception in thread "main" java.lang.InterruptedException:
> K-Means
> > > > >>> Iteration
> > > > >>> failed processing reuters-clusters/part-randomSeed
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:342)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:289)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:214)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > > >>>        at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> > > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > > >>>        at
> > > > >>>
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > > >>>        at
> > > > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > > >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>>        at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > > >>>
> > > > >>> Any thoughts on this one?
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Matt
> > > > >>>
> > > > >>> On Wed, Oct 27, 2010 at 5:39 PM, Ted Dunning<
> ted.dunning@gmail.com
> > >
> > > > >>>  wrote:
> > > > >>>
> > > > >>>  On Wed, Oct 27, 2010 at 2:04 PM, Matt Spitz<
> mspitz@meebo-inc.com>
> > > > >>>>  wrote:
> > > > >>>>
> > > > >>>>  Ted- Awesome!  Thanks!  Running with mahout-0.4 sorted
things
> out
> > > > >>>>> rather
> > > > >>>>> nicely.
> > > > >>>>>
> > > > >>>>>  There were lots of improvements there.
> > > > >>>>
> > > > >>>>
> > > > >>>>  One thing that I find really weird is that 'mahout
> seqdirectory'
> > > > always
> > > > >>>>> hits
> > > > >>>>> the local filesystem for input, even when running
in Hadoop
> mode.
> > > >  So,
> > > > >>>>> if
> > > > >>>>>
> > > > >>>> I
> > > > >>>>
> > > > >>>>> have 'myurls' on the DFS, but that path doesn't exist
locally,
> > > > >>>>>
> > > > >>>> seqdirectory
> > > > >>>>
> > > > >>>>> creates an empty sequence file (with no error). 
Is this
> > expected?
> > > > >>>>>
> > > > >>>>>  No.  That sounds like a bug.  Can you file a report
here:
> > > > >>>> https://issues.apache.org/jira/browse/MAHOUT ?
> > > > >>>>
> > > > >>>>
> > > > >>>>  Is there a nice way to create sequence files that isn't
> > > seqdirectory?
> > > > >>>>>
> > > > >>>>  I'd
> > > > >>>>
> > > > >>>>> like to do a little processing on the documents as
they get
> sent
> > to
> > > > the
> > > > >>>>> sequence file without having to generate a second
copy on the
> > DFS.
> > > > >>>>>
> > > > >>>>>  Sure.  Just snarf the code from the program in question
and
> > > massage
> > > > it
> > > > >>>> as
> > > > >>>> you like.  The command line versions are handy,
> > > > >>>> but it is very common to need to customize.  At that
point, the
> > > > command
> > > > >>>> line
> > > > >>>> programs serve as example code.  You don't
> > > > >>>> have to use them and they have no magic.
> > > > >>>>
> > > > >>>> If you think you have some improvements in generality,
we can
> push
> > > > them
> > > > >>>> back
> > > > >>>> into the Mahout versions.
> > > > >>>>
> > > > >>>>
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message