mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Using several Mahout JarSteps in a JobFlow
Date Tue, 08 Feb 2011 16:46:04 GMT
Hi Thomas,

you can also use the parameter --tempDir to explicitly point a job to a
temp directory.

By the way I recoginize that our users shouldn't need to execute both
jobs like you do because the similar items computation is already
contained in RecommenderJob, we should add an option that makes it write
out the similar items in a nice form, so we can avoid having to run both
jobs.

I'm gonna create a ticket for this.

--sebastian


Am 08.02.2011 17:37, schrieb Sean Owen:
> I would not run them in the same root directory / key prefix. Put them
> both under different namespaces.
> 
> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen <thomas@beluto.com> wrote:
>> Hi fellow data crunchers,
>>
>> I am running a JobFlow with a step using
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>> following step using
>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>> works without problems, but the second one is throwing an Exception:
>>
>> |Exception in thread"main"
>>  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> temp/itemIDIndex already exists and is not empty
>>        at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>        at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> |
>>
>> It looks like the second job is using the same temporal output directories
>> as the first job. How can I avoid this? Or even better: If some of the tasks
>> are already done and cached in the first step, how could I use them so that
>> they don't have to be recomputed in the second step?
>>
>> Best regards,
>> Thomas
>>
>> PS: This is the actual JobFlow definition in JSON:
>>
>> [
>>   [......],
>>  {
>>    "Name": "MR Step 2: Find similiar items",
>>    "HadoopJarStep": {
>>      "MainClass":
>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--maxSimilaritiesPerItem",    "100"
>>      ]
>>    }
>>  },
>>  {
>>    "Name": "MR Step 3: Find items for user",
>>    "HadoopJarStep": {
>>      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>      "Args": [
>>         "--input",
>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>         "--output",
>>  "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>         "--numRecommendations",    "100"
>>      ]
>>    }
>>  }
>> ]
>>
>> ||||
>>
>>


Mime
View raw message