mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Söhngen <tho...@beluto.com>
Subject Re: Using several Mahout JarSteps in a JobFlow
Date Tue, 08 Feb 2011 17:36:21 GMT
Hi Sebastian,

thank you very much, using the tempDir parameter fixed the problem.

As you mentioned, it would be really nice if there were a a single step, 
which puts out item recommendations for users as well as user-user and 
item-item similiarity. An alternative would be, to split the 
RecommenderJob class in different jobs, which rely on each others 
output. This would be even better for my case, because I am using AWS 
EMR and would have to do a manual copy out of hdfs if these information 
are not in the main output of the step, which would be much harder to 
script.

Best regards,
Thomas

Am 08.02.2011 17:46, schrieb Sebastian Schelter:
> Hi Thomas,
>
> you can also use the parameter --tempDir to explicitly point a job to a
> temp directory.
>
> By the way I recoginize that our users shouldn't need to execute both
> jobs like you do because the similar items computation is already
> contained in RecommenderJob, we should add an option that makes it write
> out the similar items in a nice form, so we can avoid having to run both
> jobs.
>
> I'm gonna create a ticket for this.
>
> --sebastian
>
>
> Am 08.02.2011 17:37, schrieb Sean Owen:
>> I would not run them in the same root directory / key prefix. Put them
>> both under different namespaces.
>>
>> On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen<thomas@beluto.com>  wrote:
>>> Hi fellow data crunchers,
>>>
>>> I am running a JobFlow with a step using
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
>>> following step using
>>> "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
>>> works without problems, but the second one is throwing an Exception:
>>>
>>> |Exception in thread"main"
>>>   org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>>> temp/itemIDIndex already exists and is not empty
>>>         at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>> |
>>>
>>> It looks like the second job is using the same temporal output directories
>>> as the first job. How can I avoid this? Or even better: If some of the tasks
>>> are already done and cached in the first step, how could I use them so that
>>> they don't have to be recomputed in the second step?
>>>
>>> Best regards,
>>> Thomas
>>>
>>> PS: This is the actual JobFlow definition in JSON:
>>>
>>> [
>>>    [......],
>>>   {
>>>     "Name": "MR Step 2: Find similiar items",
>>>     "HadoopJarStep": {
>>>       "MainClass":
>>> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--maxSimilaritiesPerItem",    "100"
>>>       ]
>>>     }
>>>   },
>>>   {
>>>     "Name": "MR Step 3: Find items for user",
>>>     "HadoopJarStep": {
>>>       "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
>>>       "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
>>>       "Args": [
>>>          "--input",
>>> "s3n://recommendertest/data/<jobid>/aggregateWatched/",
>>>          "--output",
>>>   "s3n://recommendertest/data/<jobid>/userRecommendations/",
>>>          "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
>>>          "--numRecommendations",    "100"
>>>       ]
>>>     }
>>>   }
>>> ]
>>>
>>> ||||
>>>
>>>

Mime
View raw message