mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: RowSimilarity error
Date Thu, 12 Jul 2012 15:04:55 GMT
Sorry, I overread that its more than one machine. Could you provide
the values for the counters from RowSimilarityJob (ROWS,
COOCCURRENCES, PRUNED_COOCCURRENCES)?

Best,
Sebastian

2012/7/12 Pat Ferrel <pat@occamsmachete.com>:
> Thanks, actually there are two machines. I am testing before spending on
> AWS. It's failing the test in this case.
>
> BTW I ran the same setup with 150,000 docs and 250,000 terms with a much
> lower timeout (30000000) all worked fine. I was using 0.6 at the time and
> not sure if 0.8 has ever completed a rowsimilarity of any size. Small runs
> work fine on my laptop.
>
> I smell some kind of other problem than simple performance. In any case in a
> perfect world isn't the code supposed to check in often enough so the
> cluster config doesn't need to be tweaked for a specific job?
>
> It may be some problem of mine, of course. I see no obvious hadoop or mahout
> errors but there are many places to look.
>
> With a 100 minute timeout I am currently at the pause between map and
> reduce. If it fails would you like any specific logs?
>
>
> On 7/11/12 4:00 PM, Sebastian Schelter wrote:
>>
>> To be honest, I don't think it makes a lot of sense to test a Hadoop
>> job on a single machine. It's pretty obvious that you will get
>> terrible performance.
>>
>> 2012/7/12 Pat Ferrel <pat@occamsmachete.com>:
>>>
>>> BTW the timeout is 1800 but the task in total runs over 9 hours before
>>> each
>>> failure. This causes the job to take (after three tries) 27 hrs to
>>> completely fail. Oh, bother...
>>>
>>> The timeout seems to be during the last map, so when the mappers reach
>>> 100%
>>> but still running. Maybe some kind of cleanup is happening?
>>> The first reducer is still "pending". The reducer never gets a chance to
>>> start.
>>>
>>> 12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
>>> 12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
>>> 12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
>>> 12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
>>> 12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
>>> 12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
>>> 12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
>>> 12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
>>> 12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
>>> ---after a very long wait (9hrs or so) insert fail here--->
>>>
>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms
>>>
>>> Any other info you need please ask.
>>>
>>> I'm about to try cranking it up to a couple hours for timeout but I
>>> suspect
>>> there is something else going on here.
>>>
>>>
>>> On 7/11/12 10:35 AM, Pat Ferrel wrote:
>>>>
>>>> I'm have a custom lucene stemming analyzer that filters out stop words
>>>> and
>>>> uses the following seq2sparse. The -x 40 is the only other thing that
>>>> affects tossing frequent terms and as I understand things, tosses any
>>>> term
>>>> that appears in over 40% of the docs.
>>>>
>>>> mahout seq2sparse \
>>>>      -i b2/seqfiles/ \
>>>>      -o b2/vectors/ \
>>>>      -ow \
>>>>      -chunk 2000 \
>>>>      -x 40 \
>>>>      -seq \
>>>>      -n 2 \
>>>>      -nv \
>>>>      -a com.finderbots.analyzers.LuceneStemmingAnalyzer
>>>>
>>>>
>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
>>>>>
>>>>> Hi Pat,
>>>>>
>>>>> have you removed highly frequent terms before launching rowsimilarity
>>>>> job?
>>>>>
>>>>> On 11.07.2012 18:14, Pat Ferrel wrote:
>>>>>>
>>>>>> I've been trying to get a rowsimilarity job to complete. It continues
>>>>>> to
>>>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so
I've
>>>>>> upped the timeout to 30 minutes now. There are no errors in the logs
>>>>>> that I can see and no other task I've tried is acting like this.
Is
>>>>>> this
>>>>>> expected? Shouldn't the task check in more often?
>>>>>>
>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it is
a bit
>>>>>> slow anyway, still I shouldn't have to turn up the timeout so high
>>>>>> should I?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Mime
View raw message