##### Site index · List index
Message view
Top
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Speed up a job thats been running for 60+ hours (long)
Date Tue, 27 May 2008 17:44:14 GMT
```
Uhh... Emitting all combinations is kinda exponential in the number of
groups.  Even worse, if some user is a member of more than 30 groups, you
will have billions (literally) of records for that user.

Try emitting all *pairs* of groups for each user.  Then filter the pairs to
only retain "interesting" pairs.  Once you have a (restricted) list of
pairs, do the same sort of thing for interesting triples.

If you are fairly restrictive at each step, the output will stay very sparse
and you won't have to consider all of the long combinations that are doomed
to be boring because of the fact that sub-combinations are already known to
be boring.  This should be much, much faster if you are fierce about
restricting how combinations progress to the next level.

For interestingness, you can use simple thresholding on number of
occurrences (say, only promote if you find >=10 occurrences).  Slightly more
clever, but probably no faster would be to use a better statistical estimate
such as the log-likelihood ratio test that I am always touting and then
limit the number of sequences that are allowed to be promoted.

On 5/26/08 3:39 AM, "Yuri Kudryavcev" <mail@ykud.com> wrote:

> Hi.
>
> I really would like some input on this case, since I'm trying to scale up a
> similar algorithm.
>
> I can be totally wrong, please correct )
> So you're emitting C^2_n group pairs from every user record by going for
> group pairs?
> For a n = 100 groups for an average user -- that's an 4950 output records
> for every user. Do you see similar numbers in logs?
> I think increasing the intermediate bunch of records in this proportion
>
> - Yuri.
>
> On 5/26/08, jkupferman <jkupferman@umail.ucsb.edu> wrote:
>>
>>
>> Hi everyone,
>> I am using hadoop (17) to try and do some large scale user comparisons and
>> although the programs are all written, its taking incredibly long to run
>> and
>> it seems like it should be going faster. I would really like some insight
>> as
>> to what I could do to speed this up aside from just "add more computers". I
>> would really appreciate some help from all of the sagacious hadoop
>> core-users.
>>
>> The basic idea is that there are a bunch of users, each of which is some
>> groups. I would like to know how many users each combination of groups has
>> in common. I laid out the data using sequence files which seems to be
>> working well and quickly, each sequence file entry has a text user name and
>> a map writable which contains all of the groups they are in. The map
>> function takes in each user and the outputs all of the combinations of the
>> groups for which it is a part of and a 1 which is the instance counter(like
>> in wordcount). So user x which is a member of groups 1,2,3,4 will output
>> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
>> a collector which reduces the number of records about 10x. Reducer is
>> really
>> simple just sums up the total for each combination and then outputs it to a
>> file. Just as an aside, I make sure to use intwritables just about
>> everywhere which I hoped would help since there are inevitable tons of
>> comparisons going on.
>>
>> This is being done on about 4gb of user data on an 20 Large instance
>> cluster
>> on Amazons EC2. With that much data, there are about 240 map tasks and I
>> have it set to run 10 map tasks per task tracker. With those settings, the
>> slaves are running about 100% CPU and memory is just about capacity but is
>> almost no paging. Although the tasks seem to be progressing, some of the
>> tasks that have just completed have run for 30+ hours. Some of the tasks
>> have failed with a "Lost task tracker:" which I intend on fixing with
>> HADOOP-3403_0_20080516.patch, whenever this job finishes.
>>
>> It seemed to me that the problem might have been calling the collector so
>> many times since users can be in 1000's of groups and it does about n^2
>> comparisons. I tried another version which outputs only n times by having
>> each entry output a map, but this did not prove much better on the test
>> trials I ran, and the extra work in the reducer is really killer.
>>
>> It is not clear to me what is dragging down this job, or what I can do to
>> increase the rate at which it is computing. Although there is quite a bit
>> of
>> data, it doesnt seem like it should be taking this long on 20 nodes. Any
>> help/questions/comments would be greatly appreciated. Thanks for all of
>> your
>> help.
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28lo
>> ng%29-tp17465721p17465721.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>

```
Mime
View raw message