hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Herou <marcus.he...@tailsweep.com>
Subject Re: Parallell maps
Date Fri, 03 Jul 2009 14:58:31 GMT
Hi.

Comments inline

Cheers
//Marcus
On Fri, Jul 3, 2009 at 4:48 PM, Steve Loughran <stevel@apache.org> wrote:

> Marcus Herou wrote:
>
>> Hi.
>>
>> This is my company so I reveal what I like, even though the board would
>> shoot me but hey do you think they are scanning this mailinglist ? :)
>>
>> The PR algo is very simple (but clever) and can be found on wikipedia:
>> http://en.wikipedia.org/wiki/PageRank
>> What is painful is to calculate it in a distributed architecture. You will
>> never achieve your goal by using a DB to store the score/node and links
>> from/to it (we did not at least).
>> We use plain lucene indexes and 10 memcached servers to store the
>> intermediate scoring and run enough iterations for the scoring to almost
>> converge (it never converges completely).
>>
>>
> memcached? why not store the intermediate values in the MR FS?

Why do you think I chose memcached ? It was not due to the nice API... Doh!
Performance of course. It beats the hell out of HDFS for small temporarily
stored data (that's why it's called a cache and not FS). By using memcached
we again have the matter of using something which is not shared-nothing but
nicely distributed and have a great performance for the workload. HBase
cannot compete either in the <1ms range

>
>
> Here's Paolo Castagna's MR-based implementation (some logging omitted)
>
>  - first clean the data up
>  -count the files and give them all an initial score
>  -update the ranks, dealing with dangling pointers
>  -repeat until you are happy.
>  -sort and display the output
>
> No DB, just HDFS
>
>        exec("Data cleanup", new CheckingData(), inpath, output);
>        //count the data
>        String countFile = outpath + File.separator + COUNT;
>        exec("Page count", new CountPages(), output, countFile);
>        String count = read(fs, countFile);
>        exec("InitializeRanks", new InitializeRanks(), output, input,
> count);
>        int iterations = 0;
>        String sortedRanksDir = outpath + File.separator +
>                SORTED_RANKS;
>        while (iterations < iterationLimit) {
>            String danglingFile = outpath + File.separator + DANGLING;
>            exec("DanglingPages", new DanglingPages(), input, danglingFile);
>            String dangling = read(fs, danglingFile);
>            exec("UpdateRanks", new UpdateRanks(), input, output, count,
> dangling);
>            overwrite(fs, new Path(output), new Path(input));
>            if ((iterations > CHECK_CONVERGENCE_FREQUENCY)
>                    && (iterations % CHECK_CONVERGENCE_FREQUENCY == 0)) {
>                String convergenceFile = outpath + File.separator +
> CONVERGENCE;
>                exec("CheckConvergence", new CheckConvergence(),
>                        input,
>                        convergenceFile);
>                double tolerance = Double.parseDouble(read(fs,
> convergenceFile));
>
>                if (tolerance <= toleranceArg) {
>                    //exit the loop when we are happy
>                    break;
>                }
>            }
>
>            iterations++;
>        }
>
>        exec("SortRanks", new SortRanks(), input, sortedRanksDir);
>
>
>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message