hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Kendall <mkend...@justin.tv>
Subject Re: Identifying lines in map()
Date Mon, 30 Nov 2009 17:47:26 GMT
this really seems like the kind of query that doesn't lend itself to
mapreduce very well...  i'd probably do some kind of mapreduce
abuse... (using mapreduce for distributed computation but for nothing
else)

map:
tokens = set of tokens in first line
for each line:
    make set of tokens in this line
    tokens = intersection(tokens, tokens_this_line)
print tokens

combiner: same as map

red: cat

this way each mapper will reduce all of its input to one line of
tokens found in all lines.  then you can re-run this on the output
until you get a small enough set that you can run the last job on one
box.

-mike

On Sun, Nov 29, 2009 at 8:37 PM, Owen O'Malley <omalley@apache.org> wrote:
>
> On Nov 29, 2009, at 5:00 PM, James R. Leek wrote:
>
>> I want to use hadoop to discover if there is any token that appears in every line
of a file.
>
> What I would do:
>
> map:
>   generate sorted list of tokens (dropping duplicates) for current line
>   if this is the first record:
>      previous = current token list
>   else:
>      iterate through both lists deleting any tokens from previous that aren't in the
current
>
> in map close:
>  for each token:
>     emit token, 1
>
> in reduce:
>   if there are M, where M is the number of maps, values for the key:
>     emit token, null
>
> memory in map is limited to roughly double the size of each line, which in most non-insane
data sets is totally fine. Processing for each line is N lg N in the number of tokens in that
line. Everything else is linear in the size of the answer.
>
> -- Owen
>
>
>

Mime
View raw message