hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Identifying lines in map()
Date Mon, 30 Nov 2009 04:31:09 GMT
Hey Todd, James,

If you want to tweak this to reduce memory usage (please excuse my pseudo-python),

  common_tokens = set()
  processed_first_line = False
  line_count = 0
  line_count += 1
  if processed_first_line == False:
    for each word in line:
      common_tokens.add( word )
    processed_first_line = True
    temp = set()
    for each word in line:
  emit (null, line_count)
  for each word in common_tokens:
    emit (word, line_count)
combiner and reducer:
  Mostly as below.  Couldn't you avoid the post-processing step by selecting a "null" key
that is sorted as the first element and using only one reducer?  This way, you know the true
# of lines prior to processing "real" keys.

(I might have to look up in my trusty ol' data structure book to see if the computational
complexity of the map might be improved).

Basically, for each mapper, you're keeping track of all tokens that are in each line of the
given input split and emitting only those, as opposed to emitting at least one record for
each unique token in the file.  Saves you some RAM and reduces the size of the intermediate
files, which can be a real killer for a CPU-light problem such as this one.

Hope this helps,


On Nov 29, 2009, at 7:23 PM, Todd Lipcon wrote:

> Ah, I misunderstood.
> How about this?
> mapper:
>  for each word in line:
>   add word to a set()
>  for each word in set:
>    emit (word, 1)
>  emit (null, 1)
> combiner:
>  sum up input values (just like word count)
> reducer:
>  same as combiner, except you can avoid emitting any words which have a
> count less than any other count you previously emitted
> then post process - you have the "null" key which is the total line count.
> Any keys which have that same value appeared on every line.
> If you want to do away with the post process step, you might be able to do
> something clever involving using a counter in the map, and then reading that
> counter's value from the Reduce task. I've never tried it, but I *think* you
> should be able to get at your job's map counter values from the reduce
> execution (though not the combiner, necessarily). If you can do this, you'd
> know the total line count in the reducer and could avoid emitting any keys
> that aren't in every line.
> -Todd
> On Sun, Nov 29, 2009 at 5:18 PM, James R. Leek <leek2@llnl.gov> wrote:
>> Thanks Todd, but your code seems check if a given token exists on every
>> line.  (Like a the regex example.)  I want to find any tokens that exist on
>> every line.  So, give the input:
>> Amy Sue Fred John
>> Jack Joe Sue John
>> Alice Bob Fred Sue John
>> The output should be:
>> Sue
>> John
>> because Sue and John appear on every line.  I don't know Sue and John in
>> advance.
>> Thanks,
>> Jim
>> Todd Lipcon wrote:
>>> Hi James,
>>> Something like the following pseudocode:
>>> Mapper:
>>> configure:
>>>   set instance variable "seenOnlyMatches" = true
>>> map:
>>>   save OutputCollector
>>>   if current line doesn't match, set seenOnlyMatches to false
>>> close:
>>>   output a single record containing the value of seenOnlyMatches (and a
>>> null key)
>>>   super.close()
>>> Reducer:
>>> if any input records are false, output false. otherwise output true
>>> Make sense?
>>> -Todd
>>> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <leek2@llnl.gov> wrote:
>>>> I want to use hadoop to discover if there is any token that appears in
>>>> every line of a file.  I thought that this should be pretty
>>>> straightforward,
>>>> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.
>>>> I've
>>>> been using it for about two weeks.)
>>>> My original idea was to have the mapper produce every token as the key,
>>>> with the line number as the value.  But I couldn't find any InputFormat
>>>> that
>>>> would give me line numbers.
>>>> However, it seemed that FileInputFormat would give me the position in the
>>>> file as the key, and the line as the value.  I assume that the key would
>>>> be
>>>> the position in the file of the beginning of the line.  With that I could
>>>> have the token be the key, and the line position as the value, and use a
>>>> hash table in the reducer to determine if the token appeared in every
>>>> line.
>>>> However, I found that it actually seems to give the position of the
>>>> input
>>>> split.  I figured this out because, rather than getting 50,000 unique
>>>> keys
>>>> to the mapper (the number of lines in the file), I was getting 220 unique
>>>> keys.  (The number of mappers/input splits.)
>>>> So, what should I do?
>>>> Thanks,
>>>> Jim

View raw message