hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball" <aa...@cloudera.com>
Subject Re: Suggestions of proper usage of "key" parameter ?
Date Mon, 15 Dec 2008 12:12:41 GMT
To expand a bit on Owen's remarks:

It should be pointed out that in the case of a single MapReduce pass from an
input dataset to an output dataset, the keys into the mapper and out of the
reducer may not be particularly interesting to you. However, more
complicated algorithms often involve multiple MapReduce jobs, where an input
dataset has an MR pass over it, yielding an intermediate dataset which
undergoes yet another MR pass, yielding a final dataset.

In such cases, this provides continuity of interface, where every stage has
both "key" and "value" components. Very often the dataset gets partitioned
or organized in such a way that it makes sense to stamp a key on to values
early on in the process, and continue to allow the keys to flow through the
system between passes. This interface makes that much more convenient. For
example, you may have an input record which is joined against multiple other
datasets. Each other data set join may involve a separate mapreduce pass,
but the primary key will be the same the whole time.

As for determining the input key types: The default TextInputFormat gives
you a line offset and a line of text. This offset may not be particularly
useful to your application.  On the other hand, the KeyValueTextInputFormat
will read each line of text from the input file, and split this into a key
Text object and a value Text object based on the first tab char in the line.
This matches the formatting of output files done by the default
TextOutputFormat.  Chained MRs should set the input format to this one, as
Hadoop won't "know" that this is your intended use case.

If your intermediate reducer outputs more complicated datatypes, you may
want to use SequenceFileOutputFormat, which marshals your data types into a
binary file format. The SequenceFileInputFormat will then read in the data,
and demarshal it into the same data types that you had already encoded.
(Your final pass may want to use TextOutputFormat to get everything back to
a human-readable form)

- Aaron

On Sun, Dec 14, 2008 at 11:39 PM, Owen O'Malley <omalley@apache.org> wrote:

> On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote:
>  Yes, I am referring to the "key" INPUT INTO the map() function and the
>> "key" EMITTED FROM the reduce() function.  Can someone explain why do we
>> need a "key" in these cases and what is the proper use of it ?
> It was a design choice and could have been done as:
> R1 -> map -> K,V -> reduce -> R2
> instead of
> K1,V1 -> map -> K2,V2 -> reduce -> K3,V3
> but since the input of the reduce is sorted on K2, the output of the reduce
> is also typically sorted and therefore keyed. Since jobs are often chained
> together, it makes sense to make the reduce input match the map input. Of
> course everything you could do with the first option is possible with the
> second using either K1 = R1 or V1 = R1. Having the keys is often
> convenient...
>  Who determines what the "key" should be ?  (by the corresponding
>> "InputFormat" implementation class) ?
> The InputFormat makes the choice.
>  In this case, what is the key in the map() call ?  (name of the input
>> file) ?
> TextInputFormat uses the byte offset as the key and the line as the value.
>  What if the reduce() function emits multiple <key, value> entries or not
>> emitting any entry at all ?  Is this considered OK ?
> Yes.
>  What if the reduce() function emits a <key, value> entry whose key is not
>> the same as the input key parameter to the reduce() function ?  Is this OK ?
> Yes, although the reduce output is not re-sorted, so the results won't be
> sorted unless K3 is a subset of K2.
>  If there is a two Map/Reduce cycle chained together.  Is the "key" input
>> into the 2nd round map() function determined by the "key" emitted from the
>> 1st round reduce() function ?
> Yes.
> -- Owen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message