hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Magalhaes <pedror...@gmail.com>
Subject Re: CompositeInputFormat
Date Sat, 09 Aug 2014 19:10:01 GMT
And i got this from StackOverflow:
http://stackoverflow.com/questions/9467189/your-advice-on-a-hadoop-mapreduce-job


   1. Your third job is an interesting idea, but you have to be careful
   with CompositeInputFormat. There are two conditions that must be met for
   you to be able to use this input format. The first is that there has to be
   the same number of files in both input directories. This can be achieved by
   setting the same number of reducer's for Job1 and Job2. The second
   condition is that the input files CANNOT be splittable. This can be
   achieved by using a non splittable compression such as bzip.



On Sat, Aug 9, 2014 at 4:08 PM, Pedro Magalhaes <pedrorjbr@gmail.com> wrote:

> I forgot the quote is from Hadoop, Definitive Guide.
>
>
> On Thu, Aug 7, 2014 at 6:04 PM, Pedro Magalhaes <pedrorjbr@gmail.com>
> wrote:
>
>> Thanks for reply..
>>
>> Really, what i  am doing is trying to implement a mapside join. In my
>> mind, i am gonna need that files must be no splittable, so each map will
>> process partitions with same key.
>>
>> I saw in hadoop, definitive guide, that i can force files not be split,
>> setting the min split size to max integer. Other option is override the
>> method isSplittablable.
>>
>> Is that make sense?
>>
>> Sorry for the spelling mistakes. I am from my iphone..
>>
>>
>>
>> Em quinta-feira, 7 de agosto de 2014, Chris Douglas <cdouglas@apache.org>
>> escreveu:
>>
>> Is that quote from product documentation?
>>>
>>> Whether the output files are splittable is a practical consideration
>>> when setting up the join; the quote is identifying a common case that
>>> satisfies the constraints. The size of each partition is irrelevant,
>>> provided that the splits are generated consistently across all
>>> InputFormats involved in the expression (i.e., given datasets A,B in a
>>> join expression and a key K in A, K is in partition N iff K is in
>>> partition N for InputFormat B OR K is not in B). -C
>>>
>>> On Mon, Aug 4, 2014 at 1:36 PM, Pedro Magalhaes <pedrorjbr@gmail.com>
>>> wrote:
>>> > I saw that one of the requirements to use CompositeInputFormat is:
>>> > "A map-side join can be used to join the outputs of several jobs that
>>> had
>>> > the same number of reducers, the same keys, and output files that are
>>> not
>>> > splittable (by being smaller than an HDFS block, or by virtue of being
>>> gzip
>>> > compressed, for example)"
>>> >
>>> > So Does my partitions size must be equal or smaller than the HDFS
>>> Block?
>>> >
>>> > If i have a 1 GB File = 1024 mb, i will have 16 partitions of 64 MB?
>>> >
>>> > How can i control the size of the partition?
>>> >
>>> >
>>>
>>
>

Mime
View raw message