hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: WholeFileInputFormat format
Date Tue, 10 Jul 2012 20:27:18 GMT
Hello Harsh,

          I am sorry to be a pest of questions. Actually I am kinda
stuck. I have to write my MapReduce job such that the comparisons
between each output from both the mappers must be in order. I mean I
have to read one line from the file and extract the desired fields
from the line in one mapper, and in the second mapper I have to read
the values from Hbase table and compare those values with the fields
read in the first mapper. I am wondering how to achieve that since
reducer phase will not start until all the mappers are done.
          Maybe a bit of elaboration of my use case would be helpful
in understanding the problem in a better fashion. I have a file that
contains several fields. I have created columns for these fields in my
Hbase table. After that I am extracting value of each field from the
file and storing it in the corresponding Hbase column. Now, I have a
'support file' for the same file whose values are already stored in
Hbase, but with a totally different format. But the order of fields in
the original file and the order of lines(containing corresponding
fields) in the support file is exactly same. So I am trying to read
one line from the support file, extract the field of interest in one
mapper and read the same field from the Hbase table in second mapper
and send these values to the reducer where the comparison will be made
to conclude the test.
         Please help me out by providing your able guidance, as being
a novice I am not able to tackle with the situation.(Pardon my

May thanks.

    Mohammad Tariq

On Tue, Jul 10, 2012 at 8:34 PM, Harsh J <harsh@cloudera.com> wrote:
> I don't see why you'd have to use the WholeFileInputFormat for such a
> task. Your task is very similar to joins, and you can see the section
> "General reducer-side join" for what your overall logic should look
> like, under Ricky's
> http://horicky.blogspot.in/2010/08/designing-algorithmis-for-map-reduce.html
> article.
> On Tue, Jul 10, 2012 at 7:46 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
>> Hello Harsh,
>>          Thank you so much for the quick response. Actually I have a
>> use case wherein I have to compare values that are coming from 2
>> mappers to one reducer. For that I am planning to use MultipleInputs
>> class. In one mapper I have a text file (these files may contain
>> 1,00,000 to 2,00,000 lines), and I have to extract bytes from 2-13,
>> 20-25, 32-38 and so on from each line of this file. In the second
>> mapper I have to read values from an Hbase table. The columns of this
>> table correspond to the fields which I am reading from the text file
>> in the first mapper.
>>         In the reducer I have to compare the results coming for both
>> the mappers and generate the final result. Need your guidance. Many
>> thanks.
>> Regards,
>>     Mohammad Tariq
>> On Tue, Jul 10, 2012 at 6:55 PM, Harsh J <harsh@cloudera.com> wrote:
>>> It depends on what you need. If your file is not splittable, or if you
>>> need to read the whole file from a single mapper itself (i.e. you do
>>> not _want_ it to be split), then use WholeFileInputFormats. Otherwise,
>>> you get more parallelism with regular splitting.
>>> On Tue, Jul 10, 2012 at 6:31 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
>>>> Hello list,
>>>>        What could be the approximate maximum size of the files that
>>>> can be handled using WholeFileInputFormat format??I mean, if the file
>>>> is very big, then is it feasible to use WholeFileInputFormat as the
>>>> entire load will go to one mapper??Many thanks.
>>>> Regards,
>>>>     Mohammad Tariq
>>> --
>>> Harsh J
> --
> Harsh J

View raw message