hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Best practice for batch file conversions
Date Thu, 10 Feb 2011 05:42:51 GMT
I think you should be able to use the JobConf class and set the output
format there.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Thu, Feb 10, 2011 at 8:06 AM, felix gao <gre1600@gmail.com> wrote:

> I have another question regarding MultipleOutputFormat, it seems it is in
> the old api, when I do job.setOutputFormatClass
> (MyMultipleFileOutputFormat.class);
> then it won't compile because The method setOutputFormatClass(Class<?
> extends OutputFormat>) in the type Job is not applicable for the arguments
> (Class<MyMultipleFileOutputFormat>), I went to
> the mapred/org/apache/hadoop/mapreduce/lib/output directory and there isn't
> a new api (0.20.2) for MultipleOutputFormat. Should I just create my
> own MultipleOutputFormat by copy the functions from MultipleOutputFormat
> from and wire them to use the FileOutputFormat
> under mapred/org/apache/hadoop/mapreduce/lib/output? or there is more magic
> under the hood than that.
>
> Felix
>
>
> On Wed, Feb 9, 2011 at 4:26 PM, felix gao <gre1600@gmail.com> wrote:
>
>> Sonal,
>>
>> can you tell me how to use the MultipleOutputFormat in my Mapper?  I want
>> to read a line of text and convert it to some other format and then write it
>> back to HDFS using MultipleOutputFormat.
>>
>> Thanks,
>>
>> Felix
>>
>>
>> On Tue, Feb 8, 2011 at 6:05 PM, Sonal Goyal <sonalgoyal4@gmail.com>wrote:
>>
>>> You can check out MultipleOutputFormat for this.
>>> Thanks and Regards,
>>> Sonal
>>> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
>>> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
>>> Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 9, 2011 at 5:22 AM, felix gao <gre1600@gmail.com> wrote:
>>>
>>>> I am stuck again. The binary files are stored in hdfs under some
>>>> pre-defined structure like
>>>> root/
>>>> |-- dir1
>>>> |   |-- file1
>>>> |   |-- file2
>>>> |   `-- file3
>>>> |-- dir2
>>>> |   |-- file1
>>>> |   `-- file3
>>>> `-- dir3
>>>>     |-- file2
>>>>     `-- file3
>>>>
>>>> after I processed them somehow using Non-splittable InputFormat in my
>>>> mapper, I would like to store the files back into HDFS like
>>>> processed/
>>>> |-- dir1
>>>> |   |-- file1.done
>>>> |   |-- file2.done
>>>> |   `-- file3.done
>>>> |-- dir2
>>>> |   |-- file1.done
>>>> |   `-- file3.done
>>>> `-- dir3
>>>>     |-- file2.done
>>>>     `-- file3.done
>>>>
>>>> can someone please show me how to do this?
>>>>
>>>> thanks,
>>>>
>>>> Felix
>>>>
>>>> On Tue, Feb 8, 2011 at 9:43 AM, felix gao <gre1600@gmail.com> wrote:
>>>>
>>>>> thanks a lot for the pointer. I will play around with it.
>>>>>
>>>>>
>>>>> On Mon, Feb 7, 2011 at 10:55 PM, Sonal Goyal <sonalgoyal4@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> You can use FileStreamInputFormat which returns the file stream as
the
>>>>>> value.
>>>>>>
>>>>>>
>>>>>> https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input
>>>>>>
>>>>>> You need to remember that you lose data locality by trying to
>>>>>> manipulate the file as a whole, but in your case, the requirement
probably
>>>>>> demands it.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Sonal
>>>>>> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
>>>>>> Salesforce, FTP servers and others<https://github.com/sonalgoyal/hiho>
>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>>
>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 8, 2011 at 8:59 AM, Harsh J <qwertymaniac@gmail.com>wrote:
>>>>>>
>>>>>>> Extend FileInputFormat, and write your own binary-format based
>>>>>>> implementation of it, and make it non-splittable (isSplitable
should
>>>>>>> return false). This way, a Mapper would get a whole file, and
you
>>>>>>> shouldn't have block-splitting issues.
>>>>>>>
>>>>>>> On Tue, Feb 8, 2011 at 6:37 AM, felix gao <gre1600@gmail.com>
wrote:
>>>>>>> > Hello users of hadoop,
>>>>>>> > I have a task to convert large binary files from one format
to
>>>>>>> another.  I
>>>>>>> > am wondering what is the best practice to do this.  Basically,
I am
>>>>>>> trying
>>>>>>> > to get one mapper to work on each binary file and i am not
sure how
>>>>>>> to do
>>>>>>> > that in hadoop properly.
>>>>>>> > thanks,
>>>>>>> > Felix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Harsh J
>>>>>>> www.harshj.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message