hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Best practice for batch file conversions
Date Wed, 09 Feb 2011 02:05:29 GMT
You can check out MultipleOutputFormat for this.
Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Wed, Feb 9, 2011 at 5:22 AM, felix gao <gre1600@gmail.com> wrote:

> I am stuck again. The binary files are stored in hdfs under some
> pre-defined structure like
> root/
> |-- dir1
> |   |-- file1
> |   |-- file2
> |   `-- file3
> |-- dir2
> |   |-- file1
> |   `-- file3
> `-- dir3
>     |-- file2
>     `-- file3
>
> after I processed them somehow using Non-splittable InputFormat in my
> mapper, I would like to store the files back into HDFS like
> processed/
> |-- dir1
> |   |-- file1.done
> |   |-- file2.done
> |   `-- file3.done
> |-- dir2
> |   |-- file1.done
> |   `-- file3.done
> `-- dir3
>     |-- file2.done
>     `-- file3.done
>
> can someone please show me how to do this?
>
> thanks,
>
> Felix
>
> On Tue, Feb 8, 2011 at 9:43 AM, felix gao <gre1600@gmail.com> wrote:
>
>> thanks a lot for the pointer. I will play around with it.
>>
>>
>> On Mon, Feb 7, 2011 at 10:55 PM, Sonal Goyal <sonalgoyal4@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> You can use FileStreamInputFormat which returns the file stream as the
>>> value.
>>>
>>>
>>> https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input
>>>
>>> You need to remember that you lose data locality by trying to manipulate
>>> the file as a whole, but in your case, the requirement probably demands it.
>>>
>>> Thanks and Regards,
>>> Sonal
>>> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
>>> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
>>> Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 8, 2011 at 8:59 AM, Harsh J <qwertymaniac@gmail.com> wrote:
>>>
>>>> Extend FileInputFormat, and write your own binary-format based
>>>> implementation of it, and make it non-splittable (isSplitable should
>>>> return false). This way, a Mapper would get a whole file, and you
>>>> shouldn't have block-splitting issues.
>>>>
>>>> On Tue, Feb 8, 2011 at 6:37 AM, felix gao <gre1600@gmail.com> wrote:
>>>> > Hello users of hadoop,
>>>> > I have a task to convert large binary files from one format to
>>>> another.  I
>>>> > am wondering what is the best practice to do this.  Basically, I am
>>>> trying
>>>> > to get one mapper to work on each binary file and i am not sure how
to
>>>> do
>>>> > that in hadoop properly.
>>>> > thanks,
>>>> > Felix
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>> www.harshj.com
>>>>
>>>
>>>
>>
>

Mime
View raw message