hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From felix gao <gre1...@gmail.com>
Subject Re: Best practice for batch file conversions
Date Tue, 08 Feb 2011 23:52:58 GMT
I am stuck again. The binary files are stored in hdfs under some pre-defined
structure like
root/
|-- dir1
|   |-- file1
|   |-- file2
|   `-- file3
|-- dir2
|   |-- file1
|   `-- file3
`-- dir3
    |-- file2
    `-- file3

after I processed them somehow using Non-splittable InputFormat in my
mapper, I would like to store the files back into HDFS like
processed/
|-- dir1
|   |-- file1.done
|   |-- file2.done
|   `-- file3.done
|-- dir2
|   |-- file1.done
|   `-- file3.done
`-- dir3
    |-- file2.done
    `-- file3.done

can someone please show me how to do this?

thanks,

Felix

On Tue, Feb 8, 2011 at 9:43 AM, felix gao <gre1600@gmail.com> wrote:

> thanks a lot for the pointer. I will play around with it.
>
>
> On Mon, Feb 7, 2011 at 10:55 PM, Sonal Goyal <sonalgoyal4@gmail.com>wrote:
>
>> Hi,
>>
>> You can use FileStreamInputFormat which returns the file stream as the
>> value.
>>
>>
>> https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input
>>
>> You need to remember that you lose data locality by trying to manipulate
>> the file as a whole, but in your case, the requirement probably demands it.
>>
>> Thanks and Regards,
>> Sonal
>> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
>> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
>> Nube Technologies <http://www.nubetech.co>
>>
>> <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>>
>>
>> On Tue, Feb 8, 2011 at 8:59 AM, Harsh J <qwertymaniac@gmail.com> wrote:
>>
>>> Extend FileInputFormat, and write your own binary-format based
>>> implementation of it, and make it non-splittable (isSplitable should
>>> return false). This way, a Mapper would get a whole file, and you
>>> shouldn't have block-splitting issues.
>>>
>>> On Tue, Feb 8, 2011 at 6:37 AM, felix gao <gre1600@gmail.com> wrote:
>>> > Hello users of hadoop,
>>> > I have a task to convert large binary files from one format to another.
>>>  I
>>> > am wondering what is the best practice to do this.  Basically, I am
>>> trying
>>> > to get one mapper to work on each binary file and i am not sure how to
>>> do
>>> > that in hadoop properly.
>>> > thanks,
>>> > Felix
>>>
>>>
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>>
>>
>>
>

Mime
View raw message