hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Loading file to HDFS with custom chunk structure
Date Wed, 16 Jan 2013 15:49:33 GMT
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for
that. You can strip off the EBCDIC+Binary header(Initial 3600
Bytes) and store the SEGY file as Sequence Files, where each
trace (Trace Header+Trace Data) would be the value and the
trace no. could be the key.

Otherwise you have to write a custom InputFormat to deal with
that. It would enhance the performance as well, since Sequence
Files are already in key-value form.

Warm Regards,

On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> Look at  the block size concept in Hadoop and see if that is what you are
> looking for
> Sent from my iPhone
> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
> kaliyugantagonist@gmail.com> wrote:
> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
> To summarize, the SegY file consists of :
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
> The scenario is as follows:
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
> Please guide me as to how I must proceed.
> Thanks and regards !

View raw message