hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sugandha Naolekar <sugandha....@gmail.com>
Subject Re: Can the file storage in HDFS be customized?
Date Wed, 26 Feb 2014 04:34:15 GMT
Yes. Got it. Thanks

Thanks & Regards,
Sugandha Naolekar

On Tue, Feb 25, 2014 at 10:14 PM, java8964 <java8964@hotmail.com> wrote:

> Hi, Naolekar:
> The blocks in HDFS just store the bytes. It has no idea nor cares what
> kind of data, or how many ploygons in this block. It just store 128M (if
> your default block size is set to 128M) bytes.
> It is your InputFormat/RecordReader to read these bytes in, and
> deserialize them to <K,V> pair.
> The default TextInputFormat will read one LINE of text for each reading.
> Of course, the boundary of the block most likely will be in the middle of
> line, so that is the TextInputFormat's responsibility to read correct whole
> line of last record of one block, or find the correct starting point of the
> fist line of current block, as you can image. You can read source code of
> TextInputFormat to see how it implements it.
> After each line of Text read, it is the RecordReader class's
> responsibility to translate that line of text into <K,V> pair.
> Is the above logic good for your data? Maybe not, then it is time to write
> your owner InputFormat/RecordReader class to understand your own data.
> For InputFormat, read one record out from the block bytes array,
> especially handle the block boundary cases, for both starting/ending of
> block, as TextInputFormat does.
> For RecordRecorder, translate that record into <K,V> for your mapper.
> Yong
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 15:59:33 +0530
> Subject: Can the file storage in HDFS be customized?
> To: user@hadoop.apache.org
> Hello,
> I have a huge shapefile which has some 500 polygon  geometries. Is there a
> way to store this shapefile in such a format in HDFS that each block will
> have 100 polygon geometries. And each block representing a quad core
> machine.
> Thus, 5 machines, with 5 blocks, which have in total 500 polygon
> geometries.
> Internally, I would like to read each of the block of HDFS in such a way
> where, each polygon geometry is fed to the map() task. THus, 100 map()
> tasks per block per machine.
> --
> Thanks & Regards,
> Sugandha Naolekar

View raw message