hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Can the file storage in HDFS be customized?
Date Tue, 25 Feb 2014 16:44:00 GMT
Hi, Naolekar:
The blocks in HDFS just store the bytes. It has no idea nor cares what kind of data, or how
many ploygons in this block. It just store 128M (if your default block size is set to 128M)
bytes.
It is your InputFormat/RecordReader to read these bytes in, and deserialize them to <K,V>
pair.
The default TextInputFormat will read one LINE of text for each reading. Of course, the boundary
of the block most likely will be in the middle of line, so that is the TextInputFormat's responsibility
to read correct whole line of last record of one block, or find the correct starting point
of the fist line of current block, as you can image. You can read source code of TextInputFormat
to see how it implements it.
After each line of Text read, it is the RecordReader class's responsibility to translate that
line of text into <K,V> pair.
Is the above logic good for your data? Maybe not, then it is time to write your owner InputFormat/RecordReader
class to understand your own data.
For InputFormat, read one record out from the block bytes array, especially handle the block
boundary cases, for both starting/ending of block, as TextInputFormat does.
For RecordRecorder, translate that record into <K,V> for your mapper.
Yong

From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 15:59:33 +0530
Subject: Can the file storage in HDFS be customized?
To: user@hadoop.apache.org

Hello,

I have a huge shapefile which has some 500 polygon  geometries. Is there a way to store this
shapefile in such a format in HDFS that each block will have 100 polygon geometries. And each
block representing a quad core machine.



Thus, 5 machines, with 5 blocks, which have in total 500 polygon geometries. 



Internally, I would like to read each of the block of HDFS in such a way where, each polygon
geometry is fed to the map() task. THus, 100 map() tasks per block per machine.


--Thanks & Regards,
Sugandha Naolekar






 		 	   		  
Mime
View raw message