hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Reading a file in a customized way
Date Tue, 25 Feb 2014 16:55:53 GMT
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?‏"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org

Hello,

Irrespective of the file blocks placed in HDFS, I
 want my map() to be called/invoked in a customized manner. For. eg. I 
want to process a huge JSON File(single file). Now this file is definitely less than the default
block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will
be called only once right? But, I want my map 
function to process every feature of this json file. Thus, every feature
 task will be the map task. Thus, to read this json, will I have to get 
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",


"features": [
{
 "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, 
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000, 
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, 
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
 8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature", 
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000, 
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "", 
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE": 
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, 
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID": 
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH": 
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
 }
]

}


Also, generally the text files are split and placed in blocks in what manner by hadoop? Line
by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature
as seen in the json is a combination of multiple lines. Now, can there be a possibility where,
the one line of the feature tag is placed in pne block of one m/c and rest of the lines in
other machine's block?



--Thanks & Regards,


Sugandha Naolekar




 		 	   		  
Mime
View raw message