hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sugandha Naolekar <sugandha....@gmail.com>
Subject Reading a file in a customized way
Date Tue, 25 Feb 2014 06:10:13 GMT
Hello,

Irrespective of the file blocks placed in HDFS, I want my map() to be
called/invoked in a customized manner. For. eg. I want to process a huge
JSON File(single file). Now this file is definitely less than the default
block size(128 MB). Thus, ideally, only one mapper will be called. Means,
the map task will be called only once right? But, I want my map function to
process every feature of this json file. Thus, every feature task will be
the map task. Thus, to read this json, will I have to get the inputsplits
and use custom record reader? Please find the sample of the json file below:

{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
"REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
"START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
"REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
1458703.170565 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }

]

}

Also, generally the text files are split and placed in blocks in what
manner by hadoop? Line by Line? Can this be customized? IF not, can we read
the file from 2 blocks ? e,g; Each feature as seen in the json is a
combination of multiple lines. Now, can there be a possibility where, the
one line of the feature tag is placed in pne block of one m/c and rest of
the lines in other machine's block?

--
Thanks & Regards,
Sugandha Naolekar

Mime
View raw message