hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sudhakara st <sudhakara...@gmail.com>
Subject Re: Reading a file in a customized way
Date Tue, 25 Feb 2014 07:44:31 GMT
Use  WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and  the
contents of the file as its value to mapper.
Before getting into  this,  better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<sugandha.n87@gmail.com>wrote:

> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>


-- 

Regards,
...sudhakara

Mime
View raw message