hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shumin Guo <gsmst...@gmail.com>
Subject RE: Reading a file in a customized way
Date Wed, 26 Feb 2014 02:30:32 GMT
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

Shumin
On Feb 25, 2014 10:56 AM, "java8964" <java8964@hotmail.com> wrote:

> See my reply for another email today for similar question.
>
> "
>
>    - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Mime
View raw message