hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"
Date Tue, 13 May 2008 19:53:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596517#action_12596517
] 

Chris Douglas commented on HADOOP-3221:
---------------------------------------

bq. We don't invent brand new input formats. We reuse what exists and the amount of new code
is minimal

Which is why this would reuse LineRecordReader to handle compression for the split generation,
etc.

bq. We are better at handling the cases of large files. Granted that with 1 line per map,
we might have the same problem with FileSplit. But we could work around that by having a larger
N.

That's why this was requested. Our model handles large files, but users want to create maps
initialized with a handful of parameters defined in a text file and executed at arbitrary
points on the cluster. I'm skeptical of this model, but it's an idiom used often enough to
justify a new InputFormat. It only makes sense when N is small (in practice, N=1 most of the
time) and specified by the user and when the file is small. The existing code covers the other
cases.

bq. The only issue is that we might end up in a situation where a couple of datanodes in the
cluster becomes a bottleneck for the split serving

That's not likely to be a bottleneck for these jobs. The optimization isn't just for split
serving, but also potentially to the size of the split. Doing this with FileSplits sans locations
will probably end up with an average 70-120 bytes per split, right? If the lines are shorter,
then embedding them in the split is a win. If it's within 10-20% of that size, it's probably
still worth doing. It becomes less attractive as it converges to the cases we already cover.

bq. We don't make assumptions about the line lengths, etc. Just make one pass over the files
and arrive at the splits.

Both require a pass for the line numbers, if that's a requirement.

A lot seems to hinge on this. If it is a requirement that the path be included, then there's
no longer any real advantage to embedding the line with the split. If users don't need that
context, then there are some potential advantages to the core approach in the current patch.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input
file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input
in a control file (which is the input path to the map-reduce application, where as the input
dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default,
one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable,
Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous
chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather,
should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers
?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all
the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText"
name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message