hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"
Date Sun, 11 May 2008 03:41:55 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Douglas updated HADOOP-3221:
----------------------------------

    Status: Open  (was: Patch Available)

This implements something slightly different than the requirements as stated, i.e. it takes
input file(s) and encodes each line (or a subset of lines) as a split, rather than specifying
a partition of a resource with one split per line. This has some clear advantages for the
issue at hand, i.e. one map per line of text, where a vanilla FileSplit is likely as large
(path + offsets + locations) as the relevant line of text, and placement avoids being misled.

That said, slurping all the input files and writing their contents into the splits may not
be the best approach. The result is likely to be close to guessing even offsets into each
input (without reading each file), and while there's a possible space savings if both the
line length and N are small, it's close enough that the value added may not distinguish it
from an InputFormat returning closely cropped FileSplits, stripped of locations. The use and
purpose of this new InputFormat might be clearer (though not what this patch implements) if
one set a property that governs how many lines are in each split (defaulting to 1).\* Since
the JobTracker has to read in all the splits (and hold them in memory for the duration of
the job, limiting the size of the file the user points this at would be a good idea (via a
property that- if said user felt daring or malicious- he could cast off). If you felt daring,
you could even mix stripped-down FileSplits with LineSplits based on the length of each section,
since the classname of each split is encoded into job.splits.

A few nits:
* This should be in o.a.h.mapred.lib, not o.a.h.mapred
* Since the map expects Text, LineSplit might as well keep Text[] rather than String[]
* It might be worthwhile to use LineRecordReader instead of InputStreamReader
* I'm fairly certain that "line number" should not be local to the split, but either the line
number in the original input file or an offset into that file.

\* Semantically, it's not clear how to regard files with a number of lines not evenly divided
by N; the current patch would group lines from different files into the same split, which
might not be what users would expect, but the particular choice is not critical as long as
it's documented.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input
file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input
in a control file (which is the input path to the map-reduce application, where as the input
dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default,
one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable,
Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous
chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather,
should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers
?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all
the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText"
name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message