hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"
Date Mon, 12 May 2008 08:31:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596007#action_12596007
] 

Devaraj Das commented on HADOOP-3221:
-------------------------------------

I agree with Chris that the JobTracker shouldn't load the lines into memory. I think we should
make this work with FileSplit (minus the locations info). A pass over the input files containing
the lines will tell us how many lines there are. The number of maps that the user desires
will give us the number of lines per map (goalsize). The offsets in the input files can then
be derived in a second pass over the input files (with the pass breaking at file boundaries
just like the FileSplit case). Would this satisfy the requirements?

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input
file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input
in a control file (which is the input path to the map-reduce application, where as the input
dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default,
one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable,
Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous
chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather,
should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers
?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all
the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText"
name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message