hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Sivachenko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6085) Facilitate processing of text files without key/value split
Date Fri, 12 Sep 2014 22:36:34 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Dmitry Sivachenko updated MAPREDUCE-6085:
    Assignee:     (was: Dmitry Sivachenko)

> Facilitate processing of text files without key/value split
> -----------------------------------------------------------
>                 Key: MAPREDUCE-6085
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.4.1
>            Reporter: Dmitry Sivachenko
>         Attachments: IdentifierResolver1.java.patch
> There is a rather popular type of task: processing of text files line by line without
splitting line to key/value pair in streaming mode.  (UNIX commands like grep, awk, etc, any
filter scripts).
> By default, Hadoop streaming interface uses TextInputFormat which suites well for this
task: it passes the input line itself to streaming job stdin.
> TextOutputReader class, which receives streaming job's output, splits it for key and
value pair, and TextOutputFormat tries to merge this pair with separator.
> This results in extra separator appearing in the output in some cases.
> KeyOnlyTextOutputReader solves this problem: it passes the whole line as a key with null
value, and TextOutputFormat correctly writes it without any separators inserted.
> I propose to add another IdentifierResolver: "keyonlytextoutput", which uses standard
TextInputWriter but replaces TextOutputReader with KeyOnlyTextOutputReader).
> As a result, lines of text are never split into key/value pair and never joined back,
so lines appear in the output unmodified.

This message was sent by Atlassian JIRA

View raw message