hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Dahiya (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-788) Streaming should use a subclass of TextInputFormat for reading text inputs.
Date Wed, 31 Jan 2007 09:53:06 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sanjay Dahiya updated HADOOP-788:
---------------------------------

    Status: Patch Available  (was: Open)

> Streaming should use a subclass of TextInputFormat for reading text inputs.
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-788
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Owen O'Malley
>         Assigned To: Sanjay Dahiya
>         Attachments: Hadoop-788.patch
>
>
> Currently streaming uses a lot of custom code for processing text inputs. 
> I propose:
>  1. Move class LineRecordReader  out of TextInputFormat.
>  2. Make class StreamLineRecordReader extend LineRecordReader.
>  3. StreamLineRecordReader uses LineRecordReader.next to read the lines and splits them
on tab to generate a Text/Text key/value pair.
> This will remove a lot of code from streaming and give it automatic support for the compression
codecs that the "base" part of Hadoop enjoys. In particular, if the native zlib code is used,
it will remove the 2gb limit on compressed files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message