hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Created: (HADOOP-788) Streaming should use a subclass of TextInputFormat for reading text inputs.
Date Thu, 07 Dec 2006 20:02:06 GMT
Please make sure that this fix does not bring back the UTF-8 problems:
streaming expects to see the same bytes that are in the input files.    
No encoding conversion should happen unless explicitly requested (by a  
not-yet-existent command line option?)


On Dec 6, 2006, at 1:14 PM, Owen O'Malley (JIRA) wrote:

> Streaming should use a subclass of TextInputFormat for reading text  
> inputs.
> ----------------------------------------------------------------------- 
> ----
>
>                  Key: HADOOP-788
>                  URL: http://issues.apache.org/jira/browse/HADOOP-788
>              Project: Hadoop
>           Issue Type: Improvement
>           Components: contrib/streaming
>             Reporter: Owen O'Malley
>          Assigned To: Sanjay Dahiya
>
>
> Currently streaming uses a lot of custom code for processing text  
> inputs.
>
> I propose:
>
>  1. Move class LineRecordReader  out of TextInputFormat.
>  2. Make class StreamLineRecordReader extend LineRecordReader.
>  3. StreamLineRecordReader uses LineRecordReader.next to read the  
> lines and splits them on tab to generate a Text/Text key/value pair.
>
> This will remove a lot of code from streaming and give it automatic  
> support for the compression codecs that the "base" part of Hadoop  
> enjoys. In particular, if the native zlib code is used, it will remove  
> the 2gb limit on compressed files.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:  
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:  
> http://www.atlassian.com/software/jira
>
>


Mime
View raw message