hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1214) the first step for streaming clean up
Date Thu, 12 Apr 2007 00:10:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488228
] 

Runping Qi commented on HADOOP-1214:
------------------------------------


This patch includes the following changes:

1. Introduce FileInputFormat class. This class is simply a rename of InputFormatBase class,
which now simply extends FileInputFormat class. InputFormatBase  is deprecated. (Owen's idea).

2. TextInputFormat and SequenceFileInputFormat classes now extend  FileInputFormat, instead
of InputFormatBase  .

3. Add the following classes:

        KeyValueTextInputFormat, KeyValueLineRecordReader:  they are similar to TextInputFormat
and 
        LineRecordReader, except that,  instead of setting the key to an IntWritable with
the current  position, 
       and the value to the whole line, KeyValueLineRecordReader splits each line into key/value
part by 
       the first tab char in the line.

       SequenceFileToLineInputFormat, SequenceFileLineRecordReader: they are similar to  
 
       SequenceFileInputFormat, SequenceFileRecordReader, except that SequenceFileLineRecordReader
       converts the keys and values to their string representation by calling their toString()
method.

        These classes are mainly for Hadoop streaming use, though they can be used by anybody.

4. Modify the hadoop streaming commandline to take the following options:

     -partitioner JavaClass
     -outputformat JavaClass
     -inputformat JavaClass
     -additionalconfspec configFile.xml

    The first three options allow the user to specify input/output format and partitioner
classes just like 
the non-streaming case. the -additionalconfspec configFile.xml allows the user to specify
a set of 
attr/value pairs in a single XML file, rather than having to specify them through multiple
-jobconf options.

5. Add the following junit tests:

    TestSequenceFileToLineInputFormat
    TestKeyValueTextInputFormat

6. Update the hadoop streaming unit tests to use the new features.

All the unit tests have passed on my local dev machines.
I've also tested the new version with the above changes by running
a few applications using the new version with the above changes.
All applications behaved correctly and produced the expected results.



 

> the first step for streaming clean up
> -------------------------------------
>
>                 Key: HADOOP-1214
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1214
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: patch-1214.txt
>
>
> This is the first step for streaming clean up.
> This step will mainly replace various streaming classes related inputformat/output format,
record readers, etc. with hadoop's counterparts.
> This step will maintain backward compatibility

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message