hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jay vyas (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-5511) Multifilewc and the mapred.* API: Is the use of getPos() valid?
Date Mon, 16 Sep 2013 15:35:54 GMT
jay vyas created MAPREDUCE-5511:
-----------------------------------

             Summary: Multifilewc and the mapred.* API:  Is the use of getPos() valid?
                 Key: MAPREDUCE-5511
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: examples
            Reporter: jay vyas
            Priority: Minor


The MultiFileWordCount class in the hadoop examples libraries uses a record reader which switches
between files.  This behaviour can cause the RawLocalFileSystem to break in a concurrent environment
because of the way buffering works (in RawLocalFileSystem, switching between streams results
in a temproraily "null" inner stream, and that inner stream is called by the getPos() implementation
in the custom RecordReader for MultiFileWordCount). 

There are basically 2 ways to handle this:

1) Wrap the getPos() implementation in the object returned by open() in the RawLocalFileSystem
to cache the value of getPos() everytime it is called, so that calls to getPos() can return
a valid long even if underlying stream is null. OR

2) Update the RecordReader in multifilewc to not rely on the inner input stream and cache
the position / return 0 if the stream cannot return a valid value. 

The final question here is:  Is the RecordReader for MultiFileWordCount doing the right thing
?  Or is it breaking the contract of getPos()... and really... what SHOULD getPos() return
if the underlying stream has already been consumed? 



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message