hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subroto <ssan...@datameer.com>
Subject Calculation of BYTES_READ counter in TrackedRecordReader can give incorrect result
Date Thu, 21 Jun 2012 22:53:09 GMT

I have a RecordReader implementation which reads the records asynchronously and caches them
in memory(In a BlockingQueue).
When TrackingRecordReader calls for next Record, the internal implementation of RecordReader
reads from the queue and supplies the record to MapTask.
The TrackingRecordReader increments the BYTES_READ  counter by calculating:
bytesInCurr - bytesInPrev
where bytesIncurr is FSStatistics byte read after the call to next and bytesInPrev is before
call to next.
As the records are already read before making a call to next most of the time bytesInCurr
- bytesInPrev results to zero or some other value if the Asynchronous Thread is running in
Earlier the BYTES_READ counter was handled by getPos() method which my RecordReader use to
take care properly.

Would like to get opinion if the current behavior of calculating BYTES_READ in TrackingRecordReader
is correct as it compels the user to read the records in synchronous fashion.

Please let me know if there is any workaround for getting the correct statistics from the
MR job.

Subroto Sanyal
View raw message