hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johannes Herr (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-10921) MapFile.fix fails silently when file is block compressed
Date Fri, 01 Aug 2014 15:06:38 GMT
Johannes Herr created HADOOP-10921:

             Summary: MapFile.fix fails silently when file is block compressed
                 Key: HADOOP-10921
                 URL: https://issues.apache.org/jira/browse/HADOOP-10921
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.20.2
            Reporter: Johannes Herr

MapFile provides a method 'fix' to reconstruct missing 'index' files. If the 'data' file is
block compressed the method will compute offsets that are to large, which will lead to keys
not being found in the mapfile. (See the attached test case.)

Tested against 0.20.2 but the trunk version looks like it has the same problem.

Cause of the problem is, that 'dataReader.getPosition()' is used to find the offset to write
for the next entry that should be indexed. When the file is block compressed however 'dataReader.getPosition()'
seems to return the  position of the next compressed block, not of block that contains the
last entry. This position will thus be to large in most cases and a seek operation with this
offset will incorrectly report the key as not present.

I think its not obvious how to fix it, since the SequenceFile-Reader does not provide the
offset of the currently buffered entries. I've experimented with watching the offset change
and that seems to work mostly, but is quiet ugly and not exact in edge cases.

The method should probably throw an exception when the 'data' file is block compressed instead
of silently creating invalid files. A workaround for block compressed files is to read the
sequence file and write the entries to a new mapfile and then replace the old file. This also
avoids the problems mentioned below.

A few side notes: 

1. The 'index' files created by the fix-method are not block compressed (which the 'index'
files created by MapFile Writer always are, since the 'index' file is read completely anyway).

2. The fix method does not index the first entry, the Writer does.

3. The header offset is not used.

This message was sent by Atlassian JIRA

View raw message