pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3655) BinStorage and InterStorage approach to record markers is broken
Date Thu, 20 Jul 2017 19:27:00 GMT

    [ https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095229#comment-16095229

Rohini Palaniswamy commented on PIG-3655:

Some comments on the latest patch:
1) Would be good if we got rid of the i > 0 check within the loop. b == -1 check can also
be removed when throwing exception
if ((byte) b != syncMarker[0]) {
         throw new IOException("Corrupt data file, expected sync marker at position " + in.getPosition());
      int i = 1;                                                           
      while (i < syncMarker.length) {                                      
          b = in.read();
          if ((byte) b != syncMarker[i]) {
              throw new IOException("Corrupt data file, expected sync marker at position "
+ in.getPosition());

2) The split illustration is very useful and makes the test easier to understand. Thanks for
adding that. Still it would be good to have the one line explanation of the test in the beginning
of the javadoc as well for readability. 

3) Please do shorten the test name of testSyncMarkerOverlappingMarkerWithSplitsWithoutMarkerBeforeAndAfter
or move it into testSyncMarkerOverlappingMarker() test case itself.

4) testSyncMarkerOneMarkerAtBeginningOnly - Test has testInterStorageSyncMarker(1024, 10,
2000L) which means there will only be one split. Can you lower split size so that there is
at least 3 splits?

5) Representation of [ 22 ] has spaces while other record sizes don't

> BinStorage and InterStorage approach to record markers is broken
> ----------------------------------------------------------------
>                 Key: PIG-3655
>                 URL: https://issues.apache.org/jira/browse/PIG-3655
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 0.9.0, 0.9.1,
0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>            Reporter: Jeff Plaisance
>            Assignee: Adam Szita
>         Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, PIG-3655.3.patch,
> The way that the record readers for these storage formats seek to the first record in
an input split is to find the byte sequence 1 2 3 110 for BinStorage or 1 2 3 19-21|28-30|36-45
for InterStorage. If this sequence occurs in the data for any reason (for example the integer
16909166 stored big endian encodes to the byte sequence for BinStorage) other than to mark
the start of a tuple it can cause mysterious failures in pig jobs because the record reader
will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it is important
to reduce the probability of the sequence occuring naturally in the data by ensuring that
your record marker is sufficiently long. Hadoop SequenceFile uses 128 bits for this and randomly
generates the sequence for each file (selecting a fixed, predetermined value opens up the
possibility of a mean person intentionally sending you that value). This makes it extremely
unlikely that collisions will occur. In the long run I think that pig should also be doing
> As a quick fix it might be good to save the current position in the file before entering
readDatum, and if an exception is thrown seek back to the saved position and resume trying
to find the next record marker.

This message was sent by Atlassian JIRA

View raw message