hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-814) Make Binstorage more robust when data contains record markers
Date Fri, 22 May 2009 18:30:45 GMT

     [ https://issues.apache.org/jira/browse/PIG-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-814:
-------------------------------

    Status: Patch Available  (was: Open)

The basic issue is that we use ctrl-A,ctrl-B,ctrl-C sequence to identify beginning of a record
in binstorage format. We
keep parsing the inputstream till we see this sequence. After seeing this sequence, we send
the input stream to another
function to read the tuple which represents the record. The tuple itself is stored in Binstorage
by first having a byte
representing the tuple type(tuple marker), followed by the tuple size which is stored as an
integer (in java
serialization format) and then the actual tuple fields each stored in java serialization format
with a type marker
prefix.

An exception is thrown when the data itself has ctrl-A,ctrl-B,ctrl-C (maybe in the serialized
form of a
field in some tuple). This can happen when the RandomSampleLoader (used in ordre by ) tries
to uniformly sample 100 tuples and lands in some
part of the data which has this sequence but is not a RECORD begin sequence put in by BinStorage.

The fix will be to look for ctrl-A,ctrl-B,ctrl-c and additionally TUPLEMARKER before trying
to read the tuple. This
decreases the probability of finding all these four markers in the data as well ( and it also
fixes the error for this
particular query).


> Make Binstorage more robust when data contains record markers
> -------------------------------------------------------------
>
>                 Key: PIG-814
>                 URL: https://issues.apache.org/jira/browse/PIG-814
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: 0.3.0
>
>         Attachments: PIG-814.patch
>
>
> When the inputstream for BinStorage is at a position where the data has the record marker
sequence, the code incorrectly assumes that it is at the beginning of a record (tuple) and
calls DataReaderWriter.readDatum() trying to read the tuple. The problem is more likely when
RandomSampleLoader (used in order by implementation) skips the input stream for sampling and
calls Binstorage.getNext(). The code should be more robust in such cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message