pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3059) Global configurable minimum 'bad record' thresholds
Date Fri, 28 Dec 2012 23:18:12 GMT

    [ https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540637#comment-13540637

Cheolsoo Park commented on PIG-3059:

I think it depends on file format. But for Avro, one case that we should handle is when a
sync() call throws an exception. In this case, we can't really find the next position where
we can resume the read. Given that we're implementing this logic in PigRecordReader (a wrapper
class for underlying record readers), I don't think that skipping records not splits is always
possible. Please correct me if I am wrong.

> Global configurable minimum 'bad record' thresholds
> ---------------------------------------------------
>                 Key: PIG-3059
>                 URL: https://issues.apache.org/jira/browse/PIG-3059
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Russell Jurney
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>         Attachments: PIG-3059.patch, test_avro_files.tar.gz
> See PIG-2614. 
> Pig dies when one record in a LOAD of a billion records fails to parse. This is almost
certainly not the desired behavior. elephant-bird and some other storage UDFs have minimum
thresholds in terms of percent and count that must be exceeded before a job will fail outright.
> We need these limits to be configurable for Pig, globally. I've come to realize what
a major problem Pig's crashing on bad records is for new Pig users. I believe this feature
can greatly improve Pig.
> An example of a config would look like:
> pig.storage.bad.record.threshold=0.01
> pig.storage.bad.record.min=100
> A thorough discussion of this issue is available here: http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message