Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 41F05200CE5 for ; Sat, 22 Jul 2017 13:41:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3EB43164A65; Sat, 22 Jul 2017 11:41:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 98B9F164AF9 for ; Sat, 22 Jul 2017 13:41:10 +0200 (CEST) Received: (qmail 35874 invoked by uid 500); 22 Jul 2017 11:41:09 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 35863 invoked by uid 500); 22 Jul 2017 11:41:09 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 35860 invoked by uid 99); 22 Jul 2017 11:41:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Jul 2017 11:41:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CECE6C01E4 for ; Sat, 22 Jul 2017 11:41:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id pMrKZhpLH5e4 for ; Sat, 22 Jul 2017 11:41:08 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id DAB995FD1B for ; Sat, 22 Jul 2017 11:41:07 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A6542E0E3F for ; Sat, 22 Jul 2017 11:41:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B9E0721EF1 for ; Sat, 22 Jul 2017 11:41:02 +0000 (UTC) Date: Sat, 22 Jul 2017 11:41:02 +0000 (UTC) From: "Adam Szita (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (PIG-3655) BinStorage and InterStorage approach to record markers is broken MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 22 Jul 2017 11:41:11 -0000 [ https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Szita updated PIG-3655: ---------------------------- Resolution: Fixed Fix Version/s: 0.18.0 Status: Resolved (was: Patch Available) > BinStorage and InterStorage approach to record markers is broken > ---------------------------------------------------------------- > > Key: PIG-3655 > URL: https://issues.apache.org/jira/browse/PIG-3655 > Project: Pig > Issue Type: Bug > Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1 > Reporter: Jeff Plaisance > Assignee: Adam Szita > Fix For: 0.18.0 > > Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, PIG-3655.3.patch, PIG-3655.4.patch, PIG-3655.5.patch > > > The way that the record readers for these storage formats seek to the first record in an input split is to find the byte sequence 1 2 3 110 for BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence occurs in the data for any reason (for example the integer 16909166 stored big endian encodes to the byte sequence for BinStorage) other than to mark the start of a tuple it can cause mysterious failures in pig jobs because the record reader will try to decode garbage and fail. > For this approach of using an unlikely sequence to mark record boundaries, it is important to reduce the probability of the sequence occuring naturally in the data by ensuring that your record marker is sufficiently long. Hadoop SequenceFile uses 128 bits for this and randomly generates the sequence for each file (selecting a fixed, predetermined value opens up the possibility of a mean person intentionally sending you that value). This makes it extremely unlikely that collisions will occur. In the long run I think that pig should also be doing this. > As a quick fix it might be good to save the current position in the file before entering readDatum, and if an exception is thrown seek back to the saved position and resume trying to find the next record marker. -- This message was sent by Atlassian JIRA (v6.4.14#64029)