Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88AB811789 for ; Mon, 19 May 2014 23:59:38 +0000 (UTC) Received: (qmail 56887 invoked by uid 500); 19 May 2014 23:59:38 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 56837 invoked by uid 500); 19 May 2014 23:59:38 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 56828 invoked by uid 500); 19 May 2014 23:59:38 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 56825 invoked by uid 99); 19 May 2014 23:59:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 23:59:38 +0000 Date: Mon, 19 May 2014 23:59:38 +0000 (UTC) From: "Philip (flip) Kromer (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (PIG-3952) PigStorage accepts '-tagSplit' to return full split information MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Philip (flip) Kromer created PIG-3952: ----------------------------------------- Summary: PigStorage accepts '-tagSplit' to return full split information Key: PIG-3952 URL: https://issues.apache.org/jira/browse/PIG-3952 Project: Pig Issue Type: Improvement Components: internal-udfs Reporter: Philip (flip) Kromer If -tagSplit is specified, PigStorage will prepend the full file split information, separated by hash ('#') marks, to each Tuple/row: * File path (directory and filename) of the split * Split index (zero-based) * Split offset: starting position of the split in bytes. * Split length: length of the split in bytes. Motivating examples are * attach a unique ordered ID without a reduce: with the split index and sequentially numbered files (or some other strict ordering of the filenames) you can build a serial ID from say SPRINTF("%05d-%3d-%8d", (int)file_idx, (int)split_idx, rank_vals). * do a consistent shuffle; the lines will be well-mixed but remain identically shuffled from run to run. With a seedable hash function you can stabilize the mixing for testing purposes. {code} -- shuffle according to a hash with good mixing properties. MD5 is OK but murmur3 from DATAFU-47 even better. DEFINE Hasher datafu.pig.hash.MD5('hex'); vals = LOAD '/data/geo/us_city_pops.tsv' USING PigStorage('\t', '-tagSplit') AS (split_info:chararray, city:chararray, state:chararray, pop:int); vals_rked = RANK vals; vals_ided = FOREACH vals_rked { line_info = CONCAT((chararray)split_info, '#', (chararray)rank_vals); GENERATE Hasher((chararray)line_info) AS rand_id, *; }; vals_shuffled = FOREACH (ORDER vals_ided BY rand_id) GENERATE *; STORE vals_shuffled INTO '/data/out/vals_shuffled'; -- 7b86bdc8f28edb05025a556c06805753 37 file:/data/geo/census/us_city_pops.tsv#0#0#1541 Kansas City Missouri 463202 -- 7c9c93658545d5aba2825699866cb024 13 file:/data/geo/census/us_city_pops.tsv#0#0#1541 Austin Texas 820611 -- 8532dc8d18992a688621950c50b6ca80 14 file:/data/geo/us_city_pops.tsv#0#0#1541 San Francisco California 812826 -- 8d8632fecb4895288f1684c451a01a0b 29 file:/data/geo/us_city_pops.tsv#0#0#1541 Portland Oregon 593820 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)