Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96A1817244 for ; Sat, 11 Oct 2014 05:42:34 +0000 (UTC) Received: (qmail 59353 invoked by uid 500); 11 Oct 2014 05:42:34 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 59283 invoked by uid 500); 11 Oct 2014 05:42:34 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 59268 invoked by uid 500); 11 Oct 2014 05:42:34 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 59262 invoked by uid 99); 11 Oct 2014 05:42:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 11 Oct 2014 05:42:34 +0000 Date: Sat, 11 Oct 2014 05:42:33 +0000 (UTC) From: "Mithun Radhakrishnan (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-8394) HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HIVE-8394: --------------------------------------- Attachment: HIVE-8394.1.patch Tentative fix using a singleton to track OutputCommitter calls for TaskAttempt commits/aborts. I tried solving this using the configuration, and by storing state in the OutputFormat. It can't be done. OutputFormats are constructed afresh using Reflection, and hence have no state. Similarly, Pig clones the configurations (and TaskAttemptContext instances) when calling RecordWriter functions, to prevent pollution. I even tried storing state in the UDFContext, by modifying {{HCatStorer::setLocation()}}, and then transferring the state into the JobConf. No dice. > HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss. > ------------------------------------------------------------- > > Key: HIVE-8394 > URL: https://issues.apache.org/jira/browse/HIVE-8394 > Project: Hive > Issue Type: Bug > Components: HCatalog > Affects Versions: 0.12.0, 0.14.0, 0.13.1 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Priority: Critical > Attachments: HIVE-8394.1.patch > > > We've found situations in production where Pig queries using {{HCatStorer}}, dynamic partitioning and {{opt.multiquery=true}} that produce partitions in the output table, but the corresponding directories have no data files (in spite of Pig reporting non-zero records written to HDFS). I don't yet have a distilled test-case for this. > Here's the code from FileOutputCommitterContainer after HIVE-7803: > {code:java|title=FileOutputCommitterContainer.java|borderStyle=dashed|titleBGColor=#F7D6C1|bgColor=#FFFFCE} > @Override > public void commitTask(TaskAttemptContext context) throws IOException { > String jobInfoStr = context.getConfiguration().get(FileRecordWriterContainer.DYN_JOBINFO); > if (!dynamicPartitioningUsed) { > //See HCATALOG-499 > FileOutputFormatContainer.setWorkOutputPath(context); > getBaseOutputCommitter().commitTask(HCatMapRedUtil.createTaskAttemptContext(context)); > } else if (jobInfoStr != null) { > ArrayList jobInfoList = (ArrayList)HCatUtil.deserialize(jobInfoStr); > org.apache.hadoop.mapred.TaskAttemptContext currTaskContext = HCatMapRedUtil.createTaskAttemptContext(context); > for (String jobStr : jobInfoList) { > OutputJobInfo localJobInfo = (OutputJobInfo)HCatUtil.deserialize(jobStr); > FileOutputCommitter committer = new FileOutputCommitter(new Path(localJobInfo.getLocation()), currTaskContext); > committer.commitTask(currTaskContext); > } > } > } > {code} > The serialized jobInfoList can't be retrieved, and hence the commit never completes. This is because Pig's MapReducePOStoreImpl deliberately clones both the TaskAttemptContext and the contained Configuration instance, thus separating the Configuration instances passed to {{FileOutputCommitterContainer::commitTask()}} and {{FileRecordWriterContainer::close()}}. Anything set by the RecordWriter is unavailable to the Committer. > One approach would have been to store state in the FileOutputFormatContainer. But that won't work since this is constructed via reflection in HCatOutputFormat (itself constructed via reflection by PigOutputFormat via HCatStorer). There's no guarantee that the instance is preserved. > My only recourse seems to be to use a Singleton to store shared state. I'm loath to indulge in this brand of shenanigans. (Statics and container-reuse in Tez might not play well together, for instance.) It might work if we're careful about tearing down the singleton. > Any other ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)