Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E284010D0C for ; Tue, 8 Mar 2016 09:29:33 +0000 (UTC) Received: (qmail 33864 invoked by uid 500); 8 Mar 2016 09:29:32 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 33793 invoked by uid 500); 8 Mar 2016 09:29:32 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 33783 invoked by uid 99); 8 Mar 2016 09:29:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2016 09:29:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 22E45180602 for ; Tue, 8 Mar 2016 09:29:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.18 X-Spam-Level: * X-Spam-Status: No, score=1.18 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, LOTS_OF_MONEY=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5Dil_ySUFbC9 for ; Tue, 8 Mar 2016 09:29:30 +0000 (UTC) Received: from mail-vk0-f54.google.com (mail-vk0-f54.google.com [209.85.213.54]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 91DB25FB5A for ; Tue, 8 Mar 2016 09:29:29 +0000 (UTC) Received: by mail-vk0-f54.google.com with SMTP id e6so9976835vkh.2 for ; Tue, 08 Mar 2016 01:29:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=u4wIKoB2OVCzwgeww3Z3VYoHmoGAQ3/TP0jtcsxALG0=; b=k3lXx3WuZasXcsXLA/lFUJk4B4F5q1Wpa1pZZTuSzYVRkPLFRAPAWEIEQK7K8rZ0bB 5WK9pu0rr6D15oTqeik1kzx5tTblQSO94nmLhmA9r1u2/0RIlGnABw5jldjDEfvCZny8 0pYWD62sfvQaUy0pmO1viDPxgTQFMWGZVne6Qs7jLBsxLGcQh8AsIl60TOHlURCWBu2L bJGaAUqPrjWBMPYINYkRsIBGfH4OgFEmJuspaapznTyTcmGJ/SjXH6cLMXDiujd2paVT Mmna+ANf2kCCyN65mDn+nQTISzG0vGutUcOxJ8Kns8kveaj005jsv6Vm58gsPgUz1RKk tfQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=u4wIKoB2OVCzwgeww3Z3VYoHmoGAQ3/TP0jtcsxALG0=; b=ijY3OI6Sp+kD66VONfsczxv6LBLMZ0t2ryC+wKZaPWW8O7pLnO4DzuiX4J3Zzb7aaW a2PcEeXcsmIXApURKuM7rjIWs6rLIqKKG42tqKGrT7FMyVdKVtB5J//xp197nqgd2dlV LrRJyOiCp3SYWpz34kQw8heDCJfgGLHp9iVz3ew01nxi+tt9b7ktAXZ4+by0YJLQE2TD u19YUjLnTGOjTF6aHozxYKPhOVUdTtCchPe/A5zljEPlGLBUXmo4WU+CWjweNCaVKjKv A0Jq7TrjBpSER7aOBOr+oDPAO6RNTwg/bghfRbtjTxddzNMW6aDokOA07NiZcAhRMTE/ OnKg== X-Gm-Message-State: AD7BkJIPKuk/LHg14woNuXEh3X/dnwDglkKbp+ewkBlT3dDIWWskbK4CaAgKw57fJb9jZm6l+wWsFY/3YGKB2g== MIME-Version: 1.0 X-Received: by 10.31.9.72 with SMTP id 69mr21818348vkj.126.1457429369011; Tue, 08 Mar 2016 01:29:29 -0800 (PST) Received: by 10.31.128.213 with HTTP; Tue, 8 Mar 2016 01:29:28 -0800 (PST) In-Reply-To: References: Date: Tue, 8 Mar 2016 09:29:28 +0000 Message-ID: Subject: Re: Hive alter table concatenate loses data - can parquet help? From: Mich Talebzadeh To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11440dfad34407052d8637e1 --001a11440dfad34407052d8637e1 Content-Type: text/plain; charset=UTF-8 Hi can you please provide DDL for this table "show create table " Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 7 March 2016 at 23:25, Marcin Tustin wrote: > Hi All, > > Following on from from our parquet vs orc discussion, today I observed > hive's alter table ... concatenate command remove rows from an ORC > formatted table. > > 1. Has anyone else observed this (fuller description below)? And > 2. How to do parquet users handle the file fragmentation issue? > > Description of the problem: > > Today I ran a query to count rows by date. Relevant days below: > 2016-02-28 16866 > 2016-03-06 219 > 2016-03-07 2863 > I then ran concatenation on that table. Rerunning the same query resulted > in: > > 2016-02-28 16866 > 2016-03-06 219 > 2016-03-07 1158 > > Note reduced count for 2016-03-07 > > I then ran concatenation a second time, and the query a third time: > 2016-02-28 16344 > 2016-03-06 219 > 2016-03-07 1158 > > Now the count for 2016-02-28 is reduced. > > This doesn't look like an elimination of duplicates occurring by design - > these didn't all happen on the first run of concatenation. It looks like > concatenation just kind of loses data. > > > > Want to work at Handy? Check out our culture deck and open roles > > Latest news at Handy > Handy just raised $50m > led > by Fidelity > > --001a11440dfad34407052d8637e1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi

can you please provide D= DL for this table "show create table <TABLE>"


On 7 March 2016 at 23:25, Marcin Tustin <mtustin@handybook.com> wrote:
Hi All,

Following on from from our= parquet vs orc discussion, today I observed hive's alter table ... con= catenate command remove rows from an ORC formatted table.=C2=A0
<= br>
1. Has anyone else observed this (fuller description below)? = And=C2=A0
2. How to do parquet users handle the file fragmentatio= n issue?

Description of the problem:
Today I ran a query to count rows by date. Relevant days below:=
2016-02-28 1686= 6
2016-03-06 219
2016-03-07 2863
I then ran concatenation on that table. Rerunning the same query res= ulted in:

2016-02-28 16866
2016-03-06 219
2016-03-07 1158

Note reduced count= for 2016-03-07

I then ran concatenation a second = time, and the query a third time:
2016-02-28 16344
2016-03-06 219
2016-03-07 1158

Now the count for 20= 16-02-28 is reduced.

This doesn't look like an= elimination of duplicates occurring by design - these didn't all happe= n on the first run of concatenation. It looks like concatenation just kind = of loses data.



Want to work at Handy? Check out our=C2= =A0culture deck and open roles
<= /div>
Latest=C2=A0news=C2=A0at Handy
Handy=C2=A0just raised $50m=C2=A0= led by Fidelity


--001a11440dfad34407052d8637e1--