Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7BA2B200B8E for ; Mon, 26 Sep 2016 10:02:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7A2DD160AC8; Mon, 26 Sep 2016 08:02:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 22DE9160AB8 for ; Mon, 26 Sep 2016 10:01:58 +0200 (CEST) Received: (qmail 56672 invoked by uid 500); 26 Sep 2016 08:01:57 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 56662 invoked by uid 99); 26 Sep 2016 08:01:57 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Sep 2016 08:01:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 965D3180516 for ; Mon, 26 Sep 2016 08:03:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id t8AGnFqkhacH for ; Mon, 26 Sep 2016 08:03:10 +0000 (UTC) Received: from mail-vk0-f42.google.com (mail-vk0-f42.google.com [209.85.213.42]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C14A15F36A for ; Mon, 26 Sep 2016 08:01:54 +0000 (UTC) Received: by mail-vk0-f42.google.com with SMTP id v205so44016678vke.1 for ; Mon, 26 Sep 2016 01:01:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=ST7+Sg2cyjBvbxBCkd3SrOnPQsjdNB0r7IvCECM1zlc=; b=lNM3Nz75O89Mto6mec7kh0dz/cfj5lsh7HPIG8aWIzVzRA42g9uNeGazeOwSCkzN6h Ntj7kzxTjCCCD0zm4Fkfez9jGMDCnV0myui2tVVUZuUy4QzJcqYn0l383uXinjKSnpD9 dmRFXnW6E6Vkz7qoy9bounmTxfyTSIzlv3UeEE3kEh3xpyzKJtldi2UMyS30fGV/qL3x DGvm3SpMPt0aFSNGupXiFV7TRZrX7x5lrnaVPdNLFhMf4RZmoMdGVdFfFmACvQZ+IDny C5v16kT/iKslb6DyXJx8lxKWN8ZjCUv1XZ5k78Rir5Qk13Tsih+9kOjGUWwFET7lgIwq 72Cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=ST7+Sg2cyjBvbxBCkd3SrOnPQsjdNB0r7IvCECM1zlc=; b=UIaISxWVyGw5uXhtmGvImCzLz0IXS9gHgj6qVJr1k8ytT5w8JSXla607mtcUuk3nI9 QuzCgGYEJjw4NCZdpqDDILzrLVJXYVEH3xooMRikmo3mka30ZLdXHI8MGMjA/YElBxah cL0lMFuSGXmXYkzkcFi3XumCikb7LKDhKr44EZQezilpWqQD60s7vgfL8LUVwJsdgcub 8wr311aYbS7/l/aSf79X6aL6zpDDKyr0fe3ONMqsrtiSuynpKi0/rMWts5Omhx28j7d9 zp8Iv/IKvH3WoiKGNKt9sG2zXJO+zClEiuPJ8I1qWupnbJr2k9J23bxBHN69lADdUyRO ChwA== X-Gm-Message-State: AA6/9Rm5fUKk/ttYaqp4TyJ/WfwEyUzkkoGVOo1+z1hleVAsSnndOqU9ULJeRv7dxSyE/AXx5n/0Cq2pGou5tA== X-Received: by 10.31.52.76 with SMTP id b73mr5726818vka.60.1474876914147; Mon, 26 Sep 2016 01:01:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.188.198 with HTTP; Mon, 26 Sep 2016 01:01:53 -0700 (PDT) In-Reply-To: References: From: Mich Talebzadeh Date: Mon, 26 Sep 2016 09:01:53 +0100 Message-ID: Subject: Re: populating Hive table periodically from files on HDFS To: user Content-Type: multipart/alternative; boundary=001a1143e8508e2b98053d648ac5 archived-at: Mon, 26 Sep 2016 08:02:00 -0000 --001a1143e8508e2b98053d648ac5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Eugene, My table in Hive happens to be ORC so making it bucketed and transactional is trivial. However, there is an underlying concern of mine. Hive transactional table will generate a lot of delta (if periodically appended). At least this is my understanding is what is going to happen. Once delta files are created, then Spark is unable to read data. This is unfortunate but true. My main analysis relies on Spark that read Hive table. In that case queries won't work. So it is a neat suggestion but my concern is that we are going to be left with delta files and until compaction happens nothing can be done to make Spark read data. If my assumptions are incorrect, I stand corrected. Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6= zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 26 September 2016 at 04:50, Eugene Koifman wrote: > Have you considered Hive Streaming? (https://cwiki.apache.org/ > confluence/display/Hive/Streaming+Data+Ingest) > It=E2=80=99s built for exactly such use case. > Both Flume and Storm are integrated with it and write directly to your > target table. > > Eugene > > From: Mich Talebzadeh > Reply-To: "user@hive.apache.org" > Date: Sunday, September 25, 2016 at 8:47 AM > To: user > Subject: Re: populating Hive table periodically from files on HDFS > > Thanks > > I agree I think using INSERT OWERWRITE to repopulate data in the partitio= n > is bullet proof with nothing left behind. Performance looks good as well. > > When creating partitions by date it seems to be more effective to > partition by a single string of =E2=80=98YYYY-MM-DD=E2=80=99 rather than = use a multi-depth > partition Year, Months, Days etc. > > I thought about bucketing the partitions but one needs to balance the > housekeeping with the number of buckets within each partition. So I did n= ot > bother. > > Cheers > > > > Dr Mich Talebzadeh > > > > LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJ= d6zP6AcPCCdOABUrV8Pw > * > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk.Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising fro= m > such loss, damage or destruction. > > > > On 25 September 2016 at 12:19, J=C3=B6rn Franke wr= ote: > >> I think what you propose makes sense. If you would do a delta load you >> gain not much performance benefits (most likely you will have less >> performance because you need to figure out what has changed, have the >> typical issues of distributed systems that some changes may arrive later= , >> error handling etc). Especially given the volumes. >> You may partition smaller, but yes given the volumes not really needed. >> >> On 25 Sep 2016, at 12:32, Mich Talebzadeh >> wrote: >> >> >> Hi, >> >> I have trade data delivered through kafka and flume as csv files to HDFS= . >> There are 100 prices every 2 seconds so in a minute there are 3000 new >> rows, 18K rows an hour and in a day 4,320,000 new rows. >> >> Flume creates a new sub directory partition ever day in the format >> YYYY-MM-DD like prices/2015-09-25 on HDFS >> >> There is an external Hive table pointing to new directory by simply >> altering external table location >> >> ALTER TABLE ${DATABASE}.externalMarketData set location >> 'hdfs://rhes564:9000/data/prices/${TODAY}'; >> >> This means that the external Hive table only points to the current >> directory. >> >> The target internal table in Hive is partitioned by DateStamp >> =3D"YYYY-MM-DD" >> >> PARTITIONED BY (DateStamp string) >> >> to populate the Hive table a cron job runs every 15 minutes and does >> simply >> >> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp =3D >> "${TODAY}") >> SELECT >> ''''''''''''''''''''''''' >> ) >> FROM ${DATABASE}.externalMarketData >> >> So effectively every 15 minutes *today's partition* is overwritten by >> new data from the external table. >> >> This seems to be OK. >> >> The other option is only add new rows since last time with INSERT INTO >> WHERE rows do not exist in target table. >> >> Any other suggestions? >> >> >> Thanks >> >> >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrb= Jd6zP6AcPCCdOABUrV8Pw >> * >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk.Any and all responsibility for any >> loss, damage or destruction of data or any other property which may aris= e >> from relying on this email's technical content is explicitly disclaimed. >> The author will in no case be liable for any monetary damages arising fr= om >> such loss, damage or destruction. >> >> >> >> > --001a1143e8508e2b98053d648ac5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Eugene,

My table in H= ive happens to be ORC so making it bucketed and transactional is trivial.

However, there is an underlying concern of mine.

Hive transactional table will generate a lot of delt= a (if periodically appended). At least this is my understanding is what is = going to happen.

Once delta files are created, the= n Spark is unable to read data. This is unfortunate but true.
My main analysis relies on Spark that read Hive table. In that = case queries won't work.

So it is a neat sugge= stion but my concern is that we are going to be left with delta files and u= ntil compaction happens nothing can be done to make Spark read data.
<= div>
If my assumptions are incorrect, I stand corrected.

Regards




Dr Mich Talebzadeh

=C2=A0

LinkedIn =C2=A0https://www.linkedin.com/profile/view?id=3DAAEA= AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

=C2=A0

http:= //talebzadehmich.wordpress.com


Disclaimer:=C2=A0Use = it=C2=A0at your own risk. Any and all responsibilit= y for any loss, damage or destruction of data or any other property which may arise from relying on this email= 9;s=C2=A0technical=C2=A0content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from = such loss, damage or destruction.

=C2=A0

<= font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

On 26 September 2016 at 04:50, Eugene Koifma= n <ekoifman@hortonworks.com> wrote:
It=E2=80=99s built for exactly such use case.
Both Flume and Storm are integrated with it and write directly to your= target table.

Eugene

From: Mich Talebzadeh <mich.talebzadeh@gma= il.com>
Reply-To: "user@hive.apache.org" <<= a href=3D"mailto:user@hive.apache.org" target=3D"_blank">user@hive.apache.o= rg>
Date: Sunday, September 25, 2016 at= 8:47 AM
To: user <user@hive.apache.org>
Subject: Re: populating Hive table = periodically from files on HDFS

Thanks

I agree I think using INSERT OWERWRITE to repopulate data in the parti= tion is bullet proof with nothing left behind. Performance looks good as we= ll.

When creating partitions by date it seems=C2=A0to be=C2=A0more effecti= ve to partition by a single string of =E2=80=98YYYY-MM-DD=E2=80=99 rather t= han use a multi-depth partition Year, Months,=C2=A0Days=C2=A0 etc.

I thought about bucketing the partitions but one needs to balance the = housekeeping with the number of buckets within each partition. So I did not= bother.

Cheers



Dr Mich Talebzadeh

=C2=A0

LinkedIn =C2=A0https://www.linkedin.com/profile/view?id=3D<= wbr>AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

=C2=A0

http://talebzadehmich.wordpress.com


= Disclaimer:=C2=A0Use it=C2=A0at your own risk.Any and all responsibility for any loss, damage or destruction of data or any = other property which may arise from relying on this email's=C2=A0techni= cal=C2=A0content is explicitly disclaimed. The author will in no case be li= able for any monetary damages arising from such loss, damage or destruction.

=C2=A0


On 25 September 2016 at 1= 2:19, J=C3=B6rn Franke <jornfranke@gm= ail.com> wrote:
I think what you propose makes sense. If you would do a delta load you= gain not much performance benefits (most likely you will have less perform= ance because you need to figure out what has changed, have the typical issu= es of distributed systems that some changes may arrive later, error handling etc). Especially given the volume= s.
You may partition smaller, but yes given the volumes not really needed= .=C2=A0

On 25 Sep 2016, at 12:32, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:<= br>

Hi,

I have trade data delivered through kafka and flume as csv files to=C2= =A0HDFS. There are 100 prices every 2 seconds so in a minute there are 3000= new rows,=C2=A018K rows an hour and in a day 4,320,000 new rows.

Flume creates a new sub directory partition ever day in the format YYY= Y-MM-DD like prices/2015-09-25 on=C2=A0HDFS

There is an external Hive table pointing to new directory by simply al= tering external table location

ALTER TABLE ${DAT= ABASE}.externalMarketData set location 'hdfs://rhes564:9000/data/prices/${TODAY}';

This means that the external Hive table only points to the current dir= ectory.

The target internal table in Hive=C2=A0is partitioned by=C2=A0 DateSta= mp =3D"YYYY-MM-DD"

PARTITIONED BY (D= ateStamp=C2=A0 string)

to populate the Hive table a cron job runs every 15 minutes and does s= imply

INSERT OVERWRITE = TABLE ${DATABASE}.marketData PARTITION (DateStamp =3D "${TODAY}")=
SELECT
'''&#= 39;''''''''''''''&#= 39;''''''
)
FROM ${DATABASE}.externalMarketData

So effectively every 15 minutes today's partition= is overwritten by new data from the=C2=A0external table.

This seems to be OK.=C2=A0

The other option is only add new rows since last time with INSERT INTO= WHERE rows do not exist in target table.

Any other suggestions?


Thanks









= Disclaimer:=C2=A0Use it=C2=A0at your own risk.Any and all responsibility for any loss, damage or destruction of data or any = other property which may arise from relying on this email's=C2=A0techni= cal=C2=A0content is explicitly disclaimed. The author will in no case be li= able for any monetary damages arising from such loss, damage or destruction.

=C2=A0



--001a1143e8508e2b98053d648ac5--