Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6A38610F77 for ; Tue, 7 Jan 2014 21:06:15 +0000 (UTC) Received: (qmail 16444 invoked by uid 500); 7 Jan 2014 21:06:13 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 16388 invoked by uid 500); 7 Jan 2014 21:06:13 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 16380 invoked by uid 99); 7 Jan 2014 21:06:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 21:06:13 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mohajeri@gmail.com designates 209.85.219.52 as permitted sender) Received: from [209.85.219.52] (HELO mail-oa0-f52.google.com) (209.85.219.52) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 21:06:08 +0000 Received: by mail-oa0-f52.google.com with SMTP id o6so829275oag.11 for ; Tue, 07 Jan 2014 13:05:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=KKab1eLioOo5TvSw2xKCmQEYUUnIV0OU1fvMS+yoitg=; b=om/c6lnT4Xj3CXvZTVEbGyRH83c6Yf8CixPd0r+H2w13nRb/jKVDk1fL/UZOanrffH udMyq7KbxcCDpSfWG3w910jU5JO371lZa7/nWXHTeDIp2m6T7JKboz8+304sURevoTbk rlnKQzEuM7eSAISvb+eha41NrA4veaQ8nbnmvjdWYF6oWNBYsJ8yVAMhGWAoTcrvQpEK EFifNL2Pe4rkhMX0qfal5gtcIWBG4bpFxKitMCrq6+1PONU2uVBBrWQwtgHl+9bdobTs Xc4aQxDj2lNX8NZg6RipPHgsevZwbUCZLJNTPi4XB7zKBn0+VeWr2pn3M+44Pjl/KweJ XZbw== MIME-Version: 1.0 X-Received: by 10.182.221.230 with SMTP id qh6mr77138557obc.7.1389128748078; Tue, 07 Jan 2014 13:05:48 -0800 (PST) Received: by 10.182.248.165 with HTTP; Tue, 7 Jan 2014 13:05:47 -0800 (PST) In-Reply-To: <14BB85F3-5D2B-40BA-B36E-7728647DF84D@hortonworks.com> References: <38E00696-1424-4B56-A614-74ACADEC9B6F@hortonworks.com> <14BB85F3-5D2B-40BA-B36E-7728647DF84D@hortonworks.com> Date: Tue, 7 Jan 2014 13:05:47 -0800 Message-ID: Subject: Re: Help on loading data stream to hive table. From: Peyman Mohajerian To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11c2fcb893a40204ef67bda4 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2fcb893a40204ef67bda4 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable You may find summingbird relevant, I'm still investigating it: https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird On Tue, Jan 7, 2014 at 11:39 AM, Alan Gates wrote: > I am not wise enough in the ways of Storm to tell you how you should > partition data across bolts. However, there is no need in Hive for all > data for a partition to be in the same file, only in the same directory. > So if each bolt creates a file for each partition and then all those fil= es > are placed in one directory and loaded into Hive it will work. > > Alan. > > On Jan 6, 2014, at 6:26 PM, Chen Wang wrote: > > > Alan, > > the problem is that the data is partitioned by epoch ten hourly, and i > want all data belong to that partition to be written into one file named > with that partition. How can i share the file writer across different bol= t? > should I instruct data within the same partition to the same bolt? > > Thanks, > > Chen > > > > > > On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates > wrote: > > You shouldn=92t need to write each record to a separate file. Each Sto= rm > bolt should be able to write to it=92s own file, appending records as it > goes. As long as you only have one writer per file this should be fine. > You can then close the files every 15 minutes (or whatever works for you= ) > and have a separate job that creates a new partition in your Hive table > with the files created by your bolts. > > > > Alan. > > > > On Jan 2, 2014, at 11:58 AM, Chen Wang > wrote: > > > >> Guys, > >> I am using storm to read data stream from our socket server, entry by > entry, and then write them to file: one entry per file. At some point, i > need to import the data into my hive table. There are several approaches = i > could think of: > >> 1. directly write to hive hdfs file whenever I get the entry(from our > socket server). The problem is that this could be very inefficient, sinc= e > we have huge amount of data stream, and I would not want to write to hive > hdfs one by one. > >> Or > >> 2 i can write the entries to files(normal file or hdfs file) on the > disk, and then have a separate job to merge those small files into big on= e, > and then load them into hive table. > >> The problem with this is, a) how can I merge small files into big file= s > for hive? b) what is the best file size to upload to hive? > >> > >> I am seeking advice on both approaches, and appreciate your insight. > >> Thanks, > >> Chen > >> > > > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entit= y > to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the read= er > > of this message is not the intended recipient, you are hereby notified > that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender > immediately > > and delete it from your system. Thank You. > > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity = to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified th= at > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediate= ly > and delete it from your system. Thank You. > --001a11c2fcb893a40204ef67bda4 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
You may find summingbird relevant, I'm still investiga= ting it:


O= n Tue, Jan 7, 2014 at 11:39 AM, Alan Gates <gates@hortonworks.com&= gt; wrote:
I am not wise enough in the ways of Storm to= tell you how you should partition data across bolts. =A0However, there is = no need in Hive for all data for a partition to be in the same file, only i= n the same directory. =A0So if each bolt creates a file for each partition = and then all those files are placed in one directory and loaded into Hive i= t will work.

Alan.

On Jan 6, 2014, at 6:26 PM, Chen Wang <chen.apache.solr@gmail.com> wrote:

> Alan,
> the problem is that the data is partitioned by epoch ten hourly, and i= want all data belong to that partition to be written into one file named w= ith that partition. How can i share the file writer across different bolt? = should I instruct data within the same partition to the same bolt?
> Thanks,
> Chen
>
>
> On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <gates@hortonworks.com> wrote:
> You shouldn=92t need to write each record to a separate file. =A0Each = Storm bolt should be able to write to it=92s own file, appending records as= it goes. =A0As long as you only have one writer per file this should be fi= ne. =A0You can then close the files every 15 minutes (or whatever works for= you) and have a separate job that creates a new partition in your Hive tab= le with the files created by your bolts.
>
> Alan.
>
> On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.solr@gmail.com> wrote:
>
>> Guys,
>> I am using storm to read data stream from our socket server, entry= by entry, and then write them to file: one entry per file. =A0At some poin= t, i need to import the data into my hive table. There are several approach= es i could think of:
>> 1. directly write to hive hdfs file whenever I get the entry(from = our socket server). The problem is that this could be very inefficient, =A0= since we have huge amount of data stream, and I would not want to write to = hive hdfs one by one.
>> Or
>> 2 i can write the entries to files(normal file or hdfs file) on th= e disk, and then have a separate job to merge those small files into big on= e, and then load them into hive table.
>> The problem with this is, a) how can I merge small files into big = files for hive? b) what is the best file size to upload to hive?
>>
>> I am seeking advice on both approaches, and appreciate your insigh= t.
>> Thanks,
>> Chen
>>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or enti= ty to
> which it is addressed and may contain information that is confidential= ,
> privileged and exempt from disclosure under applicable law. If the rea= der
> of this message is not the intended recipient, you are hereby notified= that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immedi= ately
> and delete it from your system. Thank You.
>


--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to=
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that=
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately=
and delete it from your system. Thank You.

--001a11c2fcb893a40204ef67bda4--