Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B344810D97 for ; Tue, 7 Jan 2014 02:26:28 +0000 (UTC) Received: (qmail 49622 invoked by uid 500); 7 Jan 2014 02:26:27 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 49491 invoked by uid 500); 7 Jan 2014 02:26:27 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 49482 invoked by uid 99); 7 Jan 2014 02:26:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 02:26:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chen.apache.solr@gmail.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bk0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 02:26:20 +0000 Received: by mail-bk0-f42.google.com with SMTP id w11so52892bkz.29 for ; Mon, 06 Jan 2014 18:26:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=HIBlsNBguNWG+OV5GAGibfMweYOF9sh+dlE4rcxPznY=; b=iIJIZZwK5fL2Zm1o4jBXjk8bUO2cAAT3ekI7hXh9+lgswrvibYKHUBlAKyx8wmmEHp NINt7WlcW1ey1rBY4ZK1w1PzeVKBPdP961G0EQONmazbaei4eA8nqySlv1hjk2UAqvDf l7U4bTC9vQeW1GF45SSJUvkxzCDJ4gAOffPHGYkbQoYrKv2Oo5ggWYROosRvQOB0jp0h GSK3BwaFcT8X8phu8H1G0W4JklgmUpRwzQPlBOnkZRGs6o0WT/43ZiFw/WQvuM19ibpn //8YjyyQF/30rroh43Er4wYZVJ1PHveREIkaa1QGWfzuELjONwEAXYG0Ewz9rhqMrYJ0 pr/g== MIME-Version: 1.0 X-Received: by 10.205.44.5 with SMTP id ue5mr1488358bkb.102.1389061560443; Mon, 06 Jan 2014 18:26:00 -0800 (PST) Received: by 10.204.172.144 with HTTP; Mon, 6 Jan 2014 18:26:00 -0800 (PST) In-Reply-To: <38E00696-1424-4B56-A614-74ACADEC9B6F@hortonworks.com> References: <38E00696-1424-4B56-A614-74ACADEC9B6F@hortonworks.com> Date: Mon, 6 Jan 2014 18:26:00 -0800 Message-ID: Subject: Re: Help on loading data stream to hive table. From: Chen Wang To: user@hive.apache.org Content-Type: multipart/alternative; boundary=bcaec5299781e1d2fe04ef5818e9 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec5299781e1d2fe04ef5818e9 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Alan, the problem is that the data is partitioned by epoch ten hourly, and i want all data belong to that partition to be written into one file named with that partition. How can i share the file writer across different bolt? should I instruct data within the same partition to the same bolt? Thanks, Chen On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates wrote: > You shouldn=92t need to write each record to a separate file. Each Storm > bolt should be able to write to it=92s own file, appending records as it > goes. As long as you only have one writer per file this should be fine. > You can then close the files every 15 minutes (or whatever works for you= ) > and have a separate job that creates a new partition in your Hive table > with the files created by your bolts. > > Alan. > > On Jan 2, 2014, at 11:58 AM, Chen Wang wrote= : > > > Guys, > > I am using storm to read data stream from our socket server, entry by > entry, and then write them to file: one entry per file. At some point, i > need to import the data into my hive table. There are several approaches = i > could think of: > > 1. directly write to hive hdfs file whenever I get the entry(from our > socket server). The problem is that this could be very inefficient, sinc= e > we have huge amount of data stream, and I would not want to write to hive > hdfs one by one. > > Or > > 2 i can write the entries to files(normal file or hdfs file) on the > disk, and then have a separate job to merge those small files into big on= e, > and then load them into hive table. > > The problem with this is, a) how can I merge small files into big files > for hive? b) what is the best file size to upload to hive? > > > > I am seeking advice on both approaches, and appreciate your insight. > > Thanks, > > Chen > > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity = to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified th= at > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediate= ly > and delete it from your system. Thank You. > --bcaec5299781e1d2fe04ef5818e9 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Alan,
the problem is that the data is partitioned by e= poch ten hourly, and i want all data belong to that partition to be written= into one file named with that partition. How can i share the file writer a= cross different bolt? should I instruct data within the same partition to t= he same bolt?=A0
Thanks,
Chen


<= div class=3D"gmail_quote">On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <= gates@hortonworks.com> wrote:
You shouldn=92t need to write each record to= a separate file. =A0Each Storm bolt should be able to write to it=92s own = file, appending records as it goes. =A0As long as you only have one writer = per file this should be fine. =A0You can then close the files every 15 minu= tes (or whatever works for you) and have a separate job that creates a new = partition in your Hive table with the files created by your bolts.

Alan.

On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.solr@gmail.com> wrote:

> Guys,
> I am using storm to read data stream from our socket server, entry by = entry, and then write them to file: one entry per file. =A0At some point, i= need to import the data into my hive table. There are several approaches i= could think of:
> 1. directly write to hive hdfs file whenever I get the entry(from our = socket server). The problem is that this could be very inefficient, =A0sinc= e we have huge amount of data stream, and I would not want to write to hive= hdfs one by one.
> Or
> 2 i can write the entries to files(normal file or hdfs file) on the di= sk, and then have a separate job to merge those small files into big one, a= nd then load them into hive table.
> The problem with this is, a) how can I merge small files into big file= s for hive? b) what is the best file size to upload to hive?
>
> I am seeking advice on both approaches, and appreciate your insight. > Thanks,
> Chen
>


--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to=
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that=
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately=
and delete it from your system. Thank You.

--bcaec5299781e1d2fe04ef5818e9--