Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of chen.apache.solr@gmail.com
 designates 209.85.214.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <38E00696-1424-4B56-A614-74ACADEC9B6F@hortonworks.com>
References: 
 <CACim9R=jqH=juWK0PgQvuNFm9CJZnmBNSOEVBKn7eBFf6SOPrQ@mail.gmail.com>
	<38E00696-1424-4B56-A614-74ACADEC9B6F@hortonworks.com>
Date: Mon, 6 Jan 2014 18:26:00 -0800
Message-ID: 
 <CACim9Rm-ixg-NmYm_2uA7OX5VtC=w-Z_ML0RFKgHFmPHYxDD0g@mail.gmail.com>
Subject: Re: Help on loading data stream to hive table.
From: Chen Wang <chen.apache.solr@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=bcaec5299781e1d2fe04ef5818e9

--bcaec5299781e1d2fe04ef5818e9
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Alan,
the problem is that the data is partitioned by epoch ten hourly, and i want
all data belong to that partition to be written into one file named with
that partition. How can i share the file writer across different bolt?
should I instruct data within the same partition to the same bolt?
Thanks,
Chen


On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <gates@hortonworks.com> wrote:

> You shouldn=92t need to write each record to a separate file.  Each Storm
> bolt should be able to write to it=92s own file, appending records as it
> goes.  As long as you only have one writer per file this should be fine.
>  You can then close the files every 15 minutes (or whatever works for you=
)
> and have a separate job that creates a new partition in your Hive table
> with the files created by your bolts.
>
> Alan.
>
> On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.solr@gmail.com> wrote=
:
>
> > Guys,
> > I am using storm to read data stream from our socket server, entry by
> entry, and then write them to file: one entry per file.  At some point, i
> need to import the data into my hive table. There are several approaches =
i
> could think of:
> > 1. directly write to hive hdfs file whenever I get the entry(from our
> socket server). The problem is that this could be very inefficient,  sinc=
e
> we have huge amount of data stream, and I would not want to write to hive
> hdfs one by one.
> > Or
> > 2 i can write the entries to files(normal file or hdfs file) on the
> disk, and then have a separate job to merge those small files into big on=
e,
> and then load them into hive table.
> > The problem with this is, a) how can I merge small files into big files
> for hive? b) what is the best file size to upload to hive?
> >
> > I am seeking advice on both approaches, and appreciate your insight.
> > Thanks,
> > Chen
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity =
to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified th=
at
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediate=
ly
> and delete it from your system. Thank You.
>

--bcaec5299781e1d2fe04ef5818e9
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Alan,<div>the problem is that the data is partitioned by e=
poch ten hourly, and i want all data belong to that partition to be written=
 into one file named with that partition. How can i share the file writer a=
cross different bolt? should I instruct data within the same partition to t=
he same bolt?=A0</div>
<div>Thanks,</div><div>Chen</div></div><div class=3D"gmail_extra"><br><br><=
div class=3D"gmail_quote">On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <span =
dir=3D"ltr">&lt;<a href=3D"mailto:gates@hortonworks.com" target=3D"_blank">=
gates@hortonworks.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">You shouldn=92t need to write each record to=
 a separate file. =A0Each Storm bolt should be able to write to it=92s own =
file, appending records as it goes. =A0As long as you only have one writer =
per file this should be fine. =A0You can then close the files every 15 minu=
tes (or whatever works for you) and have a separate job that creates a new =
partition in your Hive table with the files created by your bolts.<br>

<br>
Alan.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Jan 2, 2014, at 11:58 AM, Chen Wang &lt;<a href=3D"mailto:chen.apache.so=
lr@gmail.com">chen.apache.solr@gmail.com</a>&gt; wrote:<br>
<br>
&gt; Guys,<br>
&gt; I am using storm to read data stream from our socket server, entry by =
entry, and then write them to file: one entry per file. =A0At some point, i=
 need to import the data into my hive table. There are several approaches i=
 could think of:<br>

&gt; 1. directly write to hive hdfs file whenever I get the entry(from our =
socket server). The problem is that this could be very inefficient, =A0sinc=
e we have huge amount of data stream, and I would not want to write to hive=
 hdfs one by one.<br>

&gt; Or<br>
&gt; 2 i can write the entries to files(normal file or hdfs file) on the di=
sk, and then have a separate job to merge those small files into big one, a=
nd then load them into hive table.<br>
&gt; The problem with this is, a) how can I merge small files into big file=
s for hive? b) what is the best file size to upload to hive?<br>
&gt;<br>
&gt; I am seeking advice on both approaches, and appreciate your insight.<b=
r>
&gt; Thanks,<br>
&gt; Chen<br>
&gt;<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
CONFIDENTIALITY NOTICE<br>
NOTICE: This message is intended for the use of the individual or entity to=
<br>
which it is addressed and may contain information that is confidential,<br>
privileged and exempt from disclosure under applicable law. If the reader<b=
r>
of this message is not the intended recipient, you are hereby notified that=
<br>
any printing, copying, dissemination, distribution, disclosure or<br>
forwarding of this communication is strictly prohibited. If you have<br>
received this communication in error, please contact the sender immediately=
<br>
and delete it from your system. Thank You.<br>
</font></span></blockquote></div><br></div>

--bcaec5299781e1d2fe04ef5818e9--