hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Mohajerian <mohaj...@gmail.com>
Subject Re: Help on loading data stream to hive table.
Date Tue, 07 Jan 2014 21:05:47 GMT
You may find summingbird relevant, I'm still investigating it:
https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird


On Tue, Jan 7, 2014 at 11:39 AM, Alan Gates <gates@hortonworks.com> wrote:

> I am not wise enough in the ways of Storm to tell you how you should
> partition data across bolts.  However, there is no need in Hive for all
> data for a partition to be in the same file, only in the same directory.
>  So if each bolt creates a file for each partition and then all those files
> are placed in one directory and loaded into Hive it will work.
>
> Alan.
>
> On Jan 6, 2014, at 6:26 PM, Chen Wang <chen.apache.solr@gmail.com> wrote:
>
> > Alan,
> > the problem is that the data is partitioned by epoch ten hourly, and i
> want all data belong to that partition to be written into one file named
> with that partition. How can i share the file writer across different bolt?
> should I instruct data within the same partition to the same bolt?
> > Thanks,
> > Chen
> >
> >
> > On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <gates@hortonworks.com>
> wrote:
> > You shouldn’t need to write each record to a separate file.  Each Storm
> bolt should be able to write to it’s own file, appending records as it
> goes.  As long as you only have one writer per file this should be fine.
>  You can then close the files every 15 minutes (or whatever works for you)
> and have a separate job that creates a new partition in your Hive table
> with the files created by your bolts.
> >
> > Alan.
> >
> > On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.solr@gmail.com>
> wrote:
> >
> >> Guys,
> >> I am using storm to read data stream from our socket server, entry by
> entry, and then write them to file: one entry per file.  At some point, i
> need to import the data into my hive table. There are several approaches i
> could think of:
> >> 1. directly write to hive hdfs file whenever I get the entry(from our
> socket server). The problem is that this could be very inefficient,  since
> we have huge amount of data stream, and I would not want to write to hive
> hdfs one by one.
> >> Or
> >> 2 i can write the entries to files(normal file or hdfs file) on the
> disk, and then have a separate job to merge those small files into big one,
> and then load them into hive table.
> >> The problem with this is, a) how can I merge small files into big files
> for hive? b) what is the best file size to upload to hive?
> >>
> >> I am seeking advice on both approaches, and appreciate your insight.
> >> Thanks,
> >> Chen
> >>
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Mime
View raw message