hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Wampler <dean.wamp...@thinkbiganalytics.com>
Subject Re: Best practice for automating jobs
Date Thu, 10 Jan 2013 22:30:48 GMT
If you know make and bash, have a look at Stampede for scheduling work:

https://github.com/ThinkBigAnalytics/stampede

(Full disclosure: I wrote it)


On Thu, Jan 10, 2013 at 4:11 PM, Sean McNamara
<Sean.McNamara@webtrends.com>wrote:

> > I want to know if there are any accepted patterns or best practices for
> >this?
>
> http://oozie.apache.org/
>
>
>
> With both Stampede and Oozie, you can tell them to watch for certain data
to show up, e.g., a _SUCCESS file marker in a directory getting new data
files, and then start a Hive query, etc. You can also add your partition
creation commands in the workflow, e.g., as soon as the data is present (or
even before; Hive won't care if it doesn't exist yet).


> > New partitions will be added regularly
>
> When you add a partition, that metadata goes into the metastore, so every
hive instance sharing that metastore will see it. Of course, you should
avoid scenarios where multiple processes attempt to create the same
partition, although if they are using exactly the same command, then adding
an IF NOT EXISTS clause will avoid error messages. Still, I wouldn't want
to torture test the metastore...


> What type of partitions are you adding? Why frequently?
>
>
>
>
> Sean
>
>
> On 1/10/13 3:03 PM, "Tom Brown" <tombrown52@gmail.com> wrote:
>
> >All,
> >
> >I want to automate jobs against Hive (using an external table with
> >ever growing partitions), and I'm running into a few challenges:
> >
> >Concurrency - If I run Hive as a thrift server, I can only safely run
> >one job at a time. As such, it seems like my best bet will be to run
> >it from the command line and setup a brand new instance for each job.
> >That quite a bit of a hassle to solves a seemingly common problem, so
> >I want to know if there are any accepted patterns or best practices
> >for this?
> >
> >Partition management - New partitions will be added regularly. If I
> >have to setup multiple instances of Hive for each (potentially)
> >overlapping job, it will be difficult to keep track of the partitions
> >that have been added. In the context of the preceding question, what
> >is the best way to add metadata about new partitions?
> >
> >Thanks in advance!
> >
> >--Tom
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Mime
View raw message