hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: Classpath question with testcase and UDF
Date Tue, 14 Jul 2009 22:18:58 GMT
On Tue, Jul 14, 2009 at 5:40 PM, Frederick Oko<> wrote:
> Is this just a workaround for not being able to as easily extend the grammar
> with something like 'INSERT TABLE JDBC(...' which would process over the
> final result set in bulk instead of a per record evaluate()?
> On Jul 14, 2009 8:57 AM, "Edward Capriolo" <> wrote:
> Thanks.
> I was using add file. That did the trick.
> 2009/7/14 He Yongqiang <>:
>> Try "add jar your-db-connector-jar" first. > > Yongqiang > > On 09-7-14
>> 下午10:57, "Edward Capriolo"...


A bulk operation would perform better. I was going to build a more
complete version and start a Jira on it, I guess I let the cat out of
the bag on this one.

My Notes:

My DataFlow looks like this..

RawWebLogs (5 MInute interval)  -> Hive raw_web_log partition by date and hour

Every hour I want to summarize the previous hour for fast access.
These summaries might be used directly by a web-application. For
example I want a summary of http_status and count. This would have
from 5-10 rows.

200  65300
304   4000
404   1003

A medium summary like a page count may look like this and have <50,000 rows
/page1  9000
/page2  5454

   -> Large Summaries   > 100,000 rows
raw-> Medium Summaries  < 100,000 rows
   -> Small Summaries   < 1,000 rows

Small Summaries:
                Might be too small for a partition, but could be justified.
		Loading into a relational database should be fast
		Operations like sorting could be handled in a client browser/AJAX
Medium Summaries:
		Might be too small for a partition, but could be justified.
		Loading into a relational database should be fast
		operations like sorting column pruning should be handled server side

Large Summaries:
		Should be in a partition.
		Loading into a relational database should be throttled.
		Operations like sorting or column pruning might need to be done by
another summary or should be pre-sorted

Currently I am dropping these summaries into a partition of another
table, some are very very small.

raw_web_log_http_status partition date, partition hour

The UDF would be targeting small & medium summaries. Performance
should not be an issue since the number of inserts should be small.
Also I think if performance is an issue the jdbc driver could be a
connection pool.

Large inserts would likely have issues that custom grammar or UDF
could not address either, like throttling, replication, and shards.

My goal is to be able to dump my summaries into a hive partition
and/or a mysql table. This way my front end web server does not need
to have exotic hadoop/hive access to show people stats. Do you think I
am on the right track?


View raw message