hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
Date Wed, 18 May 2011 19:38:47 GMT


Ning Zhang commented on HIVE-2144:

Great! I like the idea. 

One comment about the primary key constraint: I'm not sure if UNIQUE is the standard way to
specify primary key constraint. There are people using Oralce/MS SQL sever/Postgres as metastore,
we should use a standard way. I think 'id varchar(255) PRIMARY KEY' is more widely supported.
Can you double check with mysql and derby?

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>                 Key: HIVE-2144
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted
by another task (mostly likely a speculative or previously failed task). Depending on if the
ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per
row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we
insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in
the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that
even though the aggregation query is more expensive, it is only run once per query. 

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message