hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang" <nzh...@fb.com>
Subject Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher
Date Mon, 23 May 2011 21:16:24 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review709
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1417>

    can you add a comment on what situation this exception will be thrown? Just for the sake
of reader that didn't notice there is a primary key constraint in the DDL. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1418>

    remove this?


- Ning


On 2011-05-21 01:49:07, Tomasz Nykiel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/765/
> -----------------------------------------------------------
> 
> (Updated 2011-05-21 01:49:07)
> 
> 
> Review request for hive.
> 
> 
> Summary
> -------
> 
> Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics,
first query to check if the ID was inserted by another task, and second query to insert a
new or update the existing row.
> The latter occurs very rarely, since duplicates most likely originate from speculative
failed tasks.
> 
> Currently the schema of the stat table is the following:
> 
> PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity
constraints declared.
> 
> We amend it to:
> 
> PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
> 
> HIVE-2144 improves on performance by greedily performing the insertion statement.
> Then instead of executing two queries per row inserted, we can execute one INSERT query.
> In the case primary key constraint violation, we perform a single UPDATE query.
> The UPDATE query needs to check the condition, if the currently inserted stats are "newer"
then the ones already in the table.
> 
> 
> This addresses bug HIVE-2144.
>     https://issues.apache.org/jira/browse/HIVE-2144
> 
> 
> Diffs
> -----
> 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468

>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/765/diff
> 
> 
> Testing
> -------
> 
> TestStatsPublisher JUnit test:
> - basic behaviour
> - multiple updates
> - cleanup of the statistics table after aggregation
> 
> Standalone testing on the cluster.
> - insert/analyze queries over non-partitioned/partitioned tables
> 
> NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE
table dropped - which triggers creation of the table with the constraint declared.
> 
> 
> Thanks,
> 
> Tomasz
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message