hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
Date Sat, 21 May 2011 01:51:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037221#comment-13037221
] 

jiraposter@reviews.apache.org commented on HIVE-2144:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/
-----------------------------------------------------------

(Updated 2011-05-21 01:49:07.819494)


Review request for hive.


Changes
-------

-Amended the test cases to accommodate prefix aggregation
-Fixed unnecessary conf settings
-Fixed exception handling in JDBCStatsPublisher.publishStats -> SQLRecoverableException
is handled when executing the update statement.


Summary
-------

Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first
query to check if the ID was inserted by another task, and second query to insert a new or
update the existing row.
The latter occurs very rarely, since duplicates most likely originate from speculative failed
tasks.

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity
constraints declared.

We amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

HIVE-2144 improves on performance by greedily performing the insertion statement.
Then instead of executing two queries per row inserted, we can execute one INSERT query.
In the case primary key constraint violation, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats are "newer"
then the ones already in the table.


This addresses bug HIVE-2144.
    https://issues.apache.org/jira/browse/HIVE-2144


Diffs (updated)
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION 

Diff: https://reviews.apache.org/r/765/diff


Testing
-------

TestStatsPublisher JUnit test:
- basic behaviour
- multiple updates
- cleanup of the statistics table after aggregation

Standalone testing on the cluster.
- insert/analyze queries over non-partitioned/partitioned tables

NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE
table dropped - which triggers creation of the table with the constraint declared.


Thanks,

Tomasz



> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted
by another task (mostly likely a speculative or previously failed task). Depending on if the
ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per
row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we
insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in
the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that
even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message