hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerome Boulon (JIRA)" <>
Subject [jira] Created: (HIVE-834) Hive merge & queries concurrency issue
Date Tue, 15 Sep 2009 20:41:57 GMT
Hive merge & queries concurrency issue

                 Key: HIVE-834
             Project: Hadoop Hive
          Issue Type: Improvement
          Components: Metastore, Query Processor, Server Infrastructure
            Reporter: Jerome Boulon

Today we are loading our Hive table every XX minutes so at the end of the day or sooner we
have to run a hive merge in order to 1) reduce the number of file on HDFS and 2) to improve
Hive performance.

During that merge, if we run a query against that table we may have a FileNotFound exception
because of the merge.
The idea is to use some kind of versioning to be able to run some queries while Hive is doing
a merge.

The merge will do at the high level:
1) Create a new Version V2, so new writer will write to the new version, readers will have
to read from both
2.0) Put a Merge Flag to V1 directory  with a UUID/timeout/etc tp prevent any other merge
while that one is running
2.1) New select queries will read from V1 and V2
2.2) New write queries will write to V2
3) Run the merge 
4) Publish the new folder V3
5) Readers will now read from V2 and V3
6) Older version can be removed in background so running queries will not fail

In practice it's a little bit more complicated because we need to that in a transaction but
sounds feasible
and would involved something like either Zookeeper or Database transactions.

Also, it will be nice to be able to trigger a 2 levels merge:
- quick merge during the day: file size less than XX MB (while your partition is still active,hot
- full merge at the end of the day (cold data)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message