hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brock Noland (JIRA)" <>
Subject [jira] [Commented] (HIVE-860) Persistent distributed cache
Date Tue, 18 Feb 2014 21:39:22 GMT


Brock Noland commented on HIVE-860:

bq. Does this work for jars on HDFS that have been added using the ADD JAR functionality?

Yes jars added via the mechanism are also cached.

bq. So when a non-local jar is added by a session, it gets copied locally to the session resource
directory. But if the local copy of the jar has the same file name/md5 hash/mtime as what
is already saved in the user's distributed cache, then this should work right?

This patch uses sha1 + file size to ensure the files are the same. In reality the file size
check is just to ensure the jar is complete as sha1 should be unique enough for our purposes.

> Persistent distributed cache
> ----------------------------
>                 Key: HIVE-860
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Zheng Shao
>            Assignee: Brock Noland
>             Fix For: 0.13.0
>         Attachments: HIVE-860.patch, HIVE-860.patch, HIVE-860.patch, HIVE-860.patch,
> DistributedCache is shared across multiple jobs, if the hdfs file name is the same.
> We need to make sure Hive put the same file into the same location every time and do
not overwrite if the file content is the same.
> We can achieve 2 different results:
> A1. Files added with the same name, timestamp, and md5 in the same session will have
a single copy in distributed cache.
> A2. Filed added with the same name, timestamp, and md5 will have a single copy in distributed
> A2 has a bigger benefit in sharing but may raise a question on when Hive should clean
it up in hdfs.

This message was sent by Atlassian JIRA

View raw message