hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brock Noland (JIRA)" <>
Subject [jira] [Commented] (HIVE-860) Persistent distributed cache
Date Tue, 18 Feb 2014 23:07:23 GMT


Brock Noland commented on HIVE-860:

bq. Are you proposing to change the contents of the hive-exec.jar as distributed with Hive
or just as pushed to Hadoop for running a job?


bq.  If it's the former won't it mean that any project that includes hive-exec.jar in it's
pom.xml will have to change its pom to explicitly include all of the extra jars now in the
fat jar?

Nope. The previously shaded jars are listed as dependencies in the source pom file and thus
they will be pulled in transitively by depending on hive-exec. I have verified this locally.
That is after a mvn install before the patch all the currently shaded jars are removed from
the published pom file so they are not pulled in transitively. After the patch, the only jar
which is shaded is kryo, and it is the only one which is removed from the published pom. That
is to say the other dependencies remain in the pom for clients. This is inline which my expectations.

> Persistent distributed cache
> ----------------------------
>                 Key: HIVE-860
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Zheng Shao
>            Assignee: Brock Noland
>             Fix For: 0.13.0
>         Attachments: HIVE-860.patch, HIVE-860.patch, HIVE-860.patch, HIVE-860.patch,
HIVE-860.patch, HIVE-860.patch
> DistributedCache is shared across multiple jobs, if the hdfs file name is the same.
> We need to make sure Hive put the same file into the same location every time and do
not overwrite if the file content is the same.
> We can achieve 2 different results:
> A1. Files added with the same name, timestamp, and md5 in the same session will have
a single copy in distributed cache.
> A2. Filed added with the same name, timestamp, and md5 will have a single copy in distributed
> A2 has a bigger benefit in sharing but may raise a question on when Hive should clean
it up in hdfs.

This message was sent by Atlassian JIRA

View raw message