hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venki Korukanti (JIRA)" <>
Subject [jira] [Commented] (HIVE-7492) Enhance SparkCollector
Date Thu, 07 Aug 2014 18:08:16 GMT


Venki Korukanti commented on HIVE-7492:

Hi [~brocknoland], 

I was about to create a JIRA for the same, but have following questions:
* how cleanup works in case the task exits abnormally.
* where to create these tmp files on DFS.

Currently RowContainer is used in join operator (mainline hive not just spark branch), so
it can create temp files as part of the Reduce task if the output exceeds in memory blocksize.
In case of MapReduce tasks MR framework overrides the default tmp dir location with a location
under JVM working directory (See [here|])
using jvm arg and the working directory of the JVM is deleted by framework
whenever the JVM exits or job is killed. As RowContainer temp files are also created under
this temp dir using  [|,%20java.lang.String,],
they will also get cleaned up.

I was looking at Spark code. Spark provides an API org.apache.spark.util.Utils.createTempDir()
which also adds a shutdown hook to delete the tmpdir when jvm exits. Should we use the same
API and provide it to RowContainer? It will be still on local FS.

> Enhance SparkCollector
> ----------------------
>                 Key: HIVE-7492
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Venki Korukanti
>             Fix For: spark-branch
>         Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch
> SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction.
It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector
should have a bounded memory usage, and be able to spill to disc when its quota is reached.

This message was sent by Atlassian JIRA

View raw message