hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <>
Subject [jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]
Date Wed, 17 Sep 2014 15:03:34 GMT


Rui Li commented on HIVE-8043:

Hi [~xuefuz],

I looked into the patch in HIVE-7704. My understanding is that the newly added operator, mapper
etc. is just for (fast) merging RC and Orc files. Other file formats will still be merged
by the {{TS -> FS}} work. For RC and Orc files, this work is a {{MergeFileWork}}, for others,
this work is a {{MapWork}}. And according to the execution engine, this work will be wrapped
in a MapredWork, TezWork or SparkWork.

For RC and Orc files, {{MergeFileMapper}} is used instead of {{ExecMapper}}. The main difference
between the two mappers is that {{MergeFileMapper}} wraps and uses {{AbstractFileMergeOperator}}
(two implementations for RC and Orc file respectively) as the top operator, while {{ExecMapper}}
uses {{MapOperator}}.

I think the following needs to be considered on spark side:
* For non-RC files, I think it should work out of the box, at least for simple cases. We may
need to take extra care of dynamically partitioned tables, multi-insert and union queries
etc. I tested some simple insert queries where I increased {{mapreduce.job.reduces}} to generate
many small files. With {{hive.merge.sparkfiles=false}}, the destination table consists of
all these small files, and when turned on, all the small files get merged. I noticed the merging
feature caused some issue in HIVE-7810. I'll verify if it's still a problem now that we have
union-remove disabled for spark.
* For RC and Orc files, we need to be aware of the {{MergeFileWork}}. And since {{SparkMapRecordHandler}}
is our counterpart for {{ExecMapper}}, we'll need another record handler as counterpart for
{{MergeFileMapper}}, maybe another hive function as well. I'm working to implement this to
do some tests.
* MR distinguishes map-only and map-reduce jobs for merging. Not sure if we shall do similar
thing for spark
* Besides, it seems there're two scenarios where merging is needed: at the end of a job (map-only
or map-reduce), and in DDL task. I'll investigate more into this.

Any idea or suggestion is appreciated. Thanks.

> Support merging small files [Spark Branch]
> ------------------------------------------
>                 Key: HIVE-8043
>                 URL:
>             Project: Hive
>          Issue Type: Task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>              Labels: Spark-M1
> Hive currently supports merging small files with MR as the execution engine. There are
options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we might need
a little more research and design on this.

This message was sent by Atlassian JIRA

View raw message