hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
Date Thu, 18 Dec 2014 15:04:13 GMT


Xuefu Zhang commented on HIVE-9153:

Thanks for the findings, [~lirui]. I heard that the spark snapshot we are using is 2X slower
than previous version. this might explain the slowness. Also, I think the number of mappers
and locality matter in speed, but the two may collide with each other. For instance, if we
have more executors than mappers, it's desirable to have more map tasks. However, doing so
might impact locality because some mappers might read remotely. On the other hand, if there
are more mappers than executors, then few mappers will help the speed.

Any way, it would be good to find out how Tez generates splits using HiveInputFormat. Also,
we should fix HIVE-8722. Is there a way to disable Spark's delayed schedule to try out?

> Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
> ---------------------------------------------------------------------
>                 Key: HIVE-9153
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: spark-branch
>            Reporter: Brock Noland
>            Assignee: Rui Li
>         Attachments: screenshot.PNG
> The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However,
Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense
for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many
input splits such as {{select count(\*) from store_sales where something is not null}}.

This message was sent by Atlassian JIRA

View raw message