hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7052) Optimize split calculation time
Date Fri, 30 May 2014 06:36:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013349#comment-14013349
] 

Prasanth J commented on HIVE-7052:
----------------------------------

+1

> Optimize split calculation time
> -------------------------------
>
>                 Key: HIVE-7052
>                 URL: https://issues.apache.org/jira/browse/HIVE-7052
>             Project: Hive
>          Issue Type: Bug
>         Environment: hive + tez
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: HIVE-7052-profiler-1.png, HIVE-7052-profiler-2.png, HIVE-7052-v3.patch,
HIVE-7052-v7.patch
>
>
> When running a TPC-DS query (query_27),  significant amount of time was spent in split
computation on a dataset of size 200 GB (ORC format).
> Profiling revealed that, 
> 1. Lot of time was spent in Config's subtitutevar (regex) in HiveInputFormat.getSplits()
method.  
> 2. FileSystem was created repeatedly in OrcInputFormat.generateSplitsInfo(). 
> I will attach the profiler snapshots soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message