drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Padma Penumarthy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4706) Fragment planning causes Drillbits to read remote chunks when local copies are available
Date Tue, 01 Nov 2016 16:37:58 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15625907#comment-15625907

Padma Penumarthy commented on DRILL-4706:

For the data mentioned in the description of the problem, 4 nodes  have 16 files each, 3 nodes
have 17 files and other 3 nodes have 15 files i.e. data is not distributed equally among all
nodes. With soft affinity parallelizer, we are allocating 16 fragments on each node.  So,
the nodes which have only 15 parquet files locally are doing remote read from one of the fragments.
3 remote reads for the 3 rowGroups (512 MB *3 ~ 1.5G) explains 2% (of 70G) remote reads. With
the local affinity parallelizer, we schedule 16 fragments on 4 nodes, 17 on 3 nodes and 15
on the other 3 nodes. There were no remote reads in this case. 

> Fragment planning causes Drillbits to read remote chunks when local copies are available
> ----------------------------------------------------------------------------------------
>                 Key: DRILL-4706
>                 URL: https://issues.apache.org/jira/browse/DRILL-4706
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.6.0
>         Environment: CentOS, RHEL
>            Reporter: Kunal Khatua
>            Assignee: Sorabh Hamirwasia
>              Labels: performance, planning
> When a table (datasize=70GB) of 160 parquet files (each having a single rowgroup and
fitting within one chunk) is available on a 10-node setup with replication=3 ; a pure data
scan query causes about 2% of the data to be read remotely. 
> Even with the creation of metadata cache, the planner is selecting a sub-optimal plan
of executing the SCAN fragments such that some of the data is served from a remote server.

This message was sent by Atlassian JIRA

View raw message