phoenix-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoffrey Jacoby (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-5313) All mappers grab all RegionLocations from .META
Date Fri, 31 May 2019 23:35:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853479#comment-16853479
] 

Geoffrey Jacoby commented on PHOENIX-5313:
------------------------------------------

[~tdsilva] - fyi. I heard you and Arun were also looking into this issue?

> All mappers grab all RegionLocations from .META
> -----------------------------------------------
>
>                 Key: PHOENIX-5313
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5313
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Geoffrey Jacoby
>            Priority: Major
>
> Phoenix's MapReduce integration lives in PhoenixInputFormat. It implements getSplits
by calculating a QueryPlan for the provided SELECT query, and each split gets a mapper. As
part of this QueryPlan generation, we grab all RegionLocations from .META
> In PhoenixInputFormat:getQueryPlan: 
> {code:java}
>  // Initialize the query plan so it sets up the parallel scans
>  queryPlan.iterator(MapReduceParallelScanGrouper.getInstance());
> {code}
> In MapReduceParallelScanGrouper.getRegionBoundaries()
> {code:java}
> return context.getConnection().getQueryServices().getAllTableRegions(tableName);
> {code}
> This is fine.
> Unfortunately, each mapper Task spawned by the job will go through this _same_ exercise
when trying to create the RecordReader. Since HBase 1.x and up got rid of .META prefetching
and caching within the HBase client, that means that not only will each _Job_ make potentially
thousands of calls to .META, potentially thousands of _Tasks_ will do the same. 
> The createRecordReader should get a QueryPlan without having to read all RegionLocations,
either by using its internal knowledge of its split key range, or by serializing the query
plan from the client and sending it to the mapper tasks for use there. 
> Note that MapReduce tasks over snapshots are not affected by this, because region locations
are stored in the snapshot manifest. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message