phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rushabh Shah (Jira)" <j...@apache.org>
Subject [jira] [Created] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
Date Thu, 12 Mar 2020 22:37:00 GMT
Rushabh Shah created PHOENIX-5774:
-------------------------------------

             Summary: Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
                 Key: PHOENIX-5774
                 URL: https://issues.apache.org/jira/browse/PHOENIX-5774
             Project: Phoenix
          Issue Type: Bug
    Affects Versions: 4.13.1
            Reporter: Rushabh Shah


Internally we have tenant estimation framework which calculates the number of rows each tenant
occupy in the cluster. Basically what the framework does is it launch MapReduce(MR) job per
table and run the following query : "Select tenant_id from <table-name>" and we do count
over this tenant_id in reducer phase.
Earlier we use to run this query against live table but we found meta table was getting hammered
over the time this job was running so we thought to run the MR job on hbase snapshots instead
of live table. Take advantage of this feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in sandbox cluster
took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 hours.  We
have maximum concurrent running mapper limit to 15 to avoid hammering meta table when we were
querying live tables. We didn't remove that restriction after we moved to hbase snapshots.So
ideally it shouldn't take 135 hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map phase, each
map task has to go through all regions in snapshots to determine which region has the start
and end key assigned to that task. After determining all regions, it has to open each region
to scan all hfiles in that region. In one such map task, the start and end key from split
was distributed among 289 regions. Reading from each region took an average of 90 seconds,
so for 289 regions it took approximately 7 hours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message