phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-3744) Support snapshot scanners for MR-based queries
Date Fri, 24 Mar 2017 19:01:41 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940951#comment-15940951
] 

James Taylor commented on PHOENIX-3744:
---------------------------------------

Here's an idea on how this can be implemented:
- In the beginning of PhoenixInputFormat.getQueryPlan(), take a snapshot so we have get the
now unchanging region boundaries
- Later in PhoenixInputFormat.getQueryPlan(), when we call statement.optimizeQuery(), provide
an overloaded version that passes through an interface from which we can get the region boundaries.
Have two implementations of this interface: one that does what we do today in BaseResultIterators.getParallelScans():
{code}
        List<HRegionLocation> regionLocations = context.getConnection().getQueryServices()
                .getAllTableRegions(physicalTableName);
{code}
The other implementation would use the snapshot to get the region boundaries instead. This
will prevent a race condition in which a split could occur prior to the running of the scans,
but after we've already got the region boundaries (or the region boundaries being stale since
we get these from the cache on the HConnection). You'd use a new job configuration parameter
to determine which implementation to use based on whether or not a snapshot read is being
done.
- As side note, we might want to leverage the ParallelScanGrouper interface that's already
in place to get the region boundaries as it'll be somewhat tricky to thread a new interface
to the BaseResultIterators class and we already do this with an alternate ParallelScanGrouper
implementation for the MR jobs.
- In PhoenixRecordReader.initialize(), when doing a snapshot read, instead of instantiating
a TableResultIterator (which is the thing that does an htable.getScanner()), instantiate a
new TableSnapshotResultIterator which uses the snapshot scanner instead. The ResultIterator
interface is very simple - you just need to implement two methods (and the explain method
can be a noop):
{code}
public interface ResultIterator extends SQLCloseable {
    /**
     * Grab the next row's worth of values. The iterator will return a Tuple.
     * @return Tuple object if there is another row, null if the scanner is
     * exhausted.
     * @throws SQLException e
     */
    public Tuple next() throws SQLException;
    
    public void explain(List<String> planSteps);
}
{code}

FYI, [~akshita.malhotra], [~churromorales], [~samarthjain]

> Support snapshot scanners for MR-based queries
> ----------------------------------------------
>
>                 Key: PHOENIX-3744
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3744
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>            Assignee: Akshita Malhotra
>
> HBase support scanning over snapshots, with a SnapshotScanner that accesses the region
directly in HDFS. We should make sure that Phoenix can support that.
> Not sure how we'd want to decide when to run a query over a snapshot. Some ideas:
> - if there's an SCN set (i.e. the query is running at a point in time in the past)
> - if the memstore is empty
> - if the query is being run at a timestamp earlier than any memstore data
> - as a config option on the table
> - as a query hint
> - based on some kind of optimizer rule (i.e. based on estimated # of bytes that will
be scanned)
> Phoenix typically runs a query at the timestamp at which it was compiled. Any data committed
after this time should not be seen while a query is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message