phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-1609) MR job to populate index tables
Date Mon, 16 Feb 2015 21:35:12 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323330#comment-14323330
] 

James Taylor edited comment on PHOENIX-1609 at 2/16/15 9:34 PM:
----------------------------------------------------------------

Thanks for the patch, [~maghamravikiran@gmail.com]. Here's some feedback:
- I think we should aim to build this more directly on top of the MR support you already built,
in particular on the ability to run a SELECT query through PhoenixInputFormat. The main reason
is that with functional indexes (see http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes),
arbitrary expressions may be used to define the index which would fit in nicely with the mechanism
you've already built. Probably the approach that'll give you the most bang-for-the-buck would
be to expand your MR integration first to support *writing* the results from the SELECT to
create an HFile (much like the CSV loader).
- Once you can write to a table through our MR support, take a look at the UPSERT SELECT statement
created by PostIndexDDLCompiler to populate an index. The SELECT part of this is what you'd
want to build as your select statement, while the UPSERT part defines the columns to which
you're writing. It's possible that the building of this statement could be exposed through
a shared utility (or that you could just use PostIndexDDLCompiler for this work too). If you
get the QueryPlan for this SELECT statement, you should, in theory, be able to run it through
your existing MR support (which gets you most of the way there).
- I think we should strive to hide the MR job behind our existing CREATE INDEX statement.
I think you can decide in PostIndexDDLCompiler.compile() on whether or not you run the index
creation through MR or using our existing mechanism, based on the table stats you can retrieve
from the data table.  In fact, then you'll already have the SELECT statement and UPSERT statement
built, so it's just a matter of how they'll be run. Something like this:
{code}
    PTableStats stats = dataTableRef.getTable().getTableStats();
    Collection<GuidePostsInfo> guidePostsCollection = stats.getGuidePosts().values();
    long totalByteSize = 0;
    for (GuidePostsInfo info : guidePostsCollection) {
        totalByteSize += info.getByteCount();
    }
    long byteThreshold = connection.unwrap(PhoenixConnection.class).getQueryServices().
        getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB,
            QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD);
    if (totalByteSize >= byteThreshold) {
        // Return new MutationPlan that has an execute() method that kicks off the map/reduce
job
    } else {
        // Return MutationPlan as it is created today
    }
{code}
- As far as setting the index state appropriately, you shouldn't need to do anything to initialize
the state, as the CREATE INDEX call would set the index state at the beginning to a PIndexState.BUILDING
from createTableInternal already. Then on the successful completion of your MR job, you'd
set the index state to PIndexState.ACTIVE. It's likely we'll want to move the code that does
this now in MetaDataClient.buildIndex() into the end of each MutationPlan generated there
(instead of assuming that the index build always happens synchronously).
- Minor, but when validating that a data/index table exists, go through our meta data operations
using connection.getMetaData() and the corresponding JDBC APIs for DatabaseMetaData, instead
of dipping down to our internal PTable APIs as you've done here:
{code}
+    private boolean isValidIndexTable(final Connection connection, final String masterTable,
final String indexTable) throws SQLException {
+        final PTable table = PhoenixRuntime.getTable(connection, masterTable);
+        for(PTable indxTable : table.getIndexes()){
+            if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) {
+                return true;
+            }
+        }
+        return false;
+        
+    }
+    
{code}


was (Author: jamestaylor):
Thanks for the patch, [~maghamravikiran@gmail.com]. Here's some feedback:
- I think we should aim to build this more directly on top of the MR support you already built,
in particular on the ability to run a SELECT query through PhoenixInputFormat. The main reason
is that with functional indexes (see http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes),
arbitrary expressions may be used to define the index which would fit in nicely with the mechanism
you've already built. Probably the approach that'll give you the most bang-for-the-buck would
be to expand your MR integration first to support *writing* the results from the SELECT to
create an HFile (much like the CSV loader).
- Once you can write to a table through our MR support, take a look at the UPSERT SELECT statement
created by PostIndexDDLCompiler to populate an index. The SELECT part of this is what you'd
want to build as your select statement, while the UPSERT part defines the columns to which
you're writing. It's possible that the building of this statement could be exposed through
a shared utility (or that you could just use PostIndexDDLCompiler for this work too). If you
get the QueryPlan for this SELECT statement, you should, in theory, be able to run it through
your existing MR support (which gets you most of the way there).
- I think we should strive to hide the MR job behind our existing CREATE INDEX statement.
I think you can decide in PostIndexDDLCompiler.compile() on whether or not you run the index
creation through MR or using our existing mechanism, based on the table stats you can retrieve
from the data table.  In fact, then you'll already have the SELECT statement and UPSERT statement
built, so it's just a matter of how they'll be run. Something like this:
{code}
    PTableStats stats = dataTableRef.getTable().getTableStats();
    Collection<GuidePostsInfo> guidePostsCollection = stats.getGuidePosts().values();
    long totalByteSize = 0;
    for (GuidePostsInfo info : guidePostsCollection) {
        totalByteSize += info.getByteCount();
    }
    long byteThreshold = connection.unwrap(PhoenixConnection.class).getQueryServices().
        getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB,
            QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD);
    if (totalByteSize >= byteThreshold) {
        // Return new MutationPlan that has an execute() method that kicks off the map/reduce
job
    } else {
        // Return MutationPlan as it is created today
    }
{code}
- As far as setting the index state appropriately, you shouldn't need to do anything to initialize
the state, as the CREATE INDEX call would set the index state at the beginning to a PIndexState.BUILDING
from createTableInternal already. Then on the successful completion of your MR job, you'd
set the index state to PIndexState.ACTIVE. It's likely we'll want to move the code that does
this now in MetaDataClient.buildIndex() into the end of each MutationPlan generated there
(instead of assuming that the index build always happens synchronously).
- Minor, but when validating that a data/index table exists, go through our meta data operations
using connection.getMetaData() and the corresponding JDBC APIs for DatabaseMetaData, instead
of dipping down to our internal PTable APIs as you've done here:
{code}
+    private boolean isValidIndexTable(final Connection connection, final String masterTable,
final String indexTable) throws SQLException {
+        final PTable table = PhoenixRuntime.getTable(connection, masterTable);
+        for(PTable indxTable : table.getIndexes()){
+            if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) {
+                return true;
+            }
+        }
+        return false;
+        
+    }
+    

> MR job to populate index tables 
> --------------------------------
>
>                 Key: PHOENIX-1609
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1609
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: maghamravikiran
>            Assignee: maghamravikiran
>         Attachments: 0001-PHOENIX_1609.patch
>
>
> Often, we need to create new indexes on master tables way after the data exists on the
master tables.  It would be good to have a simple MR job given by the phoenix code that users
can call to have indexes in sync with the master table. 
> Users can invoke the MR job using the following command 
> hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt INDEX_TABLE -columns
a,b,c
> Is this ideal? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message