hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: EndPoint Coprocessor could be dealocked?
Date Tue, 15 May 2012 12:23:14 GMT
Sorry for the delay... Had a full day yesterday...

In a nut shell... Tough nut to crack.  I can give you a solution which you can probably enhance...

At the start, ignore coProcessors for now...

So what end up doing is the following.

General solution... N indexes..
Create a temp table in HBase. (1 column foo)

Assuming that you have a simple K,V index, so you just need to do a simple get() against the
index to get the list of rows ...

For each index, fetch the rows.
For each row, write the rowid and then auto increment a counter in a column foo.

Then scan the table where foo's counter >= N. note that it should == N but just in case...

Now you have found multiple indexes.

Having said that... 
Again assuming your indexes are a simple K,V pair where V is a set of row ids...

Create a hash map of <rowid, count>
For each index:
     Get() row based on key
      For each rowid in row:
           If map.fetch(rowid) is null then add ( rowid, 1)
           Else increment the value in count;
For each rowid in map(rowid, count):
    If count == number of indexes N
    Then add rowid to result set.

Now just return the rows where you have it's rowid in the result set.

That you can do in a coprocessor...
          but you may have a memory issue... Depending on the number of rowid in your index.

does that help?

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 14, 2012, at 8:20 AM, fding hbase <fding.hbase@gmail.com> wrote:

> Hi Michel,
> I indexed each column within a column family of a table, so we can query a
> row with specific column value.
> By multi-index I mean using multiple indexes at the same time on a single
> query. That looks like a SQL select
> with two *where* clauses of two indexed columns.
> The row key of index table is made up of column value and row key of
> indexed table. For set intersection
> I used the utility class from Apache common-collections package
> CollectionUtils.intersection(). There's no
> assumption on sort order on indices. A scan with column value as startKey
> and column value+1 as endKey
> applied to index table will return all rows in indexed table with that
> column value.
> For multi-index queries, previously I tried to use a scan for each index
> column and intersect of those
> result sets to get the rows that I want. But the query time is too long. So
> I decided to move the computation of
> intersection to server side and reduce the amount of data transferred.
> Do you have any better idea?
> On Mon, May 14, 2012 at 8:17 PM, Michel Segel <michael_segel@hotmail.com>wrote:
>> Need a little clarification...
>> You said that you need to do multi-index queries.
>> Did you mean to say multiple people running queries at the same time, or
>> did you mean you wanted to do multi-key indexes where the key is a
>> multi-key part.
>> Or did you mean that you really wanted to use multiple indexes at the same
>> time on a single query?
>> If its the latter, not really a good idea...
>> How do you handle the intersection of the two sets? (3 sets or more?)
>> Can you assume that the indexes are in sort order?
>> What happens when the results from the indexes exceed the amount of
>> allocated memory?
>> What I am suggesting you to do is to set aside the underpinnings of HBase
>> and look at the problem you are trying to solve in general terms.  Not an
>> easy one...
>> Sent from a remote device. Please excuse any typos...
>> Mike Segel
>> On May 14, 2012, at 4:35 AM, fding hbase <fding.hbase@gmail.com> wrote:
>>> Hi all,
>>> Is it possible to use table scanner (different from the host table
>> region)
>>> or
>>> execute coprocessor of another table, in the endpoint coprocessor?
>>> It looks like chaining coprocessors. But I found a possible deadlock!
>>> Can anyone help me with this?
>>> In my testing environment I deployed the 0.92.0 version from CDH.
>>> I wrote an Endpoint coprocessor to do composite secondary index queries.
>>> The index is stored in another table and the index update is maintained
>>> by the client through a extended HTable. While a single index query
>>> works fine through Scanners of index table, soon after we realized
>>> we need to do multi-index queries at the same time.
>>> At first we tried to pull every row keys queried from a single index
>> table
>>> and do the merge (just set intersection) on the client,
>>> but that overruns the network bandwidth. So I proposed to try
>>> the endpoint coprocessor. The idea is to use coprocessors, one
>>> in master table (the indexed table) and the other for each index table
>>> regions.
>>> Each master table region coprocessor instance invokes the index table
>>> coprocessor instances with its regioninfo (the startKey and endKey) and
>> the
>>> scan,
>>> the index table region coprocessor instance scans and returns the row
>> keys
>>> within the range of startKey and endKey passed in.
>>> The cluster blocks sometimes in invoking the index table coprocessor. I
>>> traced
>>> into the code and found that when HConnection locates regions it will rpc
>>> to the same regionserver.
>>> (After a while I found the index table coprocessor is equivalent to
>>> just a plain scan with filter, so I switched to scanners with filter, but
>>> the problem
>>> remains.)
> -- 
> Best Regards!
> Fei Ding
> fding.church@gmail.com

View raw message