hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@gmail.com>
Subject Re: Status of Huawei's 2' Indexing?
Date Mon, 16 Mar 2015 18:37:14 GMT
Hi,

a cross post from the dev list. perhaps here more people have valuable
hints or ideas.

Am 16.03.2015 um 18:46 schrieb Rose, Joseph:
> Alright, let’s see if I can get this discussion back on track.
>
> I have a sensibly defined table for patient data; its rowkey is simply
> lastname:firstname, since it’s convenient for the bulk of my lookups.
> Unfortunately I also need to efficiently find patients using an ID string,
> whose literal value is buried in a value field. I’m sure this situation is
> not foreign to the people on this list.
>
> It’s been suggested that I implement 2’ indexes myself — fine. All the
> research I’ve done seems to end with that suggestion, with the exception
> of Phoenix (I don’t want the RDBMS layer) and Huawei’s stuff (which seems
> to incite some discussion here). I’m happy to put this together but I’d
> rather go with something that has been vetted and has a larger developer
> community than one (i.e., ME). Besides, I have a full enough plate at the
> moment that I’d rather not have to do this, too.
>
> Are there constructive suggestions regarding how I can proceed with HBase?
> Right now even a well-vetted local index would be a godsend.

Well first I have a question. Is "lastname:firstname" a good idea for a
row key? Is a name that  specific? I think your row key should be the
ID, rather than the names, as it can be made unique. UUID or whatever.
However, by this the problem still stands, as just the roles are
switched. You either need an index for the IDs or the names.

The following is argued with the ID as row key and the name-firstname as
index data.

I could be image 3 solutions:

* First ... MacGyver your own index.

That's not that complicate as it sounds. A very easy idea would be the
update within the CRUD operations on your data. Within a

Put put =  new Put( Bytes.toBytes( id ) );
put.add( Bytes.toBytes( "firstname" ) , firstname );
put.add( Bytes.toBytes( "lastname" ) , lastname );

make an additional
Put indexPut = new Put( Bytes.toBytes( lastname+":"+firstname ) )
indexPut.add( Bytes.toBytes( id ) , null );

...
<put to tables>

Deleting is practically the same. Just fetch the ID, get the lastname,
firstname combination and kick it out of the index.

By this you can just fetch the row "lastename:firstname" and get all
possible ids as column qualifiers. And that's it ... almost. Here the
"risk" it, that your hbase table throws some error and the stuff is
added, but the index is not refreshed. Thus you have to write a little
more code to catch the "ID not existing" errors.

Furthermore you would have to run a small map-red now and then (perhaps
every night or so) which runs through the rows and refreshes the index
and run through the index and kicks rows there if the ID is not present
anymore. If you missed something above.

If you are new to hbase this perhaps sounds a little complicate. But
actually it's simple. If you are interested I could send you some small
snippets directly.

* Second ... Lucene

Lucene is an index system right away. As I wrote some days ago: with
hbase comes all the fancy apache/hadoop stuff. With lucene you can
implement a search method for your data. E.g. on ... drugs the people
already had.  Fancy feature for your application. And of course you
could search for "firstname=<firstname> AND lastname=<lastname>" which
would fit your need.

However, by this you introduce a new system which you have to maintain :/.

* Third ... other index systems, e.g. Elasticsearch

like the second idea. But more fancy, but more complicate. More points
of failure etc.

If your application do not need a search method, I would go with 1. If
you have to create a search anyway I would go with 2 or 3 as you can use
the search facility for your indexing problem right away.

Best wishes,

Wilm

Mime
View raw message