hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Many 2 one in a row - modeling options
Date Wed, 19 Aug 2009 17:20:25 GMT
Tim,

Very cool wiki page.  Unfortunately I'm a little confused about exactly 
what the requirements are.

Does each species (and the combination of all of its identifications) 
actually have a single, unique ID?

The most important thing when designing your HBase schema is to 
understand how you want to query it.  And I'm not exactly sure I follow 
that part.

I'm going to assume that there is a single, relatively static set of 
attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in 
a family, call it "attributes".  You would use that family as a 
key/value dictionary.  The qualifier would be the attribute name, and 
the value would be the attribute value (ie. attributes:InstCode with 
value MNHA).

The row, in this case, would be the GUID or whatever unique ID you want 
to lookup by.

Now the other part, storing the identifications.  I would definitely 
vote against multiples rows, multiple tables, and multiple families.  As 
you point out, multiple tables would require joining, multiple families 
does in fact mean 2 separate files, and multiple rows adds a great deal 
of complexity (you need to Scan now, cannot rely on Get).

So let's say we have a family "identifications" (though you may want to 
shorten these family names as they are actually stored explicitly for 
every single cell... maybe "ids").  For each identification, you would 
have a single column.  The qualifier of that column would be whatever 
the unique identifier is for that identification, or if there isn't one, 
you could just wrap up the entire thing in to a serialized type and use 
that as the qualifier.  If you have an ID, then I would serialize the 
identification into the value.

You point out that this would have poor scanning performance because of 
the need for deserialization, but I don't necessarily agree.  That can 
be quite fast, depending on implementation, and there's a great deal of 
serialization/deserialization being done behind the scenes to even get 
the data to you in the first place.

Something like protobufs has very efficient and fast 
serialize/deserialize operations.  Java serialization is inefficient in 
space and can be slow, which is why HBase and Hadoop implement the 
Writable interface and provide a minimal/efficient/binary serialization.

I do think that is the by far the best approach here, the 
serialization/deserialization should be orders of magnitude faster than 
round-trip network latency.

I didn't realize your first bullet was what it was, I thought you were 
talking about serializing the entire thing in one column.  Looking 
again, it seems you're on the right track and that would be the simplest 
and fastest approach.

Keep us updated!

JG



tim robertson wrote:
> Hi all,
> 
> I have just started a project to research the migration of a
> biodiversity occurrence index (plant / animal specimens collected or
> observed) from mysql to HBase.
> 
> We have source records that inherently have a many 2 one.  Think of
> "Scientist A identified this as a Felis concolor concolor" but 25
> years later "Scientist B identified the same preserved specimen as a
> Puma concolor".  This scientific identification has more attributes
> and there will always be 1 or more (could be 10s of them) for the same
> specimen.
> 
> I am pondering how to model this in HBase seeing a few obvious options:
> - serializing the scientific identification "List" as bytes
> - expanding the record into 2 or more rows indicating the rows were
> derived from the same source
> - expand the identifications into new families
> - expand the identification fields into multiple fields in the same family
> - consider more than 1 table
> 
> All of the above have pros and cons with respect to client code
> complexity and performance.
> 
> I have put up a vrey simple example record on
> http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
> welcome any comments on this list or on the wiki directly.
> 
> Please note that I have only just started the project so the
> documentation is really just starting up at this point, but this will
> be a case study of a migration from mysql which might be of interest
> to others.
> 
> Thanks,
> 
> Tim
> 

Mime
View raw message