hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject hypertable
Date Fri, 15 Feb 2008 21:01:40 GMT
A couple of us (JimK, Chad, and myself) went down to see the Hypertable 
fellas, Doug Judd, Luke Lu and Gordon Rios.  The lads were gracious 
hosts; they bought us lunch and fed us good coffee.

What we learned:

+ They have an interface that each FileSystem implements.  Its basic: 
open, close, seek, read, write, flush, pread.  They underlined presence 
of asynchronous read in API.

+ To get to a filesytem implementation -- e.g. HDFS -- they go via a 
'broker'.  Broker is a server that implements the FileSystem interface.  
This extra-hop abstraction will allow them to go against stores other 
than HDFS.

+ They have their own file format rather than depend on FileSystem types 
such as SequenceFile as hbase does.  Its made of blocks (64k or 64M, I 
don't remember which).  At end of file is a block index.  Blocks are 
compressed.  Keys are 
row/single-byte-column-family/column/single-byte-type/timestamp (IIRC).  
The single-byte-column-family is used to lookup into their chubby 
(called Hyperspace) where database schema is stored (schema has the 
column family name, attributes, etc).  The single-byte-type indicates 
cell type whether insert, delete, column-family delete or row delete.

+ To read, they open, read a block, and then run the decompress, parse 
keys and values over in C++ land.  Talked up fact that they can do 
read-ahead; i.e. prefetch the next block so its too hand when scanner 
crosses over into it.

+ To write, they just call append.   In the HDFS case, the broker just 
saves up the data and then writes it out when close is called.

+ They haven't played with random reads.  Currently, if a rangeserver 
goes down, the cluster is hosed (This is their highest priority at 
moment and should be addressed soon).  We'll probably standardize on the 
bigtable Performance Evaluation though it intentionally frustrates 
compression -- it uses random values -- so their compression work won't 
have a chance to shine.


Thoughts:

+ Their keying is better than hbase's.  We're missing the typing (we use 
'special' values to indicate cell delete).  Using codes to represent 
families we should also do (I've been thinking we need such a thing for 
both tables and columns every time I look at a meta scan in our master 
logs).  We should consider using the code in keys also.

+ At first I was thinking the read-ahead a nice idea but thinking on it 
more, methinks it won't buy us much.  IIRC, DFSClient blocks when you go 
off the end of one block while it closes socket to current datanode and 
puts up socket against the datanode that has the next block.  But hbase 
usually writes out files that are the HDFS 64M block size or less.  This 
means, usually, running a compaction of flushes, we shouldn't be doing 
reads over the top of socket reconnects.  Lets measure.  Regardless, we 
should fix this blocking, if this is indeed the case, either in 
DFSClient or at the application layer at Tom Whites' block caching level. 

+ In their postings on hypertable -- on their website and in responses 
to the slashdotting of hypertable -- there is the implication that HT is 
a more 'true' implementation of bigtable paper.  One area in particular 
that comes up is hbase's lack of support for 'locality groups'.  No one 
has as yet asked for this "store of stores" feature.   I can see that if 
you've botched your schema design up front or your access pattern 
changes over the life of the application and you want to 'join' two 
column families, it'd be useful (no need to change how the client 
accesses the table).  We should probably add this facility, but seems 
low priority to me.

+ Their postings also talk up better compression options and of how they 
include this and that compression algorithm lib natively whereas java 
has to go across JNI chasms and even then, java takes 2 to 3 times the 
memory C++ does and even then, its accesses are slower, etc.   On 
compression, we've done little in hbase.  Its possible to enable it but 
we've not done any profiling using SequenceFile compression options, 
record vs. block, etc.  What with i/o always being orders of magnitude 
slower than any other accesses and what with CPUs getting faster and 
faster, there is a point at which using compression to get more data off 
the disk all in the one go becomes a win.  We should spend some time 
looking at our options here when we go about making HBaseMapFile.

+ If I was to synopsize my impression of HT using a single word only, 
'performance' would seem to be foremost.  For hbase, performance is 
important but at the moment, our roadmap has 'robustness' and 
'scalability' as focus.  Should we be spending more time on performance 
issues?

Comments?
St.Ack


Mime
View raw message