hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lin Ma <lin...@gmail.com>
Subject Re: Using HBase serving to replace memcached
Date Tue, 21 Aug 2012 16:30:38 GMT
Thanks Zahoor,

> If there is no bloom... you have to load every block and scan to find if
the row exists..

I could be wrong. I think HFile index block (which is located at the end of
HFile) is a binary search tree containing all row-key values (of the HFile)
in the binary search tree. Searching a specific row-key in the binary
search tree could easily find whether a row-key exists (some node in the
tree has the same row-key value) or not. Why we need load every block to
find if the row exists?


On Tue, Aug 21, 2012 at 11:56 PM, jmozah <jmozah@gmail.com> wrote:

> >
> >
> > 1. After reading the materials you sent to me, I am confused how Bloom
> Filter could save I/O during random read. Supposing I am not using Bloom
> Filter, in order to find whether a row (or row-key) exists, we need to scan
> the index block which is at the end part of an HFile, the scan is in memory
> (I think index block is always in memory, please feel free to correct me if
> I am wrong) using binary search -- it should be pretty fast. With Bloom
> Filter, we could be a bit faster by looking up Bloom Filter bit vector in
> memory. Since both index block binary search and Bloom Filter bit vector
> search are doing in memory (no I/O is involved), what kinds of I/O is
> saved? :-)
> >
> If bloom says the Row *may* be present.. the block is loaded otherwise
> not...
> If there is no bloom... you have to load every block and scan to find if
> the row exists..
> This may incur more IO
> > 2.
> >
> > > One Hadoop job doing random reads is perfectly fine.  but , since you
> said "Handling directly user traffic"... i assumed you wanted to
> > > expose HBase independently to every client request, thereby having as
> many connections as the number of simultaneous req..
> >
> > Sorry I need to confirm again on this point. I think you mean
> establishing a new connection for each request is not good, using
> connection pool or asynchronous I/O is preferred?
> >
> Yes.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message