Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of sean.bigdatafun@gmail.com
 designates 209.85.215.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=bdHH4mylrwzIBtcTz5u3/EIIxNusVIEyxMQPadcmS0MGUeZuZ+srEZwbyV344dbJ0/
         uDqvTBOE3m1BBJwlnGAVbfOpYdOSE1egbHb6cw3iN+8tS12k7DWAXAMf94kzP1NDv7tD
         p83ugwCGtrkRYdLjHXGxXm2gngRV9Fvaf6xmI=
MIME-Version: 1.0
In-Reply-To: <AANLkTin91g_ku93g3Qrhu_Zc=Pp3EjqXrHd_9mkmONxr@mail.gmail.com>
References: <AANLkTikV02JT0Ofej+34pqhR=dNmiSxouohiAHbm=Knb@mail.gmail.com>
	<AANLkTikKfTL5h9mJksidT=o8u=F9Le-3hJ6=MH+6e=qq@mail.gmail.com>
	<AANLkTi=F0+2PYARv1tU7cCpG-G-HVTZx_KZb4=wHMB6N@mail.gmail.com>
	<AANLkTin7xSjFRQbh1xQKXce=2-1mkzHaMwGjFW-8aXBD@mail.gmail.com>
	<AANLkTimR2hCEPcRU9gjD-uPE2Rr_X+9inxrz=K3d_drU@mail.gmail.com>
	<AANLkTimSLyAtisOKAroG_k7MpLKXPQodG8XNG4oKNnEs@mail.gmail.com>
	<AANLkTi=y-AdBw2T7q=nmwzxZXyuZiqMczJcHcktG2ARF@mail.gmail.com>
	<AANLkTin91g_ku93g3Qrhu_Zc=Pp3EjqXrHd_9mkmONxr@mail.gmail.com>
Date: Fri, 29 Oct 2010 06:41:50 -0700
Message-ID: <AANLkTimi_01qcVPgwX0nJ=+6CS2XDnNpyfW59+aMrcwk@mail.gmail.com>
Subject: Re: HBase random access in HDFS and block indices
From: Sean Bigdatafun <sean.bigdatafun@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=001485f5b108e413910493c19ed6

--001485f5b108e413910493c19ed6
Content-Type: text/plain; charset=ISO-8859-1

I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).

Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.

Any comment?


On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mcorgan@hotpads.com> wrote:

> I was envisioning the HFiles being opened and closed more often, but it
> sounds like they're held open for long periods and that the indexes are
> permanently cached.  Is it roughly correct to say that after opening an
> HFile and loading its checksum/metadata/index/etc then each random data
> block access only requires a single pread, where the pread has some
> threading and connection overhead, but theoretically only requires one disk
> seek.  I'm curious because I'm trying to do a lot of random reads, and
> given
> enough application parallelism, the disk seeks should become the bottleneck
> much sooner than the network and threading overhead.
>
> Thanks again,
> Matt
>
> On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>
> > Hi,
> >
> > Since the file is write-once, no random writes, putting the index at
> > the end is the only choice.  The loading goes like this:
> > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > - read location of additional variable length sections, eg: block index
> > - read those indexes, including the variable length 'file-info' section
> >
> >
> > So to give some background, by default an InputStream from DFSClient
> > has a single socket and positioned reads are fairly fast.  The DFS
> > just sees 'read bytes from pos X length Y' commands on an open socket.
> >  That is fast.  But only 1 thread at a time can use this interface.
> > So for 'get' requests we use another interface called pread() which
> > takes a position+length, and returns data.  This involves setting up a
> > 1-use socket and tearing it down when we are done.  So it is slower by
> > 2-3 tcp RTT, thread instantiation and other misc overhead.
> >
> >
> > Back to the HFile index, it is indeed stored in ram, not block cache.
> > Size is generally not an issue, hasn't been yet.  We ship with a
> > default block size of 64k, and I'd recommend sticking with that unless
> > you have evidential proof your data set performance is better under a
> > different size.  Since the index grows linearly by a factor of 1/64k
> > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > are spread around the cluster, so you are pushing load to all
> > machines.
> >
> >
> >
> >
> > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mcorgan@hotpads.com>
> wrote:
> > > Do you guys ever worry about how big an HFile's index will be?  For
> > example,
> > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > blocks.
> > >  If each index entry is 50b, then you have a 3.2mb index which is way
> out
> > of
> > > line with your intention of having a small block size.  I believe
> that's
> > > read all at once so will be slow the first time... is the index cached
> > > somewhere (block cache?) so that index accesses are from memory?
> > >
> > > And somewhat related - since the index is stored at the end of the
> HFile,
> > is
> > > an additional random access required to find its offset?  If it was
> > > considered, why was that chosen over putting it in it's own file that
> > could
> > > be accessed directly?
> > >
> > > Thanks for all these explanations,
> > > Matt
> > >
> > >
> > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ryanobjc@gmail.com>
> > wrote:
> > >
> > >> The primary problem is the namenode memory. It contains entries for
> > every
> > >> file and block, so setting hdfs block size small limits your
> > scaleability.
> > >>
> > >> There is nothing inherently wrong with in file random read, Its just
> > That
> > >> the hdfs client was written for a single reader to read most of a
> file.
> > >> This to achieve high performance you'd need to do tricks, such as
> > >> pipelining
> > >> sockets and socket pool reuse. Right now for random reads We open a
> new
> > >> socket, read data then close it.
> > >> On Oct 18, 2010 8:22 PM, "William Kang" <weliam.cloud@gmail.com>
> wrote:
> > >> > Hi JG and Ryan,
> > >> > Thanks for the excellent answers.
> > >> >
> > >> > So, I am going to push everything to the extremes without
> considering
> > >> > the memory first. In theory, if in HBase, every cell size equals to
> > >> > HBase block size, then there would not be any in block traverse. In
> > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > >> > would not be any in-file random access necessary. This would provide
> > >> > the best performance?
> > >> >
> > >> > But, the problem is that if the block in HBase is too large, the
> > >> > memory will run out since HBase load block into memory; if the block
> > >> > in HDFS is too small, the DN will run out of memory since each HDFS
> > >> > file takes some memory. So, it is a trade-off problem between memory
> > >> > and performance. Is it right?
> > >> >
> > >> > And would it make any difference between random reading the same
> size
> > >> > file portion from of a small HDFS block and from a large HDFS block?
> > >> >
> > >> > Thanks.
> > >> >
> > >> >
> > >> > William
> > >> >
> > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ryanobjc@gmail.com>
> > >> wrote:
> > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > weliam.cloud@gmail.com>
> > >> wrote:
> > >> >>> Hi,
> > >> >>> Recently I have spent some efforts to try to understand the
> > mechanisms
> > >> >>> of HBase to exploit possible performance tunning options. And many
> > >> >>> thanks to the folks who helped with my questions in this
> community,
> > I
> > >> >>> have sent a report. But, there are still few questions left.
> > >> >>>
> > >> >>> 1. If a HFile block contains more than one keyvalue pair, will the
> > >> >>> block index in HFile point out the offset for every keyvalue pair
> in
> > >> >>> that block? Or, the block index will just point out the key ranges
> > >> >>> inside that block, so you have to traverse inside the block until
> > you
> > >> >>> meet the key you are looking for?
> > >> >>
> > >> >> The block index contains the first key for every block.  It
> therefore
> > >> >> defines in an [a,b) manner the range of each block. Once a block
> has
> > >> >> been selected to read from, it is read into memory then iterated
> over
> > >> >> until the key in question has been found (or the closest match has
> > >> >> been found).
> > >> >>
> > >> >>> 2. When HBase read block to fetching the data or traverse in it,
> is
> > >> >>> this block read into memory?
> > >> >>
> > >> >> yes, the entire block at a time is read in a single read operation.
> > >> >>
> > >> >>>
> > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > >> >>> configurable), to read the HBase blocks, we have to random access
> > the
> > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > >> >>> portion of the larger HDFS blocks, it is still a random access.
> > Would
> > >> >>> this be slow?
> > >> >>
> > >> >> Random access reads are not necessarily slow, they require several
> > >> things:
> > >> >> - disk seeks to the data in question
> > >> >> - disk seeks to the checksum files in question
> > >> >> - checksum computation and verification
> > >> >>
> > >> >> While not particularly slow, this could probably be optimized a
> bit.
> > >> >>
> > >> >> Most of the issues with random reads in HDFS is parallelizing the
> > >> >> reads and doing as much io-pushdown/scheduling as possible without
> > >> >> consuming an excess of sockets and threads.  The actual speed can
> be
> > >> >> excellent, or not, depending on how busy the IO subsystem is.
> > >> >>
> > >> >>
> > >> >>>
> > >> >>> Many thanks. I would be grateful for your answers.
> > >> >>>
> > >> >>>
> > >> >>> William
> > >> >>>
> > >> >>
> > >>
> > >
> >
>

--001485f5b108e413910493c19ed6--