hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: HBase random access in HDFS and block indices
Date Tue, 19 Oct 2010 04:00:11 GMT
HFiles are generally 256MB and default block size is 64K, so that's 4000 blocks (1/16th what
you said).  That would have a more reasonable block index of 200K.

But the block index is kept in-memory so you only read it once when the file is first opened.
 So even if you do lower the block size and increase the block index, this would mostly slowdown
the initial open but should not have a major impact on random access performance once cached.

The offset is found from reading the headers of the HFile.  Again this only has to be done
on open.  Indexes used to be kept in separate files but this doubled the number of open files
HBase might need, an issue we are always pushing against.

I'm not sure seeking around a single HFile during the initial load of that file is an especially
big issue (when talking just 2-3 areas to read from not 100s).

JG

> -----Original Message-----
> From: Matt Corgan [mailto:mcorgan@hotpads.com]
> Sent: Monday, October 18, 2010 8:53 PM
> To: user
> Subject: Re: HBase random access in HDFS and block indices
> 
> Do you guys ever worry about how big an HFile's index will be?  For
> example,
> if you have a 512mb HFile with 8k block size, you will have 64,000
> blocks.
>  If each index entry is 50b, then you have a 3.2mb index which is way
> out of
> line with your intention of having a small block size.  I believe
> that's
> read all at once so will be slow the first time... is the index cached
> somewhere (block cache?) so that index accesses are from memory?
> 
> And somewhat related - since the index is stored at the end of the
> HFile, is
> an additional random access required to find its offset?  If it was
> considered, why was that chosen over putting it in it's own file that
> could
> be accessed directly?
> 
> Thanks for all these explanations,
> Matt
> 
> 
> On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ryanobjc@gmail.com>
> wrote:
> 
> > The primary problem is the namenode memory. It contains entries for
> every
> > file and block, so setting hdfs block size small limits your
> scaleability.
> >
> > There is nothing inherently wrong with in file random read, Its just
> That
> > the hdfs client was written for a single reader to read most of a
> file.
> > This to achieve high performance you'd need to do tricks, such as
> > pipelining
> > sockets and socket pool reuse. Right now for random reads We open a
> new
> > socket, read data then close it.
> > On Oct 18, 2010 8:22 PM, "William Kang" <weliam.cloud@gmail.com>
> wrote:
> > > Hi JG and Ryan,
> > > Thanks for the excellent answers.
> > >
> > > So, I am going to push everything to the extremes without
> considering
> > > the memory first. In theory, if in HBase, every cell size equals to
> > > HBase block size, then there would not be any in block traverse. In
> > > HDFS, very HBase block size equals to each HDFS block size, there
> > > would not be any in-file random access necessary. This would
> provide
> > > the best performance?
> > >
> > > But, the problem is that if the block in HBase is too large, the
> > > memory will run out since HBase load block into memory; if the
> block
> > > in HDFS is too small, the DN will run out of memory since each HDFS
> > > file takes some memory. So, it is a trade-off problem between
> memory
> > > and performance. Is it right?
> > >
> > > And would it make any difference between random reading the same
> size
> > > file portion from of a small HDFS block and from a large HDFS
> block?
> > >
> > > Thanks.
> > >
> > >
> > > William
> > >
> > > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ryanobjc@gmail.com>
> > wrote:
> > >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang
> <weliam.cloud@gmail.com>
> > wrote:
> > >>> Hi,
> > >>> Recently I have spent some efforts to try to understand the
> mechanisms
> > >>> of HBase to exploit possible performance tunning options. And
> many
> > >>> thanks to the folks who helped with my questions in this
> community, I
> > >>> have sent a report. But, there are still few questions left.
> > >>>
> > >>> 1. If a HFile block contains more than one keyvalue pair, will
> the
> > >>> block index in HFile point out the offset for every keyvalue pair
> in
> > >>> that block? Or, the block index will just point out the key
> ranges
> > >>> inside that block, so you have to traverse inside the block until
> you
> > >>> meet the key you are looking for?
> > >>
> > >> The block index contains the first key for every block.  It
> therefore
> > >> defines in an [a,b) manner the range of each block. Once a block
> has
> > >> been selected to read from, it is read into memory then iterated
> over
> > >> until the key in question has been found (or the closest match has
> > >> been found).
> > >>
> > >>> 2. When HBase read block to fetching the data or traverse in it,
> is
> > >>> this block read into memory?
> > >>
> > >> yes, the entire block at a time is read in a single read
> operation.
> > >>
> > >>>
> > >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > >>> configurable), to read the HBase blocks, we have to random access
> the
> > >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > >>> portion of the larger HDFS blocks, it is still a random access.
> Would
> > >>> this be slow?
> > >>
> > >> Random access reads are not necessarily slow, they require several
> > things:
> > >> - disk seeks to the data in question
> > >> - disk seeks to the checksum files in question
> > >> - checksum computation and verification
> > >>
> > >> While not particularly slow, this could probably be optimized a
> bit.
> > >>
> > >> Most of the issues with random reads in HDFS is parallelizing the
> > >> reads and doing as much io-pushdown/scheduling as possible without
> > >> consuming an excess of sockets and threads.  The actual speed can
> be
> > >> excellent, or not, depending on how busy the IO subsystem is.
> > >>
> > >>
> > >>>
> > >>> Many thanks. I would be grateful for your answers.
> > >>>
> > >>>
> > >>> William
> > >>>
> > >>
> >

Mime
View raw message