hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alvin C.L Huang" <CL.Hu...@itri.org.tw>
Subject Re: HBase random access in HDFS and block indices
Date Fri, 29 Oct 2010 16:38:38 GMT
@Sean

Consider an expected low hit ratio, cache will not benefit your random get.
This would cause too many java major GCs that pause all thread and thus bad
performance.

Try no cache at all.

--
Alvin C.-L., Huang
ATC, ICL, ITRI, Taiwan


On 29 October 2010 21:41, Sean Bigdatafun <sean.bigdatafun@gmail.com> wrote:

> I have the same doubt here. Let's say I have a totally random read pattern
> (uniformly distributed).
>
> Now let's assume my total data size stored in HBase is 100TB on 10
> machines(not a big deal considering nowaday's disks), and the total size of
> my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> cache hit probablity. Under random read pattern, each read is bound to
> experience the "open-> read index -> .... -> read datablock" sequence,
> which
> would be expensive.
>
> Any comment?
>
>
>
> On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>
> > I was envisioning the HFiles being opened and closed more often, but it
> > sounds like they're held open for long periods and that the indexes are
> > permanently cached.  Is it roughly correct to say that after opening an
> > HFile and loading its checksum/metadata/index/etc then each random data
> > block access only requires a single pread, where the pread has some
> > threading and connection overhead, but theoretically only requires one
> disk
> > seek.  I'm curious because I'm trying to do a lot of random reads, and
> > given
> > enough application parallelism, the disk seeks should become the
> bottleneck
> > much sooner than the network and threading overhead.
> >
> > Thanks again,
> > Matt
> >
> > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ryanobjc@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > Since the file is write-once, no random writes, putting the index at
> > > the end is the only choice.  The loading goes like this:
> > > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > > - read location of additional variable length sections, eg: block index
> > > - read those indexes, including the variable length 'file-info' section
> > >
> > >
> > > So to give some background, by default an InputStream from DFSClient
> > > has a single socket and positioned reads are fairly fast.  The DFS
> > > just sees 'read bytes from pos X length Y' commands on an open socket.
> > >  That is fast.  But only 1 thread at a time can use this interface.
> > > So for 'get' requests we use another interface called pread() which
> > > takes a position+length, and returns data.  This involves setting up a
> > > 1-use socket and tearing it down when we are done.  So it is slower by
> > > 2-3 tcp RTT, thread instantiation and other misc overhead.
> > >
> > >
> > > Back to the HFile index, it is indeed stored in ram, not block cache.
> > > Size is generally not an issue, hasn't been yet.  We ship with a
> > > default block size of 64k, and I'd recommend sticking with that unless
> > > you have evidential proof your data set performance is better under a
> > > different size.  Since the index grows linearly by a factor of 1/64k
> > > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > > are spread around the cluster, so you are pushing load to all
> > > machines.
> > >
> > >
> > >
> > >
> > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mcorgan@hotpads.com>
> > wrote:
> > > > Do you guys ever worry about how big an HFile's index will be?  For
> > > example,
> > > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > > blocks.
> > > >  If each index entry is 50b, then you have a 3.2mb index which is way
> > out
> > > of
> > > > line with your intention of having a small block size.  I believe
> > that's
> > > > read all at once so will be slow the first time... is the index
> cached
> > > > somewhere (block cache?) so that index accesses are from memory?
> > > >
> > > > And somewhat related - since the index is stored at the end of the
> > HFile,
> > > is
> > > > an additional random access required to find its offset?  If it was
> > > > considered, why was that chosen over putting it in it's own file that
> > > could
> > > > be accessed directly?
> > > >
> > > > Thanks for all these explanations,
> > > > Matt
> > > >
> > > >
> > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ryanobjc@gmail.com>
> > > wrote:
> > > >
> > > >> The primary problem is the namenode memory. It contains entries for
> > > every
> > > >> file and block, so setting hdfs block size small limits your
> > > scaleability.
> > > >>
> > > >> There is nothing inherently wrong with in file random read, Its just
> > > That
> > > >> the hdfs client was written for a single reader to read most of a
> > file.
> > > >> This to achieve high performance you'd need to do tricks, such as
> > > >> pipelining
> > > >> sockets and socket pool reuse. Right now for random reads We open
a
> > new
> > > >> socket, read data then close it.
> > > >> On Oct 18, 2010 8:22 PM, "William Kang" <weliam.cloud@gmail.com>
> > wrote:
> > > >> > Hi JG and Ryan,
> > > >> > Thanks for the excellent answers.
> > > >> >
> > > >> > So, I am going to push everything to the extremes without
> > considering
> > > >> > the memory first. In theory, if in HBase, every cell size equals
> to
> > > >> > HBase block size, then there would not be any in block traverse.
> In
> > > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > > >> > would not be any in-file random access necessary. This would
> provide
> > > >> > the best performance?
> > > >> >
> > > >> > But, the problem is that if the block in HBase is too large,
the
> > > >> > memory will run out since HBase load block into memory; if the
> block
> > > >> > in HDFS is too small, the DN will run out of memory since each
> HDFS
> > > >> > file takes some memory. So, it is a trade-off problem between
> memory
> > > >> > and performance. Is it right?
> > > >> >
> > > >> > And would it make any difference between random reading the same
> > size
> > > >> > file portion from of a small HDFS block and from a large HDFS
> block?
> > > >> >
> > > >> > Thanks.
> > > >> >
> > > >> >
> > > >> > William
> > > >> >
> > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ryanobjc@gmail.com
> >
> > > >> wrote:
> > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > > weliam.cloud@gmail.com>
> > > >> wrote:
> > > >> >>> Hi,
> > > >> >>> Recently I have spent some efforts to try to understand
the
> > > mechanisms
> > > >> >>> of HBase to exploit possible performance tunning options.
And
> many
> > > >> >>> thanks to the folks who helped with my questions in this
> > community,
> > > I
> > > >> >>> have sent a report. But, there are still few questions
left.
> > > >> >>>
> > > >> >>> 1. If a HFile block contains more than one keyvalue pair,
will
> the
> > > >> >>> block index in HFile point out the offset for every keyvalue
> pair
> > in
> > > >> >>> that block? Or, the block index will just point out the
key
> ranges
> > > >> >>> inside that block, so you have to traverse inside the
block
> until
> > > you
> > > >> >>> meet the key you are looking for?
> > > >> >>
> > > >> >> The block index contains the first key for every block. 
It
> > therefore
> > > >> >> defines in an [a,b) manner the range of each block. Once
a block
> > has
> > > >> >> been selected to read from, it is read into memory then iterated
> > over
> > > >> >> until the key in question has been found (or the closest
match
> has
> > > >> >> been found).
> > > >> >>
> > > >> >>> 2. When HBase read block to fetching the data or traverse
in it,
> > is
> > > >> >>> this block read into memory?
> > > >> >>
> > > >> >> yes, the entire block at a time is read in a single read
> operation.
> > > >> >>
> > > >> >>>
> > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks
(64m
> > > >> >>> configurable), to read the HBase blocks, we have to random
> access
> > > the
> > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read
a small
> > > >> >>> portion of the larger HDFS blocks, it is still a random
access.
> > > Would
> > > >> >>> this be slow?
> > > >> >>
> > > >> >> Random access reads are not necessarily slow, they require
> several
> > > >> things:
> > > >> >> - disk seeks to the data in question
> > > >> >> - disk seeks to the checksum files in question
> > > >> >> - checksum computation and verification
> > > >> >>
> > > >> >> While not particularly slow, this could probably be optimized
a
> > bit.
> > > >> >>
> > > >> >> Most of the issues with random reads in HDFS is parallelizing
the
> > > >> >> reads and doing as much io-pushdown/scheduling as possible
> without
> > > >> >> consuming an excess of sockets and threads.  The actual speed
can
> > be
> > > >> >> excellent, or not, depending on how busy the IO subsystem
is.
> > > >> >>
> > > >> >>
> > > >> >>>
> > > >> >>> Many thanks. I would be grateful for your answers.
> > > >> >>>
> > > >> >>>
> > > >> >>> William
> > > >> >>>
> > > >> >>
> > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message