Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 21861 invoked from network); 29 Oct 2010 13:42:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Oct 2010 13:42:20 -0000 Received: (qmail 76036 invoked by uid 500); 29 Oct 2010 13:42:19 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 75889 invoked by uid 500); 29 Oct 2010 13:42:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 75881 invoked by uid 99); 29 Oct 2010 13:42:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Oct 2010 13:42:18 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sean.bigdatafun@gmail.com designates 209.85.215.169 as permitted sender) Received: from [209.85.215.169] (HELO mail-ey0-f169.google.com) (209.85.215.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Oct 2010 13:42:11 +0000 Received: by eydd26 with SMTP id d26so1767065eyd.14 for ; Fri, 29 Oct 2010 06:41:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=fD+uEYTxoK8tBk12SDZHTLpPDwlqzBQi+g+mpBsEYVk=; b=J7wS4ab9A5r3axiw5okHuU7B9hzjvdRHG6DVnOb1h6+ShlVHSNZ2rKlgRhd42R3vy1 zye6ID/XK+JiZmyXQn3YwSz5ko9jpd7OWDcNHaK+P7zMeUtgG9x177nhbuzufATd9jPO nEyjQOKtSlXGq1UKKfFWjOMbS0GgFAc/e4u+k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=bdHH4mylrwzIBtcTz5u3/EIIxNusVIEyxMQPadcmS0MGUeZuZ+srEZwbyV344dbJ0/ uDqvTBOE3m1BBJwlnGAVbfOpYdOSE1egbHb6cw3iN+8tS12k7DWAXAMf94kzP1NDv7tD p83ugwCGtrkRYdLjHXGxXm2gngRV9Fvaf6xmI= MIME-Version: 1.0 Received: by 10.239.156.136 with SMTP id m8mr1966801hbc.116.1288359710560; Fri, 29 Oct 2010 06:41:50 -0700 (PDT) Received: by 10.239.156.209 with HTTP; Fri, 29 Oct 2010 06:41:50 -0700 (PDT) In-Reply-To: References: Date: Fri, 29 Oct 2010 06:41:50 -0700 Message-ID: Subject: Re: HBase random access in HDFS and block indices From: Sean Bigdatafun To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001485f5b108e413910493c19ed6 X-Virus-Checked: Checked by ClamAV on apache.org --001485f5b108e413910493c19ed6 Content-Type: text/plain; charset=ISO-8859-1 I have the same doubt here. Let's say I have a totally random read pattern (uniformly distributed). Now let's assume my total data size stored in HBase is 100TB on 10 machines(not a big deal considering nowaday's disks), and the total size of my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06% cache hit probablity. Under random read pattern, each read is bound to experience the "open-> read index -> .... -> read datablock" sequence, which would be expensive. Any comment? On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan wrote: > I was envisioning the HFiles being opened and closed more often, but it > sounds like they're held open for long periods and that the indexes are > permanently cached. Is it roughly correct to say that after opening an > HFile and loading its checksum/metadata/index/etc then each random data > block access only requires a single pread, where the pread has some > threading and connection overhead, but theoretically only requires one disk > seek. I'm curious because I'm trying to do a lot of random reads, and > given > enough application parallelism, the disk seeks should become the bottleneck > much sooner than the network and threading overhead. > > Thanks again, > Matt > > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson wrote: > > > Hi, > > > > Since the file is write-once, no random writes, putting the index at > > the end is the only choice. The loading goes like this: > > - read fixed file trailer, ie: filelen.offset - > > - read location of additional variable length sections, eg: block index > > - read those indexes, including the variable length 'file-info' section > > > > > > So to give some background, by default an InputStream from DFSClient > > has a single socket and positioned reads are fairly fast. The DFS > > just sees 'read bytes from pos X length Y' commands on an open socket. > > That is fast. But only 1 thread at a time can use this interface. > > So for 'get' requests we use another interface called pread() which > > takes a position+length, and returns data. This involves setting up a > > 1-use socket and tearing it down when we are done. So it is slower by > > 2-3 tcp RTT, thread instantiation and other misc overhead. > > > > > > Back to the HFile index, it is indeed stored in ram, not block cache. > > Size is generally not an issue, hasn't been yet. We ship with a > > default block size of 64k, and I'd recommend sticking with that unless > > you have evidential proof your data set performance is better under a > > different size. Since the index grows linearly by a factor of 1/64k > > with the bytes of the data, it isn't a huge deal. Also the indexes > > are spread around the cluster, so you are pushing load to all > > machines. > > > > > > > > > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan > wrote: > > > Do you guys ever worry about how big an HFile's index will be? For > > example, > > > if you have a 512mb HFile with 8k block size, you will have 64,000 > > blocks. > > > If each index entry is 50b, then you have a 3.2mb index which is way > out > > of > > > line with your intention of having a small block size. I believe > that's > > > read all at once so will be slow the first time... is the index cached > > > somewhere (block cache?) so that index accesses are from memory? > > > > > > And somewhat related - since the index is stored at the end of the > HFile, > > is > > > an additional random access required to find its offset? If it was > > > considered, why was that chosen over putting it in it's own file that > > could > > > be accessed directly? > > > > > > Thanks for all these explanations, > > > Matt > > > > > > > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson > > wrote: > > > > > >> The primary problem is the namenode memory. It contains entries for > > every > > >> file and block, so setting hdfs block size small limits your > > scaleability. > > >> > > >> There is nothing inherently wrong with in file random read, Its just > > That > > >> the hdfs client was written for a single reader to read most of a > file. > > >> This to achieve high performance you'd need to do tricks, such as > > >> pipelining > > >> sockets and socket pool reuse. Right now for random reads We open a > new > > >> socket, read data then close it. > > >> On Oct 18, 2010 8:22 PM, "William Kang" > wrote: > > >> > Hi JG and Ryan, > > >> > Thanks for the excellent answers. > > >> > > > >> > So, I am going to push everything to the extremes without > considering > > >> > the memory first. In theory, if in HBase, every cell size equals to > > >> > HBase block size, then there would not be any in block traverse. In > > >> > HDFS, very HBase block size equals to each HDFS block size, there > > >> > would not be any in-file random access necessary. This would provide > > >> > the best performance? > > >> > > > >> > But, the problem is that if the block in HBase is too large, the > > >> > memory will run out since HBase load block into memory; if the block > > >> > in HDFS is too small, the DN will run out of memory since each HDFS > > >> > file takes some memory. So, it is a trade-off problem between memory > > >> > and performance. Is it right? > > >> > > > >> > And would it make any difference between random reading the same > size > > >> > file portion from of a small HDFS block and from a large HDFS block? > > >> > > > >> > Thanks. > > >> > > > >> > > > >> > William > > >> > > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson > > >> wrote: > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang < > > weliam.cloud@gmail.com> > > >> wrote: > > >> >>> Hi, > > >> >>> Recently I have spent some efforts to try to understand the > > mechanisms > > >> >>> of HBase to exploit possible performance tunning options. And many > > >> >>> thanks to the folks who helped with my questions in this > community, > > I > > >> >>> have sent a report. But, there are still few questions left. > > >> >>> > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will the > > >> >>> block index in HFile point out the offset for every keyvalue pair > in > > >> >>> that block? Or, the block index will just point out the key ranges > > >> >>> inside that block, so you have to traverse inside the block until > > you > > >> >>> meet the key you are looking for? > > >> >> > > >> >> The block index contains the first key for every block. It > therefore > > >> >> defines in an [a,b) manner the range of each block. Once a block > has > > >> >> been selected to read from, it is read into memory then iterated > over > > >> >> until the key in question has been found (or the closest match has > > >> >> been found). > > >> >> > > >> >>> 2. When HBase read block to fetching the data or traverse in it, > is > > >> >>> this block read into memory? > > >> >> > > >> >> yes, the entire block at a time is read in a single read operation. > > >> >> > > >> >>> > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m > > >> >>> configurable), to read the HBase blocks, we have to random access > > the > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small > > >> >>> portion of the larger HDFS blocks, it is still a random access. > > Would > > >> >>> this be slow? > > >> >> > > >> >> Random access reads are not necessarily slow, they require several > > >> things: > > >> >> - disk seeks to the data in question > > >> >> - disk seeks to the checksum files in question > > >> >> - checksum computation and verification > > >> >> > > >> >> While not particularly slow, this could probably be optimized a > bit. > > >> >> > > >> >> Most of the issues with random reads in HDFS is parallelizing the > > >> >> reads and doing as much io-pushdown/scheduling as possible without > > >> >> consuming an excess of sockets and threads. The actual speed can > be > > >> >> excellent, or not, depending on how busy the IO subsystem is. > > >> >> > > >> >> > > >> >>> > > >> >>> Many thanks. I would be grateful for your answers. > > >> >>> > > >> >>> > > >> >>> William > > >> >>> > > >> >> > > >> > > > > > > --001485f5b108e413910493c19ed6--