Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5CD3395A3 for ; Wed, 22 Aug 2012 20:51:17 +0000 (UTC) Received: (qmail 64728 invoked by uid 500); 22 Aug 2012 20:51:15 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 64663 invoked by uid 500); 22 Aug 2012 20:51:15 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 64653 invoked by uid 99); 22 Aug 2012 20:51:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2012 20:51:15 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of apamecha@x.com designates 216.33.244.6 as permitted sender) Received: from [216.33.244.6] (HELO rhv-mipot-001.corp.ebay.com) (216.33.244.6) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2012 20:51:08 +0000 DomainKey-Signature: s=xcorp; d=x.com; c=simple; q=dns; h=X-EBay-Corp:X-IronPort-AV:Received:Received:From:To: Subject:Thread-Topic:Thread-Index:Date:Message-ID: References:In-Reply-To:Accept-Language:Content-Language: X-MS-Has-Attach:X-MS-TNEF-Correlator:x-originating-ip: x-ems-proccessed:x-ems-stamp:Content-Type: Content-Transfer-Encoding:MIME-Version:X-CFilter; b=BkqjM+OKELkVGKd1zSCLk8DlIzoaAp4QOR5k2ED8I9NxeyD/919uQS3i HKouiJBgvK4k8x2S+z2Gj4PUb8xhwYD75A15pgec5ZxB/1O/XusxRTySW 0Z11wx+cdh/fQ5h; X-EBay-Corp: Yes X-IronPort-AV: E=Sophos;i="4.80,296,1344236400"; d="scan'208";a="76875030" Received: from rhv-vtenf-001.corp.ebay.com (HELO RHV-EXMHT-003.corp.ebay.com) ([10.112.113.52]) by rhv-mipot-001.corp.ebay.com with ESMTP; 22 Aug 2012 13:50:46 -0700 Received: from RHV-EXRDA-S11.corp.ebay.com ([fe80::edc0:9413:d700:64f]) by RHV-EXMHT-003.corp.ebay.com ([fe80::814f:de0a:319f:653c%14]) with mapi id 14.02.0298.004; Wed, 22 Aug 2012 13:49:52 -0700 From: "Pamecha, Abhishek" To: "user@hbase.apache.org" Subject: RE: HBase Put Thread-Topic: HBase Put Thread-Index: Ac1/8MWkIp7hLUvHSeisnLs6dArPXgAO5kYAAA2rfpD//7DaAP//aDMwgAG4eICAAEnL4A== Date: Wed, 22 Aug 2012 20:49:49 +0000 Message-ID: <2E362ACC9493D747B488241C66B3B66512F6B8@RHV-EXRDA-S11.corp.ebay.com> References: <2E362ACC9493D747B488241C66B3B66512B010@RHV-EXRDA-S11.corp.ebay.com> <1345590432.49421.YahooMailNeo@web121703.mail.ne1.yahoo.com> <2E362ACC9493D747B488241C66B3B66512B2A1@RHV-EXRDA-S11.corp.ebay.com> <1345596920.40634.YahooMailNeo@web121705.mail.ne1.yahoo.com> <2E362ACC9493D747B488241C66B3B66512CA9B@RHV-EXRDA-S11.corp.ebay.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.245.27.242] x-ems-proccessed: 10SqDH0iR7ekR7SRpKqm5A== x-ems-stamp: h9P9iojn2IZwitI9xS8mVw== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter: Scanned Can I enable bloom filters per block at column qualifier levels too? That w= ay, will small block sizes, I can selectively load only few data blocks in = memory. Then I can do some trade off between block size and bloom filter fa= lse positive rate. I am designing for a wide table scenario with thousands and millions of col= umns and thus I don't really want to stress on checks for blocks having mor= e than one row key.=20 Thanks, Abhishek -----Original Message----- From: Mohit Anchlia [mailto:mohitanchlia@gmail.com]=20 Sent: Wednesday, August 22, 2012 11:09 AM To: user@hbase.apache.org Subject: Re: HBase Put On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek wrote: > So then a GET query means one needs to look in every HFile where key=20 > falls within the min/max range of the file. > > From another parallel thread, I gather, HFile comprise of blocks=20 > which, I think, is an atomic unit of persisted data in HDFS.(please corre= ct if not). > > And that each block for a HFile has a range of keys. My key can=20 > satisfy the range for the block and yet may not be present. So, all=20 > the blocks that match the range for my key, will need to be scanned.=20 > There is one block index per HFile which sorts blocks by key ranges.=20 > This index help in reducing the number of blocks to scan by extracting=20 > only those blocks whose ranges satisfy the key. > > In this case, if puts are random wrt order, each block may have=20 > similar range and it may turn out that Hbase needs to scan every block=20 > for the File. This may not be good for performance. > > I just want to validate my understanding. > > If you have such a use case I think best practice is to use bloom filters. I think in generaly it's a good idea to atleast enable bloom filter at row = level. > Thanks, > Abhishek > > > -----Original Message----- > From: lars hofhansl [mailto:lhofhansl@yahoo.com] > Sent: Tuesday, August 21, 2012 5:55 PM > To: user@hbase.apache.org > Subject: Re: HBase Put > > That is correct. > > > > ________________________________ > From: "Pamecha, Abhishek" > To: "user@hbase.apache.org" ; lars hofhansl <=20 > lhofhansl@yahoo.com> > Sent: Tuesday, August 21, 2012 4:45 PM > Subject: RE: HBase Put > > Hi Lars, > > Thanks for the explanation. I still have a little doubt: > > Based on your description, given gets do a merge sort, the data on=20 > disk is not kept sorted across files, but just sorted within a file. > > So, basically if on two separate days, say these keys get inserted: > > Day1: File1: A B J M > Day2: File2: C D K P > > Then each file is sorted within itself, but scanning both files will=20 > require Hbase to use merge sort to produce a sorted result. Right? > > Also, File 1 and File2 are immutable, and during compactions, File 1=20 > and > File2 are compacted and sorted using merge sort to a bigger File3. Is=20 > that correct too? > > Thanks, > Abhishek > > > -----Original Message----- > From: lars hofhansl [mailto:lhofhansl@yahoo.com] > Sent: Tuesday, August 21, 2012 4:07 PM > To: user@hbase.apache.org > Subject: Re: HBase Put > > In a nutshell: > - Puts are collected in memory (in a sorted data structure) > - When the collected data reaches a certain size it is flushed to a=20 > new file (which is sorted) > - Gets do a merge sort between the various files that have been=20 > created > - to contain the number of files they are periodically compacted into=20 > fewer, larger files > > > So the data files (HFiles) are immutable once written, changes are=20 > batched in memory first. > > -- Lars > > > > ________________________________ > From: "Pamecha, Abhishek" > To: "user@hbase.apache.org" > Sent: Tuesday, August 21, 2012 4:00 PM > Subject: HBase Put > > Hi > > I had a question on Hbase Put call. In the scenario, where data is=20 > inserted without any order to column qualifiers, how does Hbase=20 > maintain sortedness wrt column qualifiers in its store files/blocks? > > I checked the code base and I can see checks<=20 > https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/ > org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> > being made for lexicographic insertions for Key value pairs. But I=20 > cant seem to find out how the key-offset is calculated in the first place= ? > > Also, given HDFS is by nature, append only, how do randomly ordered=20 > keys make their way to sorted order. Is it only during minor/major=20 > compactions, that this sortedness gets applied and that there is a=20 > small window during which data is not sorted? > > > Thanks, > Abhishek >