Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of apamecha@x.com designates
 216.33.244.6 as permitted sender)
DomainKey-Signature: s=xcorp; d=x.com; c=simple; q=dns;
  h=X-EBay-Corp:X-IronPort-AV:Received:Received:From:To:
   Subject:Thread-Topic:Thread-Index:Date:Message-ID:
   References:In-Reply-To:Accept-Language:Content-Language:
   X-MS-Has-Attach:X-MS-TNEF-Correlator:x-originating-ip:
   x-ems-proccessed:x-ems-stamp:Content-Type:
   Content-Transfer-Encoding:MIME-Version:X-CFilter;
  b=BkqjM+OKELkVGKd1zSCLk8DlIzoaAp4QOR5k2ED8I9NxeyD/919uQS3i
   HKouiJBgvK4k8x2S+z2Gj4PUb8xhwYD75A15pgec5ZxB/1O/XusxRTySW
   0Z11wx+cdh/fQ5h;
From: "Pamecha, Abhishek" <apamecha@x.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Subject: RE: HBase Put
Thread-Topic: HBase Put
Thread-Index: 
 Ac1/8MWkIp7hLUvHSeisnLs6dArPXgAO5kYAAA2rfpD//7DaAP//aDMwgAG4eICAAEnL4A==
Date: Wed, 22 Aug 2012 20:49:49 +0000
Message-ID: 
 <2E362ACC9493D747B488241C66B3B66512F6B8@RHV-EXRDA-S11.corp.ebay.com>
References: 
 <2E362ACC9493D747B488241C66B3B66512B010@RHV-EXRDA-S11.corp.ebay.com>
	<1345590432.49421.YahooMailNeo@web121703.mail.ne1.yahoo.com>
	<2E362ACC9493D747B488241C66B3B66512B2A1@RHV-EXRDA-S11.corp.ebay.com>
	<1345596920.40634.YahooMailNeo@web121705.mail.ne1.yahoo.com>
	<2E362ACC9493D747B488241C66B3B66512CA9B@RHV-EXRDA-S11.corp.ebay.com>
 <CAOT3TWps9fqWDONaS5TZQC6Tj+wac50V89dBL_4SJy67enhQsg@mail.gmail.com>
In-Reply-To: 
 <CAOT3TWps9fqWDONaS5TZQC6Tj+wac50V89dBL_4SJy67enhQsg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Can I enable bloom filters per block at column qualifier levels too? That w=
ay, will small block sizes, I can selectively load only few data blocks in =
memory. Then I can do some trade off between block size and bloom filter fa=
lse positive rate.

I am designing for a wide table scenario with thousands and millions of col=
umns and thus I don't really want to stress on checks for blocks having mor=
e than one row key.=20

Thanks,
Abhishek


-----Original Message-----
From: Mohit Anchlia [mailto:mohitanchlia@gmail.com]=20
Sent: Wednesday, August 22, 2012 11:09 AM
To: user@hbase.apache.org
Subject: Re: HBase Put

On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <apamecha@x.com> wrote:

> So then a GET query means one needs to look in every HFile where key=20
> falls within the min/max range of the file.
>
> From another parallel thread, I gather, HFile comprise of blocks=20
> which, I think, is an atomic unit of persisted data in HDFS.(please corre=
ct if not).
>
> And that each block for a HFile has a range of keys. My key can=20
> satisfy the range for the block and yet may not be present. So, all=20
> the blocks that match the range for my key, will need to be scanned.=20
> There is one block index per HFile which sorts blocks by key ranges.=20
> This index help in reducing the number of blocks to scan by extracting=20
> only those blocks whose ranges satisfy the key.
>
> In this case, if puts are random wrt order, each block may have=20
> similar range and it may turn out that Hbase needs to scan every block=20
> for the File. This may not be good for performance.
>
> I just want to validate my understanding.
>
>
If you have such a use case I think best practice is to use bloom filters.
I think in generaly it's a good idea to atleast enable bloom filter at row =
level.

> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>  Sent: Tuesday, August 21, 2012 5:55 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> That is correct.
>
>
>
> ________________________________
>  From: "Pamecha, Abhishek" <apamecha@x.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <=20
> lhofhansl@yahoo.com>
> Sent: Tuesday, August 21, 2012 4:45 PM
> Subject: RE: HBase Put
>
> Hi Lars,
>
> Thanks for the explanation. I still have a little doubt:
>
> Based on your description, given gets do a merge sort, the data on=20
> disk is not kept sorted across files, but just sorted within a file.
>
> So, basically if on two separate days, say these keys get inserted:
>
> Day1: File1:   A B J M
> Day2: File2:  C D K P
>
> Then each file is sorted within itself, but scanning both files will=20
> require Hbase to use merge sort to produce a sorted result. Right?
>
> Also, File 1 and File2 are immutable, and during compactions, File 1=20
> and
> File2 are compacted and sorted using merge sort to a bigger File3. Is=20
> that correct too?
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> Sent: Tuesday, August 21, 2012 4:07 PM
> To: user@hbase.apache.org
> Subject: Re: HBase Put
>
> In a nutshell:
> - Puts are collected in memory (in a sorted data structure)
> - When the collected data reaches a certain size it is flushed to a=20
> new file (which is sorted)
> - Gets do a merge sort between the various files that have been=20
> created
> - to contain the number of files they are periodically compacted into=20
> fewer, larger files
>
>
> So the data files (HFiles) are immutable once written, changes are=20
> batched in memory first.
>
> -- Lars
>
>
>
> ________________________________
> From: "Pamecha, Abhishek" <apamecha@x.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Tuesday, August 21, 2012 4:00 PM
> Subject: HBase Put
>
> Hi
>
> I had a  question on Hbase Put call. In the scenario, where data is=20
> inserted without any order to column qualifiers, how does Hbase=20
> maintain sortedness wrt column qualifiers in its store files/blocks?
>
> I checked the code base and I can see checks<=20
> https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/
> org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> being  made for lexicographic insertions for Key value pairs.  But I=20
> cant seem to find out how the key-offset is calculated in the first place=
?
>
> Also, given HDFS is by nature, append only, how do randomly ordered=20
> keys make their way to sorted order. Is it only during minor/major=20
> compactions, that this sortedness gets applied and that there is a=20
> small window during which data is not sorted?
>
>
> Thanks,
> Abhishek
>