hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pamecha, Abhishek" <apame...@x.com>
Subject RE: HBase Put
Date Wed, 22 Aug 2012 17:20:59 GMT
So then a GET query means one needs to look in every HFile where key falls within the min/max
range of the file.

>From another parallel thread, I gather, HFile comprise of blocks which, I think, is an
atomic unit of persisted data in HDFS.(please correct if not). 

And that each block for a HFile has a range of keys. My key can satisfy the range for the
block and yet may not be present. So, all the blocks that match the range for my key, will
need to be scanned. There is one block index per HFile which sorts blocks by key ranges. This
index help in reducing the number of blocks to scan by extracting only those blocks whose
ranges satisfy the key.

In this case, if puts are random wrt order, each block may have similar range and it may turn
out that Hbase needs to scan every block for the File. This may not be good for performance.

I just want to validate my understanding.


-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com] 
Sent: Tuesday, August 21, 2012 5:55 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

That is correct.

 From: "Pamecha, Abhishek" <apamecha@x.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <lhofhansl@yahoo.com>

Sent: Tuesday, August 21, 2012 4:45 PM
Subject: RE: HBase Put
Hi Lars,

Thanks for the explanation. I still have a little doubt:

Based on your description, given gets do a merge sort, the data on disk is not kept sorted
across files, but just sorted within a file.

So, basically if on two separate days, say these keys get inserted: 

Day1: File1:   A B J M
Day2: File2:  C D K P

Then each file is sorted within itself, but scanning both files will require Hbase to use
merge sort to produce a sorted result. Right?

Also, File 1 and File2 are immutable, and during compactions, File 1 and File2 are compacted
and sorted using merge sort to a bigger File3. Is that correct too?


-----Original Message-----
From: lars hofhansl [mailto:lhofhansl@yahoo.com] 
Sent: Tuesday, August 21, 2012 4:07 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file (which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, larger files

So the data files (HFiles) are immutable once written, changes are batched in memory first.

-- Lars

From: "Pamecha, Abhishek" <apamecha@x.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put


I had a  question on Hbase Put call. In the scenario, where data is inserted without any
order to column qualifiers, how does Hbase maintain sortedness wrt column qualifiers in its
store files/blocks?

I checked the code base and I can see checks<https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
being  made for lexicographic insertions for Key value pairs.  But I cant seem to find out
how the key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make their way to
sorted order. Is it only during minor/major compactions, that this sortedness gets applied
and that there is a small window during which data is not sorted?


View raw message