hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: Hbase performance with HDFS
Date Thu, 07 Jul 2011 19:38:33 GMT
Hi there-

You should read the architecture section...


re:  "blobs"


On 7/7/11 3:30 PM, "Mohit Anchlia" <mohitanchlia@gmail.com> wrote:

>Thanks that helps! Just few more questions:
>You mentioned about compactions, when do those occur and what triggers
>them? Does it cause additional space usage when that happens, if it
>does it would mean you always need to have much more disk then you
>really need.
>Since HDFS is mostly write once how are updates/deletes handled?
>Is Hbase also suitable for Blobs?
>On Thu, Jul 7, 2011 at 12:11 PM, Andrew Purtell <apurtell@apache.org>
>> Some thoughts off the top of my head. Lars' architecture material
>> might/should cover this too. Pretty sure his book will.
>> Regarding reads:
>> One does not have to read a whole HDFS block. You can request arbitrary
>> ranges with the block, via positioned reads. (It is true also that HDFS
>> be improved for better random reading performance in ways not
>> yet committed to trunk or especially a 0.20.x branch with append
>>support for
>> HBase. See https://issues.apache.org/jira/browse/HDFS-1323)
>> HBase holds indexes to store files in HDFS in memory. We also open all
>> files at the HDFS layer and stash those references. Additionally, users
>> specify the use of bloom filters to improve query time performance
>> wholesale skipping of HFile reads if they are known not to contain data
>> satisfies the query. Bloom filters are held in memory as well.
>> So with indexes resident in memory when handling Gets we know the byte
>> ranges within HDFS block(s) that contain the data of interest. With
>> positioned reads we retrieve only those bytes from a DataNode. With
>> bloomfilters we avoid whole HFiles entirely.
>> Regarding writes:
>> I think you should consult the bigtable paper again if you are still
>> about the write path. The database is log structured. Writes are
>> in memory, and flushed all at once. Later flush files are compacted as
>> needed, because as you point out GFS and HDFS are optimized for
>> sequential reads and writes.
>> Best regards,
>>   - Andy
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>> ________________________________
>> From: Mohit Anchlia <mohitanchlia@gmail.com>
>> To: user@hbase.apache.org; Andrew Purtell <apurtell@apache.org>
>> Sent: Thursday, July 7, 2011 11:53 AM
>> Subject: Re: Hbase performance with HDFS
>> I have looked at bigtable and it's ssTables etc. But my question is
>> directly related to how it's used with HDFS. HDFS recommends large
>> files, bigger blocks, write once and read many sequential reads. But
>> accessing small rows and writing small rows is more random and
>> different than inherent design of HDFS. How do these 2 go together and
>> is able to provide performance.
>> On Thu, Jul 7, 2011 at 11:22 AM, Andrew Purtell <apurtell@apache.org>
>>> Hi Mohit,
>>> Start here: http://labs.google.com/papers/bigtable.html
>>> Best regards,
>>>     - Andy
>>> Problems worthy of attack prove their worth by hitting back. - Piet
>>> (via Tom White)
>>>>From: Mohit Anchlia <mohitanchlia@gmail.com>
>>>>To: user@hbase.apache.org
>>>>Sent: Thursday, July 7, 2011 11:12 AM
>>>>Subject: Hbase performance with HDFS
>>>>I've been trying to understand how Hbase can provide good performance
>>>>using HDFS when purpose of HDFS is sequential large block sizes which
>>>>is inherently different than of Hbase where it's more random and row
>>>>sizes might be very small.
>>>>I am reading this but doesn't answer my question. It does say that
>>>>HFile block size is different but how it really works with HDFS is
>>>>what I am trying to understand.

View raw message