hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: HBase Design Ideas, Part I
Date Mon, 15 May 2006 10:09:11 GMT
> Seems like you are exporting a lot of complexity to the clients by  
> having them find the table chunks via DFS read.  Lots of data  
> motion and sync / cache issues there.  When not just ask the master  
> for the block/server of a key?  Or you could distribute this work  
> over your HRegionServers if you don't want to stress the master.   
> All this could be kept fresh in RAM there (segmented if you get  
> huge). [but this adds complexity]

I see the possibility as described in the talk to get key regions  
form the master, than ask the next box for a key region  etc. So you  
can distribute the key region table over a set of boxes.
In case the client cache that the load on master shouldn't be that hard.
Alternative I see a chance of request an update with a given key and  
get a kind of forward message returned until the client find the  
region server that hosts the data for a given row key.
However since the client needs to start at the master, this would be  
a lot of load. A combination of both could be interesting.
>
> As I read you design, it sounds like you might be doing a lot of  
> seeks to find a record (do you need to scan all the logs to see if  
> a key is present?).  Best to outline the performance you want and  
> then look at the ram / disk trade-offs.  IE you can store  
> everything in BTRees, but then you will thrash you disks.  Or you  
> can store everything linearly and store all your unmerged entries  
> in RAM.  This would have different costs/benefits...
I agree there should be only one 'memory backed together' write cache  
that needs to be consulted before you search in the last checkpoint  
file.
In any case we can write also one log to make sure we lose no data  
when a box crash.
... just my 2 cents.

Stefan
>
>
> On May 14, 2006, at 3:00 PM, Michael Cafarella wrote:
>
>> Hi everyone,
>>
>> I've written up a design that I've been working on for a little  
>> bit, for
>> a project I'll call "HBase".  The idea is for Hadoop to implement  
>> something
>> similar in spirit to BigTable.  That is, a distributed data store  
>> that
>> places a greater emphasis on scalability than on SQL compatibility
>> or traditional transactional correctness.
>>
>> BigTable is neither completely described anywhere, nor is it
>> necessarily exactly what we want.  So I'm not trying to clone  
>> BigTable,
>> but I am going to draw on it a lot.
>>
>> My personal view is that BigTable is a great "physical layer" but  
>> not yet
>> a great database system.  A major thing it lacks is a good query  
>> language.
>> Another, freely admitted by the Google people, is any kind of  
>> inter-row
>> locking.  I'm not going to try to solve all these problems, but I  
>> would
>> like HBase to be extendible enough that it's easy to add new query
>> languages or primitives.
>>
>> In this mail, I'll describe a system that's pretty similar to  
>> BigTable.
>> I'll send a second one that describes what we might want to change
>> or add.
>>
>> Please let me know what you think!
>>
>> Thanks,
>> --Mike
>>
>> --------------------------------------------------------------------- 
>> -----------
>> I.  Table semantics
>>
>> An HBase consists of one or more HTables.  An HTable is a list of  
>> rows,
>> sorted alphabetically by "row name".  An HTable also has a series of
>> "columns."  A row may or may not contain a value for a column.  The
>> HTable representation is sparse, so if a row does not contain a value
>> for a given column, there is no storage overhead.
>>
>> (Thus, there's not really a "schema" to an HTable.  Every  
>> operation, even
>> adding a column, is considered a row-centric operation.)
>>
>> The "current version" of a row is always available, timestamped  
>> with its
>> last modification date.  The system may also store previous  
>> versions of a row,
>> according to how the HTable is configured.
>>
>> Updates to a single row are always atomic and can affect one or  
>> more columns.
>>
>> II.  System layout
>>
>> HTables are partitionable into contiguous row regions called  
>> HRegions.
>> All machines in a pool run an HRegionServer.  A given HRegion is  
>> served
>> to clients by a single HRegionServer.  A single HRegionServer may be
>> responsible for many HRegions.  The HRegions for a single HTable will
>> be scattered across arbitrary HRegionServers.
>>
>> When a client wants to add/delete/update a row value, it must  
>> locate the
>> relevant HRegionServer.  It then contacts the HRegionServer and  
>> communicates
>> the updates.  There may be other steps, mainly lock-oriented  
>> ones.  But locating
>> the relevant HRegionServers is a bare minimum.
>>
>> The HBase system can repartition an HTable at any time.  For  
>> example, many
>> repeated inserts at a single location may cause a single HRegion  
>> to grow
>> very large.  The HBase would then try to split that into multiple  
>> HRegions.
>> Those HRegions may be served by the same HRegionServer as the
>> original or may be served by a different one.
>>
>> Each HRegionServer sends a regular heartbeat to an HBaseMaster  
>> machine.
>> If the heartbeat for an HRegionServer fails, then the HBaseMaster  
>> is responsible
>> for reassigning its HRegions to other available HRegionServers.
>>
>> All HRegions are stored within DFS, so the HRegion is always  
>> available, even
>> in the face of machine failures.  The HRegionServers and DFS  
>> DataNodes run
>> on the same set of machines.  We would like for an HRegionServer  
>> to always
>> serve data stored locally, but that is not guaranteed when using  
>> DFS.  We can
>> encourage it by:
>> 1) In the event of an insert-motivated HRegion move, the new  
>> HRegionServer
>> should always create a new DFS file for the new HRegion.  The DFS  
>> rules of
>> thumb will allocate the chunks locally for the HRegionServer.
>> 2) In the even of a machine failure, we cannot do anything similar  
>> to above.
>> Instead, the HBaseMaster can ask DFS for hints as to where the  
>> relevant
>> file blocks are stored.  If possible, it will allocate the new
>> HRegions to servers
>> that physically contain the HRegion.
>> 3) If necessary, we could add an API to DFS that demands block  
>> replication
>> to a given node.  I'd like to avoid this if possible.
>>
>> The mapping from row to HRegion (and hence, to HRegionServer) is  
>> itself
>> stored in a special HTable.  The HBaseMaster is the only client  
>> allowed to
>> write to this HTable.  This special HTable may itself be split  
>> into several
>> HRegions.  However, we only allow a hard-coded number of split- 
>> levels.
>> The top level of this hierarchy must be easily-stored on a single  
>> machine.
>> That top-level table is always served by the HBaseMaster itself.
>>
>> III.  Client behavior
>>
>> Let's think about what happens when a client wants to add a row.
>> 1) The client must compute what HRegion is responsible for the key
>> it wants to insert into the HTable.  It must navigate the row- 
>> >HRegion
>> mapping, which is stored in an HTable.
>>
>> So the client first contacts the HBaseMaster for the top-level  
>> table contents.
>> It then steps downward through the table set, until it finds the  
>> mapping for
>> the target row.
>>
>> 2) The client contacts the HRegionServer responsible for the  
>> target row,
>> and asks to insert.  If the HRegionServer is no longer responsible  
>> for the
>> relevant HRegion, it returns a failure message and tells the  
>> client to go
>> back to step 1 to find the new correct HRegionServer.
>>
>> If the HRegionServer is the right place to go, it accepts the new  
>> row from
>> the client.  The HRegionServer guarantees that the insert is  
>> atomic; it
>> will not intermingle the insert with a competing insert for the  
>> same row key.
>> When the row is stored, the HRegionServer includes version and  
>> timestamp
>> information.
>>
>> 3) That's it!
>>
>> IV The HRegionServer
>>
>> Maintaining the data for a single HRegion is slightly  
>> complicated.  It's
>> especially weird given the write-once semantics of DFS.  There are
>> three important moving parts:
>>
>> 1) HBackedStore is a file-backed store for rows and their values.
>> It is never edited in place.  It has B-Tree-like lookups for finding
>> a row quickly.  HBackedStore is actually a series of on-disk stores,
>> each store being tuned for a certain object size.  Thus, all the  
>> "small"
>> (in bytes) values for a row live within the same file, all the medium
>> ones live in a separate file, etc.  There is only one HBackedStore
>> for any single HRegion.
>>
>> 2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
>> by an on-disk file.  When making reads from the HBackedStore, it may
>> be necessary to consult the HUpdateLog to see if any more-recent
>> updates have been made.  There may be a series of HUpdateLogs
>> for a single HRegion.
>>
>> 3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
>> to be consulted whenever performing a read.  There is only one
>> HUpdateBuf for a single HRegion.
>>
>> Any incoming edit is first made directly to the HUpdateBuf.  Changes
>> made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
>> The rate of flushes is an admin-configurable parameter.
>>
>> Periodically, the HBackedStore and the series of current HUpdateLogs
>> are merged to form a new HBackedStore.  At that point, the old  
>> HUpdateLog
>> objects can be destroyed.  During this compaction process, edits are
>> made to the HUpdateBuf.
>
>


Mime
View raw message