hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/DataModel" by JeanDanielCryans
Date Fri, 11 Jul 2008 16:44:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JeanDanielCryans:
http://wiki.apache.org/hadoop/Hbase/DataModel

------------------------------------------------------------------------------
+ '''This page is a work in progress'''
+  
   * [#intro Introduction]
   * [#overview Overview]
+  * [#row Rows]
+  * [#columns Column Families]
+  * [#ts Timestamps]
+  * [#famatt Families Attributes]
+  * [#example Real Life Example]
+   * [#relational The Source ERD]
+   * [#hbaseschema The HBase Target Schema]
  
  [[Anchor(intro)]]
  = Introduction =
@@ -11, +20 @@

  [[Anchor(overview)]]
  = Overview =
  
- To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], Map<byte[],
Map<long, byte[]>>>>. The first Map maps row keys to their ''column families''.
The second maps column families to their ''column keys''. The third one maps column keys''
to their ''timestamps''. Finally, the last one maps the timestamps to a single value. The
keys are typically strings, the timestamp is a long and the value is an uninterpreted array
of bytes. The 
+ To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], Map<byte[],
Map<long, byte[]>>>>. The first Map maps row keys to their ''column families''.
The second maps column families to their ''column keys''. The third one maps column keys to
their ''timestamps''. Finally, the last one maps the timestamps to a single value. The keys
are typically strings, the timestamp is a long and the value is an uninterpreted array of
bytes. The column key is always preceded by its family and is represented like this: ''family:key''.
Since a family maps to another map, this means that a single column family can contain a theoretical
infinity of column keys. So, to retrieve a single value, the user has to do a ''get'' using
three keys:
  
  row key+column key+timestamp -> value
  
+ [[Anchor(row)]]
+ = Rows =
+ 
+ The row key is treated by HBase as an array of bytes but it must have a string representation.
A special property of the row key Map is that it keeps them in a lexicographical order. For
example, numbers going from 1 to 100 will be ordered like this:
+ 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,...,9,91,92,93,94,95,96,97,98,99
+ 
+ To keep the integers natural ordering, the row keys have to be left-padded with zeros. To
take advantage of this, the functionalities of the row key Map are augmented by offering a
scanner which takes a ''start row key'' (if not specified, the first one in the table) and
an ''stop row key'' (if not specified, the last one in the table). For example, if the row
keys are dates in the format YYYYMMDD, getting the month of July 2008 is a matter of opening
a scanner from ''20080700'' to ''20080800''. It does not matter if the specified row keys
are existing or not, the only thing to keep in mind is that the stop row key will not be returned
which is why the first of August is given to the scanner. 
+ 
+ [[Anchor(columns)]]
+ = Column Families =
+ 
+ A column family regroups data of a same nature in HBase and has no constraint on the type.
The families are part of the table schema and stay the same for each row; what differs from
rows to rows is that the column keys can be very sparse. For example, row "20080702" may have
in it's "info:" family the following column keys:
+ ||info:aaa||
+ ||info:bbb||
+ ||info:ccc||
+ While row "20080703" only has:
+ ||info:12342||
+ Developers have to be very careful when using column keys since a key with a length of zero
is permitted which means that in the previous example data can be inserted in column key "info:".
We strongly suggest using empty column keys only when no other keys will be specified. Also,
since the data in a family has the same nature, many attributes can be specified regarding
[#famatt performance and timestamps].
+ 
+ [[Anchor(ts)]]
+ = Timestamps =
+ 
+ The values in HBase may have multiple versions kept according to the family configuration.
By default, HBase sets the timestamp to each new value to current time in milliseconds and
returns the latest version when a cell is retrieved. The developer can also provide it's own
timestamps when inserting data as he can specify a certain timestamp when fetching it.
+ 
+ [[Anchor(famatt)]]
+ = Family Attributes =
+ 
+ The following attributes can be specified or each families:
+ 
+ Implemented
+ 
+  * Compression
+   * Record: means that each exact values found at a rowkey+columnkey+timestamp will be compressed
independently.
+   * Block: means that blocks in HDFS are compressed. A block may contain multiple records
if they are shorter than one HDFS block or may only contain part of a record if the record
is longer than a HDFS block.
+  * Timestamps
+   * Max number: the maximum number of different versions a value has.
+   * Time to live: versions older than specified time will be garbage collected.
+ 
+ Still not implemented
+ 
+  * In memory: all values of that family will be kept in memory.
+  * Length: values written will not be longer than the specified number of bytes.
+ 
+ [[Anchor(example)]]
+ = Real Life Example =
+ 
+ [[Anchor(relational)]]
+ == The Source ERD ==
+ 
+ [[Anchor(hbaseschema)]]
+ == The HBase Target Schema ==
+ 

Mime
View raw message