hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/NewFileFormat" by Misty
Date Thu, 22 Oct 2015 20:55:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/NewFileFormat" page has been changed by Misty:

- This page is for discussion related to [[https://issues.apache.org/jira/browse/HBASE-61|HBASE-61,
Create an HBase-specific MapFile implementation]].  That issue, and its linked issues, has
a bunch of suggestions for how we might do a better persistence.  Most have been replicated
in the ''New Format'' section below.  Other related issues include, [[https://issues.apache.org/jira/browse/HADOOP-3315|TFile]],
and [[https://issues.apache.org/jira/browse/HBASE-647|HBASE-647, Remove the HStoreFile 'info'
file (and index and bloomfilter if possible)]] as well as ''SSTable'' from the bigtable paper.
+ The HBase Wiki is in the process of being decommissioned. The info that used to be on this
page has moved to http://hbase.apache.org/book.html#_hfile_format_2. Please update your bookmarks.
- == Current Implementation ==
- Currently -- circa 0.19.0 -- hbase Store files are built on ''org.apache.hadoop.io.!MapFile''.
!MapFile is made of two ''org.apache.hadoop.io.SequenceFile''s; a sorted data file of key/values
and then an accompanying index file. Once written, these files do not change (both data and
index file).  The current index is 'flat' made of keys and their offsets.  An index entry
is made for every Nth entry of the data file where N is configurable with a default of 32
in hbase (its 128 for hadoop).
- !MapFiles can be configured to compress each key/value entry or compress based off a block
size.  Blocks do not span key/values but break on entries.
- Hbase keys are made of row/column/timestamp.  Rows and columns are effectively binary. 
Timestamp is a long.  The sort is not a straight-forward binary sort; it has its idiosyncrasies
embodied in the particular Comparator passed creating the store file: e.g. The timestamps
are in reverse order because we want to find the newest first.
- Every hbase flush creates a new !MapFile in the file system and an accompanying SequenceFile
of metadata, an 'info' file.  Metadata includes the id of the last edit added the !MapFile
and if the store file is a 'reference' file -- more on this later (TODO) -- it also includes
info on whats referred to.
- Optionally administrators can enable bloomfilters on hbase stores.  The bloomfilter allows
a fast test of whether or not the store file contains an entry.  The bloomfilter is persisted
into the filesystem in its own file.
- Worse-case, each flush writes '''four''' files to the file system: a mapfile data file,
the mapfile index, an accompanying 'info' file for metadata, and a file of the bloomfilter.
- Currently, on open of a store file, the index is read into memory and then closed.  The
data file is opened and kept open to avoid paying the 'open' cost on every random-access.
 This latter action makes it so hbase soon trips 'too many open files' exception (See [[http://wiki.apache.org/hadoop/Hbase/FAQ#6|FAQ
#6]]).  The info is opened, read, and then closed.  If a bloomfilter, its deserialized out
of the bloomfilter file.
- == Common Index-based Accesses ==
- Lookup for a particular key, a query is first made against the !MapFile index to find the
nearest key using a binary search.  We then go to the data file and seek to the index offset
and iterate until we find the queried key or we've moved past where it should have been in
the file.
- Another common access pattern has us asking for the row that falls closest that which we
asked for, both closest-before and closest-after (if not an exact match).  To figure closest
row, we go to index first and then iterate forward.
- We also need to be able to figure quickly if a store file has an entry at all for a particular
- == File Index ==
- We need to fix the case where rows have many entries.  When scanning, we'll pre-populate
the scanner with the  scanner start row (using the index to figure the offset).  On call to
hasNext, we'll then iterate forward until we trip over the next row.  This works fine if <
tens of entries per row but if millions hbase scanner crawls (or client lease just times out).
 We at least need to be smarter about our use of the flat index going back to it to try figure
if row has < tens or millions of entries per row -- or index could record every the start
of every row offset.
- Index needs to be small.  There are lots of these store files in hbase.  Currently we open
index, read into memory, then close the index file but keep the data file open for 'fast'
random access. One improvement would be to divide the index into pieces -- file blocks as
in TFile or as in cassandra would make most sense -- and optionally let go of LRU block indices
when memory pressure.
- If index included offset to every key, would be able to use it to figure if file had an
entry for the queried key and every index lookup would get us exact offset.  But such an index
would be too large to keep in memory (If values are small, file could have many entries. 
Files are usually about 64MB but can grow to an upper-bound of about 1G though this is configurable
and nothing to stop it being configured up from this).
- == New Format ==
-  * [[https://issues.apache.org/jira/browse/HBASE-647|HBASE-647]]: Have data, metadata, indices
and bloomfilters, etc., all rolled up in the one file.  Could do this with [[https://issues.apache.org/jira/browse/HADOOP-3315|TFile]].
 SequenceFile allows addition of metadata but this facility is not exposed in !MapFile.  Could
add to !MapFile but SequenceFile metadata is stored in the head of the SequenceFile.  Many
metadata are known only after the flush: count-of-entries, bloomfilter, etc.
-  * [[https://issues.apache.org/jira/browse/HBASE-519|Convert HStore to use only new interface
methods]].  If an Interface, can try different implementations.
-  * In-memory: TFile has user supply the underlying data stream.  Could supply a stream hosted
in  memory.
-  * Always-on General bloomfilter. We know how many entries a file will have when we go to
flush it so we can optimally size a bloomfilter.  The small amount of memory a bloomfilter
occupies will pay for itself many-fold in the seeks saved trying to figure is a file contains
an asked for key.
-  * Optimal random-access
-  * Iterate over keys only, rather than mapfiles currenty key+values always.  This'd be useful
when trying to find closest. TFile and SequenceFile can do this (Its not exposed in !MapFile).
-  * Smart getClosest and getClosestAtOrBefore [[https://issues.apache.org/jira/browse/HBASE-792|hbase-792]]
-  * Get vs. Scan accesses.  Latter has state.
-  * Sharing blocks and indices: Can have multiple Readers on a single file (e.g. many concurrent
Scanners).  If so, rather than read in index for each instance, share indices if one already
in-memory.  Same for file blocks.  Only make trip to datanode if not already instance of the
(read-only) block in mem.
-  * Version: File format should have a version so we can evolve the format.
- === Index ===
- TODO, but the TFile block-based rather than !MapFile interval-based would seem better for
us; indices then are of predicatable size; a seek to the index position will load at an amenable
spot when blocks are compressed. 
- === Nice-to-haves ===
-  * Don't write out the family portion of column when writing keys [[https://issues.apache.org/jira/browse/HBASE-68|HBASE-68]]
-  * Row index at end of datablock.  Doesn't have to have actual row, just positions.  Can
look at current position and then at the data block index to figure next row start.
- === Excercise ===
- Make the index interval one, copy it to local filesystem and memory-map it. Opening, flushing
or compacting, we'd make a copy of the index file on the local file system.  Closing regions,
we'd cleanup local copies of index.  Thanks to Eitan and Nitay for suggested exercise.
- == Other File Formats ==
- Cassandra uses a Sequence File.  It adds key/values in blocks of 128 by default.  On the
128th entry, an index for the block keys is inlined and then a new block begins.  Block offsets
are kept out in an index file as in !MapFile.  Bloomfilters are on by default.
- From the bigtable paper, an SSTable "... contains a sequence of blocks (typically each block
is 64KB in size, but this is configurable).  A block index (stored at the end of the SSTable)
is used to locate blocks; the index is loaded into memory when the SSTable is opened.  A lookup
can be performed with a single disk seek: we first find the appropriate block by performing
a binary search in the in-memory index, and then reading the appropriate block from disk.
 Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups
and scans without touching the disk."

View raw message