kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From danburk...@apache.org
Subject [2/5] incubator-kudu git commit: Move design docs to the new folder and change the extension
Date Thu, 18 Feb 2016 22:29:53 GMT
http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/tablet.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/tablet.md b/docs/design-docs/tablet.md
new file mode 100644
index 0000000..3b792c4
--- /dev/null
+++ b/docs/design-docs/tablet.md
@@ -0,0 +1,760 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+A Tablet is a horizontal partition of a Kudu table, similar to tablets
+in BigTable or regions in HBase. Each tablet hosts a contiguous range
+of rows which does not overlap with any other tablet's range. Together,
+all the tablets in a table comprise the table's entire key space.
+
+Each tablet is further subdivided into a number of sets of rows called
+RowSets. Each RowSet consists of the data for a set of rows. RowSets
+are disjoint, ie the set of rows for different RowSets do not
+intersect, so any given key is present in at most one RowSet. While
+RowSets are disjoint, their key spaces may overlap.
+
+============================================================
+Handling Insertions
+============================================================
+
+One RowSet is held in memory and is referred to as the MemRowSet. All
+inserts go directly into the MemRowSet, which is an in-memory B-Tree sorted
+by the table's primary key. As data is inserted, it is accumulated in the MemRowSet,
+where it is made immediately visible to future readers, subject to MVCC
+(see below).
+
+NOTE: Unlike BigTable, only inserts and updates of recently-inserted data go into the MemRowSet
+-- mutations such as updates and deletions of on-disk rows are discussed in a later section of
+this document.
+
+Each row exists in exactly one entry in the MemRowSet. The value of this entry consists
+of a special header, followed by the packed format of the row data (more detail below).
+Since the MemRowSet is fully in-memory, it will eventually fill up and "Flush" to disk --
+this process is described in detail later in this document.
+
+============================================================
+MVCC Overview
+============================================================
+
+Kudu uses multi-version concurrency control in order to provide a number of useful
+features:
+
+- Snapshot scanners: when a scanner is created, it operates as of a point-in-time
+  snapshot of the tablet. Any further updates to the tablet which occur during
+  the course of the scan are ignored. In addition, this point-in-time can be
+  stored and re-used for additional scans on the same tablet, for example if an application
+  would like to perform analytics requiring multiple passes on a consistent view of the data.
+
+- Time-travel scanners: similar to the above, a user may create a scanner which
+  operates as of some point in time from the past, providing a consistent "time travel read".
+  This can be used to take point-in-time consistent backups.
+
+- Change-history queries: given two MVCC snapshots, the user may be able to query
+  the set of deltas between those two snapshots for any given row. This can be leveraged
+  to take incremental backups, perform cross-cluster synchronization, or for offline audit
+  analysis.
+
+- Multi-row atomic updates within a tablet: a single mutation may apply to multiple
+  rows within a tablet, and it will be made visible in a single atomic action.
+
+In order to provide MVCC, each mutation is tagged with a timestamp. Timestamps are generated by a
+TS-wide Clock instance, and ensured to be unique within a tablet by the tablet's MvccManager. The
+state of the MvccManager determines the set of timestamps which are considered "committed" and thus
+visible to newly generated scanners. Upon creation, a scanner takes a snapshot of the MvccManager
+state, and any data which seen by that scanner is then compared against the MvccSnapshot to
+determine which insertions, updates, and deletes should be considered visible.
+
+Timestamps are monotonically increasing per tablet. We use a technique called HybridTime (see
+OSDI'14 submission for details) to create timestamps which correspond to true wall clock
+time but also reflect causality between nodes.
+
+In order to support these snapshot and time-travel reads, multiple versions of any given
+row must be stored in the database. To prevent unbounded space usage, the user may configure
+a retention period beyond which old transaction records may be GCed (thus preventing any snapshot
+reads from earlier than that point in history).
+(NOTE: history GC not currently implemented)
+
+============================================================
+MVCC Mutations in MemRowSet
+============================================================
+
+In order to support MVCC in the MemRowSet, each row is tagged with the timestamp which
+inserted the row. Additionally, the row contains a singly linked list containing any further
+mutations that were made to the row after its insertion, each tagged with the mutation's
+timestamp:
+
+```
+               MemRowSet Row
++----------------------------------------------------+
+| insertion timestamp  | mutation head | row data... |
++-------------------------|--------------------------+
+                          |
+                          v          First mutation
+                  +-----------------------------------------------+
+                  | mutation timestamp | next_mut | change record |
+                  +--------------------|--------------------------+
+                            __________/
+                           /
+                           |         Second mutation
+                  +--------v--------------------------------------+
+                  | mutation timestamp | next_mut | change record |
+                  +--------------------|--------------------------+
+                            __________/
+                           /
+                           ...
+```
+
+In traditional database terms, one can think of the mutation list forming a sort of
+"REDO log" containing all changes which affect this row.
+
+Any reader traversing the MemRowSet needs to apply these mutations to read the correct
+snapshot of the row, via the following logic:
+
+- If row.insertion_timestamp is not committed in scanner's MVCC snapshot, skip the row
+  (it was not yet inserted when the scanner's snapshot was made).
+- Otherwise, copy the row data into the output buffer.
+- For each mutation in the list:
+  - if mutation.timestamp is committed in the scanner's MVCC snapshot, apply the change
+    to the in-memory copy of the row. Otherwise, skip this mutation (it was not yet
+    mutated at the time of the snapshot).
+  - if the mutation indicates a DELETE, mark the row as deleted in the output buffer
+    of the scanner by zeroing its bit in the scanner's selection vector.
+
+Note that "mutation" in this case can be one of three types:
+- UPDATE: changes the value of one or more columns
+- DELETE: removes the row from the database
+- REINSERT: reinsert the row with a new set of data (only occurs on a MemRowSet row
+            with a prior DELETE mutation)
+
+As a concrete example, consider the following sequence on a table with schema
+(key STRING, val UINT32):
+
+  INSERT INTO t VALUES ("row", 1);         [timestamp 1]
+  UPDATE t SET val = 2 WHERE key = "row";  [timestamp 2]
+  DELETE FROM t WHERE key = "row";         [timestamp 3]
+  INSERT INTO t VALUES ("row", 3);         [timestamp 4]
+
+This would result in the following structure in the MemRowSet:
+```
+  +-----------------------------------+
+  | tx 1 | mutation head | ("row", 1) |
+  +----------|------------------------+
+             |
+             |
+         +---v--------------------------+
+         | tx 2 | next ptr | SET val=2  |
+         +-----------|------------------+
+              ______/
+             |
+         +---v-------v----------------+
+         | tx 3 | next ptr | DELETE   |
+         +-----------|----------------+
+              ______/
+             |
+         +---v------------------------------------+
+         | tx 4 | next ptr | REINSERT ("row", 3)  |
+         +----------------------------------------+
+```
+
+Note that this has a couple of undesirable properties when update frequency is high:
+- readers must chase pointers through a singly linked list, likely causing many CPU cache
+  misses.
+- updates must append to the end of a singly linked list, which is O(n) where 'n' is the
+  number of times this row has been updated.
+
+However, we consider the above inefficiencies tolerable given the following assumptions:
+- Kudu's target uses cases have a relatively low update rate: we assume that a single row
+  won't have a high frequency of updates
+- Only a very small fraction of the total database will be in the MemRowSet -- once the MemRowSet
+  reaches some target size threshold, it will flush. So, even if scanning MemRowSet is slow
+  due to update handling, it will make up only a small percentage of overall query time.
+
+If it turns out that the above inefficiencies impact real applications, various optimizations
+can be applied in the future to reduce the overhead.
+
+============================================================
+MemRowSet Flushes
+============================================================
+
+When the MemRowSet fills up, a Flush occurs, which persists the data to disk.
+```
++------------+
+| MemRowSet  |
++------------+
+     |
+     | Flush process writes entries in memory to a new DiskRowSet on disk
+     v
++--------------+  +--------------+    +--------------+
+| DiskRowSet 0 |  | DiskRowSet 1 | .. | DiskRowSet N |
++-------------+-  +--------------+    +--------------+
+```
+When the data is flushed, it is stored as a set of CFiles (see src/kudu/cfile/README).
+Each of the rows in the data is addressable by a sequential "rowid", which is
+dense, immutable, and unique within this DiskRowSet. For example, if a given
+DiskRowSet contains 5 rows, then they will be assigned rowid 0 through 4, in
+order of ascending key. Within a different DiskRowSet, there will be different
+rows with the same rowids.
+
+Reads may map between primary keys (user-visible) and rowids (internal) using an index
+structure. In the case that the primary key is a simple key, the key structure is
+embedded within the primary key column's cfile. Otherwise, a separate index cfile
+stores the encoded compound key and provides a similar function.
+
+NOTE: rowids are not explicitly stored with each row, but rather an implicit
+identifier based on the row's ordinal index in the file. Some parts of the source
+code refer to rowids as "row indexes" or "ordinal indexes".
+
+NOTE: other systems such as C-Store call the MemRowSet the
+"write optimized store" (WOS), and the on-disk files the "read-optimized store"
+(ROS).
+
+============================================================
+Historical MVCC in DiskRowSets
+============================================================
+
+In order to continue to provide MVCC for on-disk data, each on-disk RowSet
+consists not only of the current columnar data, but also "UNDO" records which
+provide the ability to rollback a row's data to an earlier version.
+```
++--------------+       +-----------+
+| UNDO records | <---  | base data |
++--------------+       +-----------+
+- time of data progresses to the right --->
+```
+When a user wants to read the most recent version of the data immediately after
+a flush, only the base data is required. Because the base data is stored in a
+columnar format, this common case is very efficient. If instead, the user wants
+to run a time-travel query, the read path consults the UNDO records in order to
+roll back the visible data to the earlier point in time.
+
+When a scanner encounters a row, it processes the MVCC information as follows:
+  - Read base image of row
+  - For each UNDO record:
+  -- If the associated timestamp is NOT committed, execute rollback change.
+
+For example, recall the series of mutations used in "MVCC Mutations in MemRowSet" above:
+```
+  INSERT INTO t VALUES ("row", 1);         [timestamp 1]
+  UPDATE t SET val = 2 WHERE key = "row";  [timestamp 2]
+  DELETE FROM t WHERE key = "row";         [timestamp 3]
+  INSERT INTO t VALUES ("row", 3);         [timestamp 4]
+```
+When this row is flushed to disk, we store it on disk in the following way:
+```
+    Base data:
+       ("row", 3)
+    UNDO records (roll-back):
+       Before Tx 4: DELETE
+       Before Tx 3: INSERT ("row", 2")
+       Before Tx 2: SET row=1
+       Before Tx 1: DELETE
+```
+Each UNDO record is the inverse of the transaction which triggered it -- for example
+the INSERT at transaction 1 turns into a "DELETE" when it is saved as an UNDO record.
+
+The use of the UNDO record here acts to preserve the insertion timestamp:
+queries whose MVCC snapshot indicates Tx 1 is not yet committed will execute
+the DELETE "UNDO" record, such that the row is made invisible.
+
+For example, consider two different example scanners:
+```
+  Current time scanner (all txns committed)
+  -----------------------------------------
+  - Read base data
+  - Since tx 1-4 are committed, ignore all UNDO records
+  - No REDO records
+  Result: current row ("row", 3)
+
+
+  Scanner as of timestamp 1
+  ---------------------
+  - Read base data. Buffer = ("row", 3)
+  - Rollback Tx 4:  Buffer = <deleted>
+  - Rollback Tx 3:  Buffer = ("row", 2)
+  - Rollback Tx 2:  Buffer = ("row", 1)
+  Result: ("row", 1)
+```
+Each case processes the correct set of UNDO records to yield the state of the row as of
+the desired point of time.
+
+
+Given that the most common case of queries will be running against "current" data. In
+that case, we would like to optimize query execution by avoiding the processing of any
+UNDO records. To do so, we include file-level metadata indicating
+the range of transactions for which UNDO records are present. If the scanner's MVCC
+snapshot indicates that all of these transactions are already committed, then the set
+of deltas may be short circuited, and the query can proceed with no MVCC overhead.
+
+============================================================
+Handling mutations against on-disk files
+============================================================
+
+Updates or deletes of already-flushed rows do not go into the MemRowSet.
+Instead, the updated key is searched for among all RowSets in order to locate
+the unique RowSet which holds this key. This processes first uses an interval
+tree to locate a set of candidate rowsets which may contain the key in question.
+Following this, we consult a bloom filter for each of those candidates. For
+rowsets which pass both checks, we seek the primary key index to determine
+the row's rowid within that rowset.
+
+Once the appropriate RowSet has been determined, the mutation will also
+be aware of the key's rowid within the RowSet (as a result of the same
+key search which verified that the key is present in the RowSet). The
+mutation can then enter an in-memory structure called the DeltaMemStore.
+
+The DeltaMemStore is an in-memory concurrent BTree keyed by a composite key of the
+rowid and the mutating timestamp. At read time, these mutations
+are processed in the same manner as the mutations for newly inserted data.
+
+When the Delta MemStore grows too large, it performs a flush to an
+on-disk DeltaFile, and resets itself to become empty:
+```
++------------+      +---------+     +---------+     +----------------+
+| base data  | <--- | delta 0 | <-- | delta N | <-- | delta memstore |
++------------+      +---------+     +---------+     +----------------+
+```
+The DeltaFiles contain the same type of information as the Delta MemStore,
+but compacted to a dense on-disk serialized format. Because these delta files
+contain records of transactions that need to be re-applied to the base data
+in order to bring rows up-to-date, they are called "REDO" files, and the
+mutations contained are called "REDO" records. Similar to data resident in the
+MemRowSet, REDO mutations need to be applied to read newer versions of the data.
+
+A given row may have delta information in multiple delta structures. In that
+case, the deltas are applied sequentially, with later modifications winning
+over earlier modifications.
+
+Note that the mutation tracking structure for a given row does not
+necessarily include the entirety of the row. If only a single column of a row
+is updated, then the mutation structure will only include the updated column.
+This allows for fast updates of small columns without the overhead of reading
+or re-writing larger columns (an advantage compared to the MVCC techniques used
+by systems such as C-Store and PostgreSQL).
+
+============================================================
+Summary of delta file processing
+============================================================
+
+In summary, each DiskRowSet consists of three logical components:
+```
++--------------+       +-----------+      +--------------+
+| UNDO records | <---  | base data | ---> | REDO records |
++--------------+       +-----------+      +--------------+
+```
+Base data:    the columnar data for the RowSet, at the time the RowSet was flushed
+
+UNDO records: historical data which needs to be processed to rollback rows to
+              points in time prior to the RowSet flush.
+
+REDO records: data which needs to be processed in order to bring rows up to date
+              with respect to modifications made after the RowSet was flushed.
+
+UNDO records and REDO records are stored in the same file format, called a DeltaFile.
+
+============================================================
+Delta Compactions
+============================================================
+
+Within a RowSet, reads become less efficient as more mutations accumulate
+in the delta tracking structures; in particular, each flushed delta file
+will have to be seeked and merged as the base data is read. Additionally,
+if a record has been updated many times, many REDO records have to be
+applied in order to expose the most current version to a scanner.
+
+In order to mitigate this and improve read performance, Kudu performs background
+processing which transforms a RowSet from inefficient physical layouts to more
+efficient ones, while maintaining the same logical contents. These types
+of transformations are called "delta compactions". Delta compactions serve
+several main goals:
+
+1) Reduce the number of delta files
+
+  The more delta files that have been flushed for a RowSet, the more separate
+  files must be read in order to produce the current version of a row. In
+  workloads that do not fit in RAM, each random read will result in a disk seek
+  for each of the delta files, causing performance to suffer.
+
+2) Migrate REDO records to UNDO records
+
+  As described above, a RowSet consists of base data (stored per-column),
+  a set of "undo" records (to move back in time), and a set of "redo" records
+  (to move forward in time from the base data). Given that most queries will be
+  made against the present version of the database, we would like to minimize
+  the number of REDO records stored.
+
+  At any point, a row's REDO records may be merged into the base data, and
+  replaced by an equivalent set of UNDO records containing the old versions
+  of the cells.
+
+3) Garbage collect old UNDO records.
+
+  UNDO records need to be retained only as far back as a user-configured
+  historical retention period. Beyond this period, we can remove old "undo"
+  records to save disk space.
+
+NOTE: In the BigTable design, timestamps are associated with data, not with changes.
+In the Kudu design, timestamps are associated with changes, not with data. After historical
+UNDO logs have been removed, there is no remaining record of when any row or
+cell was inserted or updated. If users need this functionality, they should
+keep their own "inserted_on" timestamp column, as they would in a traditional RDBMS.
+
+============================================================
+Types of Delta Compaction
+============================================================
+
+A delta compaction may be classified as either 'minor' or 'major':
+
+Minor delta compaction:
+------------------------
+
+A 'minor' compaction is one that does not include the base data. In this
+type of compaction, the resulting file is itself a delta file.
+```
++------------+      +---------+     +---------+     +---------+     +---------+
+| base data  | <--- | delta 0 + <-- | delta 1 + <-- | delta 2 + <-- | delta 3 +
++------------+      +---------+     +---------+     +---------+     +---------+
+                    \_________________________________________/
+                           files selected for compaction
+
+  =====>
+
++------------+      +---------+     +-----------------------+
+| base data  | <--- | delta 0 + <-- | delta 1 (old delta 3) +
++------------+      +---------+     +-----------------------+
+                    \_________/
+                  compaction result
+```
+
+Minor delta compactions serve only goals 1 and 3: because they do not read or re-write
+base data, they cannot transform REDO records into UNDO.
+
+Major delta compaction:
+------------------------
+
+A 'major' compaction is one that includes the base data along with any number
+of delta files.
+```
++------------+      +---------+     +---------+     +---------+     +---------+
+| base data  | <--- | delta 0 + <-- | delta 1 + <-- | delta 2 + <-- | delta 3 +
++------------+      +---------+     +---------+     +---------+     +---------+
+\_____________________________________________/
+      files selected for compaction
+
+  =====>
+
++------------+      +----------------+      +-----------------------+     +-----------------------+
+| new UNDOs  | -->  | new base data  | <--- | delta 0 (old delta 2) + <-- | delta 1 (old delta 3) +
++------------+      +----------------+      +-----------------------+     +-----------------------+
+\____________________________________/
+           compaction result
+```
+Major delta compactions can satisfy all three goals of delta compactions, but cost
+more than than minor delta compactions since they must read and re-write the base data,
+which is typically larger than the delta data.
+
+A major delta compaction may be performed against any subset of the columns
+in a DiskRowSet -- if only a single column has received a significant number of updates,
+then a compaction can be performed which only reads and rewrites that column. It is
+assumed that this is a common workload in many EDW-like applications (e.g updating
+an `order_status` column in an order table, or a `visit_count` column in a user table).
+
+Note that both types of delta compactions maintain the row ids within the RowSet:
+hence, they can be done entirely in the background with no locking. The resulting
+compaction file can be introduced into the RowSet by atomically swapping it with
+the compaction inputs. After the swap is complete, the pre-compaction files may
+be removed.
+
+============================================================
+Merging compactions
+============================================================
+
+As more data is inserted into a tablet, more and more DiskRowSets will accumulate.
+This can hurt performance for the following cases:
+
+a) Random access (get or update a single row by primary key)
+
+In this case, each RowSet whose key range includes the probe key must be individually consulted to
+locate the specified key.  Bloom filters can mitigate the number of physical seeks, but extra bloom
+filter accesses can impact CPU and also increase memory usage.
+
+b) Scan with specified range (eg scan where primary key between 'A' and 'B')
+
+In this case, each RowSet with an overlapping key range must be individually seeked, regardless of
+bloom filters.  Specialized index structures might be able to assist, here, but again at the cost of
+memory, etc.
+
+c) Sorted scans
+
+If the user query requires that the scan result be yielded in primary-key-sorted
+order, then the results must be passed through a merge process. Merging is typically
+logarithmic in the number of inputs: as the number of inputs grows higher, the merge
+becomes more expensive.
+
+Given the above, it is desirable to merge RowSets together to reduce the number of
+RowSets:
+```
++------------+
+| RowSet 0   |
++------------+
+
++------------+ \
+| RowSet 1   | |
++------------+ |
+               |
++------------+ |                            +--------------+
+| RowSet 2   | |===> RowSet compaction ===> | new RowSet 1 |
++------------+ |                            +--------------+
+               |
++------------+ |
+| RowSet 3   | |
++------------+ /
+```
+
+Unlike Delta Compactions described above, note that row ids are _not_ maintained
+in a Merging Compaction. This makes the handling of concurrent mutations a somewhat
+intricate dance. This process is described in more detail in 'compaction.txt' in this
+directory.
+
+============================================================
+Overall picture
+============================================================
+
+Go go gadget ASCII art!
+```
++-----------+
+| MemRowSet |
++-----------+
+  |
+  | flush: creates a new DiskRowSet 0
+  v
++---------------+
+| DiskRowSet 0  |
++---------------+
+
+DiskRowSet 1:
++---------+     +------------+      +---------+     +---------+     +---------+     +---------+
+| UNDOs 0 | --> | base data  | <--- | REDOs 0 | <-- | REDOS 1 | <-- | REDOs 2 | <-- | REDOs 3 |
++---------+     +------------+      +---------+     +---------+     +---------+     +---------+
+\____________________________________________________________/
+                           | major compaction
+                           v
+
++---------+     +------------+      +---------+     +---------+
+| UNDOs 0'| --> | base data' | <--- | REDOs 2 | <-- | REDOs 3 |
++---------+     +------------+      +---------+     +---------+
+\____________________________/
+      compaction result
+
+
+DiskRowSet 2:
++---------+     +------------+      +---------+     +---------+     +---------+     +---------+
+| UNDOs 0 | --> | base data  | <--- | REDOs 0 | <-- | REDOS 1 | <-- | REDOs 2 | <-- | REDOs 3 |
++---------+     +------------+      +---------+     +---------+     +---------+     +---------+
+                                    \_________________________/
+                                         | minor compaction
+                                         v
++---------+     +------------+      +---------+      +---------+     +---------+
+| UNDOs 0 | --> | base data  | <--- | REDOS 0'|  <-- | REDOs 2 | <-- | REDOs 3 |
++---------+     +------------+      +---------+      +---------+     +---------+
+                                    \_________/
+                                 compaction result
+
++-----------------+ \
+| DiskRowSet 3    | |
++-----------------+ |
+                    |
++-----------------+ |                              +----------------+
+| DiskRowSet 4    | |===> Merging compaction ===>  | new DiskRowSet |
++-----------------+ |                              +----------------+
+                    |
++-----------------+ |
+| DiskRowSet 5    | |
++-----------------+ /
+```
+
+============================================================
+Comparison to BigTable approach
+============================================================
+
+This design differs from the approach used in BigTable in a few key ways:
+
+1) A given key is only present in at most one RowSet in the tablet.
+
+In BigTable, a key may be present in several different SSTables. An entire
+Tablet in BigTable looks more like the RowSet in Kudu -- any read of a key
+must merge together data found in all of the SSTables, just like a single
+row lookup in Kudu must merge together the base data with all of the DeltaFiles.
+
+The advantage of the Kudu approach is that, when reading a row, or servicing a query
+for which sort-order is not important, no merge is required. For example,
+an aggregate over a range of keys can individually scan each RowSet (even
+in parallel) and then sum the results, since the order in which keys are
+presented is not important. Similarly, selects without an explicit
+'ORDER BY primary_key' specification do not need to conduct a merge.
+It's obvious why this can result in more efficient scanning.
+
+The disadvantage here is that, unlike BigTable, inserts and mutations
+are distinct operations: inserts must go into the MemRowSet, whereas
+mutations (delete/update) must go into the DeltaMemStore in the specific RowSet
+containing that key. This has performance impacts as follows:
+
+  a) Inserts must determine that they are in fact new keys.
+
+  This results in a bloom filter query against all present RowSets. If
+  any RowSet indicates a possible match, then a seek must be performed
+  against the key column(s) to determine whether it is in fact an
+  insert or update.
+
+  It is assumed that, so long as the number of RowSets is small, and the
+  bloom filters accurate enough, the vast majority of inserts will not
+  require any physical disk seeks. Additionally, if the key pattern
+  for inserts is locally sequential (eg '<host>_<timestamp>' in a time-series
+  application), then the blocks corresponding to those keys are likely to
+  be kept in the data block cache due to their frequent usage.
+
+  b) Updates must determine which RowSet they correspond to.
+
+  Similar to above, this results in a bloom filter query against
+  all RowSets, as well as a primary key lookup against any matching RowSets.
+
+One advantage to this difference is that the semantics are more familiar to
+users who are accustomed to RDBMS systems where an INSERT of a duplicate
+primary key gives a Primary Key Violation error rather than replacing the
+existing row. Similarly, an UPDATE of a row which does not exist can give
+a key violation error, indicating that no rows were updated. These semantics
+are not generally provided by BigTable-like systems.
+
+2) Mutation applications of data on disk are performed on numeric rowids rather than
+   arbitrary keys.
+
+In order to reconcile a key on disk with its potentially-mutated form,
+BigTable performs a merge based on the row's key. These keys may be arbitrarily
+long strings, so comparison can be expensive. Additionally, even if the
+key column is not needed to service a query (e.g an aggregate computation),
+the key column must be read off disk and processed, which causes extra IO.
+Given that composite keys are often used in BigTable applications, the key size
+may dwarf the size of the column of interest by an order of magnitude, especially
+if the queried column is stored in a dense encoding.
+
+In contrast, mutations in Kudu are stored by rowid. So, merges can proceed
+much more efficiently by maintaining counters: given the next mutation to apply,
+we can simply subtract to find how many rows of unmutated base data may be passed
+through unmodified. Alternatively, direct addressing can be used to efficiently
+"patch" entire blocks of base data given a set of mutations.
+
+Additionally, if the key is not needed in the query results, the query plan
+need not consult the key except perhaps to determine scan boundaries.
+
+As an example, consider the query:
+ > SELECT SUM(cpu_usage) FROM timeseries WHERE machine = 'foo.cloudera.com'
+   AND unix_time BETWEEN 1349658729 AND 1352250720;
+ ... given a composite primary key (host, unix_time)
+
+This may be evaluated in Kudu with the following pseudo-code:
+  sum = 0
+  foreach RowSet:
+    start_rowid = rowset.lookup_key(1349658729)
+    end_rowid = rowset.lookup_key(1352250720)
+    iter = rowset.new_iterator("cpu_usage")
+    iter.seek(start_rowid)
+    remaining = end_rowid - start_rowid
+    while remaining > 0:
+      block = iter.fetch_upto(remaining)
+      sum += sum(block)
+
+The fetching of blocks can be done very efficiently since the application
+of any potential mutations can simply index into the block and replace
+any mutated values with their new data.
+
+3) timestamps are not part of the data model
+
+In BigTable-like systems, the timestamp of each cell is exposed to the user, and
+essentially forms the last element of a composite row key. This means that it is
+efficient to directly access some particular version of a cell, and store entire
+time series as many different versions of a single cell. This is not efficient
+in Kudu -- timestamps should be considered an implementation detail used for MVCC,
+not another dimension in the row key. Instead, Kudu provides native composite row keys
+which can be useful for time series.
+
+
+============================================================
+Comparing the MVCC implementation to other databases
+============================================================
+
+C-Store/Vertica
+----------
+C-Store provides MVCC by adding two extra columns to each table: an insertion epoch
+and a deletion epoch. Epochs in Vertica are essentially equivalent to timestamps in
+Kudu. When a row is inserted, the transaction's epoch is written in the row's epoch
+column. The deletion epoch column is initially NULL. When a row is deleted, the epoch
+of the deletion transaction is written into that column. As a scanner iterates over
+the table, it only includes rows where the insertion epoch is committed and the
+deletion epoch is either NULL or uncommitted.
+
+Updates in Vertica are always implemented as a transactional DELETE followed by a
+re-INSERT. So, the old version of the row has the update's epoch as its deletion epoch,
+and the new version of the row has the update's epoch as its insertion epoch.
+
+This has the downside that even updates of one small column must read all of the columns
+for that row, incurring many seeks and additional IO overhead for logging the re-insertion.
+Additionally, while both versions of the row need to be retained, the space usage of the
+row has been doubled. If a row is being frequently updated, then the space usage will
+increase significantly, even if only a single column of the row has been changed.
+
+In contrast, Kudu does not need to read the other columns, and only needs to re-store
+the columns which have changed, which should yield much improved UPDATE throughput
+for online applications.
+
+References:
+ - http://vertica-forums.com/viewtopic.php?f=48&t=345&start=10
+ - http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf
+
+
+PostgreSQL
+----------
+PostgreSQL's MVCC implementation is very similar to Vertica's. Each tuple has an associated
+"xmin" and "xmax" column. "xmin" contains the timestamp when the row was inserted, and "xmax"
+contains the timestamp when the row was deleted or updated.
+
+PostgreSQL has the same downsides as C-Store in that a frequently updated row will end up
+replicated many times in the tablespace, taking up extra storage and IO. The overhead is not
+as bad, though, since Postgres is a row-store, and thus re-reading all of the N columns for an
+update does not incur N separate seeks.
+
+References:
+ - postgres source code
+ - http://www.packtpub.com/article/transaction-model-of-postgresql
+
+Oracle Database
+---------------
+Oracle's MVCC and time-travel implementations are somewhat similar to
+Kudu's. Its MVCC operates on physical blocks rather than records. Whenever a
+block is modified, it is modified in place and a compensating UNDO record is
+written to a Rollback Segment (RBS) in the transaction log.  The block header is
+then modified to point to the Rollback Segment which contains the UNDO record.
+
+When readers read a block, the read path looks at the data block header to
+determine if rollback is required. If so, it reads the associated rollback
+segment to apply UNDO logs.
+
+This has the downside that the rollback segments are allocated based on the
+order of transaction commit, and thus are not likely to be sequentially laid out
+with regard to the order of rows being read. So, scanning through a table in a
+time travel query may require a random access to retrieve associated UNDO logs
+for each block, whereas in Kudu, the undo logs have been sorted and organized by
+row-id.
+
+NOTE: the above is very simplified, but the overall idea is correct.
+
+References:
+ - http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:275215756923

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/src/kudu/cfile/README
----------------------------------------------------------------------
diff --git a/src/kudu/cfile/README b/src/kudu/cfile/README
deleted file mode 100644
index cda5989..0000000
--- a/src/kudu/cfile/README
+++ /dev/null
@@ -1,186 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-CFile is a simple columnar format which stores multiple related B-Trees.
-
-
-File format
------------------
-
-<header>
-<blocks>
-<btree root blocks>
-<footer>
-EOF
-
-
-Header
-------
-
-<magic>: the string 'kuducfil'
-<header length>: 32-bit unsigned integer length delimiter
-<header>: CFileHeaderPB protobuf
-
-
-Footer
-------
-
-<footer>: CFileFooterPB protobuf
-<magic>: the string 'kuducfil'
-<footer length> (length of protobuf)
-
-
-==============================
-
-Data blocks:
-
-Data blocks are stored with various types of encodings.
-
-* Prefix Encoding
-
-Currently used for STRING blocks. This is based on the encoding used
-by LevelDB for its data blocks, more or less.
-
-Starts with a header of four uint32s, group-varint coded:
-  <num elements>       \
-  <ordinal position>   |
-  <restart interval>   |  group varint 32
-  <unused>             /
-
-Followed by prefix-compressed values. Each value is stored relative
-to the value preceding it using the following format:
-
-  shared_bytes: varint32
-  unshared_bytes: varint32
-  delta: char[unshared_bytes]
-
-Periodically there will be a "restart point" which is necessary for
-faster binary searching. At a "restart point", shared_bytes is
-0 but otherwise the encoding is the same.
-
-At the end of the block is a trailer with the offsets of the
-restart points:
-
-  restart_points[num_restarts]:  uint32
-  num_restarts: uint32
-
-The restart points are offsets relative to the start of the block,
-including the header.
-
-
-* Group Varint Frame-Of-Reference Encoding
-
-Used for uint32 blocks.
-
-Starts with a header:
-
-<num elements>     \
-<min element>      |
-<ordinal position> | group varint 32
-<unused>           /
-
-The ordinal position is the ordinal position of the first item in the
-file. For example, the first data block in the file has ordinal position
-0. If that block had 400 data entries, then the second data block would
-have ordinal position 400.
-
-Followed by the actual data, each set of 4 integers using group-varint.
-The last group is padded with 0s.
-Each integer is relative to the min element in the header.
-
-==============================
-
-Nullable Columns
-
-If a column is marked as nullable in the schema, a bitmap is used to keep track
-of the null and not null rows.
-
-The bitmap is added the begininning of the data block, and it uses RLE.
-
-  <num elements in the block>   : vint
-  <null bitmap size>            : vint
-  <null bitmap>                 : RLE encoding
-  <encoded non-null values>     : encoded data
-
-Data Block Example - 4 items, the first and last are nulls.
-  4        Num Elements in the block
-  1        Null Bitmap Size
-  0110     Null Bitmap
-  v2       Value of row 2
-  v3       Value of row 3
-
-==============================
-
-Index blocks:
-
-The index blocks are organized in a B-Tree. As data blocks are written,
-they are appended to the end of a leaf index block. When a leaf index
-block reaches the configured block size, it is added to another index
-block higher up the tree, and a new leaf is started. If the intermediate
-index block fills, it will start a new intermediate block and spill into
-an even higher-layer internal block.
-
-For example:
-
-                      [Int 0]
-           ------------------------------
-           |                            |
-        [Int 1]                       [Int 2]
-    -----------------            --------------
-    |       |       |            |             |
-[Leaf 0]     ...   [Leaf N]   [Leaf N+1]    [Leaf N+2]
-
-
-In this case, we wrote N leaf blocks, which filled up the node labeled
-Int 1. At this point, the writer would create Int 0 with one entry pointing
-to Int 1. Further leaf blocks (N+1 and N+2) would be written to a new
-internal node (Int 2). When the file is completed, Int 2 will spill,
-adding its entry into Int 0 as well.
-
-Note that this strategy doesn't result in a fully balanced b-tree, but instead
-results in a 100% "fill factor" on all nodes in each level except for the last
-one written.
-
-There are two types of indexes:
-
-- Positional indexes: map ordinal position -> data block offset
-
-These are used to satisfy queries like: "seek to the Nth entry in this file"
-
-- Value-based indexes: reponsible for mapping value -> data block offset
-
-These are only present in files which contain data stored in sorted order
-(e.g key columns). They can satisfy seeks by value.
-
-
-An index block is encoded similarly for both types of indexes:
-
-<key> <block offset> <block size>
-<key> <block offset> <block size>
-...
-   key: vint64 for positional, otherwise varint-length-prefixed string
-   offset: vint64
-   block size: vint32
-
-<offset to first key>   (fixed32)
-<offset to second key>  (fixed32)
-...
-   These offsets are relative to the start of the block.
-
-<trailer>
-   A IndexBlockTrailerPB protobuf
-<trailer length>
-
-The trailer protobuf includes a field which designates whether the block
-is a leaf node or internal node of the B-Tree, allowing a reader to know
-whether the pointer is to another index block or to a data block.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/src/kudu/client/README
----------------------------------------------------------------------
diff --git a/src/kudu/client/README b/src/kudu/client/README
deleted file mode 100644
index 33bcf50..0000000
--- a/src/kudu/client/README
+++ /dev/null
@@ -1,132 +0,0 @@
-// -*- mode: c++ -*-
-//
-// Licensed to the Apache Software Foundation (ASF) under one
-// or more contributor license agreements.  See the NOTICE file
-// distributed with this work for additional information
-// regarding copyright ownership.  The ASF licenses this file
-// to you under the Apache License, Version 2.0 (the
-// "License"); you may not use this file except in compliance
-// with the License.  You may obtain a copy of the License at
-//
-//   http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing,
-// software distributed under the License is distributed on an
-// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-// KIND, either express or implied.  See the License for the
-// specific language governing permissions and limitations
-// under the License.
-/*
-
-This file contains some example code for the C++ client. It will
-probably be eventually removed in favor of actual runnable examples,
-but serves as a guide/docs for the client API design for now.
-
-See class docs for KuduClient, KuduSession, KuduTable for proper docs.
-*/
-
-// This is an example of explicit batching done by the client.
-// This would be used in contexts like interactive webapps, where
-// you are likely going to set a short timeout.
-void ExplicitBatchingExample() {
-  // Get a reference to the tablet we want to insert into.
-  // Note that this may be done without a session, either before or
-  // after creating a session, since a session isn't tied to any
-  // particular table or set of tables.
-  scoped_refptr<KuduTable> t;
-  CHECK_OK(client_->OpenTable("my_table", &t));
-
-  // Create a new session. All data-access operations must happen through
-  // a session.
-  shared_ptr<KuduSession> session(client->NewSession());
-
-  // Setting flush mode to MANUAL_FLUSH makes the session accumulate
-  // all operations until the next Flush() call. This is sort of like
-  // TCP_CORK.
-  CHECK_OK(session->SetFlushMode(KuduSession::MANUAL_FLUSH));
-
-  // Insert 100 rows.
-  for (int i = 0; i < 100; i++) {
-    gscoped_ptr<Insert> ins = t->NewInsert();
-    ins->mutable_row()->SetInt64("key", i);
-    ins->mutable_row()->SetInt64("val", i * 2);
-    // The insert should return immediately after moving the insert
-    // into the appropriate buffers. This always returns OK unless the
-    // Insert itself is invalid (eg missing a key column).
-    CHECK_OK(session->Apply(ins.Pass()));
-  }
-
-  // Update a row.
-  gscoped_ptr<Update> upd = t->NewUpdate();
-  upd->mutable_row()->SetInt64("key", 1);
-  upd->mutable_row()->SetInt64("val", 1 * 2 + 1);
-
-  // Delete a row.
-  gscoped_ptr<Delete> del = t->NewDelete();
-  del->mutable_row()->SetInt64("key", 2); // only specify key.
-
-  // Setting a timeout on the session applies to the next Flush call.
-  session->SetTimeoutMillis(300);
-
-  // After accumulating all of the stuff in the batch, call Flush()
-  // to send the updates in one go. This may be done either sync or async.
-  // Sync API example:
-  {
-    // Returns an Error if any insert in the batch had an issue.
-    CHECK_OK(session->Flush());
-    // Call session->GetPendingErrors() to get errors.
-  }
-
-  // Async API example:
-  {
-    // Returns immediately, calls Callback when either success or failure.
-    CHECK_OK(session->FlushAsync(MyCallback));
-    // TBD: should you be able to use the same session before the Callback has
-    // been called? Or require that you do nothing with this session while
-    // in-flight (which is more like what JDBC does I think)
-  }
-}
-
-// This is an example of how a "bulk ingest" program might work -- one in
-// which the client just wants to shove a bunch of data in, and perhaps
-// fail if it ever gets an error.
-void BulkIngestExample() {
-  scoped_refptr<KuduTable> t;
-  CHECK_OK(client_->OpenTable("my_table", &t));
-  shared_ptr<KuduSession> session(client->NewSession());
-
-  // If the amount of buffered data in RAM is larger than this amount,
-  // blocks the writer from performing more inserts until memory has
-  // been freed (either by inserts succeeding or timing out).
-  session->SetBufferSpace(32 * 1024 * 1024);
-
-  // Set a long timeout for this kind of usecase. This determines how long
-  // Flush() may block for, as well as how long Apply() may block due to
-  // the buffer being full.
-  session->SetTimeoutMillis(60 * 1000);
-
-  // In AUTO_FLUSH_BACKGROUND mode, the session will try to accumulate batches
-  // for optimal efficiency, rather than flushing each operation.
-  CHECK_OK(session->SetFlushMode(KuduSession::AUTO_FLUSH_BACKGROUND));
-
-  for (int i = 0; i < 10000; i++) {
-    gscoped_ptr<Insertion> ins = t->NewInsertion();
-    ins->SetInt64("key", i);
-    ins->SetInt64("val", i * 2);
-    // This will start getting written in the background.
-    // If there are any pending errors, it will return a bad Status,
-    // and the user should call GetPendingErrors()
-    // This may block if the buffer is full.
-    CHECK_OK(session->Apply(&ins));
-    if (session->HasErrors())) {
-      LOG(FATAL) << "Failed to insert some rows: " << DumpErrors(session);
-    }
-  }
-  // Blocks until remaining buffered operations have been flushed.
-  // May also use the async API per above.
-  Status s = session->Flush());
-  if (!s.ok()) {
-    LOG(FATAL) << "Failed to insert some rows: " << DumpErrors(session);
-  }
-}
-

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/src/kudu/codegen/README
----------------------------------------------------------------------
diff --git a/src/kudu/codegen/README b/src/kudu/codegen/README
deleted file mode 100644
index fb0a1e1..0000000
--- a/src/kudu/codegen/README
+++ /dev/null
@@ -1,247 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-===============================================================================
-Code Generation Interface
-===============================================================================
-
-The codegen directory houses code which is compiled with LLVM code generation
-utilities. The point of code generation is to have code that is generated at
-run time which is optimized to run on data specific to usage that can only be
-described at run time. For instance, code which projects rows during a scan
-relies on the types of the data stored in each of the columns, but these are
-only determined by a run time schema. To alleviate this issue, a row projector
-can be compiled with schema-specific machine code to run on the current rows.
-
-Note the following classes, whose headers are LLVM-independent and thus intended
-to be used by the rest of project without introducing additional dependencies:
-
-CompilationManager (compilation_manager.h)
-RowProjector (row_projector.h)
-
-(Other classes also avoid LLVM headers, but they have little external use).
-
-CompilationManager
-------------------
-
-The compilation manager takes care of asynchronous compilation tasks. It
-accepts requests to compile new objects. If the requested object is already
-cached, then the compiled object is returned. Otherwise, the compilation request
-is enqueued and eventually carried out.
-
-The manager can be accessed (and thus compiled code requests can be made)
-by using the GetSingleton() method. Yes - there's a universal singleton for
-compilation management. See the header for details.
-
-The manager allows for waiting for all current compilations to finish, and can
-register its metrics (which include code cache performance) upon request.
-
-No cleanup is necessary for the CompilationManager. It registers a shutdown method
-with the exit handler.
-
-Generated objects
------------------
-
-* codegen::RowProjector - A row projector has the same interface as a
-common::RowProjector, but supports a narrower scope of row types and arenas.
-It does not allow its schema to be reset (indeed, that's the point of compiling
-to a specific schema). The row projector's behavior is fully determined by
-the base and projection schemas. As such, the compilation manager expects those
-two items when retrieving a row projector.
-
-================================================================================
-Code Generation Implementation Details
-================================================================================
-
-Code generation works by creating what is essentially an assembly language
-file for the desired object, then handing off that assembly to the LLVM
-MCJIT compiler. The LLVM backend handles generating target-dependent machine
-code. After code generation, the machine code, which is represented as a
-shared object in memory, is dynamically linked to the invoking application
-(i.e., this one), and the newly generated code becomes available.
-
-Overview of LLVM-interfacing classes
-------------------------------------
-
-Most of the interfacing with LLVM is handled by the CodeGenerator
-(code_generator.h) and ModuleBuilder (module_builder.h) classes. The CodeGenerator
-takes care of setting up static intializations that LLVM is dependent on and
-provides an interface which wraps around various calls to LLVM compilation
-functions.
-
-The ModuleBuilder takes care of the one-time construction of a module, which is
-LLVM's unit of code. A module is its own namespace containing functions that
-are compiled together. Currently, LLVM does not support having multiple
-modules per execution engine so the code is coupled with an ExecutionEngine
-instance which owns the generated code behind the scenes (the ExecutionEngine is
-the LLVM class responsible for actual compilation and running of the dynamically
-linked code). Note throughout the directory the execution engine is referred to
-(actually typedef-ed as) a JITCodeOwner, because to every single class except
-the ModuleBuilder that is all the execution engine is good for. Once the
-destructor to a JITCodeOwner object is called, the associated data is deleted.
-
-In turn, the ModuleBuilder provides a minimal interface to code-generating
-classes (classes that accept data specific to a certain request and create the
-LLVM IR - the assembly that was mentioned earlier - that is appropriate for
-the specific data). The classes fill up the module with the desired assembly.
-
-Sequence of operation
----------------------
-
-The parts come together as follows (in the case that the code cache is empty).
-
-1. External component requests some compiled object for certain runtime-
-dependent data (e.g. a row projector for a base and projection schemas).
-2. The CompilationManager accepts the request, but finds no such object
-is cached.
-3. The CompilationManager enqueues a request to compile said object to its
-own threadpool, and responds with failure to the external component.
-4. Eventually, a thread becomes available to take on the compilation task. The
-task is dequeued and the CodeGenerator's compilation method for the request is
-called.
-5. The code generator checks that code generation is enabled, and makes a call
-to the appropriate code-generating classes.
-6. The classes rely on the ModuleBuilder to compile their code, after which
-they return pointers to the requested functions.
-
-Code-generating classes
------------------------
-
-As mentioned in steps (5) and (6), the code-generating classes are responsible
-for generating the LLVM IR which is compiled at run time for whatever specific
-requests the external components have.
-
-The "code-generating classes" implement the JITWrapper (jit_wrapper.h) interface.
-The base class requires an owning reference to a JITCodeOwner, intended to be the
-owner of the JIT-compiled code that the JITWrapper derived class refers to.
-
-On top of containing the JITCodeOwner and pointers to JIT-compiled functions,
-the JITWrapper also provides methods which enable code caching. Caching compiled
-code is essential because compilation times are prohibitively slow, so satisfying
-any single request with freshly compiled code is not an option. As such, each
-piece of compiled code should be associated with some run time determined data.
-
-In the case of a row projector, this data is a pair of schemas, for the base
-and the projection. In order to work for arbitrary types (so we do not need
-multiple code caches for each different compiled object), the JITWrapper
-implementation must be able to provide a byte string key encoding of its
-associated data. This provides the key for the aforementioned cache. Similarly,
-there should be a static method which allows encoding such a key without
-generating a new instance (every time there is a request made to the manager,
-the manager needs to generate the byte string key to look it up in the cache).
-
-For instance, the JITWrapper for RowProjector code, RowProjectorFunctions, has
-the following method:
-
-static Status EncodeKey(const Schema& base, const Schema& proj,
-                        faststring* out);
-
-For any given input (pair of schemas), the JITWrapper generates a unique key
-so that the cache can be looked up for the generated row projector in later
-requests (the manager handles the cache lookups).
-
-In order to keep one homogeneous cache of all the generated code, the keys
-need to be unique across classes, which is difficult to maintain because the
-encodings could conflict by accident. For this reason, a type identifier should
-be prefixed to the beginning of every key. This identifier is an enum, with
-values for each JITWrapper derived type, thus guaranteeing uniqueness between
-classes.
-
-Guide to creating new codegenned classes
-----------------------------------------
-
-To add new classes with code generation, one needs to generate the appropriate
-JITWrapper and update the higher-level classes.
-
-First, the inputs to code generation need to be established (henceforth referred
-to as just "inputs").
-
-1. Making a new JITWrapper
-
-A new JITWrapper should derive from the JITWrapper class and expose a static
-key-generation method which returns a key given the inputs for the class. To
-satisfy the prefix condition, a new enum value must be added in
-JITWrapper::JITWrapperType.
-
-The JITWrapper derived class should have a creation method that generates
-a shared reference to an instance of itself. The JITWrappers should only
-be handled through shared references because this ensures that the code owner
-within the class is kept alive exactly as long as references to code pointing with
-it exist (the derived class is the only class that should contain members which
-are pointers to the desired compiled functions for the given input).
-
-The actual creation of the compiled code is perhaps the hardest part. See the
-section below.
-
-2. Updating top-level classes
-
-On top of adding the new enum value in the JITWrapper enumeration, several other
-top-level classes should provide the interfaces necessary to use the new
-codegen class (the layer of interface classes enables separate components
-of kudu to be independent of LLVM headers).
-
-In the CodeGenerator, there should be a Compile...(inputs) function which
-creates a scoped_refptr to the derived JITWrapper class by invoking the
-class' creation method. Note that the CodeGenerator should also print
-the appropriate LLVM disassembly if the flag is activated.
-
-The compilation manager should likewise offer a Request...(inputs) function
-that returns the requested compiled functions by looking up the cache for the
-inputs by generating a key with the static encoding method mentioned above. If the
-cache lookup fails, the manager should submit a new compilation request. The
-cache hit metrics should be incremented appropriately.
-
-Guide to code generation
-------------------------
-
-The resources at the bottom of this document provide a good reference for
-LLVM IR. However, there should be little need to use much LLVM IR because the
-majority of the LLVM code can be precompiled.
-
-If you wish to execute certain functions A, B, or C based on the input data which
-takes on values 1, 2, or 3, then do the following:
-
-1. Write A, B, and C in an extern "C" namespace (to avoid name mangling) in
-codegen/precompiled.cc.
-2. When creating your derived JITWrapper class, create a ModuleBuilder. The
-builder should load your functions A, B, and C automatically.
-3. Create an LLVM IR function dependent on the inputs. I.e., if the input
-for code generation is 1, then the desired function would be A. In that case,
-request the module builder for a function called "A". The builder, when compiled,
-will offer a pointer to the compiled function.
-
-Note in the above example the only utility of code generation is avoiding
-a couple of branches which decide on A, B, or C based on input data 1, 2, or 3.
-
-Code generation gets much more mileage from constant propagation. To utilize this,
-one needs to generate a new function in LLVM IR at run time which passes
-arguments to the precompiled functions, with hopefully some relevant constants
-based on the input data. When LLVM compiles the module, it will propagate those
-constants, creating more efficient machine code.
-
-To create a function in a module at run time, you need to use a
-ModuleBuilder::LLVMBuilder. The builder emits LLVM IR dynamically. It is an
-alias for the llvm::IRBuilder<> class, whose API is available in the links at
-the bottom of this document. A worked example is available in row_projector.cc.
-
-Useful resources
-----------------
-http://llvm.org/docs/doxygen/html/index.html
-http://llvm.org/docs/tutorial/
-http://llvm.org/docs/LangRef.html
-
-Debugging
----------
-
-Debug info is available by printing the generated code. See the flags declared
-in code_generator.cc for further details.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/src/kudu/common/README
----------------------------------------------------------------------
diff --git a/src/kudu/common/README b/src/kudu/common/README
deleted file mode 100644
index d2a294c..0000000
--- a/src/kudu/common/README
+++ /dev/null
@@ -1,16 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-This module contains utilities and protobuf message definitions
-related to the Kudu data model and the Kudu wire protocol that are to
-be shared between client, tserver, and master.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/src/kudu/consensus/README
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/README b/src/kudu/consensus/README
deleted file mode 100644
index b301a72..0000000
--- a/src/kudu/consensus/README
+++ /dev/null
@@ -1,280 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-This document introduces how Kudu will handle log replication and consistency
-using an algorithm known as Viewstamped Replication (VS) and a series of 
-practical algorithms/techniques for recovery, reconfiguration, compactions etc.
-This document introduces all the concepts directly related to Kudu, for any
-missing information please refer to the original papers [1,3,4].
-
-Quorums, in Kudu, are a set of collaborating processes that serve the purpose
-of keeping a consistent, replicated log of operations on a given data set, e.g.
-a tablet. This replicated consistent log, also plays the role of the Write
-Ahead Log (WAL) for the tablet. Throughout this document we use config
-participant and process interchangeably, these do not represent machines or OS
-processes, as machines and or application daemons will participate in multiple
-configs.
-
-============================================================
-The write ahead log (WAL)
-============================================================
-
-The WAL provides strict ordering and durability guarantees:
-
-1) If calls to Reserve() are externally synchronized, the order in
-which entries had been reserved will be the order in which they will
-be committed to disk.
-
-2) If fsync is enabled (via the 'log_force_fsync_all' flag -- see
-log_util.cc; note: this is _DISABLED_ by default), then every single
-transaction is guaranteed to be synchronized to disk before its
-execution is deemed successful.
-
-Log uses group commit to increase performance primarily by allowing
-throughput to scale with the number of writer threads while
-maintaining close to constant latency.
-
-============================================================
-Basic WAL usage
-============================================================
-
-To add operations to the log, the caller must obtain the lock, and
-call Reserve() with a collection of operations and pointer to the
-reserved entry (the latter being an out parameter). Then, the caller
-may release the lock and call the AsyncAppend() method with the
-reserved entry and a callback that will be invoked upon completion of
-the append. AsyncAppend method performs serialization and copying
-outside of the lock.
-
-For sample usage see local_consensus.cc and mt-log-test.cc.
-
-=============================================================
-Group commit implementation details
-=============================================================
-
-Currently, the group implementation uses a blocking queue (see
-Log::entry_queue_ in log.h) and a separate long-running thread (see
-Log::AppendThread in log.cc). Since access to the queue is
-synchronized via a lock and only a single thread removes the queue,
-the order in which the elements are added to the queue will be the
-same as the order in which the elements are removed from the queue.
-
-The size of the queue is currently based on the number of entries, but
-this will eventually be changed to be based on size of all queued
-entries in bytes.
-
-=============================================================
-Reserving a slot for the entry
-=============================================================
-
-Currently Reserve() allocates memory for a new entry on the heap each
-time, marks the entry internally as "reserved" via a state enum, and
-adds it to the above-mentioned queue. In the future, a ring-buffer or
-another similar data structure could be used that would take the place
-of the queue and make allocation unnecessary.
-
-============================================================
-Copying the entry contents to the reserved slot
-============================================================
-
-AsyncAppend() serializes the contents of the entry to a buffer field
-in the entry object (currently the buffer is allocated at the same
-time as the entry itself); this avoids contention that would occur if
-a shared buffer was to be used.
-
-============================================================
-Synchronizing the entry contents to disk
-============================================================
-
-A separate appender thread waits until entries are added to the
-queue. Once the queue is no longer empty, the thread grabs all
-elements on the queue. Then for each dequeued entry, the appender
-waits until the entry is marked ready (see "Copying the entry contents
-to the reserved slot" above) and then appends the entry to the current
-log segment without synchronizing the underlying file with filesystem
-(env::WritableFile::Append())
-
-Note: this could be further optimized by calling AppendVector() with a
-vector of buffers from all of the consumed entries.
-
-Once all entries are successfully appended, the appender thread syncs
-the file to disk (env::WritableFile::Sync()) and (again) waits until
-more entries are added to the queue, or until the queue or the
-appender thread are shut down.
-
-============================================================
-Log segment files and asynchronous preallocation
-============================================================
-
-Log uses PosixWritableFile() for underlying storage. If preallocation
-is enabled ('--log_preallocate_segments' flag, defined in log_util.cc,
-true by default), then whenever a new segment is created, the
-underlying file is preallocated to a certain size in megabytes
-('--log_segment_size_mb', defined in log_util.cc, default 64). While
-the offset in the segment file is below the preallocated length,
-the cheaper fdatasync() operation is used instead of fsync().
-
-When the size of the current segment exceeds the preallocated size, a
-task is launched in a separate thread that begins preallocating the
-underlying file for the new log segment; meanwhile, until the task
-finishes, appends still go to the existing file.  Once the new file is
-preallocated, it is renamed to the correct name for the next segment
-and is swapped in place of the current segment.
-
-When the current segment is closed without reaching the preallocated
-size, the underlying file is truncated to the last written offset
-(i.e., the actual size).
-
-============================================================
-Quorums and roles within configs
-============================================================
-
-A config in Kudu is a fault-tolerant, consistent unit that serves requests for
-a single tablet. As long as there are 2f+1 participants available in a config,
-where f is the number of possibly faulty participants, the config will keep
-serving requests for its tablet and it is guaranteed that clients perceive a
-fully consistent, linearizable view of both data and operations on that data.
-The f parameter, defined table wide through configuration implicitly
-defines the size of the config, f=0 indicates a single node config, f=1
-indicates a 3 node config, f=2 indicates a 5 node config, etc.. Quorums may
-overlap in the sense that each physical machine may be participating in
-multiple configs, usually one per each tablet that it serves.
-
-Within a single config, in steady state, i.e. when no peer is faulty, there
-are two main types of peers. The leader peer and the follower peers.
-The leader peer dictates the serialization of the operations throughout the
-config, its version of the sequence of data altering requests is the "truth"
-and any data altering request is only considered final (i.e. can be
-acknowledged to the client as successful) when a majority of the config
-acknowledges that they "agree" with the leader's view of the event order.
-In practice this means that all write requests are sent directly to the
-leader, which then replicates them to a majority of the followers before
-sending an ACK to the client. Follower peers are completely passive in
-steady state, only receiving data from the leader and acknowledging back.
-Follower peers only become active when the leader process stops and one
-of the followers (if there are any) must be elected leader.
-
-Participants in a config may be assigned the following roles:
-
-LEADER - The current leader of the config, receives requests from clients
-and serializes them to other nodes.
-
-FOLLOWER - Active participants in the config, whose votes count towards
-majority, replication count etc.
-
-LEARNER - Passive participants in the config, whose votes do not count
-towards majority or replication count. New nodes joining the config
-will have this role until they catch up and can be promoted to FOLLOWER.
-
-NON_PARTICIPANT - A peer that does not participate in a particular
-config. Mostly used to mark prior participants that stopped being so
-on a configuration change.
-
-The following diagram illustrates the possible state changes:
-
-                 +------------+
-                 |  NON_PART  +---+
-                 +-----+------+   |
-       Exist. RaftConfig?  |          |
-                 +-----v------+   |
-                 |  LEARNER   +   | New RaftConfig?
-                 +-----+------+   |
-                       |          |
-                 +-----v------+   |
-             +-->+  FOLLOW.   +<--+
-             |   +-----+------+
-             |         |
-             |   +-----v------+
-  Step Down  +<--+ CANDIDATE  |
-             ^   +-----+------+
-             |         |
-             |   +-----v------+
-             +<--+   LEADER   |
-                 +------------+
-
-Additionally all states can transition to NON_PARTICIPANT, on configuration
-changes and/or peer timeout/death.
-
-============================================================
-Assembling/Rebooting a RaftConfig and RaftConfig States
-============================================================
-
-Prior to starting/rebooting a peer, the state in WAL must have been replayed
-in a bootstrap phase. This process will yield an up-to-date Log and Tablet.
-The new/rebooting peer is then Init()'ed with this Log. The Log is queried
-for the last committed configuration entry (A Raft configuration consists of
-a set of peers (uuid and last known address) and hinted* roles). If there is
-none, it means this is a new config.
-
-After the peer has been Init()'ed, Start(Configuration) is called. The provided
-configuration is a hint which is only taken into account if there was no previous
-configuration*.
-
-Independently of whether the configuration is a new one (new config)
-or an old one (rebooting config), the config cannot start until a
-leader has been elected and replicates the configuration through
-consensus. This ensures that a majority of nodes agree that this is
-the most recent configuration.
-
-The provided configuration will always specify a leader -- in the case
-of a new config, it is chosen by the master, and in the case of a
-rebooted one, it is the configuration that was active before the node
-crashed. In either case, replicating this initial configuration
-entry happens in the exact same way as any other config entry,
-i.e. the LEADER will try and replicate it to FOLLOWERS. As usual if
-the LEADER fails, leader election is triggered and the new LEADER will
-try to replicate a new configuration.
-
-Only after the config has successfully replicated the initial configuration
-entry is the config ready to accept writes.
-
-
-Peers in the config can therefore be in the following states:
-
-BOOTSTRAPPING - The phase prior to initialization where the Log is being
-replayed. If a majority of peers are still BOOTSTRAPPING, the config doesn't
-exist yet.
-
-CONFIGURING: Until the current configuration is pushed though consensus. This
-is true for both new configs and rebooting configs. The peers do not accept
-client requests in this state. In this state, the Leader tries to replicate
-the configuration. Followers run failure detection and trigger leader election
-if the hinted leader doesn't successfully replicate within the configured
-timeout period.
-
-RUNNING: The LEADER peer accepts writes and replicates them through consensus.
-FOLLOWER replicas accepts writes from the leader and ACK.
-
-* The configuration provided on Start() can only be taken into account if there
-is an appropriate leader election algorithm. This can be added later but is not
-present in the initial implementation. Roles are hinted in the sense that the
-config initiator (usually the master) might hint what the roles for the peers
-in the config should be, but the config is the ultimate decider on whether that
-is possible or not.
-
-============================================================
-References
-============================================================
-[1] http://ramcloud.stanford.edu/raft.pdf
-
-[2] http://www.cs.berkeley.edu/~brewer/cs262/Aries.pdf
-
-[3] Viewstamped Replication: A New Primary Copy Method to Support
-Highly-Available Distributed Systems. B. Oki, B. Liskov
-http://www.pmg.csail.mit.edu/papers/vr.pdf
-
-[4] Viewstamped Replication Revisited. B. Liskov and J. Cowling 
-http://pmg.csail.mit.edu/papers/vr-revisited.pdf
-
-[5] Aether: A Scalable Approach to logging
-http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/src/kudu/master/README
----------------------------------------------------------------------
diff --git a/src/kudu/master/README b/src/kudu/master/README
deleted file mode 100644
index 26023be..0000000
--- a/src/kudu/master/README
+++ /dev/null
@@ -1,238 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-============================================================
-The Catalog Manager and System Tables
-============================================================
-
-The Catalog Manager keeps track of the Kudu tables and tablets defined by the
-user in the cluster.
-
-All the table and tablet information is stored in-memory in copy-on-write
-TableInfo / TabletInfo objects, as well as on-disk, in the "sys.catalog"
-Kudu system table hosted only on the Masters.  This system table is loaded
-into memory on Master startup.  At the time of this writing, the "sys.catalog"
-table consists of only a single tablet in order to provide strong consistency
-for the metadata under RAFT replication (as currently, each tablet has its own
-log).
-
-To add or modify a table or tablet, the Master writes, but does not yet commit
-the changes to memory, then writes and flushes the system table to disk, and
-then makes the changes visible in-memory (commits them) if the disk write (and,
-in a distributed master setup, config-based replication) is successful. This
-allows readers to access the in-memory state in a consistent
-way, even while a write is in-progress.
-
-This design prevents having to go through the whole scan path to service tablet
-location calls, which would be more expensive, and allows for easily keeping
-"soft" state in the Master for every Table and Tablet.
-
-The catalog manager maintains 3 hash-maps for looking up info in the sys table:
-- [Table Id] -> TableInfo
-- [Table Name] -> TableInfo
-- [Tablet Id] -> TabletInfo
-
-The TableInfo has a map [tablet-start-key] -> TabletInfo used to provide
-the tablets locations to the user based on a key-range request.
-
-
-Table Creation
---------------
-
-The below corresponds to the code in CatalogManager::CreateTable().
-
-1. Client -> Master request: Create "table X" with N tablets and schema S.
-2. Master: CatalogManager::CreateTable():
-   a. Validate user request (e.g. ensure a valid schema).
-   b. Verify that the table name is not already taken.
-      TODO: What about old, deleted tables?
-   c. Add (in-memory) the new TableInfo (in "preparing" state).
-   d. Add (in-memory) the TabletInfo based on the user-provided pre-split-keys
-      field (in "preparing" state).
-   e. Write the tablets info to "sys.catalog"
-      (The Master process is killed if the write fails).
-      - Master begins writing to disk.
-      - Note: If the Master crashes or restarts here or at any time previous to
-        this point, the table will not exist when the Master comes back online.
-   f. Write the table info to "sys.catalog" with the "running" state
-      (The Master process is killed if the write fails).
-      - Master completes writing to disk.
-      - After this point, the table will exist and be re-created as necessary
-        at startup time after a crash or process restart.
-   g. Commit the "running" state to memory, which allows clients to see the table.
-3. Master -> Client response: The table has been created with some ID, i.e. "xyz"
-   (or, in case something went wrong, an error message).
-
-After this point in time, the table is reported as created, which means that if
-the cluster is shut down, when it starts back up the table will still exist.
-However, the tablets are not yet created (see Table Assignment, below).
-
-
-Table Deletion
---------------
-
-When the user sends a DeleteTable request for table T, table T is marked as
-deleted by writing a "deleted" flag in the state field in T's record in the
-"sys.catalog" table, table T is removed from the in-memory "table names"
-map on the Master, and the table is marked as being "deleted" in the
-in-memory TableInfo / TabletInfo "state" field on the Master.
-TODO: Could this race with table deletion / creation??
-
-At this point, the table is no longer externally visible to clients via Master
-RPC calls, but the tablet configs that make up the table may still be up and
-running. New clients trying to open the table will get a NotFound error, while
-clients that already have the tablet locations cached may still be able to
-read and write to the tablet configs, as long as the corresponding tablet
-servers are online and their respective tablets have not yet been deleted.
-In some ways, this is similar the design of FS unlink.
-
-The Master will asynchronously send a DeleteTablet RPC request to each tablet
-(one RPC request per tablet server in the config, for each tablet), and the
-tablets will therefore be deleted in parallel in some unspecified order. If the
-Master or tablet server goes offline before a particular DeleteTablet operation
-successfully completes, the Master will send a new DeleteTablet request at the
-time that the next heartbeat is received from the tablet that is to be deleted.
-
-A "Cleaner" process will be reponsible for removing the data from deleted tables
-and tablets in the future, both on-disk and cached in memory (TODO).
-
-
-Table Assignment (Tablet Creation)
-----------------------------------
-
-Once a table is created, the tablets must be created on a set of replicas. In
-order to do that, the master has to select the replicas and associate them to
-the tablet.
-
-For each tablet not created we select a set of replicas and a leader and we
-send the "create tablet" request. On the next TS-heartbeat from the leader we
-can mark the tablet as "running", if reported. If we don't receive a "tablet
-created" report after ASSIGNMENT-TIMEOUT-MSEC we replace the tablet with a new
-one, following these same steps for the new tablet.
-
-The Assignment is processed by the "CatalogManagerBgTasks" thread. This thread
-is waiting for an event that can be:
-
-- Create Table (need to process the new tablet for assignment)
-- Assignment Timeout (some tablet request timeout expired, replace it)
-
-This is the current control flow:
-
-- CatalogManagerBgTasks thread:
-  1. Process Pending Assignments:
-     - For each tablet pending assignment:
-       - If tablet creation was already requested:
-          - If we did not receive a response yet, and the configurable
-            assignment timeout period has passed, mark the tablet as "replaced":
-            1. Delete the tablet if it ever reports in.
-            2. Create a new tablet in its place, add that tablet to the
-               "create table" list.
-       - Else, if the tablet is new (just created by CreateTable in "preparing" state):
-         - Add it to the "create tablet" list.
-     - Now, for each tablet in the "create tablet" list:
-       - Select a set of tablet servers to host the tablet config.
-       - Select a tablet server to be the initial config leader.
-       [BEGIN-WRITE-TO-DISK]
-       - Flush the "to create" to sys.catalog with state "creating"
-       [If something fails here, the "Process Pending Assignments" will
-        reprocess these tablets. As nothing was done, running tables will be replaced]
-       [END-WRITE-TO-DISK]
-       - For each tablet server in the config:
-         - Send an async CreateTablet() RPC request to the TS.
-           On TS-heartbeat, the Master will receive the notification of "tablet creation".
-     - Commit any changes in state to memory.
-       At this point the tablets marked as "running" are visible to the user.
-
-  2. Cleanup deleted tables & tablets (FIXME: is this implemented?):
-     - Remove the tables/tablets with "deleted" state from "sys.catalog"
-     - Remove the tablets with "deleted" state from the in-memory map
-     - Remove the tables with "deleted" state from the in-memory map
-
-When the TS receives a CreateTablet() RPC, it will attempt to create the tablet
-replica locally. Once it is successful, it will be added to the next tablet
-report. When the tablet is reported, the master-side ProcessTabletReport()
-function is called.
-
-If we find at this point that the reported tablet is in "creating" state, and
-the TS reporting the tablet is the leader selected during the assignment
-process (see CatalogManagerBgTasksThread above), the tablet will be marked as
-running and committed to disk, completing the assignment process.
-
-
-Alter Table
------------
-
-When the user sends an alter request, which may contain changes to the schema,
-table name or attributes, the Master will send a set of AlterTable() RPCs to
-each TS handling the set of tablets currently running. The Master will keep
-retrying in case of error.
-
-If a TS is down or goes down during an AlterTable request, on restart it will
-report the schema version that it is using, and if it is out of date, the Master
-will send an AlterTable request to that TS at that time.
-
-When the Master first comes online after being restarted, a full tablet report
-will be requested from each TS, and the tablet schema version sent on the next
-heartbeat will be used to determine if a given TS needs an AlterTable() call.
-
-============================================================
-Heartbeats and TSManager
-============================================================
-
-Heartbeats are sent by the TS to the master. Per master.proto, a
-heartbeat contains:
-
-1. Node instance information: permanent uuid, node sequence number
-(which is incremented each time the node is started).
-
-2. (Optional) registration. Sent either at TS startup or if the master
-responded to a previous heartbeat with "needs register" (see
-'Handling heartbeats' below for an explanation of when this response
-will be sent).
-
-3. (Optional) tablet report. Sent either when tablet information has
-changed, or if the master responded to a previous heartbeat with
-"needs a full tablet report" (see "Handling heartbeats" below for an
-explanation of when this response will be sent).
-
-Handling heartbeats
--------------------
-
-Upon receiving a heartbeat from a TS, the master will:
-
-1) Check if the heartbeat has registration info. If so, register
-the TS instance with TSManager (see "TSManager" below for more
-details).
-
-2) Retrieve a TSDescriptor from TSManager. If the TSDescriptor
-is not found, reply to the TS with "need re-register" field set to
-true, and return early.
-
-3) Update the heartbeat time (see "TSManager" below) in the
-registration object.
-
-4) If the heartbeat contains a tablet report, the Catalog Manager will
-process the report and update its cache as well as the system tables
-(see "Catalog Manager" above). Otherwise, the master will respond to
-the TS requesting a full tablet report.
-
-5) Send a success respond to the TS.
-
-TSManager
----------
-
-TSManager provides in-memory storage for information sent by the
-tablet server to the master (tablet servers that have been heard from,
-heartbeats, tablet reports, etc...). The information is stored in a
-map, where the key is the permanent uuid of a tablet server and the
-value is (a pointer to) a TSDescriptor.



Mime
View raw message