accumulo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mwa...@apache.org
Subject [2/7] accumulo git commit: ACCUMULO-4630 Move user manual to Accumulo website
Date Mon, 22 May 2017 17:43:21 GMT
http://git-wip-us.apache.org/repos/asf/accumulo/blob/e99ec9f0/docs/src/main/asciidoc/chapters/ssl.txt
----------------------------------------------------------------------
diff --git a/docs/src/main/asciidoc/chapters/ssl.txt b/docs/src/main/asciidoc/chapters/ssl.txt
deleted file mode 100644
index d315c6e..0000000
--- a/docs/src/main/asciidoc/chapters/ssl.txt
+++ /dev/null
@@ -1,134 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== SSL
-Accumulo, through Thrift's TSSLTransport, provides the ability to encrypt
-wire communication between Accumulo servers and clients using secure
-sockets layer (SSL). SSL certifcates signed by the same certificate authority
-control the "circle of trust" in which a secure connection can be established.
-Typically, each host running Accumulo processes would be given a certificate
-which identifies itself.
-
-Clients can optionally also be given a certificate, when client-auth is enabled,
-which prevents unwanted clients from accessing the system. The SSL integration
-presently provides no authentication support within Accumulo (an Accumulo username
-and password are still required) and is only used to establish a means for
-secure communication.
-
-=== Server configuration
-
-As previously mentioned, the circle of trust is established by the certificate
-authority which created the certificates in use. Because of the tight coupling
-of certificate generation with an organization's policies, Accumulo does not
-provide a method in which to automatically create the necessary SSL components.
-
-Administrators without existing infrastructure built on SSL are encourage to
-use OpenSSL and the +keytool+ command. An example of these commands are
-included in a section below. Accumulo servers require a certificate and keystore,
-in the form of Java KeyStores, to enable SSL. The following configuration assumes
-these files already exist.
-
-In +accumulo-site.xml+, the following properties are required:
-
-* *rpc.javax.net.ssl.keyStore*=_The path on the local filesystem to the keystore containing the server's certificate_
-* *rpc.javax.net.ssl.keyStorePassword*=_The password for the keystore containing the server's certificate_
-* *rpc.javax.net.ssl.trustStore*=_The path on the local filesystem to the keystore containing the certificate authority's public key_
-* *rpc.javax.net.ssl.trustStorePassword*=_The password for the keystore containing the certificate authority's public key_
-* *instance.rpc.ssl.enabled*=_true_
-
-Optionally, SSL client-authentication (two-way SSL) can also be enabled by setting
-+instance.rpc.ssl.clientAuth=true+ in +accumulo-site.xml+.
-This requires that each client has access to  valid certificate to set up a secure connection
-to the servers. By default, Accumulo uses one-way SSL which does not require clients to have
-their own certificate.
-
-=== Client configuration
-
-To establish a connection to Accumulo servers, each client must also have
-special configuration. This is typically accomplished through the use of
-the client configuration file whose default location is +~/.accumulo/config+.
-
-The following properties must be set to connect to an Accumulo instance using SSL:
-
-* *rpc.javax.net.ssl.trustStore*=_The path on the local filesystem to the keystore containing the certificate authority's public key_
-* *rpc.javax.net.ssl.trustStorePassword*=_The password for the keystore containing the certificate authority's public key_
-* *instance.rpc.ssl.enabled*=_true_
-
-If two-way SSL if enabled (+instance.rpc.ssl.clientAuth=true+) for the instance, the client must also define
-their own certificate and enable client authenticate as well.
-
-* *rpc.javax.net.ssl.keyStore*=_The path on the local filesystem to the keystore containing the server's certificate_
-* *rpc.javax.net.ssl.keyStorePassword*=_The password for the keystore containing the server's certificate_
-* *instance.rpc.ssl.clientAuth*=_true_
-
-=== Generating SSL material using OpenSSL
-
-The following is included as an example for generating your own SSL material (certificate authority and server/client
-certificates) using OpenSSL and Java's KeyTool command.
-
-==== Generate a certificate authority
-
-----
-# Create a private key
-openssl genrsa -des3 -out root.key 4096
-
-# Create a certificate request using the private key
-openssl req -x509 -new -key root.key -days 365 -out root.pem
-
-# Generate a Base64-encoded version of the PEM just created
-openssl x509 -outform der -in root.pem -out root.der
-
-# Import the key into a Java KeyStore
-keytool -import -alias root-key -keystore truststore.jks -file root.der
-
-# Remove the DER formatted key file (as we don't need it anymore)
-rm root.der
-----
-
-The +truststore.jks+ file is the Java keystore which contains the certificate authority's public key.
-
-==== Generate a certificate/keystore per host
-
-It's common that each host in the instance is issued its own certificate (notably to ensure that revocation procedures
-can be easily followed). The following steps can be taken for each host.
-
-----
-# Create the private key for our server
-openssl genrsa -out server.key 4096
-
-# Generate a certificate signing request (CSR) with our private key
-openssl req -new -key server.key -out server.csr
-
-# Use the CSR and the CA to create a certificate for the server (a reply to the CSR)
-openssl x509 -req -in server.csr -CA root.pem -CAkey root.key -CAcreateserial \
-    -out server.crt -days 365
-
-# Use the certificate and the private key for our server to create PKCS12 file
-openssl pkcs12 -export -in server.crt -inkey server.key -certfile server.crt \
-    -name 'server-key' -out server.p12
-
-# Create a Java KeyStore for the server using the PKCS12 file (private key)
-keytool -importkeystore -srckeystore server.p12 -srcstoretype pkcs12 -destkeystore \
-    server.jks -deststoretype JKS
-
-# Remove the PKCS12 file as we don't need it
-rm server.p12
-
-# Import the CA-signed certificate to the keystore
-keytool -import -trustcacerts -alias server-crt -file server.crt -keystore server.jks
-----
-
-The +server.jks+ file is the Java keystore containing the certificate for a given host. The above
-methods are equivalent whether the certficate is generate for an Accumulo server or a client.

http://git-wip-us.apache.org/repos/asf/accumulo/blob/e99ec9f0/docs/src/main/asciidoc/chapters/summaries.txt
----------------------------------------------------------------------
diff --git a/docs/src/main/asciidoc/chapters/summaries.txt b/docs/src/main/asciidoc/chapters/summaries.txt
deleted file mode 100644
index 08d8011..0000000
--- a/docs/src/main/asciidoc/chapters/summaries.txt
+++ /dev/null
@@ -1,232 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements.  See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License.  You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Summary Statistics
-
-=== Overview
-
-Accumulo has the ability to generate summary statistics about data in a table
-using user defined functions.  Currently these statistics are only generated for
-data written to files.  Data recently written to Accumulo that is still in
-memory will not contribute to summary statistics.
-
-This feature can be used to inform a user about what data is in their table.
-Summary statistics can also be used by compaction strategies to make decisions
-about which files to compact.  
-
-Summary data is stored in each file Accumulo produces.  Accumulo can gather
-summary information from across a cluster merging it along the way.  In order
-for this to be fast the, summary information should fit in cache.  There is a
-dedicated cache for summary data on each tserver with a configurable size.  In
-order for summary data to fit in cache, it should probably be small.
-
-For information on writing a custom summarizer see the javadoc for
-+org.apache.accumulo.core.client.summary.Summarizer+.  The package
-+org.apache.accumulo.core.client.summary.summarizers+ contains summarizer
-implementations that ship with Accumulo and can be configured for use.
-
-=== Inaccuracies
-
-Summary data can be inaccurate when files are missing summary data or when
-files have extra summary data. Files can contain data outside of a tablets
-boundaries. This can happen as result of bulk imported files and tablet splits.
-When this happens, those files could contain extra summary information.
-Accumulo offsets this some by storing summary information for multiple row
-ranges per a file.  However, the ranges are not granular enough to completely
-offset extra data.
-
-Any source of inaccuracies is reported when summary information is requested.
-In the shell examples below this can be seen on the +File Statistics+ line.
-For files missing summary information, the compact command in the shell has a
-+--sf-no-summary+ option.  This options compacts files that do not have the
-summary information configured for the table.  The compact command also has the
-+--sf-extra-summary+ option which will compact files with extra summary
-information.
-
-=== Configuring
-
-The following tablet server and table properties configure summarization.
-
-* <<appendices/config.txt#_tserver_cache_summary_size>>
-* <<appendices/config.txt#_tserver_summary_partition_threads>>
-* <<appendices/config.txt#_tserver_summary_remote_threads>>
-* <<appendices/config.txt#_tserver_summary_retrieval_threads>>
-* <<appendices/config.txt#TABLE_SUMMARIZER_PREFIX>>
-* <<appendices/config.txt#_table_file_summary_maxsize>>
-
-=== Permissions
-
-Because summary data may be derived from sensitive data, requesting summary data
-requires a special permission.  User must have the table permission
-+GET_SUMMARIES+ in order to retrieve summary data.
-
-
-=== Bulk import
-
-When generating rfiles to bulk import into Accumulo, those rfiles can contain
-summary data.  To use this feature, look at the javadoc on the
-+AccumuloFileOutputFormat.setSummarizers(...)+ method.  Also,
-+org.apache.accumulo.core.client.rfile.RFile+ has options for creating RFiles
-with embedded summary data.
-
-=== Examples
-
-This example walks through using summarizers in the Accumulo shell.  Below a
-table is created and some data is inserted to summarize.
-
- root@uno> createtable summary_test
- root@uno summary_test> setauths -u root -s PI,GEO,TIME
- root@uno summary_test> insert 3b503bd name last Doe
- root@uno summary_test> insert 3b503bd name first John
- root@uno summary_test> insert 3b503bd contact address "123 Park Ave, NY, NY" -l PI&GEO
- root@uno summary_test> insert 3b503bd date birth "1/11/1942" -l PI&TIME
- root@uno summary_test> insert 3b503bd date married "5/11/1962" -l PI&TIME
- root@uno summary_test> insert 3b503bd contact home_phone 1-123-456-7890 -l PI
- root@uno summary_test> insert d5d18dd contact address "50 Lake Shore Dr, Chicago, IL" -l PI&GEO
- root@uno summary_test> insert d5d18dd name first Jane
- root@uno summary_test> insert d5d18dd name last Doe
- root@uno summary_test> insert d5d18dd date birth 8/15/1969 -l PI&TIME
- root@uno summary_test> scan -s PI,GEO,TIME
- 3b503bd contact:address [PI&GEO]    123 Park Ave, NY, NY
- 3b503bd contact:home_phone [PI]    1-123-456-7890
- 3b503bd date:birth [PI&TIME]    1/11/1942
- 3b503bd date:married [PI&TIME]    5/11/1962
- 3b503bd name:first []    John
- 3b503bd name:last []    Doe
- d5d18dd contact:address [PI&GEO]    50 Lake Shore Dr, Chicago, IL
- d5d18dd date:birth [PI&TIME]    8/15/1969
- d5d18dd name:first []    Jane
- d5d18dd name:last []    Doe
-
-After inserting the data, summaries are requested below.  No summaries are returned.
-
- root@uno summary_test> summaries
-
-The visibility summarizer is configured below and the table is flushed.
-Flushing the table creates a file creating summary data in the process. The
-summary data returned counts how many times each column visibility occurred.
-The statistics with a +c:+ prefix are visibilities.  The others are generic
-statistics created by the CountingSummarizer that VisibilitySummarizer extends. 
-
- root@uno summary_test> config -t summary_test -s table.summarizer.vis=org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer
- root@uno summary_test> summaries
- root@uno summary_test> flush -w
- 2017-02-24 19:54:46,090 [shell.Shell] INFO : Flush of table summary_test completed.
- root@uno summary_test> summaries
-  Summarizer         : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {}
-  File Statistics    : [total:1, missing:0, extra:0, large:0]
-  Summary Statistics : 
-     c:                                                           = 4
-     c:PI                                                         = 1
-     c:PI&GEO                                                     = 2
-     c:PI&TIME                                                    = 3
-     emitted                                                      = 10
-     seen                                                         = 10
-     tooLong                                                      = 0
-     tooMany                                                      = 0
-
-VisibilitySummarizer has an option +maxCounters+ that determines the max number
-of column visibilites it will track.  Below this option is set and compaction
-is forced to regenerate summary data.  The new summary data only has three
-visibilites and now the +tooMany+ statistic is 4.  This is the number of
-visibilites that were not counted.
-
- root@uno summary_test> config -t summary_test -s table.summarizer.vis.opt.maxCounters=3
- root@uno summary_test> compact -w
- 2017-02-24 19:54:46,267 [shell.Shell] INFO : Compacting table ...
- 2017-02-24 19:54:47,127 [shell.Shell] INFO : Compaction of table summary_test completed for given range
- root@uno summary_test> summaries
-  Summarizer         : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3}
-  File Statistics    : [total:1, missing:0, extra:0, large:0]
-  Summary Statistics : 
-     c:PI                                                         = 1
-     c:PI&GEO                                                     = 2
-     c:PI&TIME                                                    = 3
-     emitted                                                      = 10
-     seen                                                         = 10
-     tooLong                                                      = 0
-     tooMany                                                      = 4
-
-Another summarizer is configured below that tracks the number of deletes.  Also
-a compaction strategy that uses this summary data is configured.  The
-+TooManyDeletesCompactionStrategy+ will force a compaction of the tablet when
-the ratio of deletes to non-deletes is over 25%.  This threshold is
-configurable.  Below a delete is added and its reflected in the statistics.  In
-this case there is 1 delete and 10 non-deletes, not enough to force a
-compaction of the tablet.
-
-....
-root@uno summary_test> config -t summary_test -s table.summarizer.del=org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer
-root@uno summary_test> compact -w
-2017-02-24 19:54:47,282 [shell.Shell] INFO : Compacting table ...
-2017-02-24 19:54:49,236 [shell.Shell] INFO : Compaction of table summary_test completed for given range
-root@uno summary_test> config -t summary_test -s table.compaction.major.ratio=10
-root@uno summary_test> config -t summary_test -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.TooManyDeletesCompactionStrategy
-root@uno summary_test> deletemany -r d5d18dd -c date -f
-[DELETED] d5d18dd date:birth [PI&TIME]
-root@uno summary_test> flush -w
-2017-02-24 19:54:49,686 [shell.Shell] INFO : Flush of table summary_test completed.
-root@uno summary_test> summaries
- Summarizer         : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3}
- File Statistics    : [total:2, missing:0, extra:0, large:0]
- Summary Statistics : 
-    c:PI                                                         = 1
-    c:PI&GEO                                                     = 2
-    c:PI&TIME                                                    = 4
-    emitted                                                      = 11
-    seen                                                         = 11
-    tooLong                                                      = 0
-    tooMany                                                      = 4
-
- Summarizer         : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {}
- File Statistics    : [total:2, missing:0, extra:0, large:0]
- Summary Statistics : 
-    deletes                                                      = 1
-    total                                                        = 11
-....
-
-Some more deletes are added and the table is flushed below.  This results in 4
-deletes and 10 non-deletes, which triggers a full compaction.  A full
-compaction of all files is the only time when delete markers are dropped.  The
-compaction ratio was set to 10 above to show that the number of files did not
-trigger the compaction.   After the compaction there no deletes 6 non-deletes.
-
-....
-root@uno summary_test> deletemany -r d5d18dd -f
-[DELETED] d5d18dd contact:address [PI&GEO]
-[DELETED] d5d18dd name:first []
-[DELETED] d5d18dd name:last []
-root@uno summary_test> flush -w
-2017-02-24 19:54:52,800 [shell.Shell] INFO : Flush of table summary_test completed.
-root@uno summary_test> summaries
- Summarizer         : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3}
- File Statistics    : [total:1, missing:0, extra:0, large:0]
- Summary Statistics : 
-    c:PI                                                         = 1
-    c:PI&GEO                                                     = 1
-    c:PI&TIME                                                    = 2
-    emitted                                                      = 6
-    seen                                                         = 6
-    tooLong                                                      = 0
-    tooMany                                                      = 2
-
- Summarizer         : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {}
- File Statistics    : [total:1, missing:0, extra:0, large:0]
- Summary Statistics : 
-    deletes                                                      = 0
-    total                                                        = 6
-root@uno summary_test>   
-....
-

http://git-wip-us.apache.org/repos/asf/accumulo/blob/e99ec9f0/docs/src/main/asciidoc/chapters/table_configuration.txt
----------------------------------------------------------------------
diff --git a/docs/src/main/asciidoc/chapters/table_configuration.txt b/docs/src/main/asciidoc/chapters/table_configuration.txt
deleted file mode 100644
index e78d7bd..0000000
--- a/docs/src/main/asciidoc/chapters/table_configuration.txt
+++ /dev/null
@@ -1,670 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements.  See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License.  You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Table Configuration
-
-Accumulo tables have a few options that can be configured to alter the default
-behavior of Accumulo as well as improve performance based on the data stored.
-These include locality groups, constraints, bloom filters, iterators, and block
-cache.  For a complete list of available configuration options, see <<configuration>>.
-
-=== Locality Groups
-
-Accumulo supports storing sets of column families separately on disk to allow
-clients to efficiently scan over columns that are frequently used together and to avoid
-scanning over column families that are not requested. After a locality group is set,
-Scanner and BatchScanner operations will automatically take advantage of them
-whenever the fetchColumnFamilies() method is used.
-
-By default, tables place all column families into the same ``default'' locality group.
-Additional locality groups can be configured at any time via the shell or
-programmatically as follows:
-
-==== Managing Locality Groups via the Shell
-
-  usage: setgroups <group>=<col fam>{,<col fam>}{ <group>=<col fam>{,<col fam>}}
-      [-?] -t <table>
-
-  user@myinstance mytable> setgroups group_one=colf1,colf2 -t mytable
-
-  user@myinstance mytable> getgroups -t mytable
-
-==== Managing Locality Groups via the Client API
-
-[source,java]
-----
-Connector conn;
-
-HashMap<String,Set<Text>> localityGroups = new HashMap<String, Set<Text>>();
-
-HashSet<Text> metadataColumns = new HashSet<Text>();
-metadataColumns.add(new Text("domain"));
-metadataColumns.add(new Text("link"));
-
-HashSet<Text> contentColumns = new HashSet<Text>();
-contentColumns.add(new Text("body"));
-contentColumns.add(new Text("images"));
-
-localityGroups.put("metadata", metadataColumns);
-localityGroups.put("content", contentColumns);
-
-conn.tableOperations().setLocalityGroups("mytable", localityGroups);
-
-// existing locality groups can be obtained as follows
-Map<String, Set<Text>> groups =
-    conn.tableOperations().getLocalityGroups("mytable");
-----
-
-The assignment of Column Families to Locality Groups can be changed at any time. The
-physical movement of column families into their new locality groups takes place via
-the periodic Major Compaction process that takes place continuously in the
-background. Major Compaction can also be scheduled to take place immediately
-through the shell:
-
-  user@myinstance mytable> compact -t mytable
-
-=== Constraints
-
-Accumulo supports constraints applied on mutations at insert time. This can be
-used to disallow certain inserts according to a user defined policy. Any mutation
-that fails to meet the requirements of the constraint is rejected and sent back to the
-client.
-
-Constraints can be enabled by setting a table property as follows:
-
-----
-user@myinstance mytable> constraint -t mytable -a com.test.ExampleConstraint com.test.AnotherConstraint
-
-user@myinstance mytable> constraint -l
-com.test.ExampleConstraint=1
-com.test.AnotherConstraint=2
-----
-
-Currently there are no general-purpose constraints provided with the Accumulo
-distribution. New constraints can be created by writing a Java class that implements
-the following interface:
-
-  org.apache.accumulo.core.constraints.Constraint
-
-To deploy a new constraint, create a jar file containing the class implementing the
-new constraint and place it in the lib directory of the Accumulo installation. New
-constraint jars can be added to Accumulo and enabled without restarting but any
-change to an existing constraint class requires Accumulo to be restarted.
-
-See the https://github.com/apache/accumulo-examples/blob/master/docs/contraints.md[constraints example]
-for example code.
-
-=== Bloom Filters
-
-As mutations are applied to an Accumulo table, several files are created per tablet. If
-bloom filters are enabled, Accumulo will create and load a small data structure into
-memory to determine whether a file contains a given key before opening the file.
-This can speed up lookups considerably.
-
-To enable bloom filters, enter the following command in the Shell:
-
-  user@myinstance> config -t mytable -s table.bloom.enabled=true
-
-The https://github.com/apache/accumulo-examples/blob/master/docs/bloom.md[bloom filter example]
-contains an extensive example of using Bloom Filters.
-
-=== Iterators
-
-Iterators provide a modular mechanism for adding functionality to be executed by
-TabletServers when scanning or compacting data. This allows users to efficiently
-summarize, filter, and aggregate data. In fact, the built-in features of cell-level
-security and column fetching are implemented using Iterators.
-Some useful Iterators are provided with Accumulo and can be found in the
-*+org.apache.accumulo.core.iterators.user+* package.
-In each case, any custom Iterators must be included in Accumulo's classpath,
-typically by including a jar in +lib/+ or +lib/ext/+, although the VFS classloader
-allows for classpath manipulation using a variety of schemes including URLs and HDFS URIs.
-
-==== Setting Iterators via the Shell
-
-Iterators can be configured on a table at scan, minor compaction and/or major
-compaction scopes. If the Iterator implements the OptionDescriber interface, the
-setiter command can be used which will interactively prompt the user to provide
-values for the given necessary options.
-
-  usage: setiter [-?] -ageoff | -agg | -class <name> | -regex |
-      -reqvis | -vers   [-majc] [-minc] [-n <itername>] -p <pri>
-      [-scan] [-t <table>]
-
-  user@myinstance mytable> setiter -t mytable -scan -p 15 -n myiter -class com.company.MyIterator
-
-The config command can always be used to manually configure iterators which is useful
-in cases where the Iterator does not implement the OptionDescriber interface.
-
-  config -t mytable -s table.iterator.scan.myiter=15,com.company.MyIterator
-  config -t mytable -s table.iterator.minc.myiter=15,com.company.MyIterator
-  config -t mytable -s table.iterator.majc.myiter=15,com.company.MyIterator
-  config -t mytable -s table.iterator.scan.myiter.opt.myoptionname=myoptionvalue
-  config -t mytable -s table.iterator.minc.myiter.opt.myoptionname=myoptionvalue
-  config -t mytable -s table.iterator.majc.myiter.opt.myoptionname=myoptionvalue
-
-Typically, a table will have multiple iterators. Accumulo configures a set of
-system level iterators for each table. These iterators provide core
-functionality like visibility label filtering and may not be removed by
-users. User level iterators are applied in the order of their priority.
-Priority is a user configured integer; iterators with lower numbers go first,
-passing the results of their iteration on to the other iterators up the
-stack.
-
-==== Setting Iterators Programmatically
-
-[source,java]
-scanner.addIterator(new IteratorSetting(
-    15, // priority
-    "myiter", // name this iterator
-    "com.company.MyIterator" // class name
-));
-
-Some iterators take additional parameters from client code, as in the following
-example:
-
-[source,java]
-IteratorSetting iter = new IteratorSetting(...);
-iter.addOption("myoptionname", "myoptionvalue");
-scanner.addIterator(iter)
-
-Tables support separate Iterator settings to be applied at scan time, upon minor
-compaction and upon major compaction. For most uses, tables will have identical
-iterator settings for all three to avoid inconsistent results.
-
-==== Versioning Iterators and Timestamps
-
-Accumulo provides the capability to manage versioned data through the use of
-timestamps within the Key. If a timestamp is not specified in the key created by the
-client then the system will set the timestamp to the current time. Two keys with
-identical rowIDs and columns but different timestamps are considered two versions
-of the same key. If two inserts are made into Accumulo with the same rowID,
-column, and timestamp, then the behavior is non-deterministic.
-
-Timestamps are sorted in descending order, so the most recent data comes first.
-Accumulo can be configured to return the top k versions, or versions later than a
-given date. The default is to return the one most recent version.
-
-The version policy can be changed by changing the VersioningIterator options for a
-table as follows:
-
-----
-user@myinstance mytable> config -t mytable -s table.iterator.scan.vers.opt.maxVersions=3
-
-user@myinstance mytable> config -t mytable -s table.iterator.minc.vers.opt.maxVersions=3
-
-user@myinstance mytable> config -t mytable -s table.iterator.majc.vers.opt.maxVersions=3
-----
-
-When a table is created, by default its configured to use the
-VersioningIterator and keep one version. A table can be created without the
-VersioningIterator with the -ndi option in the shell. Also the Java API
-has the following method
-
-[source,java]
-connector.tableOperations.create(String tableName, boolean limitVersion);
-
-===== Logical Time
-
-Accumulo 1.2 introduces the concept of logical time. This ensures that timestamps
-set by Accumulo always move forward. This helps avoid problems caused by
-TabletServers that have different time settings. The per tablet counter gives unique
-one up time stamps on a per mutation basis. When using time in milliseconds, if
-two things arrive within the same millisecond then both receive the same
-timestamp. When using time in milliseconds, Accumulo set times will still
-always move forward and never backwards.
-
-A table can be configured to use logical timestamps at creation time as follows:
-
-  user@myinstance> createtable -tl logical
-
-===== Deletes
-
-Deletes are special keys in Accumulo that get sorted along will all the other data.
-When a delete key is inserted, Accumulo will not show anything that has a
-timestamp less than or equal to the delete key. During major compaction, any keys
-older than a delete key are omitted from the new file created, and the omitted keys
-are removed from disk as part of the regular garbage collection process.
-
-==== Filters
-
-When scanning over a set of key-value pairs it is possible to apply an arbitrary
-filtering policy through the use of a Filter. Filters are types of iterators that return
-only key-value pairs that satisfy the filter logic. Accumulo has a few built-in filters
-that can be configured on any table: AgeOff, ColumnAgeOff, Timestamp, NoVis, and RegEx. More can be added
-by writing a Java class that extends the
-+org.apache.accumulo.core.iterators.Filter+ class.
-
-The AgeOff filter can be configured to remove data older than a certain date or a fixed
-amount of time from the present. The following example sets a table to delete
-everything inserted over 30 seconds ago:
-
-----
-user@myinstance> createtable filtertest
-
-user@myinstance filtertest> setiter -t filtertest -scan -minc -majc -p 10 -n myfilter -ageoff
-AgeOffFilter removes entries with timestamps more than <ttl> milliseconds old
-----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter negate, default false
-                keeps k/v that pass accept method, true rejects k/v that pass accept method:
-----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter ttl, time to
-                live (milliseconds): 30000
-----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter currentTime, if set,
-                use the given value as the absolute time in milliseconds as the current time of day:
-
-user@myinstance filtertest>
-
-user@myinstance filtertest> scan
-
-user@myinstance filtertest> insert foo a b c
-
-user@myinstance filtertest> scan
-foo a:b [] c
-
-user@myinstance filtertest> sleep 4
-
-user@myinstance filtertest> scan
-
-user@myinstance filtertest>
-----
-
-To see the iterator settings for a table, use:
-
-  user@example filtertest> config -t filtertest -f iterator
-  ---------+---------------------------------------------+------------------
-  SCOPE    | NAME                                        | VALUE
-  ---------+---------------------------------------------+------------------
-  table    | table.iterator.majc.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter
-  table    | table.iterator.majc.myfilter.opt.ttl ...... | 30000
-  table    | table.iterator.majc.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator
-  table    | table.iterator.majc.vers.opt.maxVersions .. | 1
-  table    | table.iterator.minc.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter
-  table    | table.iterator.minc.myfilter.opt.ttl ...... | 30000
-  table    | table.iterator.minc.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator
-  table    | table.iterator.minc.vers.opt.maxVersions .. | 1
-  table    | table.iterator.scan.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter
-  table    | table.iterator.scan.myfilter.opt.ttl ...... | 30000
-  table    | table.iterator.scan.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator
-  table    | table.iterator.scan.vers.opt.maxVersions .. | 1
-  ---------+---------------------------------------------+------------------
-
-==== Combiners
-
-Accumulo supports on the fly lazy aggregation of data using Combiners. Aggregation is
-done at compaction and scan time. No lookup is done at insert time, which` greatly
-speeds up ingest.
-
-Accumulo allows Combiners to be configured on tables and column
-families. When a Combiner is set it is applied across the values
-associated with any keys that share rowID, column family, and column qualifier.
-This is similar to the reduce step in MapReduce, which applied some function to all
-the values associated with a particular key.
-
-For example, if a summing combiner were configured on a table and the following
-mutations were inserted:
-
-  Row     Family Qualifier Timestamp  Value
-  rowID1  colfA  colqA     20100101   1
-  rowID1  colfA  colqA     20100102   1
-
-The table would reflect only one aggregate value:
-
-  rowID1  colfA  colqA     -          2
-
-Combiners can be enabled for a table using the setiter command in the shell. Below is an example.
-
-----
-root@a14 perDayCounts> setiter -t perDayCounts -p 10 -scan -minc -majc -n daycount
-                       -class org.apache.accumulo.core.iterators.user.SummingCombiner
-TypedValueCombiner can interpret Values as a variety of number encodings
-  (VLong, Long, or String) before combining
-----------> set SummingCombiner parameter columns,
-            <col fam>[:<col qual>]{,<col fam>[:<col qual>]} : day
-----------> set SummingCombiner parameter type, <VARNUM|LONG|STRING>: STRING
-
-root@a14 perDayCounts> insert foo day 20080101 1
-root@a14 perDayCounts> insert foo day 20080101 1
-root@a14 perDayCounts> insert foo day 20080103 1
-root@a14 perDayCounts> insert bar day 20080101 1
-root@a14 perDayCounts> insert bar day 20080101 1
-
-root@a14 perDayCounts> scan
-bar day:20080101 []    2
-foo day:20080101 []    2
-foo day:20080103 []    1
-----
-
-Accumulo includes some useful Combiners out of the box. To find these look in
-the *+org.apache.accumulo.core.iterators.user+* package.
-
-Additional Combiners can be added by creating a Java class that extends
-+org.apache.accumulo.core.iterators.Combiner+ and adding a jar containing that
-class to Accumulo's lib/ext directory.
-
-See the https://github.com/apache/accumulo-examples/blob/master/docs/combiner.md[combiner example]
-for example code.
-
-=== Block Cache
-
-In order to increase throughput of commonly accessed entries, Accumulo employs a block cache.
-This block cache buffers data in memory so that it doesn't have to be read off of disk.
-The RFile format that Accumulo prefers is a mix of index blocks and data blocks, where the index blocks are used to find the appropriate data blocks.
-Typical queries to Accumulo result in a binary search over several index blocks followed by a linear scan of one or more data blocks.
-
-The block cache can be configured on a per-table basis, and all tablets hosted on a tablet server share a single resource pool.
-To configure the size of the tablet server's block cache, set the following properties:
-
-  tserver.cache.data.size: Specifies the size of the cache for file data blocks.
-  tserver.cache.index.size: Specifies the size of the cache for file indices.
-
-To enable the block cache for your table, set the following properties:
-
-  table.cache.block.enable: Determines whether file (data) block cache is enabled.
-  table.cache.index.enable: Determines whether index cache is enabled.
-
-The block cache can have a significant effect on alleviating hot spots, as well as reducing query latency.
-It is enabled by default for the metadata tables.
-
-=== Compaction
-
-As data is written to Accumulo it is buffered in memory. The data buffered in
-memory is eventually written to HDFS on a per tablet basis. Files can also be
-added to tablets directly by bulk import. In the background tablet servers run
-major compactions to merge multiple files into one. The tablet server has to
-decide which tablets to compact and which files within a tablet to compact.
-This decision is made using the compaction ratio, which is configurable on a
-per table basis. To configure this ratio modify the following property:
-
-  table.compaction.major.ratio
-
-Increasing this ratio will result in more files per tablet and less compaction
-work. More files per tablet means more higher query latency. So adjusting
-this ratio is a trade off between ingest and query performance. The ratio
-defaults to 3.
-
-The way the ratio works is that a set of files is compacted into one file if the
-sum of the sizes of the files in the set is larger than the ratio multiplied by
-the size of the largest file in the set. If this is not true for the set of all
-files in a tablet, the largest file is removed from consideration, and the
-remaining files are considered for compaction. This is repeated until a
-compaction is triggered or there are no files left to consider.
-
-The number of background threads tablet servers use to run major compactions is
-configurable. To configure this modify the following property:
-
-  tserver.compaction.major.concurrent.max
-
-Also, the number of threads tablet servers use for minor compactions is
-configurable. To configure this modify the following property:
-
-  tserver.compaction.minor.concurrent.max
-
-The numbers of minor and major compactions running and queued is visible on the
-Accumulo monitor page. This allows you to see if compactions are backing up
-and adjustments to the above settings are needed. When adjusting the number of
-threads available for compactions, consider the number of cores and other tasks
-running on the nodes such as maps and reduces.
-
-If major compactions are not keeping up, then the number of files per tablet
-will grow to a point such that query performance starts to suffer. One way to
-handle this situation is to increase the compaction ratio. For example, if the
-compaction ratio were set to 1, then every new file added to a tablet by minor
-compaction would immediately queue the tablet for major compaction. So if a
-tablet has a 200M file and minor compaction writes a 1M file, then the major
-compaction will attempt to merge the 200M and 1M file. If the tablet server
-has lots of tablets trying to do this sort of thing, then major compactions
-will back up and the number of files per tablet will start to grow, assuming
-data is being continuously written. Increasing the compaction ratio will
-alleviate backups by lowering the amount of major compaction work that needs to
-be done.
-
-Another option to deal with the files per tablet growing too large is to adjust
-the following property:
-
-  table.file.max
-
-When a tablet reaches this number of files and needs to flush its in-memory
-data to disk, it will choose to do a merging minor compaction. A merging minor
-compaction will merge the tablet's smallest file with the data in memory at
-minor compaction time. Therefore the number of files will not grow beyond this
-limit. This will make minor compactions take longer, which will cause ingest
-performance to decrease. This can cause ingest to slow down until major
-compactions have enough time to catch up. When adjusting this property, also
-consider adjusting the compaction ratio. Ideally, merging minor compactions
-never need to occur and major compactions will keep up. It is possible to
-configure the file max and compaction ratio such that only merging minor
-compactions occur and major compactions never occur. This should be avoided
-because doing only merging minor compactions causes O(_N_^2^) work to be done.
-The amount of work done by major compactions is O(_N_*log~_R_~(_N_)) where
-_R_ is the compaction ratio.
-
-Compactions can be initiated manually for a table. To initiate a minor
-compaction, use the flush command in the shell. To initiate a major compaction,
-use the compact command in the shell. The compact command will compact all
-tablets in a table to one file. Even tablets with one file are compacted. This
-is useful for the case where a major compaction filter is configured for a
-table. In 1.4 the ability to compact a range of a table was added. To use this
-feature specify start and stop rows for the compact command. This will only
-compact tablets that overlap the given row range.
-
-==== Compaction Strategies
-
-The default behavior of major compactions is defined in the class DefaultCompactionStrategy. 
-This behavior can be changed by overriding the following property with a fully qualified class name:
-
-  table.majc.compaction.strategy
-
-Custom compaction strategies can have additional properties that are specified following the prefix property:
-
-  table.majc.compaction.strategy.opts.*
-
-Accumulo provides a few classes that can be used as an alternative compaction strategy. These classes are located in the 
-org.apache.accumulo.tserver.compaction.* package. EverythingCompactionStrategy will simply compact all files. This is the 
-strategy used by the user "compact" command. SizeLimitCompactionStrategy compacts files no bigger than the limit set in the
-property table.majc.compaction.strategy.opts.sizeLimit. 
-
-TwoTierCompactionStrategy is a hybrid compaction strategy that supports two types of compression. If the total size of 
-files being compacted is larger than table.majc.compaction.strategy.opts.file.large.compress.threshold than a larger 
-compression type will be used. The larger compression type is specified in table.majc.compaction.strategy.opts.file.large.compress.type. 
-Otherwise, the configured table compression will be used. To use this strategy with minor compactions set table.file.compress.type=snappy 
-and set a different compress type in table.majc.compaction.strategy.opts.file.large.compress.type for larger files.
-
-=== Pre-splitting tables
-
-Accumulo will balance and distribute tables across servers. Before a
-table gets large, it will be maintained as a single tablet on a single
-server. This limits the speed at which data can be added or queried
-to the speed of a single node. To improve performance when the a table
-is new, or small, you can add split points and generate new tablets.
-
-In the shell:
-
-  root@myinstance> createtable newTable
-  root@myinstance> addsplits -t newTable g n t
-
-This will create a new table with 4 tablets. The table will be split
-on the letters ``g'', ``n'', and ``t'' which will work nicely if the
-row data start with lower-case alphabetic characters. If your row
-data includes binary information or numeric information, or if the
-distribution of the row information is not flat, then you would pick
-different split points. Now ingest and query can proceed on 4 nodes
-which can improve performance.
-
-=== Merging tablets
-
-Over time, a table can get very large, so large that it has hundreds
-of thousands of split points. Once there are enough tablets to spread
-a table across the entire cluster, additional splits may not improve
-performance, and may create unnecessary bookkeeping. The distribution
-of data may change over time. For example, if row data contains date
-information, and data is continually added and removed to maintain a
-window of current information, tablets for older rows may be empty.
-
-Accumulo supports tablet merging, which can be used to reduce
-the number of split points. The following command will merge all rows
-from ``A'' to ``Z'' into a single tablet:
-
-  root@myinstance> merge -t myTable -s A -e Z
-
-If the result of a merge produces a tablet that is larger than the
-configured split size, the tablet may be split by the tablet server.
-Be sure to increase your tablet size prior to any merges if the goal
-is to have larger tablets:
-
-  root@myinstance> config -t myTable -s table.split.threshold=2G
-
-In order to merge small tablets, you can ask Accumulo to merge
-sections of a table smaller than a given size.
-
-  root@myinstance> merge -t myTable -s 100M
-
-By default, small tablets will not be merged into tablets that are
-already larger than the given size. This can leave isolated small
-tablets. To force small tablets to be merged into larger tablets use
-the +--force+ option:
-
-  root@myinstance> merge -t myTable -s 100M --force
-
-Merging away small tablets works on one section at a time. If your
-table contains many sections of small split points, or you are
-attempting to change the split size of the entire table, it will be
-faster to set the split point and merge the entire table:
-
-  root@myinstance> config -t myTable -s table.split.threshold=256M
-  root@myinstance> merge -t myTable
-
-=== Delete Range
-
-Consider an indexing scheme that uses date information in each row.
-For example ``20110823-15:20:25.013'' might be a row that specifies a
-date and time. In some cases, we might like to delete rows based on
-this date, say to remove all the data older than the current year.
-Accumulo supports a delete range operation which efficiently
-removes data between two rows. For example:
-
-  root@myinstance> deleterange -t myTable -s 2010 -e 2011
-
-This will delete all rows starting with ``2010'' and it will stop at
-any row starting ``2011''. You can delete any data prior to 2011
-with:
-
-  root@myinstance> deleterange -t myTable -e 2011 --force
-
-The shell will not allow you to delete an unbounded range (no start)
-unless you provide the +--force+ option.
-
-Range deletion is implemented using splits at the given start/end
-positions, and will affect the number of splits in the table.
-
-=== Cloning Tables
-
-A new table can be created that points to an existing table's data. This is a
-very quick metadata operation, no data is actually copied. The cloned table
-and the source table can change independently after the clone operation. One
-use case for this feature is testing. For example to test a new filtering
-iterator, clone the table, add the filter to the clone, and force a major
-compaction. To perform a test on less data, clone a table and then use delete
-range to efficiently remove a lot of data from the clone. Another use case is
-generating a snapshot to guard against human error. To create a snapshot,
-clone a table and then disable write permissions on the clone.
-
-The clone operation will point to the source table's files. This is why the
-flush option is present and is enabled by default in the shell. If the flush
-option is not enabled, then any data the source table currently has in memory
-will not exist in the clone.
-
-A cloned table copies the configuration of the source table. However the
-permissions of the source table are not copied to the clone. After a clone is
-created, only the user that created the clone can read and write to it.
-
-In the following example we see that data inserted after the clone operation is
-not visible in the clone.
-
-----
-root@a14> createtable people
-
-root@a14 people> insert 890435 name last Doe
-root@a14 people> insert 890435 name first John
-
-root@a14 people> clonetable people test
-
-root@a14 people> insert 890436 name first Jane
-root@a14 people> insert 890436 name last Doe
-
-root@a14 people> scan
-890435 name:first []    John
-890435 name:last []    Doe
-890436 name:first []    Jane
-890436 name:last []    Doe
-
-root@a14 people> table test
-
-root@a14 test> scan
-890435 name:first []    John
-890435 name:last []    Doe
-
-root@a14 test>
-----
-
-The du command in the shell shows how much space a table is using in HDFS.
-This command can also show how much overlapping space two cloned tables have in
-HDFS. In the example below du shows table ci is using 428M. Then ci is cloned
-to cic and du shows that both tables share 428M. After three entries are
-inserted into cic and its flushed, du shows the two tables still share 428M but
-cic has 226 bytes to itself. Finally, table cic is compacted and then du shows
-that each table uses 428M.
-
-----
-root@a14> du ci
-             428,482,573 [ci]
-
-root@a14> clonetable ci cic
-
-root@a14> du ci cic
-             428,482,573 [ci, cic]
-
-root@a14> table cic
-
-root@a14 cic> insert r1 cf1 cq1 v1
-root@a14 cic> insert r1 cf1 cq2 v2
-root@a14 cic> insert r1 cf1 cq3 v3
-
-root@a14 cic> flush -t cic -w
-27 15:00:13,908 [shell.Shell] INFO : Flush of table cic completed.
-
-root@a14 cic> du ci cic
-             428,482,573 [ci, cic]
-                     226 [cic]
-
-root@a14 cic> compact -t cic -w
-27 15:00:35,871 [shell.Shell] INFO : Compacting table ...
-27 15:03:03,303 [shell.Shell] INFO : Compaction of table cic completed for given range
-
-root@a14 cic> du ci cic
-             428,482,573 [ci]
-             428,482,612 [cic]
-
-root@a14 cic>
-----
-
-=== Exporting Tables
-
-Accumulo supports exporting tables for the purpose of copying tables to another
-cluster. Exporting and importing tables preserves the tables configuration,
-splits, and logical time. Tables are exported and then copied via the hadoop
-distcp command. To export a table, it must be offline and stay offline while
-discp runs. The reason it needs to stay offline is to prevent files from being
-deleted. A table can be cloned and the clone taken offline inorder to avoid
-losing access to the table. See the https://github.com/apache/accumulo-examples/blob/master/docs/export.md[export example]
-for example code.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/accumulo/blob/e99ec9f0/docs/src/main/asciidoc/chapters/table_design.txt
----------------------------------------------------------------------
diff --git a/docs/src/main/asciidoc/chapters/table_design.txt b/docs/src/main/asciidoc/chapters/table_design.txt
deleted file mode 100644
index 31fa49a..0000000
--- a/docs/src/main/asciidoc/chapters/table_design.txt
+++ /dev/null
@@ -1,336 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements.  See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License.  You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Table Design
-
-=== Basic Table
-
-Since Accumulo tables are sorted by row ID, each table can be thought of as being
-indexed by the row ID. Lookups performed by row ID can be executed quickly, by doing
-a binary search, first across the tablets, and then within a tablet. Clients should
-choose a row ID carefully in order to support their desired application. A simple rule
-is to select a unique identifier as the row ID for each entity to be stored and assign
-all the other attributes to be tracked to be columns under this row ID. For example,
-if we have the following data in a comma-separated file:
-
-  userid,age,address,account-balance
-
-We might choose to store this data using the userid as the rowID, the column
-name in the column family, and a blank column qualifier:
-
-[source,java]
-----
-Mutation m = new Mutation(userid);
-final String column_qualifier = "";
-m.put("age", column_qualifier, age);
-m.put("address", column_qualifier, address);
-m.put("balance", column_qualifier, account_balance);
-
-writer.add(m);
-----
-
-We could then retrieve any of the columns for a specific userid by specifying the
-userid as the range of a scanner and fetching specific columns:
-
-[source,java]
-----
-Range r = new Range(userid, userid); // single row
-Scanner s = conn.createScanner("userdata", auths);
-s.setRange(r);
-s.fetchColumnFamily(new Text("age"));
-
-for(Entry<Key,Value> entry : s) {
-  System.out.println(entry.getValue().toString());
-}
-----
-
-=== RowID Design
-
-Often it is necessary to transform the rowID in order to have rows ordered in a way
-that is optimal for anticipated access patterns. A good example of this is reversing
-the order of components of internet domain names in order to group rows of the
-same parent domain together:
-
-  com.google.code
-  com.google.labs
-  com.google.mail
-  com.yahoo.mail
-  com.yahoo.research
-
-Some data may result in the creation of very large rows - rows with many columns.
-In this case the table designer may wish to split up these rows for better load
-balancing while keeping them sorted together for scanning purposes. This can be
-done by appending a random substring at the end of the row:
-
-  com.google.code_00
-  com.google.code_01
-  com.google.code_02
-  com.google.labs_00
-  com.google.mail_00
-  com.google.mail_01
-
-It could also be done by adding a string representation of some period of time such as date to the week
-or month:
-
-  com.google.code_201003
-  com.google.code_201004
-  com.google.code_201005
-  com.google.labs_201003
-  com.google.mail_201003
-  com.google.mail_201004
-
-Appending dates provides the additional capability of restricting a scan to a given
-date range.
-
-=== Lexicoders
-Since Keys in Accumulo are sorted lexicographically by default, it's often useful to encode
-common data types into a byte format in which their sort order corresponds to the sort order
-in their native form. An example of this is encoding dates and numerical data so that they can
-be better seeked or searched in ranges.
-
-The lexicoders are a standard and extensible way of encoding Java types. Here's an example
-of a lexicoder that encodes a java Date object so that it sorts lexicographically:
-
-[source,java]
-----
-// create new date lexicoder
-DateLexicoder dateEncoder = new DateLexicoder();
-
-// truncate time to hours
-long epoch = System.currentTimeMillis();
-Date hour = new Date(epoch - (epoch % 3600000));
-
-// encode the rowId so that it is sorted lexicographically
-Mutation mutation = new Mutation(dateEncoder.encode(hour));
-mutation.put(new Text("colf"), new Text("colq"), new Value(new byte[]{}));
-----
-
-If we want to return the most recent date first, we can reverse the sort order
-with the reverse lexicoder:
-
-[source,java]
-----
-// create new date lexicoder and reverse lexicoder
-DateLexicoder dateEncoder = new DateLexicoder();
-ReverseLexicoder reverseEncoder = new ReverseLexicoder(dateEncoder);
-
-// truncate date to hours
-long epoch = System.currentTimeMillis();
-Date hour = new Date(epoch - (epoch % 3600000));
-
-// encode the rowId so that it sorts in reverse lexicographic order
-Mutation mutation = new Mutation(reverseEncoder.encode(hour));
-mutation.put(new Text("colf"), new Text("colq"), new Value(new byte[]{}));
-----
-
-
-=== Indexing
-In order to support lookups via more than one attribute of an entity, additional
-indexes can be built. However, because Accumulo tables can support any number of
-columns without specifying them beforehand, a single additional index will often
-suffice for supporting lookups of records in the main table. Here, the index has, as
-the rowID, the Value or Term from the main table, the column families are the same,
-and the column qualifier of the index table contains the rowID from the main table.
-
-[width="75%",cols="^,^,^,^"]
-[grid="rows"]
-[options="header"]
-|=============================================
-|RowID |Column Family |Column Qualifier |Value
-|Term  |Field Name    |MainRowID        |
-|=============================================
-
-Note: We store rowIDs in the column qualifier rather than the Value so that we can
-have more than one rowID associated with a particular term within the index. If we
-stored this in the Value we would only see one of the rows in which the value
-appears since Accumulo is configured by default to return the one most recent
-value associated with a key.
-
-Lookups can then be done by scanning the Index Table first for occurrences of the
-desired values in the columns specified, which returns a list of row ID from the main
-table. These can then be used to retrieve each matching record, in their entirety, or a
-subset of their columns, from the Main Table.
-
-To support efficient lookups of multiple rowIDs from the same table, the Accumulo
-client library provides a BatchScanner. Users specify a set of Ranges to the
-BatchScanner, which performs the lookups in multiple threads to multiple servers
-and returns an Iterator over all the rows retrieved. The rows returned are NOT in
-sorted order, as is the case with the basic Scanner interface.
-
-[source,java]
-----
-// first we scan the index for IDs of rows matching our query
-Text term = new Text("mySearchTerm");
-
-HashSet<Range> matchingRows = new HashSet<Range>();
-
-Scanner indexScanner = createScanner("index", auths);
-indexScanner.setRange(new Range(term, term));
-
-// we retrieve the matching rowIDs and create a set of ranges
-for(Entry<Key,Value> entry : indexScanner) {
-    matchingRows.add(new Range(entry.getKey().getColumnQualifier()));
-}
-
-// now we pass the set of rowIDs to the batch scanner to retrieve them
-BatchScanner bscan = conn.createBatchScanner("table", auths, 10);
-bscan.setRanges(matchingRows);
-bscan.fetchColumnFamily(new Text("attributes"));
-
-for(Entry<Key,Value> entry : bscan) {
-    System.out.println(entry.getValue());
-}
-----
-
-One advantage of the dynamic schema capabilities of Accumulo is that different
-fields may be indexed into the same physical table. However, it may be necessary to
-create different index tables if the terms must be formatted differently in order to
-maintain proper sort order. For example, real numbers must be formatted
-differently than their usual notation in order to be sorted correctly. In these cases,
-usually one index per unique data type will suffice.
-
-=== Entity-Attribute and Graph Tables
-
-Accumulo is ideal for storing entities and their attributes, especially of the
-attributes are sparse. It is often useful to join several datasets together on common
-entities within the same table. This can allow for the representation of graphs,
-including nodes, their attributes, and connections to other nodes.
-
-Rather than storing individual events, Entity-Attribute or Graph tables store
-aggregate information about the entities involved in the events and the
-relationships between entities. This is often preferrable when single events aren't
-very useful and when a continuously updated summarization is desired.
-
-The physical schema for an entity-attribute or graph table is as follows:
-
-[width="75%",cols="^,^,^,^"]
-[grid="rows"]
-[options="header"]
-|==================================================
-|RowID    |Column Family  |Column Qualifier |Value
-|EntityID |Attribute Name |Attribute Value  |Weight
-|EntityID |Edge Type      |Related EntityID |Weight
-|==================================================
-
-For example, to keep track of employees, managers and products the following
-entity-attribute table could be used. Note that the weights are not always necessary
-and are set to 0 when not used.
-
-[width="75%",cols="^,^,^,^"]
-[grid="rows"]
-[options="header"]
-|=============================================
-|RowID |Column Family |Column Qualifier |Value
-| E001 | name         | bob             | 0
-| E001 | department   | sales           | 0
-| E001 | hire_date    | 20030102        | 0
-| E001 | units_sold   | P001            | 780
-| E002 | name         | george          | 0
-| E002 | department   | sales           | 0
-| E002 | manager_of   | E001            | 0
-| E002 | manager_of   | E003            | 0
-| E003 | name         | harry           | 0
-| E003 | department   | accounts_recv   | 0
-| E003 | hire_date    | 20000405        | 0
-| E003 | units_sold   | P002            | 566
-| E003 | units_sold   | P001            | 232
-| P001 | product_name | nike_airs       | 0
-| P001 | product_type | shoe            | 0
-| P001 | in_stock     | germany         | 900
-| P001 | in_stock     | brazil          | 200
-| P002 | product_name | basic_jacket    | 0
-| P002 | product_type | clothing        | 0
-| P002 | in_stock     | usa             | 3454
-| P002 | in_stock     | germany         | 700
-|=============================================
-
-To allow efficient updating of edge weights, an aggregating iterator can be
-configured to add the value of all mutations applied with the same key. These types
-of tables can easily be created from raw events by simply extracting the entities,
-attributes, and relationships from individual events and inserting the keys into
-Accumulo each with a count of 1. The aggregating iterator will take care of
-maintaining the edge weights.
-
-=== Document-Partitioned Indexing
-
-Using a simple index as described above works well when looking for records that
-match one of a set of given criteria. When looking for records that match more than
-one criterion simultaneously, such as when looking for documents that contain all of
-the words `the' and `white' and `house', there are several issues.
-
-First is that the set of all records matching any one of the search terms must be sent
-to the client, which incurs a lot of network traffic. The second problem is that the
-client is responsible for performing set intersection on the sets of records returned
-to eliminate all but the records matching all search terms. The memory of the client
-may easily be overwhelmed during this operation.
-
-For these reasons Accumulo includes support for a scheme known as sharded
-indexing, in which these set operations can be performed at the TabletServers and
-decisions about which records to include in the result set can be made without
-incurring network traffic.
-
-This is accomplished via partitioning records into bins that each reside on at most
-one TabletServer, and then creating an index of terms per record within each bin as
-follows:
-
-[width="75%",cols="^,^,^,^"]
-[grid="rows"]
-[options="header"]
-|==============================================
-|RowID |Column Family |Column Qualifier |Value
-|BinID |Term          |DocID            |Weight
-|==============================================
-
-Documents or records are mapped into bins by a user-defined ingest application. By
-storing the BinID as the RowID we ensure that all the information for a particular
-bin is contained in a single tablet and hosted on a single TabletServer since
-Accumulo never splits rows across tablets. Storing the Terms as column families
-serves to enable fast lookups of all the documents within this bin that contain the
-given term.
-
-Finally, we perform set intersection operations on the TabletServer via a special
-iterator called the Intersecting Iterator. Since documents are partitioned into many
-bins, a search of all documents must search every bin. We can use the BatchScanner
-to scan all bins in parallel. The Intersecting Iterator should be enabled on a
-BatchScanner within user query code as follows:
-
-[source,java]
-----
-Text[] terms = {new Text("the"), new Text("white"), new Text("house")};
-
-BatchScanner bscan = conn.createBatchScanner(table, auths, 20);
-
-IteratorSetting iter = new IteratorSetting(20, "ii", IntersectingIterator.class);
-IntersectingIterator.setColumnFamilies(iter, terms);
-
-bscan.addScanIterator(iter);
-bscan.setRanges(Collections.singleton(new Range()));
-
-for(Entry<Key,Value> entry : bscan) {
-    System.out.println(" " + entry.getKey().getColumnQualifier());
-}
-----
-
-This code effectively has the BatchScanner scan all tablets of a table, looking for
-documents that match all the given terms. Because all tablets are being scanned for
-every query, each query is more expensive than other Accumulo scans, which
-typically involve a small number of TabletServers. This reduces the number of
-concurrent queries supported and is subject to what is known as the `straggler'
-problem in which every query runs as slow as the slowest server participating.
-
-Of course, fast servers will return their results to the client which can display them
-to the user immediately while they wait for the rest of the results to arrive. If the
-results are unordered this is quite effective as the first results to arrive are as good
-as any others to the user.


Mime
View raw message