Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 93F77200C8C for ; Mon, 22 May 2017 19:29:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 92364160BBF; Mon, 22 May 2017 17:29:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0953A160BDD for ; Mon, 22 May 2017 19:29:52 +0200 (CEST) Received: (qmail 83011 invoked by uid 500); 22 May 2017 17:29:52 -0000 Mailing-List: contact commits-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list commits@accumulo.apache.org Received: (qmail 82194 invoked by uid 99); 22 May 2017 17:29:51 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 May 2017 17:29:51 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 2FE77E97E9; Mon, 22 May 2017 17:29:51 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: mwalch@apache.org To: commits@accumulo.apache.org Date: Mon, 22 May 2017 17:30:02 -0000 Message-Id: In-Reply-To: <8051f232ac4a42cfad03eb35b35e6540@git.apache.org> References: <8051f232ac4a42cfad03eb35b35e6540@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [13/16] accumulo-website git commit: ACCUMULO-4630 Refactored documentation for website archived-at: Mon, 22 May 2017 17:29:55 -0000 http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/analytics.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/analytics.md b/_docs-unreleased/development/analytics.md new file mode 100644 index 0000000..e579bf6 --- /dev/null +++ b/_docs-unreleased/development/analytics.md @@ -0,0 +1,226 @@ +--- +title: Analytics +category: development +order: 8 +--- + +Accumulo supports more advanced data processing than simply keeping keys +sorted and performing efficient lookups. Analytics can be developed by using +MapReduce and Iterators in conjunction with Accumulo tables. + +## MapReduce + +Accumulo tables can be used as the source and destination of MapReduce jobs. To +use an Accumulo table with a MapReduce job (specifically with the new Hadoop API +as of version 0.20), configure the job parameters to use the AccumuloInputFormat +and AccumuloOutputFormat. Accumulo specific parameters can be set via these +two format classes to do the following: + +* Authenticate and provide user credentials for the input +* Restrict the scan to a range of rows +* Restrict the input to a subset of available columns + +### Mapper and Reducer classes + +To read from an Accumulo table create a Mapper with the following class +parameterization and be sure to configure the AccumuloInputFormat. + +```java +class MyMapper extends Mapper { + public void map(Key k, Value v, Context c) { + // transform key and value data here + } +} +``` + +To write to an Accumulo table, create a Reducer with the following class +parameterization and be sure to configure the AccumuloOutputFormat. The key +emitted from the Reducer identifies the table to which the mutation is sent. This +allows a single Reducer to write to more than one table if desired. A default table +can be configured using the AccumuloOutputFormat, in which case the output table +name does not have to be passed to the Context object within the Reducer. + +```java +class MyReducer extends Reducer { + public void reduce(WritableComparable key, Iterable values, Context c) { + Mutation m; + // create the mutation based on input key and value + c.write(new Text("output-table"), m); + } +} +``` + +The Text object passed as the output should contain the name of the table to which +this mutation should be applied. The Text can be null in which case the mutation +will be applied to the default table name specified in the AccumuloOutputFormat +options. + +### AccumuloInputFormat options + +```java +Job job = new Job(getConf()); +AccumuloInputFormat.setInputInfo(job, + "user", + "passwd".getBytes(), + "table", + new Authorizations()); + +AccumuloInputFormat.setZooKeeperInstance(job, "myinstance", + "zooserver-one,zooserver-two"); +``` + +**Optional Settings:** + +To restrict Accumulo to a set of row ranges: + +```java +ArrayList ranges = new ArrayList(); +// populate array list of row ranges ... +AccumuloInputFormat.setRanges(job, ranges); +``` + +To restrict Accumulo to a list of columns: + +```java +ArrayList> columns = new ArrayList>(); +// populate list of columns +AccumuloInputFormat.fetchColumns(job, columns); +``` + +To use a regular expression to match row IDs: + +```java +IteratorSetting is = new IteratorSetting(30, RexExFilter.class); +RegExFilter.setRegexs(is, ".*suffix", null, null, null, true); +AccumuloInputFormat.addIterator(job, is); +``` + +### AccumuloMultiTableInputFormat options + +The AccumuloMultiTableInputFormat allows the scanning over multiple tables +in a single MapReduce job. Separate ranges, columns, and iterators can be +used for each table. + +```java +InputTableConfig tableOneConfig = new InputTableConfig(); +InputTableConfig tableTwoConfig = new InputTableConfig(); +``` + +To set the configuration objects on the job: + +```java +Map configs = new HashMap(); +configs.put("table1", tableOneConfig); +configs.put("table2", tableTwoConfig); +AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs); +``` + +**Optional settings:** + +To restrict to a set of ranges: + +```java +ArrayList tableOneRanges = new ArrayList(); +ArrayList tableTwoRanges = new ArrayList(); +// populate array lists of row ranges for tables... +tableOneConfig.setRanges(tableOneRanges); +tableTwoConfig.setRanges(tableTwoRanges); +``` + +To restrict Accumulo to a list of columns: + +```java +ArrayList> tableOneColumns = new ArrayList>(); +ArrayList> tableTwoColumns = new ArrayList>(); +// populate lists of columns for each of the tables ... +tableOneConfig.fetchColumns(tableOneColumns); +tableTwoConfig.fetchColumns(tableTwoColumns); +``` + +To set scan iterators: + +```java +List tableOneIterators = new ArrayList(); +List tableTwoIterators = new ArrayList(); +// populate the lists of iterator settings for each of the tables ... +tableOneConfig.setIterators(tableOneIterators); +tableTwoConfig.setIterators(tableTwoIterators); +``` + +The name of the table can be retrieved from the input split: + +```java +class MyMapper extends Mapper { + public void map(Key k, Value v, Context c) { + RangeInputSplit split = (RangeInputSplit)c.getInputSplit(); + String tableName = split.getTableName(); + // do something with table name + } +} +``` + +### AccumuloOutputFormat options + +```java +boolean createTables = true; +String defaultTable = "mytable"; + +AccumuloOutputFormat.setOutputInfo(job, + "user", + "passwd".getBytes(), + createTables, + defaultTable); + +AccumuloOutputFormat.setZooKeeperInstance(job, "myinstance", + "zooserver-one,zooserver-two"); +``` + +**Optional Settings:** + +```java +AccumuloOutputFormat.setMaxLatency(job, 300000); // milliseconds +AccumuloOutputFormat.setMaxMutationBufferSize(job, 50000000); // bytes +``` + +The [MapReduce example](https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md) +contains a complete example of using MapReduce with Accumulo. + +## Combiners + +Many applications can benefit from the ability to aggregate values across common +keys. This can be done via Combiner iterators and is similar to the Reduce step in +MapReduce. This provides the ability to define online, incrementally updated +analytics without the overhead or latency associated with batch-oriented +MapReduce jobs. + +All that is needed to aggregate values of a table is to identify the fields over which +values will be grouped, insert mutations with those fields as the key, and configure +the table with a combining iterator that supports the summarizing operation +desired. + +The only restriction on an combining iterator is that the combiner developer +should not assume that all values for a given key have been seen, since new +mutations can be inserted at anytime. This precludes using the total number of +values in the aggregation such as when calculating an average, for example. + +### Feature Vectors + +An interesting use of combining iterators within an Accumulo table is to store +feature vectors for use in machine learning algorithms. For example, many +algorithms such as k-means clustering, support vector machines, anomaly detection, +etc. use the concept of a feature vector and the calculation of distance metrics to +learn a particular model. The columns in an Accumulo table can be used to efficiently +store sparse features and their weights to be incrementally updated via the use of an +combining iterator. + +## Statistical Modeling + +Statistical models that need to be updated by many machines in parallel could be +similarly stored within an Accumulo table. For example, a MapReduce job that is +iteratively updating a global statistical model could have each map or reduce worker +reference the parts of the model to be read and updated through an embedded +Accumulo client. + +Using Accumulo this way enables efficient and fast lookups and updates of small +pieces of information in a random access pattern, which is complementary to +MapReduce's sequential access model. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/development_tools.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/development_tools.md b/_docs-unreleased/development/development_tools.md new file mode 100644 index 0000000..3e326e2 --- /dev/null +++ b/_docs-unreleased/development/development_tools.md @@ -0,0 +1,102 @@ +--- +title: Development Tools +category: development +order: 3 +--- + +Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of +Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. If +you want to write a unit test that uses Accumulo, you need a lot of infrastructure +in place before your test can run. + +## Mock Accumulo + +Mock Accumulo supplies mock implementations for much of the client API. It presently +does not enforce users, logins, permissions, etc. It does support Iterators and Combiners. +Note that MockAccumulo holds all data in memory, and will not retain any data or +settings between runs. + +While normal interaction with the Accumulo client looks like this: + +```java +Instance instance = new ZooKeeperInstance(...); +Connector conn = instance.getConnector(user, passwordToken); +``` + +To interact with the MockAccumulo, just replace the ZooKeeperInstance with MockInstance: + +```java +Instance instance = new MockInstance(); +``` + +In fact, you can use the `--fake` option to the Accumulo shell and interact with +MockAccumulo: + +``` +$ accumulo shell --fake -u root -p '' + +Shell - Apache Accumulo Interactive Shell +- +- version: 2.x.x +- instance name: fake +- instance id: mock-instance-id +- +- type 'help' for a list of available commands +- + +root@fake> createtable test + +root@fake test> insert row1 cf cq value +root@fake test> insert row2 cf cq value2 +root@fake test> insert row3 cf cq value3 + +root@fake test> scan +row1 cf:cq [] value +row2 cf:cq [] value2 +row3 cf:cq [] value3 + +root@fake test> scan -b row2 -e row2 +row2 cf:cq [] value2 + +root@fake test> +``` + +When testing Map Reduce jobs, you can also set the Mock Accumulo on the AccumuloInputFormat +and AccumuloOutputFormat classes: + +```java +// ... set up job configuration +AccumuloInputFormat.setMockInstance(job, "mockInstance"); +AccumuloOutputFormat.setMockInstance(job, "mockInstance"); +``` + +## Mini Accumulo Cluster + +While the Mock Accumulo provides a lightweight implementation of the client API for unit +testing, it is often necessary to write more realistic end-to-end integration tests that +take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this possible by +configuring and starting Zookeeper, initializing Accumulo, and starting the Master as well +as some Tablet Servers. It runs against the local filesystem instead of having to start +up HDFS. + +To start it up, you will need to supply an empty directory and a root password as arguments: + +```java +File tempDirectory = // JUnit and Guava supply mechanisms for creating temp directories +MiniAccumuloCluster accumulo = new MiniAccumuloCluster(tempDirectory, "password"); +accumulo.start(); +``` + +Once we have our mini cluster running, we will want to interact with the Accumulo client API: + +```java +Instance instance = new ZooKeeperInstance(accumulo.getInstanceName(), accumulo.getZooKeepers()); +Connector conn = instance.getConnector("root", new PasswordToken("password")); +``` + +Upon completion of our development code, we will want to shutdown our MiniAccumuloCluster: + +```java +accumulo.stop(); +// delete your temporary folder +``` http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/high_speed_ingest.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/high_speed_ingest.md b/_docs-unreleased/development/high_speed_ingest.md new file mode 100644 index 0000000..7d906a0 --- /dev/null +++ b/_docs-unreleased/development/high_speed_ingest.md @@ -0,0 +1,113 @@ +--- +title: High-Speed Ingest +category: development +order: 7 +--- + +Accumulo is often used as part of a larger data processing and storage system. To +maximize the performance of a parallel system involving Accumulo, the ingestion +and query components should be designed to provide enough parallelism and +concurrency to avoid creating bottlenecks for users and other systems writing to +and reading from Accumulo. There are several ways to achieve high ingest +performance. + +## Pre-Splitting New Tables + +New tables consist of a single tablet by default. As mutations are applied, the table +grows and splits into multiple tablets which are balanced by the Master across +TabletServers. This implies that the aggregate ingest rate will be limited to fewer +servers than are available within the cluster until the table has reached the point +where there are tablets on every TabletServer. + +Pre-splitting a table ensures that there are as many tablets as desired available +before ingest begins to take advantage of all the parallelism possible with the cluster +hardware. Tables can be split at any time by using the shell: + + user@myinstance mytable> addsplits -sf /local_splitfile -t mytable + +For the purposes of providing parallelism to ingest it is not necessary to create more +tablets than there are physical machines within the cluster as the aggregate ingest +rate is a function of the number of physical machines. Note that the aggregate ingest +rate is still subject to the number of machines running ingest clients, and the +distribution of rowIDs across the table. The aggregation ingest rate will be +suboptimal if there are many inserts into a small number of rowIDs. + +## Multiple Ingester Clients + +Accumulo is capable of scaling to very high rates of ingest, which is dependent upon +not just the number of TabletServers in operation but also the number of ingest +clients. This is because a single client, while capable of batching mutations and +sending them to all TabletServers, is ultimately limited by the amount of data that +can be processed on a single machine. The aggregate ingest rate will scale linearly +with the number of clients up to the point at which either the aggregate I/O of +TabletServers or total network bandwidth capacity is reached. + +In operational settings where high rates of ingest are paramount, clusters are often +configured to dedicate some number of machines solely to running Ingester Clients. +The exact ratio of clients to TabletServers necessary for optimum ingestion rates +will vary according to the distribution of resources per machine and by data type. + +## Bulk Ingest + +Accumulo supports the ability to import files produced by an external process such +as MapReduce into an existing table. In some cases it may be faster to load data this +way rather than via ingesting through clients using BatchWriters. This allows a large +number of machines to format data the way Accumulo expects. The new files can +then simply be introduced to Accumulo via a shell command. + +To configure MapReduce to format data in preparation for bulk loading, the job +should be set to use a range partitioner instead of the default hash partitioner. The +range partitioner uses the split points of the Accumulo table that will receive the +data. The split points can be obtained from the shell and used by the MapReduce +RangePartitioner. Note that this is only useful if the existing table is already split +into multiple tablets. + + user@myinstance mytable> getsplits + aa + ab + ac + ... + zx + zy + zz + +Run the MapReduce job, using the AccumuloFileOutputFormat to create the files to +be introduced to Accumulo. Once this is complete, the files can be added to +Accumulo via the shell: + + user@myinstance mytable> importdirectory /files_dir /failures + +Note that the paths referenced are directories within the same HDFS instance over +which Accumulo is running. Accumulo places any files that failed to be added to the +second directory specified. + +See the [bulk ingest example](https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md) +for a complete example. + +## Logical Time for Bulk Ingest + +Logical time is important for bulk imported data, for which the client code may +be choosing a timestamp. At bulk import time, the user can choose to enable +logical time for the set of files being imported. When its enabled, Accumulo +uses a specialized system iterator to lazily set times in a bulk imported file. +This mechanism guarantees that times set by unsynchronized multi-node +applications (such as those running on MapReduce) will maintain some semblance +of causal ordering. This mitigates the problem of the time being wrong on the +system that created the file for bulk import. These times are not set when the +file is imported, but whenever it is read by scans or compactions. At import, a +time is obtained and always used by the specialized system iterator to set that +time. + +The timestamp assigned by Accumulo will be the same for every key in the file. +This could cause problems if the file contains multiple keys that are identical +except for the timestamp. In this case, the sort order of the keys will be +undefined. This could occur if an insert and an update were in the same bulk +import file. + +## MapReduce Ingest + +It is possible to efficiently write many mutations to Accumulo in parallel via a +MapReduce job. In this scenario the MapReduce is written to process data that lives +in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See +the MapReduce section under Analytics for details. The [MapReduce example](https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md) +is also a good reference for example code. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/iterator_design.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/iterator_design.md b/_docs-unreleased/development/iterator_design.md new file mode 100644 index 0000000..cfb46c8 --- /dev/null +++ b/_docs-unreleased/development/iterator_design.md @@ -0,0 +1,386 @@ +--- +title: Iterator Design +category: development +order: 1 +--- + +Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs +that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly +brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java +Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element +in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the +iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a +series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create +multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can +merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to +a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators +are not designed to act as triggers nor are they designed to operate outside of the purview of a single table. + +Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is +buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong +understanding of this as the interface provides clear definitions about what each action each method should take. This +chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common +pitfalls. + +## Instantiation + +To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every +TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in +`lib/`. In development environments, it is convenient to instead place the JAR file in `lib/ext/` as JAR files +in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while +testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath +configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user +manual. + +Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that +Iterators must have a public no-args constructor. + +## Interface + +A normal implementation of the SortedKeyValueIterator defines functionality for the following methods: + +```java +void init(SortedKeyValueIterator source, Map options, IteratorEnvironment env) throws IOException; + +boolean hasTop(); + +void next() throws IOException; + +void seek(Range range, Collection columnFamilies, boolean inclusive) throws IOException; + +Key getTopKey(); + +Value getTopValue(); + +SortedKeyValueIterator deepCopy(IteratorEnvironment env); +``` + +### init + +The `init` method is called by the TabletServer after it constructs an instance of the Iterator. This method should +clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the `source`, is the +Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom"). +The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon. + +The second argument, a Map of options, is made up of options provided by the user, options set in the table's +configuration, and/or options set in the containing namespace's configuration. +These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context +(a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter +Key-Value pairs in a regular expression Iterator. + +The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the +context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator +knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan +(which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to +optimize itself. + +### seek + +The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to +advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before +the implementation of this method returns some additional processing is performed which may further advance the current +position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For +example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek` +returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value +pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey` +and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is +a Key-Value pair to return. + +The arguments passed to seek are as follows: + +The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the +Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for +both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not +guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on +Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) +provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range. + +The second argument, a `Collection`, is the set of column families which should be retained or +excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families +should be treated as an inclusion collection (true) or an exclusion collection (false). + +It is likely that all implementations of `seek` will first make a call to the `seek` method on the +"source" Iterator that was provided in the `init` method. The collection of column families and +the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will +also implement some sort of additional logic to find or compute the first Key-Value pair in the provided +Range. For example, a regular expression Iterator would consume all records which do not match the given +pattern before returning from `seek`. + +It is important to retain the original Range passed to this method to know when this Iterator should stop +reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it +will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet. +Best practice is to never emit entries outside the seek range. + +### next + +The `next` method is analogous to the `next` method on a Java Iterator: this method should advance +the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex +logic, this may result in more than one Key-Value pair being inspected. This method alters +some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods. + +The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue` +can later return. While there is another Key-Value pair to return, `hasTop` should return true. +If there are no more Key-Value pairs to return from this Iterator since the last call to +`seek`, `hasTop` should return false. + +### hasTop + +The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs +the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method +should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not +alter the internal state of the Iterator. + +### getTopKey and getTopValue + +These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true, +both of these methods should return non-null objects. If `hasTop` returns false, it is undefined +what these methods should return. Like `hasTop`, multiple calls to these methods should not alter +the state of the Iterator. + +Users should take caution when either + +1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator. +In this case, the cached Key/Value object is aliased to the reference returned by the source iterator. +Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data +that the cached Key/Value object references and resulting in a logic bug. +2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value, +then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug. + +In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies, +it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects +returned from `getTopKey`/`getTopValue`. + +### deepCopy + +The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface. +Implementations of this method should return a new object of the same type as the Accumulo Iterator +instance it was called on. Any internal state from the instance `deepCopy` was called +on should be carried over to the returned copy. The returned copy should be ready to have +`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on +an iterator before `deepCopy` and that `init` will not be called on the iterator returned by +`deepCopy`. + +Typically, implementations of `deepCopy` call a copy-constructor which will initialize +internal data structures. As with `seek`, it is common for the `IteratorEnvironment` +argument to be ignored as most Iterator implementations can be written without the explicit +information the environment provides. + +In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as +early programming assignments which implement their own tree data structures. `deepCopy` calls +copy on its sources (the children), copies itself, attaches the copies of the children, and +then returns itself. + +## TabletServer invocation of Iterators + +The following code is a general outline for how TabletServers invoke Iterators. + +```java +List batch; +Range range = getRangeFromClient(); +while(!overSizeLimit(batch)){ + SortedKeyValueIterator source = getSystemIterator(); + + for(String clzName : getUserIterators()){ + Class clz = Class.forName(clzName); + SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance(); + iter.init(source, opts, env); + source = iter; + } + + // read a batch of data to return to client + // the last iterator, the "top" + SortedKeyValueIterator topIter = source; + topIter.seek(getRangeFromUser(), ...) + + while(topIter.hasTop() && !overSizeLimit(batch)){ + key = topIter.getTopKey() + val = topIter.getTopValue() + batch.add(new KeyValue(key, val) + if(systemDataSourcesChanged()){ + // code does not show isolation case, which will + // keep using same data sources until a row boundry is hit + range = new Range(key, false, range.endKey(), range.endKeyInclusive()); + break; + } + } +} +//return batch of key values to client +``` + +Additionally, the obtuse "re-seek" case can be outlined as the following: + +```java +// Given the above +List batch = getNextBatch(); + +// Store off lastKeyReturned for this client +lastKeyReturned = batch.get(batch.size() - 1).getKey(); + +// thread goes away (client stops asking for the next batch). + +// Eventually client comes back +// Setup as before... + +Range userRange = getRangeFromUser(); +Range actualRange = new Range(lastKeyReturned, false + userRange.getEndKey(), userRange.isEndKeyInclusive()); + +// Use the actualRange, not the user provided one +topIter.seek(actualRange); +``` + +## Isolation + +Accumulo provides a feature which clients can enable to prevent the viewing of partially +applied mutations within the context of rows. If a client is submitting multiple column +updates to rows at a time, isolation would ensure that a client would either see all of +updates made to that row or none of the updates (until they are all applied). + +When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo +reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a +key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources +may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables +Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the +next possible row. + +## Abstract Iterators + +A number of Abstract implementations of Iterators are provided to allow for faster creation +of common patterns. The most commonly used abstract implementations are the `Filter` and +`Combiner` classes. When possible these classes should be used instead as they have been +thoroughly tested inside Accumulo itself. + +### Filter + +The `Filter` abstract Iterator provides a very simple implementation which allows implementations +to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method. + +Filters are extremely simple to implement; however, when the implementation is filtering a +large percentage of Key-Value pairs with respect to the total number of pairs examined, +it can be very inefficient. For example, if a Filter implementation can determine after examining +part of the row that no other pairs in this row will be accepted, there is no mechanism to +efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of +1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined +that no other Key-Value pairs in this row will be accepted. The Filter must still examine each +remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that +Filters have no means to leverage the `seek` method to efficiently skip large portions +of Key-Value pairs. + +As such, the `Filter` class functions well for filtering small amounts of data, but is +inefficient for filtering large amounts of data. The decision to use a `Filter` strongly +depends on the use case and distribution of data being filtered. + +### Combiner + +The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface +define in Hadoop's MapReduce framework, implementations of this abstract class reduce +multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. +Combiners provide a simple way to implement common operations like summation and +aggregation without the need to implement the entire Accumulo Iterator interface. + +One important consideration when choosing to design a Combiner is that the "reduction" operation +is often best represented when it is associative and commutative. Operations which do not meet +these criteria can be implemented; however, the implementation can be difficult. + +A second consideration is that a Combiner is not guaranteed to see every Key-Value pair +which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value +pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that +every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for +Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the +timestamps 1, 2, 4 and 5. + +Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the +Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The +Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value +pair that has the largest timestamp. + +## Best practices + +Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow +many implementations which are poor design decisions. The following are some common recommendations to +follow and pitfalls to avoid in Iterator implementations. + +#### Avoid special logic encoded in Ranges + +Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified. +If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the +`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators +should *never* make assumptions about the current state/context based on the `Range`. + +The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the +"stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value +pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of +Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available +resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know +the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created +from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, +non-inclusive. + +### `seek`'ing backwards + +The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators. +By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can +greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed. + +While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator, +there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its +"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions +to prevent this from happening. + +Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops +inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should +construct a completely new Range as it allows for bugs to be introduced which will break Accumulo. + +Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The +`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the +`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your +Iterator's implementation of seek. + +### Take caution in constructing new data in an Iterator + +Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means +to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator +is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there +is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s) +are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every +Key-Value pair is also prohibitively performance limiting to be considered an option. + +The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair. +For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each +row. A safe way to generate more data in an Iterator would be to construct an Iterator that is +"higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are +a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute +some function, and add a new Key-Value pair to the original collection, re-serializing the collection +of Key-Value pairs back into a single Key-Value pair. + +Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will +always see all intended data that is generated. + +## Final things to remember + +Some simple recommendations/points to keep in mind: + +### Method call order + +On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`, +`getTopKey` and `getTopValue` will not be called if `hasTop` returns false. + +### Teardown + +As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex +collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value +pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned +to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy` +being called on the new instance with the old instance provided as the argument to `deepCopy`. References +to the old instance are removed and the object is lazily garbage collected by the JVM. + +## Compaction-time Iterators + +When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need +to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however, +Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite +all files (only a subset of them), it is possible that the logic take this into consideration. + +For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The +Combiner must recognize this and not perform any function that would be incorrect due +to the missing values. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/iterator_testing.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/iterator_testing.md b/_docs-unreleased/development/iterator_testing.md new file mode 100644 index 0000000..a0e82de --- /dev/null +++ b/_docs-unreleased/development/iterator_testing.md @@ -0,0 +1,97 @@ +--- +title: Iterator Testing +category: development +order: 2 +--- + +Iterators, while extremely powerful, are notoriously difficult to test. While the API defines +the methods an Iterator must implement and each method's functionality, the actual invocation +of these methods by Accumulo TabletServers can be surprisingly difficult to mimic in unit tests. + +The Apache Accumulo "Iterator Test Harness" is designed to provide a generalized testing framework +for all Accumulo Iterators to leverage to identify common pitfalls in user-created Iterators. + +## Framework Use + +The harness provides an abstract class for use with JUnit4. Users must define the following for this +abstract class: + + * A `SortedMap` of input data (`Key`-`Value` pairs) + * A `Range` to use in tests + * A `Map` of options (`String` to `String` pairs) + * A `SortedMap` of output data (`Key`-`Value` pairs) + * A list of `IteratorTestCase`s (these can be automatically discovered) + +The majority of effort a user must make is in creating the input dataset and the expected +output dataset for the iterator being tested. + +## Normal Test Outline + +Most iterator tests will follow the given outline: + +```java +import java.util.List; +import java.util.SortedMap; + +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.iteratortest.IteratorTestCaseFinder; +import org.apache.accumulo.iteratortest.IteratorTestInput; +import org.apache.accumulo.iteratortest.IteratorTestOutput; +import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest; +import org.apache.accumulo.iteratortest.testcases.IteratorTestCase; +import org.junit.runners.Parameterized.Parameters; + +public class MyIteratorTest extends BaseJUnit4IteratorTest { + + @Parameters + public static Object[][] parameters() { + final IteratorTestInput input = createIteratorInput(); + final IteratorTestOutput output = createIteratorOutput(); + final List testCases = IteratorTestCaseFinder.findAllTestCases(); + return BaseJUnit4IteratorTest.createParameters(input, output, tests); + } + + private static SortedMap INPUT_DATA = createInputData(); + private static SortedMap OUTPUT_DATA = createOutputData(); + + private static SortedMap createInputData() { + // TODO -- implement this method + } + + private static SortedMap createOutputData() { + // TODO -- implement this method + } + + private static IteratorTestInput createIteratorInput() { + final Map options = createIteratorOptions(); + final Range range = createRange(); + return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA); + } + + private static Map createIteratorOptions() { + // TODO -- implement this method + // Tip: Use INPUT_DATA if helpful in generating output + } + + private static Range createRange() { + // TODO -- implement this method + } + + private static IteratorTestOutput createIteratorOutput() { + return new IteratorTestOutput(OUTPUT_DATA); + } +} +``` + +## Limitations + +While the provided `IteratorTestCase`s should exercise common edge-cases in user iterators, +there are still many limitations to the existing test harness. Some of them are: + + * Can only specify a single iterator, not many (a "stack") + * No control over provided IteratorEnvironment for tests + * Exercising delete keys (especially with major compactions that do not include all files) + +These are left as future improvements to the harness. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/sampling.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/sampling.md b/_docs-unreleased/development/sampling.md new file mode 100644 index 0000000..95f6d93 --- /dev/null +++ b/_docs-unreleased/development/sampling.md @@ -0,0 +1,75 @@ +--- +title: Sampling +category: development +order: 4 +--- + +## Overview + +Accumulo has the ability to generate and scan a per table set of sample data. +This sample data is kept up to date as a table is mutated. What key values are +placed in the sample data is configurable per table. + +This feature can be used for query estimation and optimization. For an example +of estimation assume an Accumulo table is configured to generate a sample +containing one millionth of a tables data. If a query is executed against the +sample and returns one thousand results, then the same query against all the +data would probably return a billion results. A nice property of having +Accumulo generate the sample is that its always up to date. So estimations +will be accurate even when querying the most recently written data. + +An example of a query optimization is an iterator using sample data to get an +estimate, and then making decisions based on the estimate. + +## Configuring + +Inorder to use sampling, an Accumulo table must be configured with a class that +implements `org.apache.accumulo.core.sample.Sampler` along with options for +that class. For guidance on implementing a Sampler see that interface's +javadoc. Accumulo provides a few implementations out of the box. For +information on how to use the samplers that ship with Accumulo look in the +package `org.apache.accumulo.core.sample` and consult the javadoc of the +classes there. See the [sampling example][example] for examples of how to +configure a Sampler on a table. + +Once a table is configured with a sampler all writes after that point will +generate sample data. For data written before sampling was configured sample +data will not be present. A compaction can be initiated that only compacts the +files in the table that do not have sample data. The example readme shows how +to do this. + +If the sampling configuration of a table is changed, then Accumulo will start +generating new sample data with the new configuration. However old data will +still have sample data generated with the previous configuration. A selective +compaction can also be issued in this case to regenerate the sample data. + +## Scanning sample data + +Inorder to scan sample data, use the `setSamplerConfiguration(...)` method on +`Scanner` or `BatchScanner`. Please consult this methods javadocs for more +information. + +Sample data can also be scanned from within an Accumulo `SortedKeyValueIterator`. +To see how to do this, look at the example iterator referenced in the [sampling example][example]. +Also, consult the javadoc on `org.apache.accumulo.core.iterators.IteratorEnvironment.cloneWithSamplingEnabled()`. + +Map reduce jobs using the `AccumuloInputFormat` can also read sample data. See +the javadoc for the `setSamplerConfiguration()` method on +`AccumuloInputFormat`. + +Scans over sample data will throw a `SampleNotPresentException` in the following cases : + +1. sample data is not present, +2. sample data is present but was generated with multiple configurations +3. sample data is partially present + +So a scan over sample data can only succeed if all data written has sample data +generated with the same configuration. + +## Bulk import + +When generating rfiles to bulk import into Accumulo, those rfiles can contain +sample data. To use this feature, look at the javadoc on the +`AccumuloFileOutputFormat.setSampler(...)` method. + +[example]: https://github.com/apache/accumulo-examples/blob/master/docs/sample.md http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/security.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/security.md b/_docs-unreleased/development/security.md new file mode 100644 index 0000000..ea1f997 --- /dev/null +++ b/_docs-unreleased/development/security.md @@ -0,0 +1,169 @@ +--- +title: Security +category: development +order: 6 +--- + +Accumulo extends the BigTable data model to implement a security mechanism +known as cell-level security. Every key-value pair has its own security label, stored +under the column visibility element of the key, which is used to determine whether +a given user meets the security requirements to read the value. This enables data of +various security levels to be stored within the same row, and users of varying +degrees of access to query the same table, while preserving data confidentiality. + +## Security Label Expressions + +When mutations are applied, users can specify a security label for each value. This is +done as the Mutation is created by passing a ColumnVisibility object to the put() +method: + +```java +Text rowID = new Text("row1"); +Text colFam = new Text("myColFam"); +Text colQual = new Text("myColQual"); +ColumnVisibility colVis = new ColumnVisibility("public"); +long timestamp = System.currentTimeMillis(); + +Value value = new Value("myValue"); + +Mutation mutation = new Mutation(rowID); +mutation.put(colFam, colQual, colVis, timestamp, value); +``` + +## Security Label Expression Syntax + +Security labels consist of a set of user-defined tokens that are required to read the +value the label is associated with. The set of tokens required can be specified using +syntax that supports logical AND `&` and OR `|` combinations of terms, as +well as nesting groups `()` of terms together. + +Each term is comprised of one to many alpha-numeric characters, hyphens, underscores or +periods. Optionally, each term may be wrapped in quotation marks +which removes the restriction on valid characters. In quoted terms, quotation marks +and backslash characters can be used as characters in the term by escaping them +with a backslash. + +For example, suppose within our organization we want to label our data values with +security labels defined in terms of user roles. We might have tokens such as: + + admin + audit + system + +These can be specified alone or combined using logical operators: + +``` +// Users must have admin privileges +admin + +// Users must have admin and audit privileges +admin&audit + +// Users with either admin or audit privileges +admin|audit + +// Users must have audit and one or both of admin or system +(admin|system)&audit +``` + +When both `|` and `&` operators are used, parentheses must be used to specify +precedence of the operators. + +## Authorization + +When clients attempt to read data from Accumulo, any security labels present are +examined against the set of authorizations passed by the client code when the +Scanner or BatchScanner are created. If the authorizations are determined to be +insufficient to satisfy the security label, the value is suppressed from the set of +results sent back to the client. + +Authorizations are specified as a comma-separated list of tokens the user possesses: + +```java +// user possesses both admin and system level access +Authorization auths = new Authorization("admin","system"); + +Scanner s = connector.createScanner("table", auths); +``` + +## User Authorizations + +Each Accumulo user has a set of associated security labels. To manipulate +these in the shell while using the default authorizor, use the setuaths and getauths commands. +These may also be modified for the default authorizor using the java security operations API. + +When a user creates a scanner a set of Authorizations is passed. If the +authorizations passed to the scanner are not a subset of the users +authorizations, then an exception will be thrown. + +To prevent users from writing data they can not read, add the visibility +constraint to a table. Use the -evc option in the createtable shell command to +enable this constraint. For existing tables use the following shell command to +enable the visibility constraint. Ensure the constraint number does not +conflict with any existing constraints. + + config -t table -s table.constraint.1=org.apache.accumulo.core.security.VisibilityConstraint + +Any user with the alter table permission can add or remove this constraint. +This constraint is not applied to bulk imported data, if this a concern then +disable the bulk import permission. + +## Pluggable Security + +New in 1.5 of Accumulo is a pluggable security mechanism. It can be broken into three actions -- +authentication, authorization, and permission handling. By default all of these are handled in +Zookeeper, which is how things were handled in Accumulo 1.4 and before. It is worth noting at this +point, that it is a new feature in 1.5 and may be adjusted in future releases without the standard +deprecation cycle. + +Authentication simply handles the ability for a user to verify their integrity. A combination of +principal and authentication token are used to verify a user is who they say they are. An +authentication token should be constructed, either directly through its constructor, but it is +advised to use the `init(Property)` method to populate an authentication token. It is expected that a +user knows what the appropriate token to use for their system is. The default token is +`PasswordToken`. + +Once a user is authenticated by the Authenticator, the user has access to the other actions within +Accumulo. All actions in Accumulo are ACLed, and this ACL check is handled by the Permission +Handler. This is what manages all of the permissions, which are divided in system and per table +level. From there, if a user is doing an action which requires authorizations, the Authorizor is +queried to determine what authorizations the user has. + +This setup allows a variety of different mechanisms to be used for handling different aspects of +Accumulo's security. A system like Kerberos can be used for authentication, then a system like LDAP +could be used to determine if a user has a specific permission, and then it may default back to the +default ZookeeperAuthorizor to determine what Authorizations a user is ultimately allowed to use. +This is a pluggable system so custom components can be created depending on your need. + +## Secure Authorizations Handling + +For applications serving many users, it is not expected that an Accumulo user +will be created for each application user. In this case an Accumulo user with +all authorizations needed by any of the applications users must be created. To +service queries, the application should create a scanner with the application +user's authorizations. These authorizations could be obtained from a trusted 3rd +party. + +Often production systems will integrate with Public-Key Infrastructure (PKI) and +designate client code within the query layer to negotiate with PKI servers in order +to authenticate users and retrieve their authorization tokens (credentials). This +requires users to specify only the information necessary to authenticate themselves +to the system. Once user identity is established, their credentials can be accessed by +the client code and passed to Accumulo outside of the reach of the user. + +## Query Services Layer + +Since the primary method of interaction with Accumulo is through the Java API, +production environments often call for the implementation of a Query layer. This +can be done using web services in containers such as Apache Tomcat, but is not a +requirement. The Query Services Layer provides a mechanism for providing a +platform on which user facing applications can be built. This allows the application +designers to isolate potentially complex query logic, and enables a convenient point +at which to perform essential security functions. + +Several production environments choose to implement authentication at this layer, +where users identifiers are used to retrieve their access credentials which are then +cached within the query layer and presented to Accumulo through the +Authorizations mechanism. + +Typically, the query services layer sits between Accumulo and user workstations. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/development/summaries.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/development/summaries.md b/_docs-unreleased/development/summaries.md new file mode 100644 index 0000000..a86e30d --- /dev/null +++ b/_docs-unreleased/development/summaries.md @@ -0,0 +1,222 @@ +--- +title: Summary Statistics +category: development +order: 5 +--- + +## Overview + +Accumulo has the ability to generate summary statistics about data in a table +using user defined functions. Currently these statistics are only generated for +data written to files. Data recently written to Accumulo that is still in +memory will not contribute to summary statistics. + +This feature can be used to inform a user about what data is in their table. +Summary statistics can also be used by compaction strategies to make decisions +about which files to compact. + +Summary data is stored in each file Accumulo produces. Accumulo can gather +summary information from across a cluster merging it along the way. In order +for this to be fast the, summary information should fit in cache. There is a +dedicated cache for summary data on each tserver with a configurable size. In +order for summary data to fit in cache, it should probably be small. + +For information on writing a custom summarizer see the javadoc for +`org.apache.accumulo.core.client.summary.Summarizer`. The package +`org.apache.accumulo.core.client.summary.summarizers` contains summarizer +implementations that ship with Accumulo and can be configured for use. + +## Inaccuracies + +Summary data can be inaccurate when files are missing summary data or when +files have extra summary data. Files can contain data outside of a tablets +boundaries. This can happen as result of bulk imported files and tablet splits. +When this happens, those files could contain extra summary information. +Accumulo offsets this some by storing summary information for multiple row +ranges per a file. However, the ranges are not granular enough to completely +offset extra data. + +Any source of inaccuracies is reported when summary information is requested. +In the shell examples below this can be seen on the `File Statistics` line. +For files missing summary information, the compact command in the shell has a +`--sf-no-summary` option. This options compacts files that do not have the +summary information configured for the table. The compact command also has the +`--sf-extra-summary` option which will compact files with extra summary +information. + +## Configuring + +The following tablet server and table properties configure summarization. + +* [tserver.cache.summary.size]({{page.docs_baseurl}}/administration/configuration-properties#tserver_cache_summary_size) +* [tserver.summary.partition.threads]({{page.docs_baseurl}}/administration/configuration-properties#tserver_summary_partition_threads) +* [tserver.summary.remote.threads]({{page.docs_baseurl}}/administration/configuration-properties#tserver_summary_remote_threads) +* [tserver.summary.retrieval.threads]({{page.docs_baseurl}}/administration/configuration-properties#tserver_summary_retreival_threads) +* [table.summarizer.*]({{page.docs_baseurl}}/administration/configuration-properties#table_summarizer_prefix) +* [table.file.summary.maxSize]({{page.docs_baseurl}}/administration/configuration-properties#table_file_summary_maxSize) + +## Permissions + +Because summary data may be derived from sensitive data, requesting summary data +requires a special permission. User must have the table permission +`GET_SUMMARIES` in order to retrieve summary data. + +## Bulk import + +When generating rfiles to bulk import into Accumulo, those rfiles can contain +summary data. To use this feature, look at the javadoc on the +`AccumuloFileOutputFormat.setSummarizers(...)` method. Also, +`org.apache.accumulo.core.client.rfile.RFile` has options for creating RFiles +with embedded summary data. + +## Examples + +This example walks through using summarizers in the Accumulo shell. Below a +table is created and some data is inserted to summarize. + + root@uno> createtable summary_test + root@uno summary_test> setauths -u root -s PI,GEO,TIME + root@uno summary_test> insert 3b503bd name last Doe + root@uno summary_test> insert 3b503bd name first John + root@uno summary_test> insert 3b503bd contact address "123 Park Ave, NY, NY" -l PI&GEO + root@uno summary_test> insert 3b503bd date birth "1/11/1942" -l PI&TIME + root@uno summary_test> insert 3b503bd date married "5/11/1962" -l PI&TIME + root@uno summary_test> insert 3b503bd contact home_phone 1-123-456-7890 -l PI + root@uno summary_test> insert d5d18dd contact address "50 Lake Shore Dr, Chicago, IL" -l PI&GEO + root@uno summary_test> insert d5d18dd name first Jane + root@uno summary_test> insert d5d18dd name last Doe + root@uno summary_test> insert d5d18dd date birth 8/15/1969 -l PI&TIME + root@uno summary_test> scan -s PI,GEO,TIME + 3b503bd contact:address [PI&GEO] 123 Park Ave, NY, NY + 3b503bd contact:home_phone [PI] 1-123-456-7890 + 3b503bd date:birth [PI&TIME] 1/11/1942 + 3b503bd date:married [PI&TIME] 5/11/1962 + 3b503bd name:first [] John + 3b503bd name:last [] Doe + d5d18dd contact:address [PI&GEO] 50 Lake Shore Dr, Chicago, IL + d5d18dd date:birth [PI&TIME] 8/15/1969 + d5d18dd name:first [] Jane + d5d18dd name:last [] Doe + +After inserting the data, summaries are requested below. No summaries are returned. + + root@uno summary_test> summaries + +The visibility summarizer is configured below and the table is flushed. +Flushing the table creates a file creating summary data in the process. The +summary data returned counts how many times each column visibility occurred. +The statistics with a `c:` prefix are visibilities. The others are generic +statistics created by the CountingSummarizer that VisibilitySummarizer extends. + + root@uno summary_test> config -t summary_test -s table.summarizer.vis=org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer + root@uno summary_test> summaries + root@uno summary_test> flush -w + 2017-02-24 19:54:46,090 [shell.Shell] INFO : Flush of table summary_test completed. + root@uno summary_test> summaries + Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {} + File Statistics : [total:1, missing:0, extra:0, large:0] + Summary Statistics : + c: = 4 + c:PI = 1 + c:PI&GEO = 2 + c:PI&TIME = 3 + emitted = 10 + seen = 10 + tooLong = 0 + tooMany = 0 + +VisibilitySummarizer has an option `maxCounters` that determines the max number +of column visibilites it will track. Below this option is set and compaction +is forced to regenerate summary data. The new summary data only has three +visibilites and now the `tooMany` statistic is 4. This is the number of +visibilites that were not counted. + +``` + root@uno summary_test> config -t summary_test -s table.summarizer.vis.opt.maxCounters=3 + root@uno summary_test> compact -w + 2017-02-24 19:54:46,267 [shell.Shell] INFO : Compacting table ... + 2017-02-24 19:54:47,127 [shell.Shell] INFO : Compaction of table summary_test completed for given range + root@uno summary_test> summaries + Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3} + File Statistics : [total:1, missing:0, extra:0, large:0] + Summary Statistics : + c:PI = 1 + c:PI&GEO = 2 + c:PI&TIME = 3 + emitted = 10 + seen = 10 + tooLong = 0 + tooMany = 4 +``` + +Another summarizer is configured below that tracks the number of deletes. Also +a compaction strategy that uses this summary data is configured. The +`TooManyDeletesCompactionStrategy` will force a compaction of the tablet when +the ratio of deletes to non-deletes is over 25%. This threshold is +configurable. Below a delete is added and its reflected in the statistics. In +this case there is 1 delete and 10 non-deletes, not enough to force a +compaction of the tablet. + +``` +root@uno summary_test> config -t summary_test -s table.summarizer.del=org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer +root@uno summary_test> compact -w +2017-02-24 19:54:47,282 [shell.Shell] INFO : Compacting table ... +2017-02-24 19:54:49,236 [shell.Shell] INFO : Compaction of table summary_test completed for given range +root@uno summary_test> config -t summary_test -s table.compaction.major.ratio=10 +root@uno summary_test> config -t summary_test -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.TooManyDeletesCompactionStrategy +root@uno summary_test> deletemany -r d5d18dd -c date -f +[DELETED] d5d18dd date:birth [PI&TIME] +root@uno summary_test> flush -w +2017-02-24 19:54:49,686 [shell.Shell] INFO : Flush of table summary_test completed. +root@uno summary_test> summaries + Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3} + File Statistics : [total:2, missing:0, extra:0, large:0] + Summary Statistics : + c:PI = 1 + c:PI&GEO = 2 + c:PI&TIME = 4 + emitted = 11 + seen = 11 + tooLong = 0 + tooMany = 4 + + Summarizer : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {} + File Statistics : [total:2, missing:0, extra:0, large:0] + Summary Statistics : + deletes = 1 + total = 11 +``` + +Some more deletes are added and the table is flushed below. This results in 4 +deletes and 10 non-deletes, which triggers a full compaction. A full +compaction of all files is the only time when delete markers are dropped. The +compaction ratio was set to 10 above to show that the number of files did not +trigger the compaction. After the compaction there no deletes 6 non-deletes. + +``` +root@uno summary_test> deletemany -r d5d18dd -f +[DELETED] d5d18dd contact:address [PI&GEO] +[DELETED] d5d18dd name:first [] +[DELETED] d5d18dd name:last [] +root@uno summary_test> flush -w +2017-02-24 19:54:52,800 [shell.Shell] INFO : Flush of table summary_test completed. +root@uno summary_test> summaries + Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3} + File Statistics : [total:1, missing:0, extra:0, large:0] + Summary Statistics : + c:PI = 1 + c:PI&GEO = 1 + c:PI&TIME = 2 + emitted = 6 + seen = 6 + tooLong = 0 + tooMany = 2 + + Summarizer : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {} + File Statistics : [total:1, missing:0, extra:0, large:0] + Summary Statistics : + deletes = 0 + total = 6 +root@uno summary_test> +``` + http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/getting-started/clients.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/getting-started/clients.md b/_docs-unreleased/getting-started/clients.md new file mode 100644 index 0000000..a78a660 --- /dev/null +++ b/_docs-unreleased/getting-started/clients.md @@ -0,0 +1,381 @@ +--- +title: Accumulo Clients +category: getting-started +order: 2 +--- + +## Running Client Code + +There are multiple ways to run Java code that uses Accumulo. Below is a list +of the different ways to execute client code. + +* using the `java` command +* using the `accumulo` command +* using the `accumulo-util hadoop-jar` command + +### Using the java command + +To run Accumulo client code using the `java` command, use the `accumulo classpath` command +to include all of Accumulo's dependencies on your classpath: + + java -classpath /path/to/my.jar:/path/to/dep.jar:$(accumulo classpath) com.my.Main arg1 arg2 + +If you would like to review which jars are included, the `accumulo classpath` command can +output a more human readable format using the `-d` option which enables debugging: + + accumulo classpath -d + +### Using the accumulo command + +Another option for running your code is to use the Accumulo script which can execute a +main class (if it exists on its classpath): + + accumulo com.foo.Client arg1 arg2 + +While the Accumulo script will add all of Accumulo's dependencies to the classpath, you +will need to add any jars that your create or depend on beyond what Accumulo already +depends on. This can be accomplished by either adding the jars to the `lib/ext` directory +of your Accumulo installation or by adding jars to the CLASSPATH variable before calling +the accumulo command. + + export CLASSPATH=/path/to/my.jar:/path/to/dep.jar; accumulo com.foo.Client arg1 arg2 + +### Using the 'accumulo-util hadoop-jar' command + +If you are writing map reduce job that accesses Accumulo, then you can use +`accumulo-util hadoop-jar` to run those jobs. See the map reduce example. + +## Connecting + +All clients must first identify the Accumulo instance to which they will be +communicating. Code to do this is as follows: + +```java +String instanceName = "myinstance"; +String zooServers = "zooserver-one,zooserver-two" +Instance inst = new ZooKeeperInstance(instanceName, zooServers); + +Connector conn = inst.getConnector("user", new PasswordToken("passwd")); +``` + +The PasswordToken is the most common implementation of an `AuthenticationToken`. +This general interface allow authentication as an Accumulo user to come from +a variety of sources or means. The CredentialProviderToken leverages the Hadoop +CredentialProviders (new in Hadoop 2.6). + +For example, the CredentialProviderToken can be used in conjunction with a Java +KeyStore to alleviate passwords stored in cleartext. When stored in HDFS, a single +KeyStore can be used across an entire instance. Be aware that KeyStores stored on +the local filesystem must be made available to all nodes in the Accumulo cluster. + +```java +KerberosToken token = new KerberosToken(); +Connector conn = inst.getConnector(token.getPrincipal(), token); +``` + +The KerberosToken can be provided to use the authentication provided by Kerberos. +Using Kerberos requires external setup and additional configuration, but provides +a single point of authentication through HDFS, YARN and ZooKeeper and allowing +for password-less authentication with Accumulo. + +## Writing Data + +Data are written to Accumulo by creating Mutation objects that represent all the +changes to the columns of a single row. The changes are made atomically in the +TabletServer. Clients then add Mutations to a BatchWriter which submits them to +the appropriate TabletServers. + +Mutations can be created thus: + +```java +Text rowID = new Text("row1"); +Text colFam = new Text("myColFam"); +Text colQual = new Text("myColQual"); +ColumnVisibility colVis = new ColumnVisibility("public"); +long timestamp = System.currentTimeMillis(); + +Value value = new Value("myValue".getBytes()); + +Mutation mutation = new Mutation(rowID); +mutation.put(colFam, colQual, colVis, timestamp, value); +``` + +### BatchWriter + +The BatchWriter is highly optimized to send Mutations to multiple TabletServers +and automatically batches Mutations destined for the same TabletServer to +amortize network overhead. Care must be taken to avoid changing the contents of +any Object passed to the BatchWriter since it keeps objects in memory while +batching. + +Mutations are added to a BatchWriter thus: + +```java +// BatchWriterConfig has reasonable defaults +BatchWriterConfig config = new BatchWriterConfig(); +config.setMaxMemory(10000000L); // bytes available to batchwriter for buffering mutations + +BatchWriter writer = conn.createBatchWriter("table", config) + +writer.addMutation(mutation); + +writer.close(); +``` + +For more example code, see the [batch writing and scanning example](https://github.com/apache/accumulo-examples/blob/master/docs/batch.md). + +### ConditionalWriter + +The ConditionalWriter enables efficient, atomic read-modify-write operations on +rows. The ConditionalWriter writes special Mutations which have a list of per +column conditions that must all be met before the mutation is applied. The +conditions are checked in the tablet server while a row lock is +held (Mutations written by the BatchWriter will not obtain a row +lock). The conditions that can be checked for a column are equality and +absence. For example a conditional mutation can require that column A is +absent inorder to be applied. Iterators can be applied when checking +conditions. Using iterators, many other operations besides equality and +absence can be checked. For example, using an iterator that converts values +less than 5 to 0 and everything else to 1, its possible to only apply a +mutation when a column is less than 5. + +In the case when a tablet server dies after a client sent a conditional +mutation, its not known if the mutation was applied or not. When this happens +the ConditionalWriter reports a status of UNKNOWN for the ConditionalMutation. +In many cases this situation can be dealt with by simply reading the row again +and possibly sending another conditional mutation. If this is not sufficient, +then a higher level of abstraction can be built by storing transactional +information within a row. + +See the [reservations example](https://github.com/apache/accumulo-examples/blob/master/docs/reservations.md) +for example code that uses the conditional writer. + +### Durability + +By default, Accumulo writes out any updates to the Write-Ahead Log (WAL). Every change +goes into a file in HDFS and is sync'd to disk for maximum durability. In +the event of a failure, writes held in memory are replayed from the WAL. Like +all files in HDFS, this file is also replicated. Sending updates to the +replicas, and waiting for a permanent sync to disk can significantly write speeds. + +Accumulo allows users to use less tolerant forms of durability when writing. +These levels are: + +* none: no durability guarantees are made, the WAL is not used +* log: the WAL is used, but not flushed; loss of the server probably means recent writes are lost +* flush: updates are written to the WAL, and flushed out to replicas; loss of a single server is unlikely to result in data loss. +* sync: updates are written to the WAL, and synced to disk on all replicas before the write is acknowledge. Data will not be lost even if the entire cluster suddenly loses power. + +The user can set the default durability of a table in the shell. When +writing, the user can configure the BatchWriter or ConditionalWriter to use +a different level of durability for the session. This will override the +default durability setting. + +```java +BatchWriterConfig cfg = new BatchWriterConfig(); +// We don't care about data loss with these writes: +// This is DANGEROUS: +cfg.setDurability(Durability.NONE); + +Connection conn = ... ; +BatchWriter bw = conn.createBatchWriter(table, cfg); +``` + +## Reading Data + +Accumulo is optimized to quickly retrieve the value associated with a given key, and +to efficiently return ranges of consecutive keys and their associated values. + +### Scanner + +To retrieve data, Clients use a Scanner, which acts like an Iterator over +keys and values. Scanners can be configured to start and stop at particular keys, and +to return a subset of the columns available. + +```java +// specify which visibilities we are allowed to see +Authorizations auths = new Authorizations("public"); + +Scanner scan = + conn.createScanner("table", auths); + +scan.setRange(new Range("harry","john")); +scan.fetchColumnFamily(new Text("attributes")); + +for(Entry entry : scan) { + Text row = entry.getKey().getRow(); + Value value = entry.getValue(); +} +``` + +### Isolated Scanner + +Accumulo supports the ability to present an isolated view of rows when +scanning. There are three possible ways that a row could change in Accumulo : + +* a mutation applied to a table +* iterators executed as part of a minor or major compaction +* bulk import of new files + +Isolation guarantees that either all or none of the changes made by these +operations on a row are seen. Use the IsolatedScanner to obtain an isolated +view of an Accumulo table. When using the regular scanner it is possible to see +a non isolated view of a row. For example if a mutation modifies three +columns, it is possible that you will only see two of those modifications. +With the isolated scanner either all three of the changes are seen or none. + +The IsolatedScanner buffers rows on the client side so a large row will not +crash a tablet server. By default rows are buffered in memory, but the user +can easily supply their own buffer if they wish to buffer to disk when rows are +large. + +See the [isolation example](https://github.com/apache/accumulo-examples/blob/master/docs/isolation.md) +for example code that uses the IsolatedScanner. + +### BatchScanner + +For some types of access, it is more efficient to retrieve several ranges +simultaneously. This arises when accessing a set of rows that are not consecutive +whose IDs have been retrieved from a secondary index, for example. + +The BatchScanner is configured similarly to the Scanner; it can be configured to +retrieve a subset of the columns available, but rather than passing a single Range, +BatchScanners accept a set of Ranges. It is important to note that the keys returned +by a BatchScanner are not in sorted order since the keys streamed are from multiple +TabletServers in parallel. + +```java +ArrayList ranges = new ArrayList(); +// populate list of ranges ... + +BatchScanner bscan = + conn.createBatchScanner("table", auths, 10); +bscan.setRanges(ranges); +bscan.fetchColumnFamily("attributes"); + +for(Entry entry : bscan) { + System.out.println(entry.getValue()); +} +``` + +For more example code, see the [batch writing and scanning example](https://github.com/apache/accumulo-examples/blob/master/docs/batch.md). + +At this time, there is no client side isolation support for the BatchScanner. +You may consider using the WholeRowIterator with the BatchScanner to achieve +isolation. The drawback of this approach is that entire rows are read into +memory on the server side. If a row is too big, it may crash a tablet server. + +## Proxy + +The proxy API allows the interaction with Accumulo with languages other than Java. +A proxy server is provided in the codebase and a client can further be generated. +The proxy API can also be used instead of the traditional ZooKeeperInstance class to +provide a single TCP port in which clients can be securely routed through a firewall, +without requiring access to all tablet servers in the cluster. + +### Prerequisites + +The proxy server can live on any node in which the basic client API would work. That +means it must be able to communicate with the Master, ZooKeepers, NameNode, and the +DataNodes. A proxy client only needs the ability to communicate with the proxy server. + +### Configuration + +The configuration options for the proxy server live inside of a properties file. At +the very least, you need to supply the following properties: + + protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory + tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken + port=42424 + instance=test + zookeepers=localhost:2181 + +You can find a sample configuration file in your distribution at `proxy/proxy.properties`. + +This sample configuration file further demonstrates an ability to back the proxy server +by MockAccumulo or the MiniAccumuloCluster. + +### Running the Proxy Server + +After the properties file holding the configuration is created, the proxy server +can be started using the following command in the Accumulo distribution (assuming +your properties file is named `config.properties`): + + accumulo proxy -p config.properties + +### Creating a Proxy Client + +Aside from installing the Thrift compiler, you will also need the language-specific library +for Thrift installed to generate client code in that language. Typically, your operating +system's package manager will be able to automatically install these for you in an expected +location such as `/usr/lib/python/site-packages/thrift`. + +You can find the thrift file for generating the client at `proxy/proxy.thrift`. + +After a client is generated, the port specified in the configuration properties above will be +used to connect to the server. + +### Using a Proxy Client + +The following examples have been written in Java and the method signatures may be +slightly different depending on the language specified when generating client with +the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift's +documentation for examples of connecting to a Thrift service), the methods on the +proxy client will be available. The first thing to do is log in: + +```java +Map password = new HashMap(); +password.put("password", "secret"); +ByteBuffer token = client.login("root", password); +``` + +Once logged in, the token returned will be used for most subsequent calls to the client. +Let's create a table, add some data, scan the table, and delete it. + +First, create a table. + +```java +client.createTable(token, "myTable", true, TimeType.MILLIS); +``` + +Next, add some data: + +```java +// first, create a writer on the server +String writer = client.createWriter(token, "myTable", new WriterOptions()); + +//rowid +ByteBuffer rowid = ByteBuffer.wrap("UUID".getBytes()); + +//mutation like class +ColumnUpdate cu = new ColumnUpdate(); +cu.setColFamily("MyFamily".getBytes()); +cu.setColQualifier("MyQualifier".getBytes()); +cu.setColVisibility("VisLabel".getBytes()); +cu.setValue("Some Value.".getBytes()); + +List updates = new ArrayList(); +updates.add(cu); + +// build column updates +Map> cellsToUpdate = new HashMap>(); +cellsToUpdate.put(rowid, updates); + +// send updates to the server +client.updateAndFlush(writer, "myTable", cellsToUpdate); + +client.closeWriter(writer); +``` + +Scan for the data and batch the return of the results on the server: + +```java +String scanner = client.createScanner(token, "myTable", new ScanOptions()); +ScanResult results = client.nextK(scanner, 100); + +for(KeyValue keyValue : results.getResultsIterator()) { + // do something with results +} + +client.closeScanner(scanner); +```