Return-Path: X-Original-To: apmail-accumulo-commits-archive@www.apache.org Delivered-To: apmail-accumulo-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 16249DF3B for ; Sat, 25 May 2013 02:56:55 +0000 (UTC) Received: (qmail 38367 invoked by uid 500); 25 May 2013 02:56:54 -0000 Delivered-To: apmail-accumulo-commits-archive@accumulo.apache.org Received: (qmail 38334 invoked by uid 500); 25 May 2013 02:56:54 -0000 Mailing-List: contact commits-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list commits@accumulo.apache.org Received: (qmail 38325 invoked by uid 500); 25 May 2013 02:56:54 -0000 Delivered-To: apmail-incubator-accumulo-commits@incubator.apache.org Received: (qmail 38322 invoked by uid 99); 25 May 2013 02:56:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 May 2013 02:56:54 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 May 2013 02:56:50 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 1C8C62388B1B for ; Sat, 25 May 2013 02:56:31 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r863067 [3/7] - in /websites/staging/accumulo/trunk/content: ./ 1.5/ 1.5/examples/ 1.5/images/ downloads/ Date: Sat, 25 May 2013 02:56:29 -0000 To: accumulo-commits@incubator.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20130525025631.1C8C62388B1B@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: websites/staging/accumulo/trunk/content/1.5/examples/batch.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/batch.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/batch.html Sat May 25 02:56:29 2013 @@ -0,0 +1,145 @@ + + + + + + Apache Accumulo Batch Writing and Scanning Example + + + + + + + + + + + + +
+ ™ +
+ +
+

Apache Accumulo Batch Writing and Scanning Example

+

This tutorial uses the following Java classes, which can be found in org.apache.accumulo.examples.simple.client in the simple-examples module:

+
    +
  • SequentialBatchWriter.java - writes mutations with sequential rows and random values
  • +
  • RandomBatchWriter.java - used by SequentialBatchWriter to generate random values
  • +
  • RandomBatchScanner.java - reads random rows and verifies their values
  • +
+

This is an example of how to use the batch writer and batch scanner. To compile +the example, run maven and copy the produced jar into the accumulo lib dir. +This is already done in the tar distribution.

+

Below are commands that add 10000 entries to accumulo and then do 100 random +queries. The write command generates random 50 byte values.

+

Be sure to use the name of your instance (given as instance here) and the appropriate +list of zookeeper nodes (given as zookeepers here).

+

Before you run this, you must ensure that the user you are running has the +"exampleVis" authorization. (you can set this in the shell with "setauths -u username -s exampleVis")

+
$ ./bin/accumulo shell -u root -e "setauths -u username -s exampleVis"
+
+ + +

You must also create the table, batchtest1, ahead of time. (In the shell, use "createtable batchtest1")

+
$ ./bin/accumulo shell -u username -e "createtable batchtest1"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.SequentialBatchWriter -i instance -z zookeepers -u username -p password -t batchtest1 --start 0 
 --num 10000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20 --vis exampleVis
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner -i instance -z zookeepers -u username -p password -t batchtest1 --num 100 --min 0 --max 10000 --size 50 --scanThreads 20 --vis exampleVis
+07 11:33:11,103 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
+07 11:33:11,112 [client.CountingVerifyingReceiver] INFO : finished
+07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : 694.44 lookups/sec   0.14 secs
+
+07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : num results : 100
+
+07 11:33:11,364 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
+07 11:33:11,370 [client.CountingVerifyingReceiver] INFO : finished
+07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : 2173.91 lookups/sec   0.05 secs
+
+07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : num results : 100
+
+
+ + + + + Added: websites/staging/accumulo/trunk/content/1.5/examples/bloom.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/bloom.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/bloom.html Sat May 25 02:56:29 2013 @@ -0,0 +1,317 @@ + + + + + + Apache Accumulo Bloom Filter Example + + + + + + + + + + + + +
+ ™ +
+ +
+

Apache Accumulo Bloom Filter Example

+

This example shows how to create a table with bloom filters enabled. It also +shows how bloom filters increase query performance when looking for values that +do not exist in a table.

+

Below table named bloom_test is created and bloom filters are enabled.

+
$ ./bin/accumulo shell -u username -p password
+Shell - Apache Accumulo Interactive Shell
+- version: 1.5.0
+- instance name: instance
+- instance id: 00000000-0000-0000-0000-000000000000
+- 
+- type 'help' for a list of available commands
+- 
+username@instance> setauths -u username -s exampleVis
+username@instance> createtable bloom_test
+username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
+username@instance bloom_test> exit
+
+ + +

Below 1 million random values are inserted into accumulo. The randomly +generated rows range between 0 and 1 billion. The random number generator is +initialized with the seed 7.

+
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 -min 0 -max 1000000000 -valueSize 50 -batchMemory 2M -batchLatency 60s -batchThreads 3 --vis exampleVis
+
+ + +

Below the table is flushed:

+
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w'
+05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.
+
+ + +

After the flush completes, 500 random queries are done against the table. The +same seed is used to generate the queries, therefore everything is found in the +table.

+
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 --vis exampleVis
+Generating 500 random queries...finished
+96.19 lookups/sec   5.20 secs
+num results : 500
+Generating 500 random queries...finished
+102.35 lookups/sec   4.89 secs
+num results : 500
+
+ + +

Below another 500 queries are performed, using a different seed which results +in nothing being found. In this case the lookups are much faster because of +the bloom filters.

+
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
+Generating 500 random queries...finished
+2212.39 lookups/sec   0.23 secs
+num results : 0
+Did not find 500 rows
+Generating 500 random queries...finished
+4464.29 lookups/sec   0.11 secs
+num results : 0
+Did not find 500 rows
+
+ + +
+

Bloom filters can also speed up lookups for entries that exist. In accumulo +data is divided into tablets and each tablet has multiple map files. Every +lookup in accumulo goes to a specific tablet where a lookup is done on each +map file in the tablet. So if a tablet has three map files, lookup performance +can be three times slower than a tablet with one map file. However if the map +files contain unique sets of data, then bloom filters can help eliminate map +files that do not contain the row being looked up. To illustrate this two +identical tables were created using the following process. One table had bloom +filters, the other did not. Also the major compaction ratio was increased to +prevent the files from being compacted into one file.

+
    +
  • Insert 1 million entries using RandomBatchWriter with a seed of 7
  • +
  • Flush the table using the shell
  • +
  • Insert 1 million entries using RandomBatchWriter with a seed of 8
  • +
  • Flush the table using the shell
  • +
  • Insert 1 million entries using RandomBatchWriter with a seed of 9
  • +
  • Flush the table using the shell
  • +
+

After following the above steps, each table will have a tablet with three map +files. Flushing the table after each batch of inserts will create a map file. +Each map file will contain 1 million entries generated with a different seed. +This is assuming that Accumulo is configured with enough memory to hold 1 +million inserts. If not, then more map files will be created.

+

The commands for creating the first table without bloom filters are below.

+
$ ./bin/accumulo shell -u username -p password
+Shell - Apache Accumulo Interactive Shell
+- version: 1.5.0
+- instance name: instance
+- instance id: 00000000-0000-0000-0000-000000000000
+- 
+- type 'help' for a list of available commands
+- 
+username@instance> setauths -u username -s exampleVis
+username@instance> createtable bloom_test1
+username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
+username@instance bloom_test1> exit
+
+$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --auths exampleVis"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
+
+ + +

The commands for creating the second table with bloom filers are below.

+
$ ./bin/accumulo shell -u username -p password
+Shell - Apache Accumulo Interactive Shell
+- version: 1.5.0
+- instance name: instance
+- instance id: 00000000-0000-0000-0000-000000000000
+- 
+- type 'help' for a list of available commands
+- 
+username@instance> setauths -u username -s exampleVis
+username@instance> createtable bloom_test2
+username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
+username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
+username@instance bloom_test2> exit
+
+$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --auths exampleVis"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
+
+ + +

Below 500 lookups are done against the table without bloom filters using random +NG seed 7. Even though only one map file will likely contain entries for this +seed, all map files will be interrogated.

+
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
+Generating 500 random queries...finished
+35.09 lookups/sec  14.25 secs
+num results : 500
+Generating 500 random queries...finished
+35.33 lookups/sec  14.15 secs
+num results : 500
+
+ + +

Below the same lookups are done against the table with bloom filters. The +lookups were 2.86 times faster because only one map file was used, even though three +map files existed.

+
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
+Generating 500 random queries...finished
+99.03 lookups/sec   5.05 secs
+num results : 500
+Generating 500 random queries...finished
+101.15 lookups/sec   4.94 secs
+num results : 500
+
+ + +

You can verify the table has three files by looking in HDFS. To look in HDFS +you will need the table ID, because this is used in HDFS instead of the table +name. The following command will show table ids.

+
$ ./bin/accumulo shell -u username -p password -e 'tables -l'
+!METADATA       =>         !0
+bloom_test1     =>         o7
+bloom_test2     =>         o8
+trace           =>          1
+
+ + +

So the table id for bloom_test2 is o8. The command below shows what files this +table has in HDFS. This assumes Accumulo is at the default location in HDFS.

+
$ hadoop fs -lsr /accumulo/tables/o8
+drwxr-xr-x   - username supergroup          0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
+-rw-r--r--   3 username supergroup   52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
+-rw-r--r--   3 username supergroup   52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
+-rw-r--r--   3 username supergroup   52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf
+
+ + +

Running the rfile-info command shows that one of the files has a bloom filter +and its 1.5MB.

+
$ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
+Locality group         : <DEFAULT>
+Start block          : 0
+Num   blocks         : 752
+Index level 0        : 43,598 bytes  1 blocks
+First key            : row_0000001169 foo:1 [exampleVis] 1326222052539 false
+Last key             : row_0999999421 foo:1 [exampleVis] 1326222052058 false
+Num entries          : 999,536
+Column families      : [foo]
+
+Meta block     : BCFile.index
+  Raw size             : 4 bytes
+  Compressed size      : 12 bytes
+  Compression type     : gz
+
+Meta block     : RFile.index
+  Raw size             : 43,696 bytes
+  Compressed size      : 15,592 bytes
+  Compression type     : gz
+
+Meta block     : acu_bloom
+  Raw size             : 1,540,292 bytes
+  Compressed size      : 1,433,115 bytes
+  Compression type     : gz
+
+
+ + + + + Added: websites/staging/accumulo/trunk/content/1.5/examples/bulkIngest.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/bulkIngest.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/bulkIngest.html Sat May 25 02:56:29 2013 @@ -0,0 +1,125 @@ + + + + + + Apache Accumulo Bulk Ingest Example + + + + + + + + + + + + +
+ ™ +
+ +
+

Apache Accumulo Bulk Ingest Example

+

This is an example of how to bulk ingest data into accumulo using map reduce.

+

The following commands show how to run this example. This example creates a +table called test_bulk which has two initial split points. Then 1000 rows of +test data are created in HDFS. After that the 1000 rows are ingested into +accumulo. Then we verify the 1000 rows are in accumulo.

+
$ PKG=org.apache.accumulo.examples.simple.mapreduce.bulk
+$ ARGS="-i instance -z zookeepers -u username -p password"
+$ ./bin/accumulo $PKG.SetupTable $ARGS -t test_bulk row_00000333 row_00000666
+$ ./bin/accumulo $PKG.GenerateTestData --start-row 0 --count 1000 --output bulk/test_1.txt
+$ ./bin/tool.sh lib/accumulo-examples-simple.jar $PKG.BulkIngestExample $ARGS -t test_bulk --inputDir bulk --workDir tmp/bulkWork
+$ ./bin/accumulo $PKG.VerifyIngest $ARGS -t test_bulk --start-row 0 --count 1000
+
+ + +

For a high level discussion of bulk ingest, see the docs dir.

+
+ + + + + Added: websites/staging/accumulo/trunk/content/1.5/examples/classpath.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/classpath.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/classpath.html Sat May 25 02:56:29 2013 @@ -0,0 +1,142 @@ + + + + + + Apache Accumulo Client Examples + + + + + + + + + + + + +
+ ™ +
+ +
+

Apache Accumulo Client Examples

+

This example shows how to use per table classpaths. The example leverages a +test jar which contains a Filter that supresses rows containing "foo". The +example shows copying the FooFilter.jar into HDFS and then making an Accumulo +table reference that jar.

+

Execute the following command in the shell.

+

$ hadoop fs -copyFromLocal $ACCUMULO_HOME/test/src/test/resources/FooFilter.jar /user1/lib

+

Execute following in Accumulo shell to setup classpath context

+

root@test15> config -s general.vfs.context.classpath.cx1=hdfs://:/user1/lib

+

Create a table

+

root@test15> createtable nofoo

+

The following command makes this table use the configured classpath context

+

root@test15 nofoo> config -t nofoo -s table.classpath.context=cx1

+

The following command configures an iterator thats in FooFilter.jar

+

root@test15 nofoo> setiter -n foofilter -p 10 -scan -minc -majc -class org.apache.accumulo.test.FooFilter + Filter accepts or rejects each Key/Value pair + ----------> set FooFilter parameter negate, default false keeps k/v that pass accept method, true rejects k/v that pass accept method: false

+

The commands below show the filter is working.

+

root@test15 nofoo> insert foo1 f1 q1 v1 + root@test15 nofoo> insert noo1 f1 q1 v2 + root@test15 nofoo> scan + noo1 f1:q1 [] v2 + root@test15 nofoo>

+

Below, an attempt is made to add the FooFilter to a table thats not configured +to use the clasppath context cx1. This fails util the table is configured to +use cx1.

+

root@test15 nofoo> createtable nofootwo + root@test15 nofootwo> setiter -n foofilter -p 10 -scan -minc -majc -class org.apache.accumulo.test.FooFilter + 2013-05-03 12:49:35,943 [shell.Shell] ERROR: java.lang.IllegalArgumentException: org.apache.accumulo.test.FooFilter + root@test15 nofootwo> config -t nofootwo -s table.classpath.context=cx1 + root@test15 nofootwo> setiter -n foofilter -p 10 -scan -minc -majc -class org.apache.accumulo.test.FooFilter + Filter accepts or rejects each Key/Value pair + ----------> set FooFilter parameter negate, default false keeps k/v that pass accept method, true rejects k/v that pass accept method: false

+
+ + + + +