; Sat, 25 May 2013 02:56:31 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r863067 [3/7] - in /websites/staging/accumulo/trunk/content: ./ 1.5/ 1.5/examples/ 1.5/images/ downloads/ Date: Sat, 25 May 2013 02:56:29 -0000 To: accumulo-commits@incubator.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20130525025631.1C8C62388B1B@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: websites/staging/accumulo/trunk/content/1.5/examples/batch.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/batch.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/batch.html Sat May 25 02:56:29 2013 @@ -0,0 +1,145 @@ + + + + + + Apache Accumulo Batch Writing and Scanning Example + + + + + + + + + + + + +

™ +

+ +

Apache Accumulo Batch Writing and Scanning Example

This tutorial uses the following Java classes, which can be found in org.apache.accumulo.examples.simple.client in the simple-examples module:

SequentialBatchWriter.java - writes mutations with sequential rows and random values
RandomBatchWriter.java - used by SequentialBatchWriter to generate random values
RandomBatchScanner.java - reads random rows and verifies their values

This is an example of how to use the batch writer and batch scanner. To compile +the example, run maven and copy the produced jar into the accumulo lib dir. +This is already done in the tar distribution.

Below are commands that add 10000 entries to accumulo and then do 100 random +queries. The write command generates random 50 byte values.

Be sure to use the name of your instance (given as instance here) and the appropriate +list of zookeeper nodes (given as zookeepers here).

Before you run this, you must ensure that the user you are running has the +"exampleVis" authorization. (you can set this in the shell with "setauths -u username -s exampleVis")

$ ./bin/accumulo shell -u root -e "setauths -u username -s exampleVis"
+

+ + +

You must also create the table, batchtest1, ahead of time. (In the shell, use "createtable batchtest1")

$ ./bin/accumulo shell -u username -e "createtable batchtest1"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.SequentialBatchWriter -i instance -z zookeepers -u username -p password -t batchtest1 --start 0 
 --num 10000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20 --vis exampleVis
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner -i instance -z zookeepers -u username -p password -t batchtest1 --num 100 --min 0 --max 10000 --size 50 --scanThreads 20 --vis exampleVis
+07 11:33:11,103 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
+07 11:33:11,112 [client.CountingVerifyingReceiver] INFO : finished
+07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : 694.44 lookups/sec   0.14 secs
+
+07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : num results : 100
+
+07 11:33:11,364 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
+07 11:33:11,370 [client.CountingVerifyingReceiver] INFO : finished
+07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : 2173.91 lookups/sec   0.05 secs
+
+07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : num results : 100
+

+ + + + + Added: websites/staging/accumulo/trunk/content/1.5/examples/bloom.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/bloom.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/bloom.html Sat May 25 02:56:29 2013 @@ -0,0 +1,317 @@ + + + + + + Apache Accumulo Bloom Filter Example + + + + + + + + + + + + +

™ +

+ +

Apache Accumulo Bloom Filter Example

This example shows how to create a table with bloom filters enabled. It also +shows how bloom filters increase query performance when looking for values that +do not exist in a table.

Below table named bloom_test is created and bloom filters are enabled.

$ ./bin/accumulo shell -u username -p password
+Shell - Apache Accumulo Interactive Shell
+- version: 1.5.0
+- instance name: instance
+- instance id: 00000000-0000-0000-0000-000000000000
+- 
+- type 'help' for a list of available commands
+- 
+username@instance> setauths -u username -s exampleVis
+username@instance> createtable bloom_test
+username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
+username@instance bloom_test> exit
+

+ + +

Below 1 million random values are inserted into accumulo. The randomly +generated rows range between 0 and 1 billion. The random number generator is +initialized with the seed 7.

$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 -min 0 -max 1000000000 -valueSize 50 -batchMemory 2M -batchLatency 60s -batchThreads 3 --vis exampleVis
+

+ + +

Below the table is flushed:

$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w'
+05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.
+

+ + +

After the flush completes, 500 random queries are done against the table. The +same seed is used to generate the queries, therefore everything is found in the +table.

$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 --vis exampleVis
+Generating 500 random queries...finished
+96.19 lookups/sec   5.20 secs
+num results : 500
+Generating 500 random queries...finished
+102.35 lookups/sec   4.89 secs
+num results : 500
+

+ + +

Below another 500 queries are performed, using a different seed which results +in nothing being found. In this case the lookups are much faster because of +the bloom filters.

$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
+Generating 500 random queries...finished
+2212.39 lookups/sec   0.23 secs
+num results : 0
+Did not find 500 rows
+Generating 500 random queries...finished
+4464.29 lookups/sec   0.11 secs
+num results : 0
+Did not find 500 rows
+

+ + +

Bloom filters can also speed up lookups for entries that exist. In accumulo +data is divided into tablets and each tablet has multiple map files. Every +lookup in accumulo goes to a specific tablet where a lookup is done on each +map file in the tablet. So if a tablet has three map files, lookup performance +can be three times slower than a tablet with one map file. However if the map +files contain unique sets of data, then bloom filters can help eliminate map +files that do not contain the row being looked up. To illustrate this two +identical tables were created using the following process. One table had bloom +filters, the other did not. Also the major compaction ratio was increased to +prevent the files from being compacted into one file.

Insert 1 million entries using RandomBatchWriter with a seed of 7
Flush the table using the shell
Insert 1 million entries using RandomBatchWriter with a seed of 8
Flush the table using the shell
Insert 1 million entries using RandomBatchWriter with a seed of 9
Flush the table using the shell

After following the above steps, each table will have a tablet with three map +files. Flushing the table after each batch of inserts will create a map file. +Each map file will contain 1 million entries generated with a different seed. +This is assuming that Accumulo is configured with enough memory to hold 1 +million inserts. If not, then more map files will be created.

The commands for creating the first table without bloom filters are below.

$ ./bin/accumulo shell -u username -p password
+Shell - Apache Accumulo Interactive Shell
+- version: 1.5.0
+- instance name: instance
+- instance id: 00000000-0000-0000-0000-000000000000
+- 
+- type 'help' for a list of available commands
+- 
+username@instance> setauths -u username -s exampleVis
+username@instance> createtable bloom_test1
+username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
+username@instance bloom_test1> exit
+
+$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --auths exampleVis"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
+

+ + +

The commands for creating the second table with bloom filers are below.

$ ./bin/accumulo shell -u username -p password
+Shell - Apache Accumulo Interactive Shell
+- version: 1.5.0
+- instance name: instance
+- instance id: 00000000-0000-0000-0000-000000000000
+- 
+- type 'help' for a list of available commands
+- 
+username@instance> setauths -u username -s exampleVis
+username@instance> createtable bloom_test2
+username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
+username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
+username@instance bloom_test2> exit
+
+$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --auths exampleVis"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
+$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
+

+ + +

Below 500 lookups are done against the table without bloom filters using random +NG seed 7. Even though only one map file will likely contain entries for this +seed, all map files will be interrogated.

$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
+Generating 500 random queries...finished
+35.09 lookups/sec  14.25 secs
+num results : 500
+Generating 500 random queries...finished
+35.33 lookups/sec  14.15 secs
+num results : 500
+

+ + +

Below the same lookups are done against the table with bloom filters. The +lookups were 2.86 times faster because only one map file was used, even though three +map files existed.

$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
+Generating 500 random queries...finished
+99.03 lookups/sec   5.05 secs
+num results : 500
+Generating 500 random queries...finished
+101.15 lookups/sec   4.94 secs
+num results : 500
+

+ + +

You can verify the table has three files by looking in HDFS. To look in HDFS +you will need the table ID, because this is used in HDFS instead of the table +name. The following command will show table ids.

$ ./bin/accumulo shell -u username -p password -e 'tables -l'
+!METADATA       =>         !0
+bloom_test1     =>         o7
+bloom_test2     =>         o8
+trace           =>          1
+

+ + +

So the table id for bloom_test2 is o8. The command below shows what files this +table has in HDFS. This assumes Accumulo is at the default location in HDFS.

$ hadoop fs -lsr /accumulo/tables/o8
+drwxr-xr-x   - username supergroup          0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
+-rw-r--r--   3 username supergroup   52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
+-rw-r--r--   3 username supergroup   52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
+-rw-r--r--   3 username supergroup   52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf
+

+ + +

Running the rfile-info command shows that one of the files has a bloom filter +and its 1.5MB.

$ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
+Locality group         : <DEFAULT>
+Start block          : 0
+Num   blocks         : 752
+Index level 0        : 43,598 bytes  1 blocks
+First key            : row_0000001169 foo:1 [exampleVis] 1326222052539 false
+Last key             : row_0999999421 foo:1 [exampleVis] 1326222052058 false
+Num entries          : 999,536
+Column families      : [foo]
+
+Meta block     : BCFile.index
+  Raw size             : 4 bytes
+  Compressed size      : 12 bytes
+  Compression type     : gz
+
+Meta block     : RFile.index
+  Raw size             : 43,696 bytes
+  Compressed size      : 15,592 bytes
+  Compression type     : gz
+
+Meta block     : acu_bloom
+  Raw size             : 1,540,292 bytes
+  Compressed size      : 1,433,115 bytes
+  Compression type     : gz
+

+ + + + + Added: websites/staging/accumulo/trunk/content/1.5/examples/bulkIngest.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/bulkIngest.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/bulkIngest.html Sat May 25 02:56:29 2013 @@ -0,0 +1,125 @@ + + + + + + Apache Accumulo Bulk Ingest Example + + + + + + + + + + + + +

™ +

+ +

Apache Accumulo Bulk Ingest Example

This is an example of how to bulk ingest data into accumulo using map reduce.

The following commands show how to run this example. This example creates a +table called test_bulk which has two initial split points. Then 1000 rows of +test data are created in HDFS. After that the 1000 rows are ingested into +accumulo. Then we verify the 1000 rows are in accumulo.

$ PKG=org.apache.accumulo.examples.simple.mapreduce.bulk
+$ ARGS="-i instance -z zookeepers -u username -p password"
+$ ./bin/accumulo $PKG.SetupTable $ARGS -t test_bulk row_00000333 row_00000666
+$ ./bin/accumulo $PKG.GenerateTestData --start-row 0 --count 1000 --output bulk/test_1.txt
+$ ./bin/tool.sh lib/accumulo-examples-simple.jar $PKG.BulkIngestExample $ARGS -t test_bulk --inputDir bulk --workDir tmp/bulkWork
+$ ./bin/accumulo $PKG.VerifyIngest $ARGS -t test_bulk --start-row 0 --count 1000
+

+ + +

For a high level discussion of bulk ingest, see the docs dir.

+ + + + + Added: websites/staging/accumulo/trunk/content/1.5/examples/classpath.html ============================================================================== --- websites/staging/accumulo/trunk/content/1.5/examples/classpath.html (added) +++ websites/staging/accumulo/trunk/content/1.5/examples/classpath.html Sat May 25 02:56:29 2013 @@ -0,0 +1,142 @@ + + + + + + Apache Accumulo Client Examples + + + + + + + + + + + + +

™ +

+ +

Apache Accumulo Client Examples

This example shows how to use per table classpaths. The example leverages a +test jar which contains a Filter that supresses rows containing "foo". The +example shows copying the FooFilter.jar into HDFS and then making an Accumulo +table reference that jar.

Execute the following command in the shell.

$ hadoop fs -copyFromLocal $ACCUMULO_HOME/test/src/test/resources/FooFilter.jar /user1/lib

Execute following in Accumulo shell to setup classpath context

root@test15> config -s general.vfs.context.classpath.cx1=hdfs://:/user1/lib

Create a table

root@test15> createtable nofoo

The following command makes this table use the configured classpath context

root@test15 nofoo> config -t nofoo -s table.classpath.context=cx1

The following command configures an iterator thats in FooFilter.jar

root@test15 nofoo> setiter -n foofilter -p 10 -scan -minc -majc -class org.apache.accumulo.test.FooFilter + Filter accepts or rejects each Key/Value pair + ----------> set FooFilter parameter negate, default false keeps k/v that pass accept method, true rejects k/v that pass accept method: false

The commands below show the filter is working.

root@test15 nofoo> insert foo1 f1 q1 v1 + root@test15 nofoo> insert noo1 f1 q1 v2 + root@test15 nofoo> scan + noo1 f1:q1 [] v2 + root@test15 nofoo>

Below, an attempt is made to add the FooFilter to a table thats not configured +to use the clasppath context cx1. This fails util the table is configured to +use cx1.

root@test15 nofoo> createtable nofootwo + root@test15 nofootwo> setiter -n foofilter -p 10 -scan -minc -majc -class org.apache.accumulo.test.FooFilter + 2013-05-03 12:49:35,943 [shell.Shell] ERROR: java.lang.IllegalArgumentException: org.apache.accumulo.test.FooFilter + root@test15 nofootwo> config -t nofootwo -s table.classpath.context=cx1 + root@test15 nofootwo> setiter -n foofilter -p 10 -scan -minc -majc -class org.apache.accumulo.test.FooFilter + Filter accepts or rejects each Key/Value pair + ----------> set FooFilter parameter negate, default false keeps k/v that pass accept method, true rejects k/v that pass accept method: false

+ + + + +

Project

Community

Development

Documentation

ASF links

Apache Accumulo Batch Writing and Scanning Example

Project

Community

Development

Documentation

ASF links

Apache Accumulo Bloom Filter Example

Project

Community

Development

Documentation

ASF links

Apache Accumulo Bulk Ingest Example

Project

Community

Development

Documentation

ASF links

Apache Accumulo Client Examples