accumulo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [08/11] accumulo git commit: ACCUMULO-4531 Remove bulkIngest.html
Date Tue, 06 Dec 2016 19:14:47 GMT
ACCUMULO-4531 Remove bulkIngest.html

* Similar documentation exists in 'High-speed Ingest' section of user manual


Branch: refs/heads/master
Commit: c0e68a3a82bb6fa7eb370039537aac8337f22e47
Parents: 43e98d4
Author: Mike Walch <>
Authored: Tue Dec 6 10:58:23 2016 -0500
Committer: Mike Walch <>
Committed: Tue Dec 6 14:04:54 2016 -0500

 docs/src/main/resources/bulkIngest.html | 114 ---------------------------
 1 file changed, 114 deletions(-)
diff --git a/docs/src/main/resources/bulkIngest.html b/docs/src/main/resources/bulkIngest.html
deleted file mode 100644
index 9e9896e..0000000
--- a/docs/src/main/resources/bulkIngest.html
+++ /dev/null
@@ -1,114 +0,0 @@
-  Licensed to the Apache Software Foundation (ASF) under one or more
-  contributor license agreements.  See the NOTICE file distributed with
-  this work for additional information regarding copyright ownership.
-  The ASF licenses this file to You under the Apache License, Version 2.0
-  (the "License"); you may not use this file except in compliance with
-  the License.  You may obtain a copy of the License at
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  See the License for the specific language governing permissions and
-  limitations under the License.
-<title>Accumulo Bulk Ingest</title>
-<link rel='stylesheet' type='text/css' href='documentation.css' media='screen'/>
-<h1>Apache Accumulo Documentation : Bulk Ingest</h2>
-<p>Accumulo supports the ability to import sorted files produced by an
-external process into an online table. Often, it is much faster to churn
-through large amounts of data using map/reduce to produce the these files.
-The new files can be incorporated into Accumulo using bulk ingest.
-<li>Construct an <code>org.apache.accumulo.core.client.Connector</code>
-<li>Call <code>connector.tableOperations().getSplits()</code></li>
-<li>Run a map/reduce job using <code>RangePartitioner</code>
-with splits from the previous step</li>
-<li>Call <code>connector.tableOperations().importDirectory()</code> passing
the output directory of the MapReduce job</li>
-<p>Files can also be imported using the "importdirectory" shell command.
-<p>A complete example is available in <a href='examples/README.bulkIngest'>README.bulkIngest</a>
-<p>Importing data using whole files of sorted data can be very efficient, but it differs
-from live ingest in the following ways:
- <li>Table constraints are not applied against they data in the file.
- <li>Adding new files to tables are likely to trigger major compactions.
- <li>The timestamp in the file could contain strange values. Accumulo can be asked
to use the ingest timestamp for all values if this is a concern.
- <li>It is possible to create invalid visibility values (for example "&|"). This
will cause errors when the data is accessed.
- <li>Bulk imports do not effect the entry counts in the monitor page until the files
are compacted.
-<h2>Best Practices</h2>
-<p>Consider two approaches to creating ingest files using map/reduce.
- <li>A large file containing the Key/Value pairs for only a single tablet.
- <li>A set of small files containing Key/Value pairs for every tablet.
-<p>In the first case, adding the file requires telling a single tablet server about
a single file. Even if the file
-is 20G in size, it is one call to the tablet server. The tablet server makes one extra file
entry in the
-tablet's metadata, and the data is now part of the tablet.
-<p>In the second case, an request must be made for each tablet for each file to be
added. If there
-100 files and 100 tablets, this will be 10K requests, and the number of files needed to be
-for scans on these tablets will be very large. Major compactions will most likely start which
will eventually
-fix the problem, but a lot more work needs to be done by accumulo to read these files.
-<p>Getting good, fast, bulk import performance depends on creating files like the first,
and avoiding files like
-the second.
-<p>For this reason, a RangePartitioner should be used to create files when
-writing with the AccumuloFileOutputFormat.
-<p>Hash partition is not recommended because it will put keys in random
-groups, exactly like our bad approach.
-<P>Any set of cut points for range partitioning can be used in a map
-reduce job, but using Accumulo's current splits is probably the most
-optimal thing to do. However in some cases there may be too many
-splits. For example if there are 2000 splits, you would need to run
-2001 reducers. To overcome this problem use the
-<code>connector.tableOperations.getSplits(&lt;table name&gt;,&lt;max
-splits&gt;)</code> method. This method will not return more than
-<code> &lt;max splits&gt; </code> splits, but the splits it returns
-will optimally partition the data for Accumulo.
-<p>Remember that Accumulo never splits rows across tablets.
-Therefore the range partitioner only considers rows when partitioning.
-<p>When bulk importing many files into a new table, it might be good to pre-split the
table to bring
-additional resources to accepting the data. For example, if you know your data is indexed
based on the
-date, pre-creating splits for each day will allow files to fall into natural splits. Having
more tablets
-accept the new data means that more resources can be used to import the data right away.
-<p>An alternative to bulk ingest is to have a map/reduce job use
-<code>AccumuloOutputFormat</code>, which can support billions of inserts per
-hour, depending on the size of your cluster. This is sufficient for
-most users, but bulk ingest remains the fastest way to incorporate
-data into Accumulo. In addition, bulk ingest has one advantage over
-AccumuloOutputFormat: there is no duplicate data insertion. When one uses
-map/reduce to output data to accumulo, restarted jobs may re-enter
-data from previous failed attempts. Generally, this only matters when
-there are aggregators. With bulk ingest, reducers are writing to new
-map files, so it does not matter. If a reduce fails, you create a new
-map file. When all reducers finish, you bulk ingest the map files
-into Accumulo. The disadvantage to bulk ingest over <code>AccumuloOutputFormat</code>
-greater latency: the entire map/reduce job must complete
-before any data is available.

View raw message