accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From joshelser <>
Subject [GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization
Date Tue, 07 Mar 2017 00:50:11 GMT
Github user joshelser commented on a diff in the pull request:
    --- Diff: docs/src/main/asciidoc/chapters/summaries.txt ---
    @@ -0,0 +1,211 @@
    +// Licensed to the Apache Software Foundation (ASF) under one or more
    +// contributor license agreements.  See the NOTICE file distributed with
    +// this work for additional information regarding copyright ownership.
    +// The ASF licenses this file to You under the Apache License, Version 2.0
    +// (the "License"); you may not use this file except in compliance with
    +// the License.  You may obtain a copy of the License at
    +// Unless required by applicable law or agreed to in writing, software
    +// distributed under the License is distributed on an "AS IS" BASIS,
    +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +// See the License for the specific language governing permissions and
    +// limitations under the License.
    +== Summary Statistics
    +=== Overview
    +Accumulo has the ability to generate summary statistics about data in a table
    +using user defined functions.  Currently these statistics are only generated for
    +data written to files.  Data recently written to Accumulo that is still in
    +memory will not contribute to summary statistics.
    +This feature can be used to inform a user about what data is in their table.
    +Summary statistics can also be used by compaction strategies to make decisions
    +about which files to compact.  
    +Summary data is stored in each file Accumulo produces.  Accumulo can gather
    +summary information from across a cluster merging it along the way.  In order
    +for this to be fast the, summary information should fit in cache.  There is a
    +dedicated cache for summary data on each tserver with a configurable size.  In
    +order for summary data to fit in cache, it should probably be small.
    +For information on writing a custom summarizer see the javadoc for
    ++org.apache.accumulo.core.client.summary.Summarizer+.  The package
    ++org.apache.accumulo.core.client.summary.summarizers+ contains summarizer
    +implementations that ship with Accumulo and can be configured for use.
    +=== Configuring
    +The following tablet server and table properties configure summarization.
    +* <<appendices/config.txt#_tserver_cache_summary_size>>
    +* <<appendices/config.txt#_tserver_summary_retrieval_threads>>
    +* <<appendices/config.txt#TABLE_SUMMARIZER_PREFIX>>
    +* <<appendices/config.txt#_table_file_summary_maxsize>>
    +=== Permissions
    +Because summary data may be derived from sensitive data, requesting summary data
    +requires a special permission.  User must have the table permission
    ++GET_SUMMARIES+ in order to retrieve summary data.
    +=== Bulk import
    +When generating rfiles to bulk import into Accumulo, those rfiles can contain
    +summary data.  To use this feature, look at the javadoc on the
    ++AccumuloFileOutputFormat.setSummarizers(...)+ method.  Also,
    ++org.apache.accumulo.core.client.rfile.RFile+ has options for creating RFiles
    +with embedded summary data.
    +=== Examples
    +This example walks through using summarizers in the Accumulo shell.  Below a
    +table is created and some data is inserted to summarize.
    + root@uno> createtable summary_test
    + root@uno summary_test> setauths -u root -s PI,GEO,TIME
    + root@uno summary_test> insert 3b503bd name last Doe
    + root@uno summary_test> insert 3b503bd name first John
    + root@uno summary_test> insert 3b503bd contact address "123 Park Ave, NY, NY" -l PI&GEO
    + root@uno summary_test> insert 3b503bd date birth "1/11/1942" -l PI&TIME
    + root@uno summary_test> insert 3b503bd date married "5/11/1962" -l PI&TIME
    + root@uno summary_test> insert 3b503bd contact home_phone 1-123-456-7890 -l PI
    + root@uno summary_test> insert d5d18dd contact address "50 Lake Shore Dr, Chicago,
    + root@uno summary_test> insert d5d18dd name first Jane
    + root@uno summary_test> insert d5d18dd name last Doe
    + root@uno summary_test> insert d5d18dd date birth 8/15/1969 -l PI&TIME
    + root@uno summary_test> scan -s PI,GEO,TIME
    + 3b503bd contact:address [PI&GEO]    123 Park Ave, NY, NY
    + 3b503bd contact:home_phone [PI]    1-123-456-7890
    + 3b503bd date:birth [PI&TIME]    1/11/1942
    + 3b503bd date:married [PI&TIME]    5/11/1962
    + 3b503bd name:first []    John
    + 3b503bd name:last []    Doe
    + d5d18dd contact:address [PI&GEO]    50 Lake Shore Dr, Chicago, IL
    + d5d18dd date:birth [PI&TIME]    8/15/1969
    + d5d18dd name:first []    Jane
    + d5d18dd name:last []    Doe
    +After inserting the data, summaries are requested below.  No summaries are returned.
    + root@uno summary_test> summaries
    +The visibility summarizer is configured below and the table is flushed.
    +Flushing the table creates a file creating summary data in the process. The
    +summary data returned counts how many times each column visibility occurred.
    +The statistics with a +c:+ prefix are visibilities.  The others are generic
    +statistics created by the CountingSummarizer that VisibilitySummarizer extends. 
    + root@uno summary_test> config -t summary_test -s table.summarizer.vis=org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer
    + root@uno summary_test> summaries
    + root@uno summary_test> flush -w
    + 2017-02-24 19:54:46,090 [shell.Shell] INFO : Flush of table summary_test completed.
    + root@uno summary_test> summaries
    +  Summarizer         : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer
vis {}
    +  File Statistics    : [total:1, missing:0, extra:0, large:0]
    +  Summary Statistics : 
    --- End diff --
    So cool.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

View raw message