Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1E3CE200B9F for ; Tue, 11 Oct 2016 21:43:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1CD54160AE6; Tue, 11 Oct 2016 19:43:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 619D9160AC3 for ; Tue, 11 Oct 2016 21:43:26 +0200 (CEST) Received: (qmail 58465 invoked by uid 500); 11 Oct 2016 19:43:25 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 58432 invoked by uid 99); 11 Oct 2016 19:43:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2016 19:43:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5772AC0957 for ; Tue, 11 Oct 2016 19:43:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.599 X-Spam-Level: X-Spam-Status: No, score=-0.599 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-2.999] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=cs.washington.edu Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id vUMinKQHwP0n for ; Tue, 11 Oct 2016 19:43:20 +0000 (UTC) Received: from mx5.cs.washington.edu (mx5.cs.washington.edu [128.208.2.106]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E57345FAF7 for ; Tue, 11 Oct 2016 19:43:19 +0000 (UTC) Received: from mx5.cs.washington.edu (localhost [127.0.0.1]) by mx5.cs.washington.edu (8.15.2/8.15.2/1.18) with ESMTP id u9BJh8lZ014020 for ; Tue, 11 Oct 2016 12:43:08 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.washington.edu; s=csw201206; t=1476214988; bh=A+O4kBAP3vpgE1B8s7K+7At50bI3UZd4LGKVWTC3i58=; h=In-Reply-To:References:From:Date:Subject:To; b=xdKzc7gjYXJjmC+OE9pibL5TxbJfsE0UO41ela9gLVdWZ2rLbCC/dCLwTeRCbJrO3 PHHim+uMwqcKV9aH9etABL7qAgbNhZw5kMFo5h4ncKNKxhtVdzN3gJ6uW+kAcwvaJh hR/ncjnvE1N0L3UIWi2KNuPeFUYIynSBkWH0ccBU= Received: from mail-vk0-f42.google.com (mail-vk0-f42.google.com [209.85.213.42]) (authenticated bits=0) by mx5.cs.washington.edu (8.15.2/8.15.2/1.18) with ESMTPSA id u9BJgxs4013973 (version=TLSv1.2 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Tue, 11 Oct 2016 12:43:01 -0700 Received: by mail-vk0-f42.google.com with SMTP id 192so30605622vkl.2 for ; Tue, 11 Oct 2016 12:43:00 -0700 (PDT) X-Gm-Message-State: AA6/9RlZTVe5rSnPVzwqxEIXK/sKTbZEWQ3s5tlezf/LcE/BirkJ13PN8Vhd2qzUdbFbt6qGlHcuX5doKwD61Q== X-Received: by 10.31.58.19 with SMTP id h19mr4063615vka.30.1476214979358; Tue, 11 Oct 2016 12:42:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.159.36.211 with HTTP; Tue, 11 Oct 2016 12:42:32 -0700 (PDT) In-Reply-To: <57FD3B69.2080309@gmail.com> References: <57FD3B69.2080309@gmail.com> From: Dylan Hutchison Date: Tue, 11 Oct 2016 12:42:32 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [DISCUSS] Would a visibility histogram on a table be harmful? To: Accumulo Dev List Content-Type: multipart/alternative; boundary=001a114389c874fc61053e9c1538 X-Uwcse-Spam-Status: No, score=0.0 required=5.0 tests=HTML_MESSAGE,NO_RELAYS autolearn=disabled version=3.4.1-20160122 X-Uwcse-Spam-Checker-Version: SpamAssassin 3.4.1-20160122 (2015-04-28) on mx5.cs.washington.edu archived-at: Tue, 11 Oct 2016 19:43:27 -0000 --001a114389c874fc61053e9c1538 Content-Type: text/plain; charset=UTF-8 Interesting idea. It begs the question: should we allow any custom index at the RFile level? If RFile indexes were user-extensible, then a visibility index would be something any developer could write. That said, we can still include such an index as an example, and if we did it could be used by the Accumulo monitor. The RFile-level sampling followed this path. I would support further work similar to it, though I admit I don't know how difficult a job it entails. Bonus points if the index information could be accessed from iterators the same way that sampled data can. I can't speak to the appropriateness of visibility histograms on the monitor *by default*, but it would be a strictly useful feature if it could be enabled via a conf option. On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser wrote: > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he > mentioned was the lack of insight into the distribution of data marked with > certain visibilities in a table. He presented an example similar to this: > > Image a hypothetical system backed by Accumulo which stores medical > information. There are three labels in the system: PRIVATE, ANONYMIZED, and > PUBLIC. PRIVATE data is that which could reasonably be considered to > identify the individual. ANONYMIZED data is some altered version of the > attribute that retains some portion of the original value, but is missing > enough context to not identify the individual (e.g. converting the name > "Josh Elser" to "J E"). PUBLIC data is for attributes which are cannot > identify the individual. > > Doctors would be able to read the PRIVATE data, while researchers could > only read the ANONYMIZED and PUBLIC data. This leads to a question: how > much of each kind of data is in the system? Without knowing how much data > is in the system, how can some application developer (who does not have the > ability to read all of the PRIVATE data) know that their application is > returning an reasonably correct amount of data? (there are many examples of > questions which could be answer on this data alone) > > Concretely, this histogram would look like (50 records with PRIVATE, 50 > with ANONYMIZED, and 20 with PUBLIC; 120 records total): > > ``` > PRIVATE: 50 > ANONYMIZED: 50 > PUBLIC: 20 > ``` > > Technically, I think this would actually be relatively simple to > implement. Inside of each RFile, we could maintain some histogram of the > visibilities observed in that file. This would allow us to very easily > report how much data in each table has each visibility label. > > However, would this feature be harmful to one of the core tenants of > Accumulo? Or, is acknowledging the existence of data in Accumulo with a > certain visibility acceptable? Would a new permission to use such an API to > access this information be sufficient to protect the data? > > - Josh > --001a114389c874fc61053e9c1538--