Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 049D1200B9F for ; Tue, 11 Oct 2016 22:55:13 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 03334160AE6; Tue, 11 Oct 2016 20:55:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4B75B160AC3 for ; Tue, 11 Oct 2016 22:55:12 +0200 (CEST) Received: (qmail 13156 invoked by uid 500); 11 Oct 2016 20:55:11 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 13144 invoked by uid 99); 11 Oct 2016 20:55:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2016 20:55:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id CB5AB180140 for ; Tue, 11 Oct 2016 20:55:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.78 X-Spam-Level: * X-Spam-Status: No, score=1.78 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id rDEF4aYsWnCO for ; Tue, 11 Oct 2016 20:55:08 +0000 (UTC) Received: from mail-qt0-f180.google.com (mail-qt0-f180.google.com [209.85.216.180]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C19575FAF7 for ; Tue, 11 Oct 2016 20:55:07 +0000 (UTC) Received: by mail-qt0-f180.google.com with SMTP id m5so4251167qtb.3 for ; Tue, 11 Oct 2016 13:55:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=LN0zkVM47RxlYmKt8VS7Qya2hIJjFZV6wTrAffVzRbo=; b=X246SSHODs4lM/Fog0TW+VJE1yfwM2H5zyK2q0nGV57UkljZcLyPT9bnlgBW2BkjMM rHpx5QNsH3HW7l/TLEe8Vq5A1bVrbOCseuq9JxyDPGoAiBociA8QMHkIFusJIV7F/87h 9VU0km93EMtiF/Ozg/kGQz0Z55NTJtCst+5MRooJ/PZPqgk89jHs37MlNiK/e0RPDKA0 dVxkfAwYL9rX1cy/dU0EKvG56WseDU6UiXt7NwqSgJ5oNsrJCjXOkYhzj1xERZIVtVmA gn4lvBx9LnbFzmu7nNqs+L2/XmUekVjuri1MEHlsTFlughvq2MlbHFQOJidmvxVtj0yj N/kQ== X-Gm-Message-State: AA6/9Rka7t3YVaMipHQ6nt8HGYaX0MMZAsWsmI2togiZD+7+jxRrN247YleqU4E65bJ3iNXvkADgsJZNvRiHng== X-Received: by 10.237.45.38 with SMTP id h35mr4961063qtd.83.1476219304853; Tue, 11 Oct 2016 13:55:04 -0700 (PDT) MIME-Version: 1.0 References: <57FD3B69.2080309@gmail.com> In-Reply-To: From: Russ Weeks Date: Tue, 11 Oct 2016 20:54:54 +0000 Message-ID: Subject: Re: [DISCUSS] Would a visibility histogram on a table be harmful? To: dev Content-Type: multipart/alternative; boundary=94eb2c124a7446ef6c053e9d17b3 archived-at: Tue, 11 Oct 2016 20:55:13 -0000 --94eb2c124a7446ef6c053e9d17b3 Content-Type: text/plain; charset=UTF-8 > I've always been under the impression that accumulo was not supposed to confirm the existence of data that a user did not have permission to read. OK, that makes sense, I can see the need for that. But if we follow this path of keeping the summary data structure in the RFile header (footer?) then it's just a convenience that's available to anybody who can read the RFile. At that point it seems like it's just a question of who else should be allowed to read it and how to grant that access. A system permission makes a lot of sense to me. -Russ On Tue, Oct 11, 2016 at 4:33 PM Mike Drob wrote: > I've always been under the impression that accumulo was not supposed to > confirm the existence of data that a user did not have permission to read. > > On Tue, Oct 11, 2016, 2:20 PM Josh Elser wrote: > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he > > mentioned was the lack of insight into the distribution of data marked > > with certain visibilities in a table. He presented an example similar to > > this: > > > > Image a hypothetical system backed by Accumulo which stores medical > > information. There are three labels in the system: PRIVATE, ANONYMIZED, > > and PUBLIC. PRIVATE data is that which could reasonably be considered to > > identify the individual. ANONYMIZED data is some altered version of the > > attribute that retains some portion of the original value, but is > > missing enough context to not identify the individual (e.g. converting > > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are > > cannot identify the individual. > > > > Doctors would be able to read the PRIVATE data, while researchers could > > only read the ANONYMIZED and PUBLIC data. This leads to a question: how > > much of each kind of data is in the system? Without knowing how much > > data is in the system, how can some application developer (who does not > > have the ability to read all of the PRIVATE data) know that their > > application is returning an reasonably correct amount of data? (there > > are many examples of questions which could be answer on this data alone) > > > > Concretely, this histogram would look like (50 records with PRIVATE, 50 > > with ANONYMIZED, and 20 with PUBLIC; 120 records total): > > > > ``` > > PRIVATE: 50 > > ANONYMIZED: 50 > > PUBLIC: 20 > > ``` > > > > Technically, I think this would actually be relatively simple to > > implement. Inside of each RFile, we could maintain some histogram of the > > visibilities observed in that file. This would allow us to very easily > > report how much data in each table has each visibility label. > > > > However, would this feature be harmful to one of the core tenants of > > Accumulo? Or, is acknowledging the existence of data in Accumulo with a > > certain visibility acceptable? Would a new permission to use such an API > > to access this information be sufficient to protect the data? > > > > - Josh > > > --94eb2c124a7446ef6c053e9d17b3--