accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <busbey+li...@cloudera.com>
Subject Re: "NOT" operator in visibility string
Date Wed, 19 Mar 2014 20:11:17 GMT
I don't see how NOT helps this use case. From what I've heard so far, we're
still talking about a positive assertion (someone in the sandbox "group1"
flagged the data as to be hidden) and then restricting who has access to
data with that positive assertion (by default excluding everyone using the
sandbox "group1").

1) I agree that the easy way to go here is to use a clone to take a
snapshot of the table when a sandbox is created. If namespaces are
available and cloning allows cloning into a new namespace, I agree with
David that you should probably use a namespace for the sandbox.

The big advantage of using clone is that it's very easy to abandon a
sandbox.

2) Since he might want to propagate the delete to the original table, I
don't think just writing deletes is what he should do. In addition he
should write a new version of the cell with a visibility that appends "&
sandbox_deleted". Then default user view requests can not include the
"sandbox_deleted" authorization and things won't be there. That allows him
to implement "undelete" as well as a way to scan for cells to then delete
in the original table.

This will require writing just as many cells as updating things to append a
"& !group1" to visibility strings and should require similar read logic.

3) If live updates of new data are needed, then this same approach can be
applied with a little more complication. Notably, there is no table clone.
Additionally, the delete isn't issued but an additional cell is still
included with an appended visibility like "& group1" and a marker in the
key (chosen so that the cell will sort prior to the cells to be hidden).
The application will then need to handle the logic of applying the
sandbox-specific suppression in the normal view. Since the actions will
include what sandbox is being used as a lens, only the "filter me out"
flags for the appropriate one will show up in the scan results.

This requires writing less data then the NOT implementation (no deletes),
but does require some additional logic and data transfer. It could be done
via a scan iterator to obviate the extra data transfer.

4) I don't know the full story around your edit logic, but be aware that
there is a bunch of conflict handling you're going to need to do (esp if
edits are staged in a sandbox). The conditional mutations added in 1.6
should make that much easier to implement. Unfortunately, the best I could
point you to for now is the latex source of the new user manual. There's a
design proposal on the jira for the addition[1], which might help some.

-Sean

[1]: https://issues.apache.org/jira/browse/ACCUMULO-1000


On Wed, Mar 19, 2014 at 2:12 PM, Christopher <ctubbsii@apache.org> wrote:

> It sounds like you'd get some of your requirements to hide data by
> simply cloning a table to create a sandbox, in which one can issue
> actual deletes to remove it from that sandbox's view. Accumulo's clone
> feature will not duplicate data unnecessarily, so you could have many
> clones, each with different data removed from view (deleted).
>
> This would only work for snapshots, though. You'd wouldn't get updates
> from the original table that was shared/cloned. I'm not sure you need
> that, though.
>
> (Sorry if it sounds like I'm trying to argue against your use case for
> NOT. I'm not. I'm just trying to think if there's an alternative that
> can get you what you need today, whether or not we decide to include
> NOT.)
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Wed, Mar 19, 2014 at 2:15 PM, Jeff Kunkle <kunklejr@gmail.com> wrote:
> > The sandboxes are really just sharing pointers to data. Users might only
> see
> > a subset of that data depending on their authorizations.
> >
> > On Mar 19, 2014, at 2:09 PM, David Medinets <david.medinets@gmail.com>
> > wrote:
> >
> > Is data shared between sandboxes? Could namespaces proxy for sandboxes?
> >
> >
> > On Wed, Mar 19, 2014 at 1:46 PM, Mike Drob <madrob@cloudera.com> wrote:
> >>
> >> Thanks, that's really helpful. Couple more questions.
> >>
> >> Is a sandbox the same thing as a workspace? Can the terms be used
> >> interchangeably? Just want to make sure I'm not misinterpreting your
> >> answers.
> >>
> >> Is it fair to describe each sandbox as a separate index table for the
> >> global data set? And then when users do deletes, it is only reflected
> in the
> >> index fields, right?
> >> But you can't just delete values from the index because you need to keep
> >> track of the changes in case the user decides to delete globally (after
> >> appropriate authorization checks, etc...)
> >>
> >> Because the visibility is part of the key, changing it involves
> re-writing
> >> the data. Which might be just an index record in your case. However,
> this is
> >> generally an expensive operation.
> >>
> >> I think I need to think on this use case some more, it's definitely
> >> interesting and not something I had considered before.
> >>
> >>
> >>
> >> On Wed, Mar 19, 2014 at 1:24 PM, Jeff Kunkle <kunklejr@gmail.com>
> wrote:
> >>>
> >>> You have a large amount of data, that is generally readable by all
> users.
> >>>
> >>> Not necessarily. All data has some visibility constraint that a users
> >>> authorization's may or may not satisfy.
> >>>
> >>> Users create their own sandbox, from which they can later exclude
> >>> portions of the global data set.
> >>>
> >>> Yes, users create their own sandboxes which are populated with global
> >>> data. They may decide to delete some of that data and the change needs
> to be
> >>> scoped to their sandbox until the change is published globally.
> >>>
> >>>
> >>> User can share their sandbox with others, so really we are talking
> about
> >>> sandbox permissions and not so much user permissions.
> >>>
> >>> Yes, users can share their sandbox with others, but a sandbox is just a
> >>> collection of pointers to data. Users sharing a workspace may not
> >>> necessarily see all of the same data depending on their authorizations.
> >>>
> >>> Sandboxes are created often. Or, at least much more often than the data
> >>> changes.
> >>>
> >>> Yes, sandboxes are created often. The data is likely to be ingested
> more
> >>> frequently than sandboxes will be created.
> >>>
> >>> Do users typically remove large amounts of data from their sandbox? 1%?
> >>> 10%? 99%?
> >>>
> >>> I don’t have good numbers to share here.
> >>>
> >>> Assuming data is removed via rules, are the rules applied automatically
> >>> to new data under ingest?
> >>>
> >>> I would say no, although I’m not positive I understand the question.
> >>> Users are not removing data from their sandbox per se, but they may
> delete
> >>> data that should then be hidden from their workspace. The data is not
> really
> >>> deleted though and is still visible to other users in other sandboxes.
> Only
> >>> when the deletion is published does it get deleted for everyone.
> >>>
> >>> On Mar 19, 2014, at 1:03 PM, Mike Drob <madrob@cloudera.com> wrote:
> >>>
> >>> Wait, I'm really confused by what you are describing, Jeff. Sorry if
> >>> these are obvious questions, but can you help me get a better grasp of
> your
> >>> use case?
> >>>
> >>> You have a large amount of data, that is generally readable by all
> users.
> >>> Users create their own sandbox, from which they can later exclude
> >>> portions of the global data set.
> >>> User can share their sandbox with others, so really we are talking
> about
> >>> sandbox permissions and not so much user permissions.
> >>> Sandboxes are created often. Or, at least much more often than the data
> >>> changes.
> >>>
> >>> Are those all accurate statements? If so, can you clarify the following
> >>> points:
> >>>
> >>> Do users typically remove large amounts of data from their sandbox? 1%?
> >>> 10%? 99%?
> >>> Assuming data is removed via rules, are the rules applied automatically
> >>> to new data under ingest?
> >>>
> >>> Thanks,
> >>> Mike
> >>>
> >>>
> >>> On Wed, Mar 19, 2014 at 12:54 PM, Jeff Kunkle <kunklejr@gmail.com>
> wrote:
> >>>>
> >>>> Hi John,
> >>>>
> >>>> Yes it’s accurate that the system controls the label and who is
> >>>> associated with it; there are no Accumulo-internal user accounts. But
> I
> >>>> don’t think it’s feasible to remove a sandbox label from something
> that
> >>>> should be hidden. Such a scenario would imply that all data is
> “tagged” with
> >>>> the labels of every sandbox that is allowed to see the data, which
> would be
> >>>> most. It would also imply that the creation of a new sandbox would
> >>>> necessitate changing the visibility of everything in Accumulo to
> include the
> >>>> new sandbox label, effectively rewriting the entire database.
> Sanboxes are
> >>>> created and deleted all the time in our application, so it doesn’t
> seem like
> >>>> a feasible solution to me.
> >>>>
> >>>> -Jeff
> >>>>
> >>>> On Mar 19, 2014, at 12:16 PM, Josh Elser <josh.elser@gmail.com>
> wrote:
> >>>>
> >>>> > It kind of sounds like you could manage this much easier by
> >>>> > controlling the authorizations a user gets (notably the workspace
> name) and
> >>>> > the grant/revoke above the Accumulo level.
> >>>> >
> >>>> > A sandbox has a unique label and the external system controls which
> >>>> > users are granted that label. This way, each sandbox can be modified
> >>>> > individually (using authorizations that contain the data visibility
> and the
> >>>> > sandbox label) or the original data set could be modified (by
> omitting a
> >>>> > sandbox label in the authorizations used).
> >>>> >
> >>>> > Is that accurate?
> >>>> >
> >>>> > On 3/19/14, 12:05 PM, Jeff Kunkle wrote:
> >>>> >> I attempted to simplify the scenario to facilitate discussion,
> which
> >>>> >> on
> >>>> >> second thought may have been a mistake. Here’s the whole
scenario:
> >>>> >>
> >>>> >> Different users have access to different subsets of the data
> >>>> >> depending
> >>>> >> on their authorizations and the visibility of the data. Users
“work
> >>>> >> with” the data in what we call a sandbox. Sanboxes can be
shared
> with
> >>>> >> other users (this is the group creation I was talking about
> earlier).
> >>>> >> Deletes to the data would be “scoped” to the sandbox by
changing
> the
> >>>> >> visibility to add “& !workspace_name” so that people
viewing the
> >>>> >> workspace wouldn’t see the data but everyone else would.
> >>>> >>
> >>>> >> On Mar 19, 2014, at 11:48 AM, Sean Busbey <
> busbey+lists@cloudera.com
> >>>> >> <mailto:busbey+lists@cloudera.com>> wrote:
> >>>> >>
> >>>> >>> On Wed, Mar 19, 2014 at 10:43 AM, Jeff Kunkle <kunklejr@gmail.com
> >>>> >>> <mailto:kunklejr@gmail.com>> wrote:
> >>>> >>>
> >>>> >>>    New groups are created on the fly by our application
when
> needed.
> >>>> >>>    Under the scenario you describe we’d have to go through
all the
> >>>> >>>    data in Accumulo whenever a group is created so that
users in
> the
> >>>> >>>    group can see the existing data.
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> Ah! So your use case is that all data defaults to world
readable
> and
> >>>> >>> then users have the option of opting out of seeing subsets.
Right?
> >>>> >>>
> >>>> >>> In your scenario user groups also get to opt-out of seeing
data on
> >>>> >>> the
> >>>> >>> fly, yes? Both require rewriting the data. Does the group
creation
> >>>> >>> happen more often?
> >>>> >>
> >>>>
> >>>
> >>>
> >>
> >
> >
>

Mime
View raw message