accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <adam.p.fu...@ugov.gov>
Subject Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation
Date Thu, 22 Dec 2011 21:00:45 GMT
Aaron,

I think it would be more accurate to describe Accumulo as an underlying
multi-map with support for aggregation overlays. A map can be thought of as
a multi-map with an overlay that takes the first of the multiple entries.
This is in fact the default configuration of Accumulo tables, where the
VersioningIterator defines this overlay. Other Iterator configurations
provide different overlays.

There are two challenges that make it difficult to case the underlying
representation as a map. The first is that the definition of uniqueness of
a Key is a bit muddy. I think that many users consider the uniqueness to
include row, column family, and column qualifier. Those that use cell-level
security also include the column visibility. Timestamp doesn't usually make
it into the uniqueness concept, from a user's perspective, even though that
affects the sort order of Keys. In fact, most users let Accumulo set the
timestamp for them. I think your definition of uniqueness takes timestamp
into account, and from that perspective what we're doing is sort of like
providing a finer grained timestamp instead of using one timestamp for an
entire Mutation (or for all Mutations that show up within a millisecond).

The second challenge is that the overlay is persisted and is not
reversible. Aggregators don't keep the Keys that they aggregate, so if a
user wants to replace a Key in the underlying map and have that replacement
operation be reflected in the overlay, we can't really do that. However, we
can do that if the underlying store is a multi-map (which is what we do
now).

Adam

On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <aaron@cordovas.org> wrote:

> Rather than aggregation functionality being defined as some operation
> performed across a set of the values of different keys, you're advocating
> allowing inserting identical keys and aggregating their values as well?
> This just seems semantically sloppy to me.
>
> These types of changes just incur a cost in terms of understanding for the
> user. Rather than being able to describe Accumulo as a map, a well defined
> and understood concept, that also supports aggregations over a set of keys
> that share a subkey, we would then have to describe Accumulo as a map, most
> of the time, except when it functions more like a multi-map, in the case of
> aggregation in the presence of multiple values for the same key ... it's
> just confusing.
>
> Even with aggregators configured over a table, it still functions as a map
> - in fact like two maps, one 'underlying' map, in which each key has one
> value, and an 'aggregate' map, in which keys also have one value, define as
> an aggregation over the 'underlying' map. Perhaps one could argue that what
> I just described could be termed a multi-map, but from the user's point of
> view, thinking of it as an 'underlying' map, which is how the user sees the
> table when writing, and an 'aggregate' map, which is how the user sees the
> table when reading is more clean. Users are used to this situation if
> they've ever used views in a relational database.
>
> For you and John, who are steeped in this field, this distinction, and
> this change, probably doesn't seem like a big deal. But when telling a new
> user about Accumulo, being able to explain to them that Accumulo is a map,
> is very useful. It makes predicting the behavior of Accumulo possible. If
> users can put identical key-value pairs into a mutation, and if Accumulo
> treats them as distinct, users' predictions will be wrong.
>
> Feel free to make this change, but just consider the collective cognitive
> cost it incurred by altering the semantics. Earlier you argued that
> extending the times aggregations are executed to include the client would
> be too great. Yet making it possible for Accumulo to cease acting like a
> map sometime doesn't give you pause?
>
> On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote:
>
> > Aaron,
> >
> > I have to disagree with you. By default, Accumulo tables are distributed
> > maps. However, as soon as you configure an aggregator or some other
> > interesting iterator on a table the semantics for that table change and
> it
> > is no longer a "proper" distributed map. Therefore I claim that the basic
> > tenant to which you refer does not exist as such.
> >
> > Users generally don't set the timestamps in a mutation, and aggregators
> > certainly don't preserve the keys that they aggregate. Are you suggesting
> > that modifying the value associated with a key that has already
> contributed
> > to a persisted aggregate should have an affect that is dependent on the
> > original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should
> > get 2?
> >
> > The fix that is suggested in this ticket just makes the behavior
> consistent
> > between the cases of putting two identical entries in one mutation versus
> > putting the two entries in two mutations. However we account for the
> > semantics of aggregation we should be for this change.
> >
> > Adam
> >
> >
> > On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) <
> > jira@apache.org> wrote:
> >
> >>
> >>   [
> >>
> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913
> ]
> >>
> >> Aaron Cordova commented on ACCUMULO-227:
> >> ----------------------------------------
> >>
> >> What the client should expect is that Accumulo will only store/process
> one
> >> value per unique key: Accumulo is a distributed map. Even if it's only
> for
> >> aggregation's sake, allowing Mutations to submit multiple values per
> unique
> >> key and processing all those values, rather than arbitrarily choosing
> one,
> >> violates the concept of a map, which will cause more confusion on the
> part
> >> of users.
> >>
> >> The right thing to do for users who want to submit lots of values to
> >> aggregate under a sub key is to insist that they make their cells
> differ by
> >> at least one element in the key. Again, aggregating multiple values
> under
> >> the same key violates the basic tenet that Accumulo is a map.
> Aggregation
> >> is performed across different keys sharing a sub key.
> >>
> >> If having the users generate unique timestamps is a problem, there are
> >> several strategies for dealing with that. One is to generate random
> >> timestamps. If aggregation is being done over timestamps, the actual
> >> timestamp shouldn't matter / ever be interpreted. If there are worries
> >> about Accumulo doing something undesired with random timestamps, one
> could
> >> generate random column qualifiers, etc. and aggregate over those.
> >>
> >> To address what Adam said about versioning - aggregating tables should
> >> probably turn off the iterator that only keeps the latest version. But
> that
> >> has nothing to do with the policy for handling multiple identical cells.
> >>
> >> Finally, I'm not advocating we do anything to support aggregation on the
> >> client side, but rather leave it up to the application developer to
> exploit
> >> any opportunities for aggregation in their application.
> >>
> >>
> >>> Improve in memory map counts to provide cell level uniqueness for
> >> repeated columns in  mutation
> >>>
> >>
> -----------------------------------------------------------------------------------------------
> >>>
> >>>                Key: ACCUMULO-227
> >>>                URL: https://issues.apache.org/jira/browse/ACCUMULO-227
> >>>            Project: Accumulo
> >>>         Issue Type: Improvement
> >>>         Components: tserver
> >>>           Reporter: John Vines
> >>>           Assignee: John Vines
> >>>            Fix For: 1.5.0
> >>>
> >>>
> >>> Currently for isolation we only isolate mutations. This doesn't allow
> >> mutations with identical cells within it. We should increase the
> mutation
> >> counts to account for each individual cell instead of each mutation.
> >>
> >> --
> >> This message is automatically generated by JIRA.
> >> If you think it was sent incorrectly, please contact your JIRA
> >> administrators:
> >>
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> >> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message