couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Garren Smith <gar...@apache.org>
Subject Re: [DISCUSS] Reduce on FDB take 3
Date Mon, 27 Jul 2020 07:24:41 GMT
Hi Bob,

That makes sense.

Cheers
Garren

On Sun, Jul 26, 2020 at 2:57 PM Robert Newson <rnewson@apache.org> wrote:

>
> Both approaches will collate keys the same way, it’s a showstopping bug if
> they don’t match existing couch collation order.
>
> I’d probably not even document that right now. We can document what aegis
> does and doesn’t cover when we make the first 4.0 RC.
> --
>   Robert Samuel Newson
>   rnewson@apache.org
>
> On Sun, 26 Jul 2020, at 13:27, Jan Lehnardt wrote:
> > Heya Bob,
> >
> > this sounds like a good path forward that allows us to validate the
> > remaining pieces of this.
> >
> > Just two side notes:
> >
> > - do we expect that the map-keys-in-fdb approach would sort differently
> > from map-keys-in-ebtree? (I assume no)
> > - I further assume that we’ll find a nicer UX for “make sure my views
> > are encrypted” than “make sure your view has a reduce fun”, no details
> > or drafts necessary at this point.
> >
> > Best
> > Jan
> > —
> >
> > > On 26. Jul 2020, at 14:13, Robert Samuel Newson <rnewson@apache.org>
> wrote:
> > >
> > > I'd like to find consensus on a path forward and, with that in mind, I
> have a suggestion.
> > >
> > > Before that, I think it's useful to lay out the points of contention,
> which I've ranked in I hope an uncontroversial order;
> > >
> > > The principal concern with using ebtree for map as well as reduce is
> that it precludes the use of foundationdb's ordering, on which much
> performance and scalability likely hinges. I agree with this on principle
> (that is, its demonstrably true) though none of us yet know the extent to
> which it matters in practice, or whether any gap could be closed with
> optimization.
> > >
> > > The secondary concern is the proposed use of ebtree for reduce-only
> where it receives only the reduce values from a document. I understand
> garren's stated desire to avoid duplication with the map side but I don't
> think it works out. Users of the reduce function are using it to get the
> reductions _across_ documents. It is rare (that is, I don't recall seeing
> an example by a cloudant customer) to emit multiple rows in a map function
> for the same key. In the cases where it doesn't, the value inserted into
> ebtree is the same as the map value anyway. Where it does happen, we're
> still duplicating the key and _a_ document level value. It seems a very
> modest saving of space for a more complicated design (two separate levels
> of reduce calculated in two separate modules).
> > >
> > > The tertiary concern is that the map-only indexes place user data
> outside of aegis (native database encryption) as, necessarily, the order of
> keys is important and a straightforward encryption there would not preserve
> it.
> > >
> > > Looking at all the options available I don't think there is a single
> path that addresses everything. There are also significant unknowns around
> practical performance for all of this.
> > >
> > > So, I suggest that the map-only code (specifically, the code we will
> use for an index definition that does not specify a reduce function) uses
> the existing couch_views approach, where the fdb keys are suffixed with the
> emitted key from the map function, and the fdb values are the emitted
> values. This vertical index should support the fastest ?key= and
> startkey/endkey= queries that fdb is capable of. There might be
> simplifications we can make to this code if we accept that it only needs to
> handle map-only indexes.
> > >
> > > For index definitions that _do_ specify a reduce function, I suggest
> this is done entirely with ebtree.We would call ebtree:insert(Db, Tree,
> Key, Value) where Key and Value are the emitted key and value from the map
> function. ebtree:open is handed a reduce_fun which handles the reduce and
> rereduce for the users reduction and, like couchdb 1/2/3, we also mix in a
> system reduction that calculates the view size (paul's diff above appears
> to do just this portion and looks broadly right).
> > >
> > > This avoids the duplication that garren's idea of storing the reduces
> in ebtree was trying to avoid, but does so in all cases. This approach
> allows us to compare the performance of map queries against map-only and
> map-reduce indexes, it allows users to opt in or out of encrypting their
> emitted view keys, and it doesn't prevent anyone from choosing to do both
> (You can define a view with map function A and another view with map
> function A and reduce function B, the former will be the couch_views
> vertical index, the latter an ebtree "horizontal" index).
> > >
> > > As we learn more over time we could enhance one type of index or the
> other and reach a point where we eliminate one, or the two could coexist
> indefinitely.
> > >
> > >> On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.davis@gmail.com>
> wrote:
> > >>
> > >> FWIW, a first pass at views entirely on ebtree turned out to be fairly
> > >> straightforward. Almost surprisingly simple in some cases.
> > >>
> > >>
> https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views
> > >>
> > >> Its currently passing all tests in `couch_views_map_test.erl` but is
> > >> failing on other suites. I only did a quick skim on the failures but
> > >> they all look superficial around some APIs I changed.
> > >>
> > >> I haven't added the APIs to query reduce functions via HTTP but the
> > >> reduce functions are being executed to calculate row counts and KV
> > >> sizes. Adding the builtin reduce functions and extending those to user
> > >> defined reduce functions should be straightforward.
> > >>
> > >> On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnewson@apache.org>
> wrote:
> > >>>
> > >>>
> > >>> I’m happy to restrict my PR comments to the actual diff, yes. So
I’m
> not +1 yet.
> > >>>
> > >>> I fixed the spurious conflicts at
> https://github.com/apache/couchdb/pull/3033.
> > >>>
> > >>> --
> > >>> Robert Samuel Newson
> > >>> rnewson@apache.org
> > >>>
> > >>> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote:
> > >>>> Ok so just to confirm, we keep my PR as-is with ebtree only for
> reduce. We
> > >>>> can get that ready to merge into fdb master. We can then use that
> to battle
> > >>>> test ebtree and then look at using it for the map side as well.
At
> that
> > >>>> point we would combine the reduce and map index into a ebtree
> index. Are
> > >>>> you happy with that?
> > >>>>
> > >>>> Cheers
> > >>>> Garren
> > >>>>
> > >>>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnewson@apache.org>
> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> It’s not as unknown as you think but certainly we need empirical
> data to
> > >>>>> guide us on the reduce side. I’m also fine with continuing
with the
> > >>>>> map-only code as it stands today until such time as we demonstrate
> ebtree
> > >>>>> meets or exceeds our needs (and I freely accept the possibility
> that it
> > >>>>> might not).
> > >>>>>
> > >>>>> I think the principal enhancement to ebtree that would address
most
> > >>>>> concerns is if it could store the leaf entries vertically (as
they
> are
> > >>>>> currently). I have some thoughts which I’ll try to realise
as
> working code.
> > >>>>>
> > >>>>> I’ve confirmed that I do create spurious conflicts and will
have a
> PR up
> > >>>>> today to fix that.
> > >>>>>
> > >>>>> --
> > >>>>> Robert Samuel Newson
> > >>>>> rnewson@apache.org
> > >>>>>
> > >>>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote:
> > >>>>>> Hi Bob,
> > >>>>>>
> > >>>>>> Thanks for that explanation, that is really helpful and
it is
> good we
> > >>>>> have
> > >>>>>> some options but it also does highlight a lot of unknowns.
> Whereas our
> > >>>>>> current map indexes are really simple and we know its behaviour.
> There
> > >>>>> are
> > >>>>>> no real unknowns. Adding ebtree here could make map indexes
a
> fair bit
> > >>>>> more
> > >>>>>> complicated since we don't know the effect of managing
different
> node
> > >>>>>> sizes, concurrent doc updates, and querying performance.
> > >>>>>>
> > >>>>>> Could we do a compromise here, could we look at using ebtree
only
> for
> > >>>>>> reduces now? That means that we can have reduce indexes
working
> quite
> > >>>>> soon
> > >>>>>> on CouchDB on FDB. At the same time, we can work on ebtree
and run
> > >>>>>> performance tests on ebtree for map indexes. Some interesting
> tests we
> > >>>>> can
> > >>>>>> do is see if a user emits KV's near the limits (8KB for
keys and
> 50KB for
> > >>>>>> values) how does ebtree handle that? How it handles being
updated
> in the
> > >>>>>> doc update transaction. And general query performance.
What do
> you think?
> > >>>>>>
> > >>>>>> Cheers
> > >>>>>> Garren
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson <
> > >>>>> rnewson@apache.org>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> A short preface at this stage is needed I think:
> > >>>>>>>
> > >>>>>>> My goal with ebtree was to implement a complete and
correct
> b+tree that
> > >>>>>>> also calculated and maintained inner reductions. I
consciously
> chose
> > >>>>> not to
> > >>>>>>> go further than that before presenting it to the group
for wider
> debate
> > >>>>>>> (indeed, the very debate we're having in this thread).
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>>
> > >>>>>>> Ebtree is not single writer, at least not inherently.
Two
> updates to
> > >>>>> the
> > >>>>>>> same ebtree should both succeed as long as they don't
modify the
> same
> > >>>>> nodes.
> > >>>>>>>
> > >>>>>>> Modifying the same node is likely where there's a reduce
> function,
> > >>>>> though
> > >>>>>>> only if the reduction value actually changes, as that
percolates
> up the
> > >>>>>>> tree. Where the reduce value does not change and when
neither
> > >>>>> transaction
> > >>>>>>> causes a split, rebalance, or merge that affects the
other, they
> should
> > >>>>>>> both commit without conflict. There is a rich history
of
> optimizations
> > >>>>> in
> > >>>>>>> this space specifically around btrees. ebtree might
cause
> spurious
> > >>>>>>> conflicts today, I will investigate and propose fixes
if so
> (e.g, I
> > >>>>> think I
> > >>>>>>> probably call erlfdb:set on nodes that have not changed).
> > >>>>>>>
> > >>>>>>> I certainly envisaged updating ebtree within the same
txn as a
> doc
> > >>>>> update,
> > >>>>>>> which is why the first argument to all the public functions
can
> be
> > >>>>> either
> > >>>>>>> an erlfdb Db or open Tx.
> > >>>>>>>
> > >>>>>>> Parallelising the initial build is more difficult with
ebtree
> than
> > >>>>>>> parallelising the existing couch_views map-only code,
though
> it's worth
> > >>>>>>> noting that ebtree:insert benefits a great deal from
batching
> (multiple
> > >>>>>>> calls to :insert in the same transaction). For an offline
build
> (i.e,
> > >>>>> an
> > >>>>>>> index that the client cannot see until the entire build
is
> complete)
> > >>>>> the
> > >>>>>>> batch size can be maximised. That is still a serial
process in
> that
> > >>>>> there
> > >>>>>>> is only one transaction at a time updating the ebtree.
I can't
> say
> > >>>>> offhand
> > >>>>>>> how fast that is in practice, but it is clearly less
powerful
> than a
> > >>>>> fully
> > >>>>>>> parallelisable approach could be.
> > >>>>>>>
> > >>>>>>> Any parallel build would require a way to divide the
database
> into
> > >>>>>>> non-overlapping subsets of emitted keys. This is easy
and
> natural if
> > >>>>> the
> > >>>>>>> fdb key is the emitted key, which is the case for the
couch_views
> > >>>>> map-only
> > >>>>>>> code. For ebtree it might be enough to simply grab
a large chunk
> of
> > >>>>>>> documents, perform the map transform, and then issues
multiple
> > >>>>> transactions
> > >>>>>>> on subsets of those.
> > >>>>>>>
> > >>>>>>> Another common technique for btrees is bulk loading
(more or less
> > >>>>>>> literally constructing the btree nodes directly from
the source,
> as
> > >>>>> long as
> > >>>>>>> you can sort it), which might be an option as well.
> > >>>>>>>
> > >>>>>>> Parallelising a build _with_ a reduce function seems
hard
> however we do
> > >>>>>>> it. The non-ebtree approach is parallelisable by virtue
of
> paring down
> > >>>>> the
> > >>>>>>> reduce functionality itself (only whole key groups,
only those
> > >>>>> functions
> > >>>>>>> that fdb has atomic operations for).
> > >>>>>>>
> > >>>>>>> I will first of all verify the multi-writer nature
of ebtree as
> it
> > >>>>> stands
> > >>>>>>> today and make a PR which fixes any spurious conflicts,
and then
> ponder
> > >>>>>>> further  how true parallel builds might be possible.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <garren@apache.org>
> wrote:
> > >>>>>>>>
> > >>>>>>>> We haven't spoken much about updates with ebtree.
> > >>>>>>>> From my understanding of ebtree it can only do
single writer,
> is that
> > >>>>>>>> correct?. If that is true it means we would not
be able to
> > >>>>> parallelize
> > >>>>>>> the
> > >>>>>>>> background building of views to speed up view builds.
> > >>>>>>>> The other thing that would mean is that we cannot
use it for
> mango
> > >>>>> where
> > >>>>>>> we
> > >>>>>>>> update the view in the doc transaction.
> > >>>>>>>>
> > >>>>>>>> Cheers
> > >>>>>>>> Garren
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely <
> kjsnavely@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> When in doubt, throw us a build at Cloudant
with ebtree maps
> and
> > >>>>> we'll
> > >>>>>>> see
> > >>>>>>>>> if it comes close to the crazy fast KV map
queries.
> > >>>>>>>>>
> > >>>>>>>>> Kyle
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel
Newson <
> > >>>>> rnewson@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I (perhaps obviously) don't agree that
I'm tying myself to old
> > >>>>> CouchDB
> > >>>>>>> or
> > >>>>>>>>>> failing to embrace FDB. FDB is a toolkit
and does not, to my
> mind,
> > >>>>>>> force
> > >>>>>>>>> us
> > >>>>>>>>>> down any particular path.
> > >>>>>>>>>>
> > >>>>>>>>>> I haven't sat down to modify couch_views
in the manner I've
> > >>>>> suggested
> > >>>>>>>>>> (where ebtree is used as designed; being
fed the output of the
> > >>>>> emit()
> > >>>>>>>>> calls
> > >>>>>>>>>> and calculating reductions as it does so)
but I think it's a
> > >>>>> worthwhile
> > >>>>>>>>>> exercise. I'd be surprised if performance
of map-only
> traversals
> > >>>>> would
> > >>>>>>> be
> > >>>>>>>>>> disappointing but who knows? I also expect
it would allow for
> > >>>>>>> significant
> > >>>>>>>>>> simplification of the code, which is one
of the higher
> virtues.
> > >>>>>>>>>>
> > >>>>>>>>>> Adam, can you describe in a little more
detail how you picture
> > >>>>> "b+tree
> > >>>>>>> is
> > >>>>>>>>>> only used for incremental aggregations,"?
It's implied in your
> > >>>>> reply
> > >>>>>>> that
> > >>>>>>>>>> it would preserve the "interesting property"
of keeping user
> data
> > >>>>> out
> > >>>>>>> of
> > >>>>>>>>>> FDB Keys (for casual readers: the new native
database
> encryption,
> > >>>>>>> called
> > >>>>>>>>>> "aegis", only encrypts the FDB value. It
can't encrypt the
> key as
> > >>>>> this
> > >>>>>>>>>> would change the order of keys, which the
current code
> depends on).
> > >>>>>>> Did I
> > >>>>>>>>>> misread you?
> > >>>>>>>>>>
> > >>>>>>>>>> B.
> > >>>>>>>>>>
> > >>>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski
<
> kocolosk@apache.org>
> > >>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> OK thanks for the clarification. As
I said I wasn’t all that
> > >>>>> confident
> > >>>>>>>>> I
> > >>>>>>>>>> understood the design :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> I like the idea that the b+tree is
only used for incremental
> > >>>>>>>>>> aggregations, rather than storing the entire
materialized
> view,
> > >>>>> for the
> > >>>>>>>>>> same reasons that Garren stated.
> > >>>>>>>>>>>
> > >>>>>>>>>>> An index maintained entirely in ebtree
has the interesting
> > >>>>> property
> > >>>>>>>>> that
> > >>>>>>>>>> it does not leak any user data into FDB
Keys, which could be
> > >>>>> attractive
> > >>>>>>>>> for
> > >>>>>>>>>> security reasons.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Adam
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren
Smith <
> garren@apache.org>
> > >>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM
Paul Davis <
> > >>>>>>>>> paul.joseph.davis@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> I would like to keep ebtree
to use just for the reduce
> index.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Could you expand on your reasoning
here, Garren? I haven't
> done
> > >>>>> any
> > >>>>>>>>>>>>> experiments on my own to understand
if I'm missing
> something
> > >>>>>>>>>>>>> important. My initial reaction
is to not diverge too far
> from
> > >>>>> the
> > >>>>>>>>>>>>> previous shape of the implementation
since we have a decent
> > >>>>> idea of
> > >>>>>>>>>>>>> how that behaves already but
perhaps you've seen or
> measured
> > >>>>>>>>> something
> > >>>>>>>>>>>>> I'm not thinking of?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I think this must have been a misunderstanding
on my part. I
> > >>>>> always
> > >>>>>>>>>> thought
> > >>>>>>>>>>>> of using ebtree to solve reduce
and wasn't planning to use
> it
> > >>>>> for the
> > >>>>>>>>>> map
> > >>>>>>>>>>>> index.
> > >>>>>>>>>>>> I don't like the idea that we have
ordered distributed
> key/value
> > >>>>>>> store
> > >>>>>>>>>> and
> > >>>>>>>>>>>> then implementing a b-tree on top
of that for indexing.
> > >>>>> Especially
> > >>>>>>>>>> since we
> > >>>>>>>>>>>> know that the current map index
is fast,
> > >>>>>>>>>>>> easy to use, and follows recommended
practices in the FDB
> > >>>>> community
> > >>>>>>> on
> > >>>>>>>>>> how
> > >>>>>>>>>>>> to do indexing. ebtree makes sense
for reduce and is a
> means to
> > >>>>> an
> > >>>>>>> end
> > >>>>>>>>>> to
> > >>>>>>>>>>>> give us CouchDB's reduce api, which
is heavily reliant on a
> > >>>>> b-tree,
> > >>>>>>>>> with
> > >>>>>>>>>>>> CouchDB on FDB. This feels like
a step backward and I worry
> we
> > >>>>> are
> > >>>>>>>>> tying
> > >>>>>>>>>>>> ourselves heavily to old CouchDB
instead of using the fact
> that
> > >>>>> we
> > >>>>>>>>>> moving
> > >>>>>>>>>>>> to FDB which then allows us to
design new api's and
> > >>>>> functionality.
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message