beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lukavský <je...@seznam.cz>
Subject Re: DISCUSS: Sorted MapState API
Date Thu, 18 Jun 2020 20:10:53 GMT
My questions were just an example. I fully agree there is a fundamental 
need for a sorted state (of some form, and I also think this links to 
efficient implementation of retrations) - I was reacting to Kenn's 
question about BIP. This one would be pretty nice example why it would 
be good to have such a "process" - not everything can be solved on ML 
and there are fundamental decisions that might need a closer attention.

On 6/18/20 5:28 PM, Reuven Lax wrote:
> Jan - my proposal is exactly TimeSortedBagState (more accurately - 
> TimeSortedListState), though I went a bit further and also proposed a 
> way to have a dynamic number of tagged TimeSortedBagStates.
>
> You are correct that the runner doesn't really have to store the data 
> time sorted - what's actually needed is the ability to fetch and 
> remove timestamp ranges of data (though that does include fetching the 
> entire list); TimeOrderedState is probably a more accurate name then 
> TimeSortedState. I don't think we could get away with operations that 
> only act on the smallest timestamp, however we could limit the API to 
> only being able to fetch and remove prefixes of data (ordered by 
> timestamp). However if we support prefixes, we might as well support 
> arbitrary subranges.
>
> On Thu, Jun 18, 2020 at 7:26 AM Jan Lukavský <je.ik@seznam.cz 
> <mailto:je.ik@seznam.cz>> wrote:
>
>     Big +1 for a BIP, as this might really help clarify all the pros
>     and cons of all possibilities. There seem to be questions that
>     need answering and motivating use cases - do we need sorted map
>     state or can we solve our use cases by something simpler - e.g.
>     the mentioned TimeSortedBagState? Does that really have to be
>     time-sorted structure, or does it "only" have to have operations
>     that can efficiently find and remove element with smallest
>     timestamp (like a PriorityQueue)?
>
>     Jan
>
>     On 6/18/20 5:32 AM, Kenneth Knowles wrote:
>>     Zooming in from generic philosophy to be clear: adding time
>>     ordered buffer to the Fn state API is *not* a shortcut.It has
>>     benefits that will not be achieved by SDK-side implementation on
>>     top of either ordered or unordered multimap. Are those benefits
>>     worth expanding the API? I don't know.
>>
>>     A change to allow a runner to have a specialized implementation
>>     for time-buffered state would be one or more StateKey types,
>>     right? Reuven, maybe put this and your Java API in a doc? A BIP?
>>     Seems like there's at least the following to explore:
>>
>>      - how that Java API would map to an SDK-side implementation on
>>     top of multimap state key
>>      - how that Java API would map to a new StateKey
>>      - whether there's actually more than one relevant implementation
>>     of that StateKey
>>      - whether SDK-side implementation on some other state key would
>>     be performant enough in all SDK languages (present and future)
>>
>>     Zooming back out to generic philosophy: Proliferation of StateKey
>>     types tuned by runners (which can very easily still share
>>     implementation) is probably better than proliferation of complex
>>     SDK-side implementations with varying completeness and performance.
>>
>>     Kenn
>>
>>     On Wed, Jun 17, 2020 at 3:24 PM Reuven Lax <relax@google.com
>>     <mailto:relax@google.com>> wrote:
>>
>>         It might help for me to describe what I have in mind. I'm
>>         still proposing that we build multimap, just not a
>>         globally-sorted multimap.
>>
>>         My previous proposal was that we provide a Multimap<Key,
>>         Value> state type, sorted by key. this would have two
>>         additional operations - multimap.getRange(startKey, endKey)
>>         and multimap.deleteRange(startKey, endKey). The primary use
>>         case was timestamp sorting, but I felt that a sorted multimap
>>         provided a nice generalization - after all, you can simply
>>         key the multimap by timestamp to get timestamp sorting.
>>
>>         This approach had some issues immediately that would take
>>         some work to solve. Since a multimap key can have any type
>>         and a runner will only be able to sort by encoded type, we
>>         would need to introduce a concept of order-preserving coders
>>         into Beam and plumb that through. Robert pointed out that
>>         even our existing standard coders for simple integral types
>>         don't preserve order, so there will likely be surprises here.
>>
>>         My current proposal is for a multimap that is not sorted by
>>         key, but that can support.ordered values for a single key.
>>         Remember that a multimap maps K -> Iterable<V>, so this means
>>         that each individual Iterable<V> is ordered, but the keys
>>         have no specific order relative to each other. This is not
>>         too different from many multimap implementations where the
>>         keys are unordered, but the list of values for a single key
>>         at least has a stable order.
>>
>>         The interface would look like this:
>>
>>         public interface MultimapState<K, V> extends State {
>>           // Add a value with a default timestamp.
>>           void put(K key, V value);
>>
>>           // Add a timestamped value.
>>           void put(K, key, TimestampedValue<V> value);
>>
>>           // Remove all values for a key.
>>           void remove (K key);
>>
>>           // Remove all values for a key with timestamps within the
>>         specified range.
>>           void removeRange(K key, Instant startTs, Instant endTs);
>>
>>           // Get an Iterable of values for V. The Iterable will be
>>         returned sorted by timestamp.
>>         ReadableState<Iterable<TimestampedValue<V>>> get(K key);
>>
>>           // Get an Iterable of values for V in the specified range.
>>         The Iterable will be returned sorted by timestamp.
>>         ReadableState<Iterable<TimestampedValue<V>>> getRange(K key,
>>         Instant startTs, Instant endTs);
>>
>>           ReadableState<Iterable<K>> keys();
>>         ReadableState<Iterable<TimestampedValue<V>>> values();
>>           ReadableState<Iterable<Map.Entry<K, TimestampedValue<V>>
>>         entries;
>>         }
>>
>>         We can of course provide helper functions that allow using
>>         MultimapState without deailing with TimestampValue for users
>>         who only want a multimap and don't want sorting.
>>
>>         I think many users will only need a single sorted list - not
>>         a full multimap. It's worth offering this as well, and we can
>>         simply build it on top of MultimapState. It will look like an
>>         extension of BagState
>>
>>         public interface TimestampSortedListState<T> extends State {
>>           void add(TimestampedValue<T> value);
>>           Iterable<TimestampedValue<T>> read();
>>           Iterable<TimestampedValue<T>> readRange(Instant startTs,
>>         Instant endTs);
>>           void clearRange(Instant startTs, Instant endTs);
>>         }
>>
>>
>>         On Wed, Jun 17, 2020 at 2:47 PM Luke Cwik <lcwik@google.com
>>         <mailto:lcwik@google.com>> wrote:
>>
>>             The portability layer is meant to live across multiple
>>             versions of Beam and I don't think it should be treated
>>             by doing the simple and useful thing now since I believe
>>             it will lead to a proliferation of the API.
>>
>>             On Wed, Jun 17, 2020 at 2:30 PM Kenneth Knowles
>>             <kenn@apache.org <mailto:kenn@apache.org>> wrote:
>>
>>                 I have thoughts on the subject of whether to have
>>                 APIs just for the lowest-level building blocks versus
>>                 having APIs for higher-level constructs. Specifically
>>                 this applies to providing only unsorted multimap vs
>>                 what I will call "time-ordered buffer". TL;DR: I'd
>>                 vote to focus on time-ordered buffer; if it turns out
>>                 to be easy to go all the way to sorted multimap
>>                 that's nice-to-have; if it turns out to be easy to
>>                 implement on top of unsorted map state that should
>>                 probably be under the hood
>>
>>                 Reasons to build low-level multimap in the runner &
>>                 fn api and layer higher-level things in the SDK:
>>
>>                  - It is less implementation for runners if they have
>>                 to only provide fewer lower-level building blocks
>>                 like multimap state.
>>                  - There are many more runners than SDKs (and will be
>>                 even more and more) so this saves overall.
>>
>>                 Reasons to build higher-level constructs directly in
>>                 the runner and fn api:
>>
>>                  - Having multiple higher-level state types may
>>                 actually be less implementation than one complex
>>                 state type, especially if they map to runner primitives.
>>                  - The runner may have better specialized
>>                 implementations, especially for something like a
>>                 time-ordered buffer.
>>                  - The particular access patterns in an SDK-based
>>                 implementation may not be ideal for each runner's
>>                 underlying implementation of the low-level building
>>                 block.
>>                  - There may be excessive gRPC overhead even for
>>                 optimal access patterns.
>>
>>                 There are ways to have best of both worlds, like:
>>
>>                 1. Define multiple state types according to
>>                 fundamental access patterns, like we did this before
>>                 portability.
>>                 2. If it is easy to layer one on top of the other, do
>>                 that inside the runner. Provide shared code so for
>>                 runners providing the lowest-level primitive they get
>>                 all the types for free.
>>
>>                 I understand that this is an oversimplification. It
>>                 still creates some more work. And APIs are a burden
>>                 so it is good to introduce as few as possible for
>>                 maintenance. But it has performance benefits and also
>>                 unblocks "just doing the simple and useful thing now"
>>                 which I always like to do as long as it is compatible
>>                 with future changes. If the APIs are fundamental,
>>                 like sets, maps, timestamp ordering, then it is safe
>>                 to guess that they will change rarely and be useful
>>                 forever.
>>
>>                 Kenn
>>
>>                 On Tue, Jun 16, 2020 at 2:54 PM Luke Cwik
>>                 <lcwik@google.com <mailto:lcwik@google.com>> wrote:
>>
>>                     I would be glad to take a stab at how to provide
>>                     sorting on top of unsorted multimap state.
>>                     Based upon your description, you want integer
>>                     keys representing timestamps and arbitrary user
>>                     value for the values, is that correct?
>>                     What kinds of operations do you need on the
>>                     sorted map state in order of efficiency requirements?
>>                     (e.g. Next(x), Previous(x), GetAll(Range[x, y)),
>>                     ClearAll(Range[x, y))
>>                     What kinds of operations do we expect the
>>                     underlying unsorted map state to be able to provide?
>>                     (at a minimum Get(K), Append(K), Clear(K) but
>>                     what else e.g. enumerate(K)?)
>>
>>                     I went through a similar exercise of how to
>>                     provide a list like side input view over a
>>                     multimap[1] side input which efficiently allowed
>>                     computation of size and provided random access
>>                     while only having access to get(K) and enumerate K's.
>>
>>                     1:
>>                     https://github.com/lukecwik/incubator-beam/blob/ec8769f6163ca8a4daecc2fb29708bc1da430917/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionViews.java#L568
>>
>>                     On Tue, Jun 16, 2020 at 8:47 AM Reuven Lax
>>                     <relax@google.com <mailto:relax@google.com>> wrote:
>>
>>                         Bringing this subject up again,
>>
>>                         I've spent some time looking into
>>                         implementing this for the Dataflow runner.
>>                         I'm unable to find a way to implement the
>>                         arbitrary sorted multimap efficiently for the
>>                         case where there are large numbers of unique
>>                         keys. Since the primary driving use case is
>>                         timestamp ordering (i.e. key is event
>>                         timestamp), you would expect to have nearly a
>>                         new key per element. I considered Luke's
>>                         suggestion above, but unfortunately it
>>                         doesn't really solve this issue.
>>
>>                         The primary use case for sorting always seems
>>                         to be sorting by timestamp. I want to propose
>>                         that instead of building the fully-general
>>                         sorted multimap, we instead focus on a state
>>                         type where the sort key is an integral type
>>                         (like a timestamp or an integer). There is
>>                         still a valid use case for multimap, but we
>>                         can provide that as an unordered state. At
>>                         least for Dataflow, it will be much easier
>>
>>                         While my difficulties here may be specific to
>>                         the Dataflow runner, any such support would
>>                         have to be built into other runners as well,
>>                         and limiting to integral sorting likely makes
>>                         it easier for other runners to implement
>>                         this. Also, if you look at this
>>                         <https://github.com/apache/flink/blob/0ab1549f52f1f544e8492757c6b0d562bf50a061/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/runtime/join/TemporalRowtimeJoin.scala#L95> Flink
>>                         comment pointed out by Aljoscha, for Flink
>>                         the main use case identified was also
>>                         timestamp sorting. This will also simplify
>>                         the API design for this feature: Sorted
>>                         multimap with arbitrary keys would require us
>>                         to introduce a way of mapping natural
>>                         ordering to encoded ordering (i.e. a new
>>                         OrderPreservingCoder), but if we limit sort
>>                         keys to integral types, the API design is
>>                         simpler as integral types can be represented
>>                         directly.
>>
>>                         Reuven
>>
>>                         On Sun, Jun 2, 2019 at 7:04 AM Reuven Lax
>>                         <relax@google.com <mailto:relax@google.com>>
>>                         wrote:
>>
>>                             This sounds to me like a potential runner
>>                             strategy. However if a runner can
>>                             natively support sorted maps (e.g. we
>>                             expect the Dataflow runner to be able to
>>                             do so, and I think it would be useful for
>>                             other runners as well), then it's
>>                             probably preferable to allow the runner
>>                             to use its native capabilities.
>>
>>                             On Fri, May 24, 2019 at 11:05 AM Lukasz
>>                             Cwik <lcwik@google.com
>>                             <mailto:lcwik@google.com>> wrote:
>>
>>                                 For the API that you proposed, the
>>                                 map key is always "void" and the sort
>>                                 key == user key. So in my example of
>>                                 key: dummy value
>>                                 key.000: token, (0001, value4)
>>                                 key.001: token, (0010, value1),
>>                                 (0011, value2)
>>                                 key.01: token
>>                                 key.1: token, (1011, value3)
>>                                 you would have:
>>                                 "void": dummy value
>>                                 "void".000: token, (0001, value4)
>>                                 "void".001: token, (0010, value1),
>>                                 (0011, value2)
>>                                 "void".01: token
>>                                 "void".1: token, (1011, value3)
>>
>>                                 Iterable<KV<K, V>> entriesUntil(K
>>                                 limit) translates into walking the
>>                                 the prefixes until you find a common
>>                                 prefix for K and then filter for
>>                                 values where they have a sort key <=
>>                                 K. Using the example above, to find
>>                                 entriesUntil(0010) you would:
>>                                 look for key."", miss
>>                                 look for key.0, miss
>>                                 look for key.00, miss
>>                                 look for key.000, hit, sort all
>>                                 contained values using secondary key,
>>                                 provide value4 to user
>>                                 look for key.001, hit, notice that
>>                                 001 is a prefix of 0010 so we sort
>>                                 all contained values using secondary
>>                                 key, filter out value2 and provide value1
>>
>>                                 void removeUntil(K limit) also
>>                                 translates into walking the prefixes
>>                                 but instead we will clear them when
>>                                 we have a "hit" with some special
>>                                 logic for when the sort key is a
>>                                 prefix of the key. Used the example,
>>                                 to removeUntil(0010) you would:
>>                                 look for key."", miss
>>                                 look for key.0, miss
>>                                 look for key.00, miss
>>                                 look for key.000, hit, clear
>>                                 look for key.001, hit, notice that
>>                                 001 is a prefix of 0010 so we sort
>>                                 all contained values using secondary
>>                                 key, store in memory all values that
>>                                 > 0010, clear and append values
>>                                 stored in memory.
>>
>>                                 On Fri, May 24, 2019 at 10:36 AM
>>                                 Reuven Lax <relax@google.com
>>                                 <mailto:relax@google.com>> wrote:
>>
>>                                     Can you explain how fetching and
>>                                     deleting ranges of keys would
>>                                     work with this data structure?
>>
>>                                     On Fri, May 24, 2019 at 9:50 AM
>>                                     Lukasz Cwik <lcwik@google.com
>>                                     <mailto:lcwik@google.com>> wrote:
>>
>>                                         Reuven, for the example, I
>>                                         assume that we never want to
>>                                         store more then 2 values at a
>>                                         given sort key prefix, and if
>>                                         we do then we will create a
>>                                         new longer prefix splitting
>>                                         up the values based upon the
>>                                         sort key.
>>
>>                                         Tuple representation in
>>                                         examples below is (key, sort
>>                                         key, value) and . is a
>>                                         character outside of the
>>                                         alphabet which can be
>>                                         represented by using an
>>                                         escaping encoding that wraps
>>                                         the key + sort key encoding.
>>
>>                                         To insert (key, 0010,
>>                                         value1), we lookup "key" +
>>                                         all the prefixes of 0010
>>                                         finding one that is not
>>                                         empty. In this case its 0, so
>>                                         we append value to the map at
>>                                         key.0 ending up with (we also
>>                                         set the key to any dummy
>>                                         value to know that it it
>>                                         contains values):
>>                                         key: dummy value
>>                                         key."": token, (0010, value1)
>>                                         Now we insert (key, 0011,
>>                                         value2), we again lookup
>>                                         "key" + all the prefixes of
>>                                         0010, finding "", so we
>>                                         append value2 to key.""
>>                                         ending up with:
>>                                         key: dummy value
>>                                         key."": token, (0010,
>>                                         value1), (0011, value2)
>>                                         Now we insert (key, 1011,
>>                                         value3), we again lookup
>>                                         "key" + all the prefixes of
>>                                         1011 finding "" but notice
>>                                         that it is full, so we
>>                                         partition all the values into
>>                                         two prefixes 0 and 1. We also
>>                                         clear the "" prefix ending up
>>                                         with:
>>                                         key: dummy value
>>                                         key.0: token, (0010, value1),
>>                                         (0011, value2)
>>                                         key.1: token, (1011, value3)
>>                                         Now we insert (key, 0001,
>>                                         value4), we again lookup
>>                                         "key" + all the prefixes of
>>                                         the value finding 0 but
>>                                         notice that it is full, so we
>>                                         partition all the values into
>>                                         two prefixes 00 and 01 but
>>                                         notice this doesn't help us
>>                                         since 00 will be too full so
>>                                         we split 00 again to 000,
>>                                         001. We also clear the 0
>>                                         prefix ending up with:
>>                                         key: dummy value
>>                                         key.000: token, (0001, value4)
>>                                         key.001: token, (0010,
>>                                         value1), (0011, value2)
>>                                         key.01: token
>>                                         key.1: token, (1011, value3)
>>
>>                                         We are effectively building a
>>                                         trie[1] where we only have
>>                                         values at the leaves and
>>                                         control how full each leaf
>>                                         can be. There are other trie
>>                                         representations like a radix
>>                                         tree that may be better.
>>
>>                                         Looking up the values in
>>                                         sorted order for "key" would
>>                                         go like this:
>>                                         Is key set, yes
>>                                         look for key."", miss
>>                                         look for key.0, miss
>>                                         look for key.00, miss
>>                                         look for key.000, hit, sort
>>                                         all contained values using
>>                                         secondary key, provide value4
>>                                         to user
>>                                         look for key.001, hit, sort
>>                                         all contained values using
>>                                         secondary key, provide value1
>>                                         followed by value2 to user
>>                                         look for key.01, hit, empty,
>>                                         return no values to user
>>                                         look for key.1, hit, sort all
>>                                         contained values using
>>                                         secondary key, provide value3
>>                                         to user
>>                                         we have walked the entire
>>                                         prefix space, signal end of
>>                                         iterable
>>
>>                                         Some notes for the above:
>>                                         * The dummy value is used to
>>                                         know that the key contains
>>                                         values and the token is to
>>                                         know whether there are any
>>                                         values deeper in the trie so
>>                                         when we know when to stop
>>                                         searching.
>>                                         * If we can recalculate the
>>                                         sort key from the combination
>>                                         of the key and value, then we
>>                                         don't need to store it.
>>                                         * Keys with lots of values
>>                                         will perform worse then keys
>>                                         with less values since we
>>                                         have to look up more keys but
>>                                         they will be empty reads. The
>>                                         number of misses can be
>>                                         controlled by how many
>>                                         elements we are willing to
>>                                         store at a given node before
>>                                         we subdivide.
>>
>>                                         In reality you could build a
>>                                         lot of structures (e.g. red
>>                                         black tree, binary tree)
>>                                         using the sort key, the issue
>>                                         is the cost of
>>                                         rebalancing/re-organizing the
>>                                         structure in map form and
>>                                         whether it has a convenient
>>                                         pre-order traversal for lookups.
>>
>>
>>
>>                                         On Fri, May 24, 2019 at 8:14
>>                                         AM Reuven Lax
>>                                         <relax@google.com
>>                                         <mailto:relax@google.com>> wrote:
>>
>>                                             Some great comments!
>>
>>                                             *Aljoscha*: absolutely
>>                                             this would have to be
>>                                             implemented by runners to
>>                                             be efficient. We can of
>>                                             course provide a default
>>                                             (inefficient)
>>                                             implementation, but
>>                                             ideally runners would
>>                                             provide better ones.
>>
>>                                             *Jan* Exactly. I think
>>                                             MapState can be dropped
>>                                             or backed by this. E.g.
>>
>>                                             *Robert* Great point
>>                                             about standard coders not
>>                                             satisfying this. That's
>>                                             why I suggested that we
>>                                             provide a way to tag the
>>                                             coders that do preserve
>>                                             order, and only accept
>>                                             those as key coders
>>                                             Alternatively we could
>>                                             present a more limited
>>                                             API - e.g. only allowing
>>                                             a hard-coded set of types
>>                                             to be used as keys - but
>>                                             that seems counter to the
>>                                             direction Beam usually
>>                                             goes. So users will have
>>                                             two ways .of creating
>>                                             multimap state specs:
>>
>>                                                private final
>>                                             StateSpec<MultimapState<Long,
>>                                             String>> state =
>>                                             StateSpecs.multimap(VarLongCoder.of(),
>>                                             StringUtf8Coder.of());
>>
>>                                             or
>>                                              private final
>>                                             StateSpec<MultimapState<Long,
>>                                             String>> state =
>>                                             StateSpecs.orderedMultimap(VarLongCoder.of(),
>>                                             StringUtf8Coder.of());
>>
>>                                             The second one will
>>                                             validate that the key
>>                                             coder preserves order,
>>                                             and fails otherwise
>>                                             (similar to coder
>>                                             determinism checking in
>>                                             GroupByKey). (BTW we
>>                                             would also have versions
>>                                             of these functions that
>>                                             use coder inference to
>>                                             "guess" the coder, but
>>                                             those will do the same
>>                                             checking)
>>
>>                                             Also the API I proposed
>>                                             did support
>>                                             random access! We could
>>                                             separate out
>>                                             OrderedBagState again if
>>                                             we think the use cases
>>                                             are fundamentally
>>                                             different. I merged the
>>                                             proposal into that of
>>                                             MultimapState because
>>                                             there seemed be 99% overlap.
>>
>>                                             Reuven
>>
>>                                             On Fri, May 24, 2019 at
>>                                             6:19 AM Robert Bradshaw
>>                                             <robertwb@google.com
>>                                             <mailto:robertwb@google.com>>
>>                                             wrote:
>>
>>                                                 On Fri, May 24, 2019
>>                                                 at 5:32 AM Reuven Lax
>>                                                 <relax@google.com
>>                                                 <mailto:relax@google.com>>
>>                                                 wrote:
>>                                                 >
>>                                                 > On Thu, May 23,
>>                                                 2019 at 1:53 PM Ahmet
>>                                                 Altay
>>                                                 <altay@google.com
>>                                                 <mailto:altay@google.com>>
>>                                                 wrote:
>>                                                 >>
>>                                                 >>
>>                                                 >>
>>                                                 >> On Thu, May 23,
>>                                                 2019 at 1:38 PM
>>                                                 Lukasz Cwik
>>                                                 <lcwik@google.com
>>                                                 <mailto:lcwik@google.com>>
>>                                                 wrote:
>>                                                 >>>
>>                                                 >>>
>>                                                 >>>
>>                                                 >>> On Thu, May 23,
>>                                                 2019 at 11:37 AM Rui
>>                                                 Wang
>>                                                 <ruwang@google.com
>>                                                 <mailto:ruwang@google.com>>
>>                                                 wrote:
>>                                                 >>>>>
>>                                                 >>>>> A few obvious
>>                                                 problems with this code:
>>                                                 >>>>>   1. Removing
>>                                                 the elements already
>>                                                 processed from the
>>                                                 bag requires clearing
>>                                                 and rewriting the
>>                                                 entire bag. This is
>>                                                 O(n^2) in the number
>>                                                 of input trades.
>>                                                 >>>>
>>                                                 >>>> why it's not O(2
>>                                                 * n) to clearing and
>>                                                 rewriting trade state?
>>                                                 >>>>
>>                                                 >>>>>
>>                                                 >>>>> public
>>                                                 interface
>>                                                 SortedMultimapState<K,
>>                                                 V> extends State {
>>                                                 >>>>>   // Add a
>>                                                 value to the map.
>>                                                 >>>>>   void put(K
>>                                                 key, V value);
>>                                                 >>>>>   // Get all
>>                                                 values for a given key.
>>                                                 >>>>>
>>                                                  ReadableState<Iterable<V>>
>>                                                 get(K key);
>>                                                 >>>>>  // Return all
>>                                                 entries in the map.
>>                                                 >>>>>
>>                                                  ReadableState<Iterable<KV<K,
>>                                                 V>>> allEntries();
>>                                                 >>>>>   // Return all
>>                                                 entries in the map
>>                                                 with keys <= limit.
>>                                                 returned elements are
>>                                                 sorted by the key.
>>                                                 >>>>>
>>                                                  ReadableState<Iterable<KV<K,
>>                                                 V>>> entriesUntil(K
>>                                                 limit);
>>                                                 >>>>>
>>                                                 >>>>>  // Remove all
>>                                                 values with the given
>>                                                 key;
>>                                                 >>>>>   void remove(K
>>                                                 key);
>>                                                 >>>>>  // Remove all
>>                                                 entries in the map
>>                                                 with keys <= limit.
>>                                                 >>>>>   void
>>                                                 removeUntil(K limit);
>>                                                 >>>>
>>                                                 >>>> Will
>>                                                 removeUntilExcl(K
>>                                                 limit) also useful?
>>                                                 It will remove all
>>                                                 entries in the map
>>                                                 with keys < limit.
>>                                                 >>>>
>>                                                 >>>>>
>>                                                 >>>>> Runners will
>>                                                 sort based on the
>>                                                 encoded value of the
>>                                                 key. In order to make
>>                                                 this easier for
>>                                                 users, I propose that
>>                                                 we introduce a new
>>                                                 tag on Coders
>>                                                 PreservesOrder. A
>>                                                 Coder that contains
>>                                                 this tag guarantees
>>                                                 that the encoded
>>                                                 value preserves the
>>                                                 same ordering as the
>>                                                 base Java type.
>>                                                 >>>>
>>                                                 >>>>
>>                                                 >>>> Could you
>>                                                 clarify what is 
>>                                                 "encoded value
>>                                                 preserves the same
>>                                                 ordering as the base
>>                                                 Java type"?
>>                                                 >>>
>>                                                 >>>
>>                                                 >>> Lets say A and B
>>                                                 represent two
>>                                                 different instances
>>                                                 of the same Java type
>>                                                 like a double, then A
>>                                                 < B (using the
>>                                                 languages comparison
>>                                                 operator) iff
>>                                                 encode(A) < encode(B)
>>                                                 (note the encoded
>>                                                 versions are compared
>>                                                 lexicographically)
>>                                                 >>
>>                                                 >>
>>                                                 >> Since coders are
>>                                                 shared across SDKs,
>>                                                 do we expect A < B
>>                                                 iff e(A) < e(P)
>>                                                 property to hold for
>>                                                 all languages we
>>                                                 support? What happens
>>                                                 A, B sort differently
>>                                                 in different languages?
>>                                                 >
>>                                                 >
>>                                                 > That would have to
>>                                                 be the property of
>>                                                 the coder (which
>>                                                 means that this
>>                                                 property probably
>>                                                 needs to be
>>                                                 represented in the
>>                                                 portability
>>                                                 representation of the
>>                                                 coder). I imagine the
>>                                                 common use cases will
>>                                                 be for simple coders
>>                                                 like int, long,
>>                                                 string, etc., which
>>                                                 are likely to sort
>>                                                 the same in most
>>                                                 languages.
>>
>>                                                 The standard coders
>>                                                 for both double and
>>                                                 integral types do not
>>                                                 respect
>>                                                 the natural ordering
>>                                                 (consider negative
>>                                                 values). KV coders
>>                                                 violate the
>>                                                 "natural"
>>                                                 lexicographic
>>                                                 ordering on
>>                                                 components as well. I
>>                                                 think
>>                                                 implicitly sorting on
>>                                                 encoded value would
>>                                                 yield many surprises.
>>                                                 (The
>>                                                 state, of course,
>>                                                 could take a
>>                                                 order-preserving, bytes
>>                                                 (string?)-producing
>>                                                 callable as a
>>                                                 parameter of course).
>>                                                 (As for
>>                                                 naming, I'd probably
>>                                                 call this
>>                                                 OrderedBagState or
>>                                                 something like
>>                                                 that...rather than
>>                                                 Map which tends to
>>                                                 imply random access.)
>>

Mime
View raw message