beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kenneth Knowles <>
Subject Re: Existing transactionality inconsistency in the Beam Java State API
Date Thu, 24 May 2018 17:37:51 GMT
I see what you mean but I don't agree that futures imply anything other
than "it is a value that you have to force", with deliberately many
possible implementations. When at the point in 1 and you've got

    interface ReadableState<T> {
        T read()

and you want to improve performance, both approaches "void readLater()" and
"StateFuture<T> read()" are natural evolutions. They both gain the same 10x
and they both support the "unchanging committed state plus buffered
mutations" implementation well. And snapshots are essentially free for that
implementation if the buffered mutations are stored in a decent data

My recollection was that futures were seen as more cumbersome. They affect
the types even for simple uses. The only appealing future API was Guava's,
but we didn't want that on the API surface. And we did not intend for these
to be used in complex ways, so the usability & perf benefits of a
future-based API weren't really realized anyhow.

The only reason I belabor this is that if we ever wanted to support more
complex use cases, such as concurrent use of state, then my preference
would flip. I wouldn't want to make XYZState a synchronized monitor. At
that point I'd favor using a snapshots-are-free concurrent data structure
under the hood of a future-based API.

Since there is really only one implementation in mind for this, maybe only
one that works reasonably, we should just document it as such. The docs on
ReadableState do make it sound like writes will invalidate the usefulness
of readLater, even though that isn't the case for the intended
implementation strategy.


On Thu, May 24, 2018 at 9:40 AM Ben Chambers <> wrote:

> I think Kenn's second option accurately reflects my memory of the original
> intentions:
> 1. I remember we we considered either using the Future interface or
> calling the ReadableState interface a future, and explicitly said "no,
> future implies asynchrony and that the value returned by `get` won't change
> over multiple calls, but we want the latest value each time". So, I
> remember us explicitly considering and rejecting Future, thus the name
> "ReadableState".
> 2. The intuition behind the implementation was analogous to a
> mutable-reference cell in languages like ML / Scheme / etc. The
> ReadableState is just a pointer to the the reference cell. Calling read
> returns the value currently in the cell. If we have 100 ReadableStates
> pointing at the same cell, they all get the same value regardless of when
> they were created. This avoids needing to duplicate/snapshot values at any
> point in time.
> 3. ReadLater was added, as noted by Charles, to suggest prefetching the
> associated value. This was added after benchmarks showed 10x (if I remember
> correctly) performance improvements in things like GroupAlsoByWindows by
> minimizing round-trips asking for more state. The intuition being -- if we
> need to make an RPC to load one state value, we are better off making an
> RPC to load all the values we need.
> Overall, I too lean towards maintaining the second interpretation since it
> seems to be consistent and I believe we had additional reasons for
> preferring it over futures.
> Given the confusion, I think strengthening the class documentation makes
> sense -- I note the only hint of the current behavior is that ReadableState
> indicates it gets the *current* value (emphasis mine). We should emphasize
> that and perhaps even mention that the ReadableState should be understood
> as just a reference or handle to the underlying state, and thus its value
> will reflect the latest write.
> Charles, if it helps, the plan I remember regarding prefetching was
> something like:
> interface ReadableMapState<K, V> {
>    ReadableState<V> get(K key);
>    ReadableState<Iterable<V>> getIterable();
>    ReadableState<Map<K, V>> get();
>    // ... more things ...
> }
> Then prefetching a value is `mapState.get(key).readLater()` and
> prefetching the entire map is `mapState.get().readLater()`, etc.
> On Wed, May 23, 2018 at 7:13 PM Charles Chen <> wrote:
>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>> should allow for some sort of prefetching / batching / background I/O for
>> state; and (2) it should be clear what the semantics are for reading (e.g.
>> so we don't have confusing read after write behavior).
>> The approach I'm leaning towards for (1) is to allow a state.prefetch()
>> method (to prefetch a value, iterable or [entire] map state) and maybe
>> something like state.prefetch_key(key) to prefetch a specific KV in the
>> map.  Issue (2) seems to be okay in either of Kenn's positions.
>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <>
>> wrote:
>>> Thanks for laying this out so well, Kenn. I'm also leaning towards the
>>> second option, despite its drawbacks. (In particular, readLater should
>>> not influence what's returned at read(), it's just a hint.)
>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <> wrote:
>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>> than long doc comment threads.
>>>> I had strong opinions that I think were a bit confused and wrong. Sorry
>>>> for that. I stated this position:
>>>>  - XYZState class is a handle to a mutable location
>>>>  - its methods like isEmpty() or contents() should return immutable
>>>> future values (implicitly means their contents are semantically frozen when
>>>> they are created)
>>>>  - the fact that you created the future is a hint that all necessary
>>>> fetching/computation should be kicked off
>>>>  - later forced with get()
>>>>  - when it was designed, pure async style was not a viable option
>>>> I see now that the actual position of some of its original designers is:
>>>>  - XYZState class is a view on a mutable location
>>>>  - its methods return new views on that mutable location
>>>>  - calling readLater() is a hint that some fetching/computation should
>>>> be kicked off
>>>>  - later read() will combine whatever readLater() did with additional
>>>> local info to give the current value
>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>> naive straight-line coding + autoscaling
>>>> These are both internally consistent I think. In fact, I like the
>>>> second perspective better than the one I have been promoting. There are
>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>> implementation style, and futures are decades old so you can get good APIs
>>>> and performance without inventing anything. But I still like the non-future
>>>> version a little better.
>>>> Kenn
>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <> wrote:
>>>>> During the design of the Beam Python State API, we noticed some
>>>>> transactionality inconsistencies in the existing Beam Java State API
>>>>> are the unresolved bugs BEAM-2980
>>>>> <> and BEAM-2975
>>>>> <>).  We are therefore
>>>>> having a discussion about this API which can have implications for its
>>>>> future development in all Beam languages:
>>>>> If you have an opinion on the possible design approaches, it would be
>>>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>>> Best,
>>>>> Charles

View raw message