aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Chu-Carroll" <mchucarr...@twopensource.com>
Subject Re: Review Request 22457: Improve aurora "job diff" command.
Date Wed, 25 Jun 2014 17:04:12 GMT


> On June 16, 2014, 5:45 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/client/cli/jobs.py, lines 181-182
> > <https://reviews.apache.org/r/22457/diff/3/?file=609592#file609592line181>
> >
> >     I don't think it's enough to json-serialize a thrift task. This is bound to
set/dict serialization sorting problem we have battled with in updater.py. Specifically this
block: https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/updater.py#L200-L223
> >     
> >     Without sorting, the diff will only work 99% of the time. This has been demonstrated
in a few real-life job updates.
> 
> Mark Chu-Carroll wrote:
>     This isn't a new thing in this updated diff system - this is the current behavior.
>     
>     If you use the new json-tree-diff, you won't have this problem; but for people who
rely on the current diff
>     behavior, I'd rather not change that.
>
> 
> Maxim Khutornenko wrote:
>     | If you use the new json-tree-diff, you won't have this problem; 
>     
>     Perhaps I am missing something but how would the new approach ensure equivalence
of these two structs?
>     { 'set1': { ['k1':'v1', 'k2':'v2']} }
>     { 'set2': { ['k2':'v2', 'k1':'v1']} }
>     
>     It's "normal" to see element re-ordering like this during thrift struct serialization.
> 
> Maxim Khutornenko wrote:
>     Here is a real TaskConfig json fragment and its reordered twin:
>     
>     1: u'constraints': [{u'name': u'value', u'constraint': {u'values': [u'1', u'2']}},
{u'name': u'limit', u'constraint': {u'limit': {u'limit': 10}}}]
>     2: u'constraints': [{u'constraint': {u'values': [u'2', u'1']}, u'name': u'value'},
{u'name': u'limit', u'constraint': {u'limit': {u'limit': 10}}}]
> 
> Mark Chu-Carroll wrote:
>     I don't think that the reordering that you show in your first example isn't valid
json. Key-value pairs belong in a dictionary, not a list.  In a dictionary, the comparison
routine does the right thing: it compares key by key, without regard to ordering.
>     
>     In the second example: I'd argue that that's actually a bug in our serializer: it's
using an ordered list for an unordered constraint. Diff can't tell that the list structure
isn't really supposed to be ordered; the serializer should be canonicalizing it.
>
> 
> Maxim Khutornenko wrote:
>     Not sure how else you could represent a python set in JSON without completely changing
the underlying data structure (e.g. emitting fake keys for every element and converting set
into a dictionary with sorted keys). Agree, the TJSONProtocol could be a bit smarter when
dealing with python sets but the reality is that python sets are inherently "unhashable" structs.
I have seen cases when completely repr-equivalent python sets were not comparing as equal.
> 
> Mark Chu-Carroll wrote:
>     All the serializer would need to do is just canonicalize when it converts to a list.
It's not an issue of being unhashable. But it should at least bother to be *uniform*. If you're
converting a set to a list, chose a convention for ordering elements. Put 'em in lexicographic
order, problem solved.
>     
>     As it stands, that problem is unsolvable for this diff tool: there's no way to tell
that a given field was supposed to be a set, and should be compared without ordering. 
>     
>     I'm not sure what you'd like me to do here, short of giving up and throwing away
the new diff tool. I don't see any way to solve this, short of creating something that uses
thrift introspection.
>
> 
> Maxim Khutornenko wrote:
>     We solved this problem in updater diff before (see the link above) and I am sure
it can be solved here. Converting to sorted tuples does not produce a pretty diff but if we
are targeting value-comparison diff output rather than serialized string compare the problem
shifts into pretty-fying the sorted tuple output rather than handling diff mechanics.
> 
> Mark Chu-Carroll wrote:
>     Not sure what you mean by "converting to sorted tuples". Is there any documentation
of what the "sorted tuple" format looks like? Or do I just need to reverse-engineer? (I hate
doing that, because if I make any wrong assumptions, things get ugly.)
>
> 
> Maxim Khutornenko wrote:
>     I have modified the _diff_configs() in updater.py to print out the output and ran
"./pants src/test/python/apache/aurora/client/api:updater -s -k test_diff_unordered_configs":
>     
>     ((u'constraints', (((u'constraint', ((u'limit', ((u'limit', 10),)),)), (u'name',
u'limit')), ((u'constraint', ((u'values', (u'1', u'2')),)), (u'name', u'value')))), (u'contactEmail',
u'foo@bar.com'), (u'diskMb', 1), (u'environment', u'test'), (u'executorConfig', ((u'data',
u'test data'), (u'name', u'test'))), (u'jobName', u'jimbob'), (u'maxTaskFailures', 1), (u'metadata',
(((u'key', u'k1'), (u'value', u'v1')), ((u'key', u'k2'), (u'value', u'v2')))), (u'numCpus',
1.0), (u'owner', ((u'role', u'mesos'),)), (u'priority', 0), (u'production', 1), (u'ramMb',
1), (u'requestedPorts', (u'142', u'3424', u'45235')), (u'taskLinks', ((u'task1', u'link1'),
(u'task2', u'link2')))) 
>     
>     The idea here is to eliminate the ordering unpredictability by replacing all thrift
structs with tuples, which are sortable. It sure isn't pretty but it works.
> 
> Mark Chu-Carroll wrote:
>     I understand the reason for doing it; what I still don't understand is just what
the format here *is*. Is there any documentation on it? I can make a reasonable guess at just
what the "tuplization" format is and what it means, but it's just a reasonable guess, based
on looking at sample outputs and reverse engineering.
>     
>     I don't understand how this fixes anything. As far as I can tell, what it does is
just replace the sets, dicts, and structs with tuples. But that doesn't change anything about
consistency, as far as I can understand.
>     
>     Let me try to ask in a slightly different way, to get at what I'm not understanding.
How does this solve the problem? At some point, you're still taking the set of values whose
order is unpredictable, and stuffing them into a tuple, which is ordered. What about this
format guarantees that the order in which the elements of a set are converted into a tuple
will be consistent?
>     
>     In the example you gave before, you've got a constraints field, which is a set of
strings. In normal JSON serialization, that turns into:
>        {u'values': [u'2', u'1']}
>     
>     In the serialization process, the json serializer took a field which was actually
a set([u'1', u'2']), and serialized it, producing [u'2', u'1']. The serialization into a list
imposed an order on the set. The problem is that we can't count on consistency in how that
order is chosen.
>     
>     Now, in this new serialization format, we're doing something very similar - except
that instead of serializing set([u'1', u'2']) into [u'1', u'2'], it serializes it as (u'1',
u'2'). To serialize it into a tuple, the serialized needed to chose an order for the elements
of the set. If I understand you correctly, you're saying that this tuplized format solves
the ordering-consistency problem, somehow. But I don't see how. What about tuples guarantees
order consistency where lists don't?
>     
>     
>     
>     
>
> 
> Maxim Khutornenko wrote:
>     The key point here is that tuples are sorted at every step during construction and
specifically this line [1] ensures the consistency of sets on compare. So, the {u'values':
[u'2', u'1']} becomes ((u'values', (u'1', u'2')),)) and so on when unwinding the recursion.

>     
>     [1] - https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/updater.py#L202
>     
>     There is no documentation besides what is already there in the _diff_configs(). The
concept is quite simple, so I am not sure what else needs to be documented there. Perhaps
add an output example?

An output example would definitely help.

Also, more on the rationale - the explanation in that code doesn't explain why it makes to
go so far - particularly in the case of the new diff algorithm that we're discussing.

For example: why convert lists to tuples? Lists have the same ordering guarantees and comparison
guarantees as tuples. So you're not increasing the comparability, but you are making the resulting
diffs harder to read?

(In fact, what's going on in that _hashable and _diff_configs seems to not just be over-aggressive,
but incorrect. A list [b, a] in a constraint expression can't be converted to [a, b]. As it
stands, that will claim no diffs in cases where there's a significant diff, because it erases
ordering even when it shouldn't.)

Why convert dictionaries to tuples? Dictionaries don't have an ordering constraint; key-by-key
comparison is fine. If you're dumping the config out to a file for comparison by the unix
diff tool, that sort-of makes sense, because you need the keys to be in the same order for
a text diff. But why convert to a tuple of tuples? Why not just do a list of tuples? Again,
comparison wise it's equivalent, but readability-wise, a list of pairs is provides an extra
bit of visual cue about the structure, which is helpful for the person reading the diff.


- Mark


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22457/#review45835
-----------------------------------------------------------


On June 18, 2014, 10:59 a.m., Mark Chu-Carroll wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22457/
> -----------------------------------------------------------
> 
> (Updated June 18, 2014, 10:59 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Brian Wickman.
> 
> 
> Bugs: aurora-520
>     https://issues.apache.org/jira/browse/aurora-520
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new diff method, which uses field-by-field comparison of JSON trees for comparing
running job configurations to potentially updated configs.
> 
> - Allow exclusion of semantically irrelevant fields.
> - Provide a clearer list of the differences between configs.
> - Provide a scripting-friendly alternative JSON syntax for diffs.
> 
> The old diff behavior is still available under the "--use-shell-diff" option.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/cli/BUILD ebe681a0d1735b7cc695dc3b7a14c4292d87ae32

>   src/main/python/apache/aurora/client/cli/jobs.py 4fa03a6c9919651551238b0dc211ed69a8dfe565

>   src/main/python/apache/aurora/client/cli/json_tree_diff.py PRE-CREATION 
>   src/test/python/apache/aurora/client/cli/BUILD 3c88ed7cf9f654bbbd80d1d44aa1dd1c8655e378

>   src/test/python/apache/aurora/client/cli/test_diff.py 38629b63c082cf81cb891dace2a70d9e8f418e18

>   src/test/python/apache/aurora/client/cli/test_json_diff.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22457/diff/
> 
> 
> Testing
> -------
> 
> New unit tests of the JSON tree diff code, plus a bunch of new "job diff" tests of the
new functionality.
> All tests pass.
> 
> 
> Thanks,
> 
> Mark Chu-Carroll
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message