nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Percivall <joeperciv...@yahoo.com.INVALID>
Subject Re: [Discuss] Expression Language Options to Add Decimal Support
Date Wed, 27 Apr 2016 03:25:29 GMT
The only time the regex the Regex would be applied to a FlowFile attribute or escaped string
is when it's passed to a function that expects a number. If we decide to allow literal decimals
in expression language (${value:plus(3.2)} instead of ${value:plus("3.2")}) then a double
regex would be used by the EL parser to evaluate the expression. I looked for un-quoted literal
dots and I don't see any in any EL examples in the documentation. The replace example does
have a dot but it is within a String. 

Mark, can you comment on how you designed EL to use dots and why they are classified as their
own literal?

Andy, that is a good point about users just not knowing. The best we can do though is to make
sure to call it out as a new feature in the release notes and, like you said, have bullet-proof
documentation.


- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com



On Monday, April 25, 2016 11:29 PM, Andy LoPresto <alopresto.apache@gmail.com> wrote:



My question about auto-casting was when the regex would be applied -- after every function
execution or just ones that return a numeric value. I think you answered that only the second
would apply. Also, the documentation uses "." in a String replacement example without any
escaping (can't find link now as I am on mobile but I believe it's on the Expression Language
Guide page under "replace"). This supports my point that our documentation must be bullet-proof
for people who are not already developers and have had to deal with integer/decimal math operations
or escaped characters on a regular basis. 

As a developer, I agree with prioritizing dev-usability and completeness of features over
ease of adoption for general users. However, I do think as NiFi adoption grows, the percentage
of our users who are non-developers will grow substantially, perhaps eventually becoming a
majority. 

I'm also not sure that we can safely assume that all users are aware that currently, only
integer operations are supported. Unless they read the documentation carefully or manually
inspect/unit test their flows before deploying, I can see users existing who believe they
are performing decimal division (having used variable replacement and never seen actual value
substitution) and will be surprised by this revelation when it is announced in release notes
(or perhaps will never notice a change). 

Andy LoPresto
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69


> On Apr 25, 2016, at 19:35, Joe Percivall <joepercivall@yahoo.com.INVALID> wrote:
> 
> Thanks for going through each of your concerns Andy and I will address each:
> 
> 
> 1: I agree that option 3 would be most concise and would address this loss of precision.
> 2: There are a couple points here:
>        1. Using decimal points: currently decimal points are reserved characters in expression
language[1] but I'm not sure that they are used anywhere (I only see two matches for "DOT",
in the lexer and parser). I'm not entirely sure how prioritization would work when adding
the Regex to match a decimal point for a double. In the PR I have open[2]  (uses option 1
and will be closed) I explicitly require quoting the decimal to allow easier parsing. Also
of note, you can only pass a quoted String or attribute name as the expression value subject
(the thing getting a method called on it). So “${1.2:toNumber()}” would not be valid but
“${"1.2":toNumber()}” would be.
>        2. Evaluation of String which would parse to a double: When you run "${“1-2”:replace(“-“,
“.”)}” you merely create a String with the contents "1.2". This is because the type
only changes after being run through a method which returns a specific type.
>        3. Long Term users: I agree that long term users would need to adjust their thinking
but for a different reason. They have only ever used/had access to using whole numbers. This
change would require users to change their thinking to think of new use-cases. If they don't
want to adjust then they can continue only using whole numbers the exact same way they always
have.
>        4: User acceptance: I agree that specifying input type to get a specific output
type will be something new to non-dev users and would require good documentation with good
examples. That being said I think the added learning curve is worth the added functionality.

> 
> [1] https://github.com/apache/nifi/blob/32398693d5917e35d3a6cd1549edd795e16fa8ec/nifi-commons/nifi-expression-language/src/main/antlr3/org/apache/nifi/attribute/expression/language/antlr/AttributeExpressionLexer.g#L80-L80
> [2] https://github.com/apache/nifi/pull/310
> 
> Joe
> 
> - - - - - - 
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
> 
> 
> 
> On Monday, April 25, 2016 9:31 PM, Andy LoPresto <alopresto@apache.org> wrote:
> 
> 
> 
> Mark made some good points in his support of option 3. I’ll try to list some of the
scenarios I’m worried about, and if we can address each, I’m fine with that choice. For
the rest of this message, “user” refers to the person writing the EL expression. They
may or may not be a developer, but have read our existing EL documentation. 
> 
> 1. Loss of precision — currently there is an issue where division of Long values can
result in a loss of precision. I have provided a failing test here [1]. I believe this would
be resolved by any of the proposed solutions, but the auto-infer mechanism of option 3 would
be the most concise. 
> 2. Auto-casting — I’m not sure how the “auto-interpretation” would occur. Would
it only happen on the result of mathematical operations? Would “${1.2:toNumber()}” return
“1” as a Long or “1.2” as a Decimal? What about “${“1-2”:replace(“-“, “.”)}”?
Is this a String with contents “1.2” or a Number (either Long or double)? The current
“single numeric type” language has (I imagine) conditioned users to associate any numerical
value with the Number type. I believe splitting decimal values out to “Decimal” type is
correct, but will require very clear documentation and probably a concerted education effort
for long-time users. As Joe pointed out, if a user wanted a decimal result of division, they
would need to specify a decimal dividend or divisor, rather than operate on the quotient.
This is accepted practice by developers, but is unlikely to be familiar with non-developers
(i.e. writing “${“10”:divide(“3”):toDecimal()}” would result in “3”; “${“10”:divide(“3”:toDecimal())}”
or ${“10”:toDecimal():divide(“3”)}” are necessary to get a result of “3.33…”)
> 
> I think as long as the user experience and concise and explicit documentation is held
in paramount, I am ok with option 3. I just see the most potential for user confusion there,
especially after some recent mailing list questions about the expression language functions
and documentation. 
> 
> [1] https://github.com/alopresto/nifi/blob/el_double/nifi-commons/nifi-expression-language/src/test/groovy/org/apache/nifi/attribute/expression/language/QueryGroovyTest.groovy#L108
> 
> 
> 
> 
> 
> 
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On Apr 25, 2016, at 7:22 AM, Pierre Villard <pierre.villard.fr@gmail.com> wrote:
>> 
>> With explanations from Mark, option 3 is fine with me especially as it is
>> backward compatible.
>> 
>> Pierre
>> 
>> 2016-04-25 16:14 GMT+02:00 Joe Percivall <joepercivall@yahoo.com.invalid>:
>> 
>> 
>> Pierre and Andy,
>>> 
>>> How do you guys feel about option 3 as Mark has explained it? Keeping in
>>> mind that it would be backwards compatible. As it stands we have differing
>>> opinions, which is a good thing, but we need to reach an agreement in order
>>> to proceed.
>>> 
>>> Joe
>>> - - - - - -
>>> Joseph Percivall
>>> linkedin.com/in/Percivall
>>> e: joepercivall@yahoo.com
>>> 
>>> 
>>> 
>>> 
>>> On Thursday, April 21, 2016 11:12 AM, Joe Percivall
>>> <joepercivall@yahoo.com.INVALID> wrote:
>>> Mark,
>>> 
>>> 
>>> After thinking through this a bit more I believe you're right. We can
>>> assume this logic for every operation:
>>> 
>>> Two whole numbers as input returns a whole number
>>> A whole number and a decimal as input returns a decimal
>>> Two decimals as input returns a Decimal
>>> 
>>> I believe, one problem with your explanation as written is "...infer the
>>> return type and allow the user to explicitly change it via toNumber() and
>>> toDecimal() calls if need be". More specifically you would need to call
>>> toNumber/toDecimal on the inputs in order to change the return type. You
>>> can't change the return type after the expression runs (without the
>>> potential loss of information/precision of converting between the two)
>>> because once the expression evaluator exits it needs to have already
>>> interpreted it as a long or double which would occur before the user is
>>> able to call toNumber/toDecimal. So the user would have to control the
>>> return type would by calling toNumber/toDecimal on the inputs.
>>> 
>>> That leads to the one confusing bit about this approach, when a user
>>> attempts to do an operation on two whole numbers that they would expect to
>>> return a decimal (like 8/5 = 1.6). In order to get their expected output
>>> the user would have to know to explicitly convert at least one of them to a
>>> double by using the toDecimal operation.
>>> 
>>> That being said it's probably that's worth it to keep down the verbosity
>>> and still have backwards compatibility.
>>> 
>>> 
>>> Joe
>>> - - - - - - Joseph Percivall
>>> linkedin.com/in/Percivall
>>> e: joepercivall@yahoo.com
>>> 
>>> 
>>> 
>>> 
>>> On Thursday, April 21, 2016 10:20 AM, Mark Payne <markap14@hotmail.com>
>>> wrote:
>>> 
>>> 
>>> 
>>> Joe,
>>> 
>>> I am definitely in favor of #3. I think if we perform some sort of binary
>>> operation on two numbers,
>>> we should return a Decimal type if either of the operands is a Decimal and
>>> a Number (Long) type
>>> if both operands are Longs. We would also need a toDecimal() method that
>>> would convert a Number
>>> type ot a Decimal type and we would need to support calling toNumber() on
>>> a decimal as well.
>>> 
>>> Given this, though, I think it makes sense to infer the return type and
>>> allow the user to explicitly
>>> change it via toNumber() and toDecimal() calls if need be. With the
>>> addition of these functions, I don't
>>> think it would be 'shady' to interpret the types at all. We already do
>>> interpret types in many cases, for
>>> instance when we have an expression such as ${num1:divide(${num2}) } -- in
>>> this case, num1 and num2
>>> are attributes, so they are strings. We already implicitly convert them to
>>> numbers. Going forward, we should
>>> convert them to either Number or Decimal type based on the regex that they
>>> match. I believe this also
>>> addresses the concern of not being able to override. I don't think
>>> backward compatibility is an issue here
>>> either, as we currently would just fail and throw an Exception if
>>> attempting to divide a string like "2.3"
>>> 
>>> Number 3 seems like a slam-dunk to me in terms of pros vs cons.
>>> 
>>> Thanks
>>> -Mark
>>> 
>>> 
>>> 
>>> 
>>> On Apr 20, 2016, at 10:16 PM, Joe Percivall
>>>> <joepercivall@yahoo.com.INVALID> wrote:
>>> 
>>> 
>>>> Hello Dev list,
>>>> I've been working on a couple improvements that deal with utilizing
>>>> expression language to do data analysis and this has exposed a couple
>>> issues with the typing of numbers in Expression Language (EL). The
>>> improvements are a continuation of the topic I presented on at the Apache
>>> NiFi MeetUp group in MD[1].
>>> 
>>> The primary issue I came across is that currently in EL all numbers
>>>> interpreted as longs[2], which only store whole numbers. This leads to
>>> problems when trying to do things like dividing whole numbers or just
>>> trying to add/subtract decimals. I am actually surprised that the request
>>> for decimals this hasn't come up before. That being said, after some
>>> initial discussion with Tony and Joe, I believe that there are four
>>> potential ways forward.
>>> 
>>> 1: Create a new EL type "decimal" backed by a double[3] and new methods
>>>> to support it ("add_decimal"): This allows the user to explicitly choose
>>> whether or not they want to use decimal or whole numbers. It retains the
>>> simple use-cases that use whole numbers while opening up the new use-cases
>>> using decimals. One down side is that it is more verbose. It means adding a
>>> new function for each math operation. This is backwards compatible.
>>> 
>>> 
>>>> 2: Back all numbers by doubles instead of longs: The easy to implement
>>>> and retains very concise methods ("add", "subtract", etc..). A few cons,
>>> doubles have a lower precision than longs[4],  can lead to rounding
>>> errors[5] and could be confusing for users who only work with whole numbers
>>> to suddenly find decimals.This is not backwards compatible.
>>> 
>>> 3: Create a new EL type "decimal" back by a double[3] and attempt to
>>>> smartly automatically cast depending on method/input/output: This would
>>> allow for the positives of having decimals and whole numbers in addition to
>>> having concise methods. The main cons being a much longer implementation
>>> time to do it right and the "shadiness" of doing things automatically for
>>> the user. Also this would mean the user wouldn't have the option to
>>> explicitly override  This is not backwards compatible.
>>> 
>>> 4: Create a new EL type "decimal" backed by a double[3] and overload the
>>>> existing methods with an additional parameter to specify return type to
>>> support it: This would allow for the positives of having decimals and whole
>>> numbers in addition to having concise method names but this may cause
>>> confusion with less technical users who aren't used to specifying return
>>> types. This is backwards compatible.
>>> 
>>> 
>>>> The options that are not backwards compatible would need to wait to be
>>>> implemented until 1.0.
>>> 
>>> The current option I am leaning towards is number 1 due to the
>>>> explicitness and greater control it gives the user. While it is more
>>> verbose I think the decimal vs whole number syntax will be easy for even
>>> non-technical users to pick up. Also I currently have a PR for it up
>>> here[6].
>>> 
>>> Any other ideas or suggestions are welcome!
>>>> [1] http://www.meetup.com/ApacheNiFi/events/229158777/
>>>> [2] https://docs.oracle.com/javase/7/docs/api/java/lang/Long.html
>>>> [3] https://docs.oracle.com/javase/7/docs/api/java/lang/Double.html
>>>> [4]
>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
>>> 
>>> [5] http://stackoverflow.com/questions/960072/rounding-errors
>>>> [6] https://issues.apache.org/jira/browse/NIFI-1662
>>>> Joe- - - - - - Joseph Percivalllinkedin.com/in/Percivalle:
>>>> joepercivall@yahoo.com
>>> 
>>>

Mime
View raw message