flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
Date Tue, 02 Jun 2015 15:50:27 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569312#comment-14569312

ASF GitHub Bot commented on FLINK-1319:

Github user uce commented on the pull request:

    Hey Timo,
    I've tested this locally so far and it's working smoothly! Let's write a blog post about
this very soon. :-) As soon as I have access to more machines, I will test it on a cluster.
The stuff denoted with [RB] should be fixed in any case before merging imo. The rest we can
also do afterwards.
    1. Disable the analysis by default. [RB]
    2. This is cumbersome, but we should go for Annotation-based exclusions instead of package
based (see inline comments). [RB]
    3. Currently manual annotations trump automatic ones, which is a good thing, because people
won't have unexpected results. Could you add a test for this?
    4. I would give the analyzer util the call location based function name (String callLocation
= Utils.getCallLocationName()). I think that will have better output than just `Function 'Job$2'
has been analyzed with the following result: ...`
    4. I would rename `UdfAnalyisMode` (see inline comments).
    5. The hints are currently logged on a singe line w/o a whitespace after each hint, like
    Function modifies static fields. This can lead to unexpected behaviour during runtime.Function
returns 'null' values. This can lead to errors during runtime.A need for forwarded fields
annotations could not be found.
    6. Some analysis like a wrong tuple index access or returning null lead fail the program
before submitting it, which is very nice. I would actually like to log these problems on a
log level WARN (instead of INFO) when only hints are enabled.
    7. When you have optimize enabled and run a filter, which changes the input, the error
you get is this. You can't tell what's wrong.
    Exception in thread "main" org.apache.flink.api.java.sca.UdfErrorException: UDF contains
obvious errors.
    	at org.apache.flink.api.java.sca.UdfAnalyzer.analyze(UdfAnalyzer.java:300)
        For a wrong tuple index access, it is correct:
    Exception in thread "main" org.apache.flink.api.java.sca.UdfErrorException: UDF contains
obvious errors.
    	at org.apache.flink.api.java.sca.UdfAnalyzer.analyze(UdfAnalyzer.java:300)
    Caused by: org.apache.flink.api.java.sca.UdfErrorException: Function contains tuple accesses
with invalid indexes. This can lead to errors during runtime.
    	at org.apache.flink.api.java.sca.UdfAnalyzer.addHintOrThrowException(UdfAnalyzer.java:413)
        I'm not sure if the thrown Execption should be an `UdfErrorException` or an `UdfAnalysisException`?
    8. When optimizing is enabled, I would still print the inferred forwarded fields.
    9. Regarding the result messages: personally, I think the hints/messages could be more
consise, e.g. for example skip the `(should be kept to a minimum)` in the number of object
creations msg or just say `Forwarded fields: none` instead of `A need for forwarded fields
annotations could not be found.`
    1. I think the internals could use more comments. It's easy to get the general idea, but
it would be nice to also get some high-level comments about the ASM-related stuff. I didn't
have time to dig deep into it.
    Will report back after cluster tests.

> Add static code analysis for UDFs
> ---------------------------------
>                 Key: FLINK-1319
>                 URL: https://issues.apache.org/jira/browse/FLINK-1319
>             Project: Flink
>          Issue Type: New Feature
>          Components: Java API, Scala API
>            Reporter: Stephan Ewen
>            Assignee: Timo Walther
>            Priority: Minor
> Flink's Optimizer takes information that tells it for UDFs which fields of the input
elements are accessed, modified, or frwarded/copied. This information frequently helps to
reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently
eliminate sorts and shuffles, which are costly.
> Right now, users can add lightweight annotations to UDFs to provide this information
(such as adding {{@ConstandFields("0->3, 1, 2->1")}}.
> We worked with static code analysis of UDFs before, to determine this information automatically.
This is an incredible feature, as it "magically" makes programs faster.
> For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly
well in many cases. We used the "Soot" toolkit for the static code analysis. Unfortunately,
Soot is LGPL licensed and thus we did not include any of the code so far.
> I propose to add this functionality to Flink, in the form of a drop-in addition, to work
around the LGPL incompatibility with ALS 2.0. Users could simply download a special "flink-code-analysis.jar"
and drop it into the "lib" folder to enable this functionality. We may even add a script to
"tools" that downloads that library automatically into the lib folder. This should be legally
fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility
with ASL 2.0 is mainly in the patentability, if I remember correctly).
> Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a
code base to start with.
> *Appendix*
> Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
> Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf
and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf
(Section 6)
> Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf
(Sections 4.3 and 5.3)

This message was sent by Atlassian JIRA

View raw message