beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From c0b <>
Subject [GitHub] beam pull request #3694: could you allow github issues here? [dummy pr for i...
Date Mon, 07 Aug 2017 05:42:21 GMT
GitHub user c0b opened a pull request:

    could you allow github issues here? [dummy pr for issue comment only]

    _I don't understand why do you require jira ticket instead of github issues here; here
I'd only want to comment on the tickets but creating an account on
for comment is a broken user experience (compared to github issues)_
    - for Go SDK
    - for NodeJS SDK
    - for a generic declarative DSL for any
language SDK writers can use
    from the I did my first
test run is to see how many runners supported by existing languages (Java & Python); I
did test the example wordcount with both Java and Python, from this error from Python seems
like it does not have most other runners,  and Python so far only support Direct and DataflowRunner,
still lack important features like triggers?
        ValueError: Unexpected pipeline runner: ApexRunner. Valid values are DirectRunner,
EagerRunner, DataflowRunner, TestDataflowRunner or the fully qualified name of a PipelineRunner
    so focus at DataflowRunner with Python or try another programming language:
    dataflow is providing REST API calls, but the difficulty here for another programming
language is how to provide the job create request body? especially, how define and encode
the job steps ?
    from two test runs of wordcount examples, so far I found the clues:
    1. with jobs list api with `view=JOB_VIEW_ALL` I can see java and python uses a different
**workerHarnessContainerImage**, so I do docker pull these images to locally to look into,
but where are the source code for each? are these open sourced? what is the default entrypoint
`/opt/google/dataflow/boot` ?
        "workerHarnessContainerImage": ""
        "workerHarnessContainerImage": ""
    $ docker images --filter='*:*'
    REPOSITORY                                TAG                 IMAGE ID            CREATED
            SIZE            2.0.0               2a1e69afbef9        2 months
ago        1.3GB   beam-2.0.0          2686ad94cb93        5 months
ago        393MB
    $ docker run -it --rm --entrypoint=/bin/bash
    root@ddfe741352d6:/# \du -sh /usr/local/gcloud/google-cloud-sdk /usr/local/lib/python2.7/dist-packages/tensorflow
/usr/local/lib/python2.7/dist-packages/scipy /usr/local/lib/python2.7/dist-packages/sklearn
    226M	/usr/local/gcloud/google-cloud-sdk
    167M	/usr/local/lib/python2.7/dist-packages/tensorflow
    155M	/usr/local/lib/python2.7/dist-packages/scipy
    72M	/usr/local/lib/python2.7/dist-packages/sklearn
    26M	/opt/google/dataflow
    root@ddfe741352d6:/# ls -lih /opt/google/dataflow
    total 26M
    19005540 -r-xr-xr-x 1 root root  43K Jan  1  1970 NOTICES.shuffle
    19005538 -r-xr-xr-x 1 root root  14M Jan  1  1970 boot
    19005539 -r-xr-xr-x 1 root root 680K Jan  1  1970 dataflow_python_worker.tar
    19005541 -r-xr-xr-x 1 root root  12M Jan  1  1970
    2. the REST API only defined each step requires a `kind`, `name` and `properties`; but
what's the internal structure of `properties` ? for the python one I spent some time figured
out the `serialized_fn` is base64 encoded  of   zlib compressed  of a pickle serialized object
code of the python function, and Java version's serialized_fn is using another way of serialization
of a function (looks like snappy compression of java byte code?),
        so the question is here: does this mean the `properties` is complete up to SDK Writers?
if somebody is going to do Go or NodeJS, since different language has very different way of
serialization of a function's code; all these will look like duplicating a lot of effort,
then BEAM-14 could be a better approach?
        but generally, could you share more necessary documentation for SDK writters? so far
I feel these are necessary: **1) define a function serialization protocol, to be used in the
`steps / properties`** **2) a language specific docker image, to be used as `workerHarnessContainerImage`,
this image will need to interpret the serialization protocol from `steps / properties`**
          "kind": "ParallelDo",
          "name": "s2",
          "properties": {
            "serialized_fn": {
              "value": "eNq1VvtX3EQUnuwuLGRBhWKtrdbYShukZLU+sFTbIvTl2m3dMmIfGCfJ7E4gm+xNJgUq0VYO1T/UP8Q72WzpntPHT27OLjPf3Pvdm/sanpRM1/ODwFI/uu3GnEluy90e14EMHQUR84oDjY6t4UJ9oXQA5RZU6PR6FHtXd2TMXOmHndXoWggjLusxV3Db4axr4VGYtKO4m1huFHNdiegwegDVDMZMOmnbSRDJkHV5YtswvgE61W27G3lpwBGo0THcMT/E9QSt5a7Ywg9lApPDdvAgxy2PoyEmozjRb95W3t5QsA5vzT2Ft5sZvGPSmh/2UpmTJTDVpBNRKg+B6Wb6DI44tNqLI5cnCcy8JFjtNMRXjvBd3jUFhmYl8vqhOXoA77XgmNkoNUYalVsrV+jBnkY2CdknJCuRvRJJZskeIZsIloinkX2NaCEhskw2K8TLJfZLJCuTnetkr0xWNy6TrJJrjCgNpNHUZlRtZJXsVRSDIlHoGNkcH0bhEllHmrvwPv23xWUah4nBQsOXPA+SET3isSEFN7YxlYkRtXHjJwYPeJeH0tJ1Az9reF4gBh4yI/BDnsvyHWkZxs12TpGjeO4ELNw6Z4SR5Igzec6QUVRQLcedZClfGQPKpVx5wO9wrCSjiD33+lqF5wNF5c9ziUNf1REcb2i09mB54T5bePzw7MY8nPgbPjDpSCJjvwcf0mne7cldW/lqu1EaYiDgJC37oQsf0VLMwaDVth96LAjgYzqZh+W54Ck6owA74GFHikP8NC0jBJ/QqaFjz08kzNLRtOdh0cAZCWdNWkl40AaTVgu/YY6OqzjmLsGndCQ3CfNU24ZzEhbo/fojFtdF1OV1j8d8y8KqDzt1K4hcFtQD36n3dqWIwvPWYj3BxC5gX2yxDk/qL3RIne+wbi9AUNHnflu9XbDETMOi06Q2oY1qR7RpbRKfo9pxDepzcxI+a8Hnrm07qR9I1Y
              "@type": ""
            "display_data": [
                "value": {
                  "@type": "",
                  "value": "__main__.WordExtractingDoFn"
    3. I see the Python API uses a lot of operator overloading like `|` and `>>`, but
is that a very thoughtful decision, for reading input to use `p | 'label' >> beam.ReadFrom...()`
I don't feel very intuitive, why not use `<<` to mean read from ? are there any good
writtings from engineers behind?
         it's similar for other languages which have many other kinds of syntax sugar, will
be more interesting if can program in other languages than Java or Python; but will it become
true? or will google will dedicate some more effort? or answer is **never**; could you do
first fill parity from Python to have all features of Java API? the missing feature of triggers
is an important one
    I don't see BigData processing area (Streaming or Batch, the competence lead by Spark
vs. Apex vs. Flink vs. Gearpump vs. DataFlow) for any other programming language than Java
is mature so far, to do any serious work on BigData processing, I feel choices are still limited
to Java only, at least for this year 2017. would you say more programming language can do
BigData, more SDKs coming next year

You can merge this pull request into a Git repository by running:

    $ git pull patch-1

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3694
commit e4f6d0502e9d85f0760dd657314e890909d9cca3
Author: c0b <>
Date:   2017-08-06T20:03:31Z

    could you allow github issues here?


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

View raw message