beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Knowles (JIRA)" <>
Subject [jira] [Commented] (BEAM-2516) User reports 4 minutes to process 1 million line CSV in DirectRunner
Date Thu, 31 Aug 2017 04:09:00 GMT


Kenneth Knowles commented on BEAM-2516:

I think for 2.2.0 it is best to remove the translation to/from a proto by hiding it behind

There's a lot of overhead right now because of the impedance mismatch between the parts that
are still Java-specific and the parts which are SDK-agnostic. In the full story for the portability
framework, the DoFns and other UDFs can't even be deserialized, but shipped to the SDK harness.
The harness will own the caching, so it probably doesn't make sense to add it to the DirectRunner
unless there's one silly repeated deserialization we can eliminate. Based on the profiling
results, perhaps there is, but no need to block anything on it.

> User reports 4 minutes to process 1 million line CSV in DirectRunner
> --------------------------------------------------------------------
>                 Key: BEAM-2516
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct
>            Reporter: Kenneth Knowles
>            Priority: Minor
>             Fix For: 2.2.0
> I don't know what the expectation are here, so I wasn't ready to say this is WAI. Low
priority since it isn't what the runner is for anyhow, but this seems like the scale of data
that should be snappy. Worth investigating, or maybe you can quickly indicate why it is expected?

This message was sent by Atlassian JIRA

View raw message