beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérémie Vexiau (JIRA) <j...@apache.org>
Subject [jira] [Created] (BEAM-2803) JdbcIO read is very slow when query return a lot of rows
Date Thu, 24 Aug 2017 09:34:00 GMT
Jérémie Vexiau created BEAM-2803:
------------------------------------

             Summary: JdbcIO read is very slow when query return a lot of rows
                 Key: BEAM-2803
                 URL: https://issues.apache.org/jira/browse/BEAM-2803
             Project: Beam
          Issue Type: Improvement
          Components: sdk-java-extensions
    Affects Versions: Not applicable
            Reporter: Jérémie Vexiau
            Assignee: Reuven Lax
             Fix For: Not applicable


Hi,

I'm using JdbcIO reader in batch mode with the postgresql driver.
my select query return more than 5 Millions rows
using cursors with Statement.setFetchSize().

these ParDo are OK :
{code:java}
          .apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
          .apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
            private Random random;
            @Setup
            public void setup() {
              random = new Random();
            }
            @ProcessElement
            public void processElement(ProcessContext context) {
              context.output(KV.of(random.nextInt(), context.element()));
            }
          }))
{code}

but reshuffle is very very slow. 
it must be the GroupByKey with more than 5 millions of Key.
{code:java}
.apply(GroupByKey.<Integer, T>create())
{code}
is there a way to optimize the reshuffle, or use another method to prevent fusion ? 

thanks in advance,



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message