Mailing-List: contact commits-help@beam.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@beam.apache.org
Date: Fri, 25 Aug 2017 09:58:00 +0000 (UTC)
From: =?utf-8?Q?J=C3=A9r=C3=A9mie_Vexiau_=28JIRA=29?= <jira@apache.org>
To: commits@beam.apache.org
Message-ID: <JIRA.13097276.1503567221000.119801.1503655080356@Atlassian.JIRA>
In-Reply-To: <JIRA.13097276.1503567221000@Atlassian.JIRA>
References: <JIRA.13097276.1503567221000@Atlassian.JIRA> <JIRA.13097276.1503567221474@jira-lw-us.apache.org>
Subject: [jira] [Updated] (BEAM-2803) JdbcIO read is very slow when query
 return a lot of rows
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 25 Aug 2017 09:58:05 -0000


     [ https://issues.apache.org/jira/browse/BEAM-2803?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:all-tabpanel ]

J=C3=A9r=C3=A9mie Vexiau updated BEAM-2803:
---------------------------------
    Attachment:     (was: test2M.png)

> JdbcIO read is very slow when query return a lot of rows
> --------------------------------------------------------
>
>                 Key: BEAM-2803
>                 URL: https://issues.apache.org/jira/browse/BEAM-2803
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>    Affects Versions: Not applicable
>            Reporter: J=C3=A9r=C3=A9mie Vexiau
>            Assignee: Reuven Lax
>              Labels: performance
>             Fix For: Not applicable
>
>
> Hi,
> I'm using JdbcIO reader in batch mode with the postgresql driver.
> my select query return more than 5 Millions rows
> using cursors with Statement.setFetchSize().
> these ParDo are OK :
> {code:java}
>           .apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
>           .apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
>             private Random random;
>             @Setup
>             public void setup() {
>               random =3D new Random();
>             }
>             @ProcessElement
>             public void processElement(ProcessContext context) {
>               context.output(KV.of(random.nextInt(), context.element()));
>             }
>           }))
> {code}
> but reshuffle is very very slow.=20
> it must be the GroupByKey with more than 5 millions of Key.
> {code:java}
> .apply(GroupByKey.<Integer, T>create())
> {code}
> is there a way to optimize the reshuffle, or use another method to preven=
t fusion ?=20
> thanks in advance,
> edit:=20
> I add some tests=20
> I use google dataflow as runner, with 1 worker, 2 max, and workerMachineT=
ype n1-standard-2
> and  autoscalingAlgorithm THROUGHPUT_BASED
> First one : query return 500 000 results :=20
> !test500k.png|thumbnail!
> as we can see,
>  parDo(Read) is about 1300 r/s
> groupByKey is about 1080 r/s
> 2nd : query return 1 000 000 results=20
> !test1M.png|thumbnail!
> parDo(read) =3D> 1480 r/s
> groupByKey =3D> 634 r/s
> 3rd : query return 1 500 000 results
> !test1500K.png|thumbnail!
> parDo(read) =3D> 1700 r/s
> groupByKey =3D> 565 r/s
> 4th query return 2 000 000 results
> !test2M.jpg|thumbnail!
> parDo(read) =3D> 1485 r/s
> groupByKey =3D> 537 r/s
> As we can see, groupByKey  rate decrease when number of record are more i=
mportant.
> ps:  2nd worker start just after ParDo(read) is succeed


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)