Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAB6MW0vHb2Nk1-HoWK4PaVdGY05w-dnRG8MhAHysHhLaRMDarg@mail.gmail.com>
References: <CAB6MW0upvbZfxho6i28qaaf-Tvd-=JT2ZMb3yQ=96LdqaBvF7Q@mail.gmail.com>
 <CAAdrtT3D+btW6A4BH3dufDsJ4cNnL1AF=m++XA=oKVY2m+NO7Q@mail.gmail.com>
 <CAB6MW0vhC=wVAfdNo3PeWYeKZXuhRmCMh6djCr0HYZRxNEe+cw@mail.gmail.com>
 <CAAdrtT1ngzFnZQFd3ebp1DSEUj_zD81+SnWGJ7mC9NCZ=nL3zg@mail.gmail.com>
 <CAB6MW0usAUC4pyuuE9bDS8ZJQDsH1Emp7pRqcvQKnHFDhO-ehg@mail.gmail.com>
 <CAAdrtT2NoKdjSw9rBPgmOR3iK1wnORLNYebmLkczG4Z6-Xu-GA@mail.gmail.com>
 <CAB6MW0sSqrAiCXCX-8Xvw5p2FXo9fGR3h8QwT7A7yM+sT-3jBg@mail.gmail.com>
 <CAAdrtT1EO3j4mv0oKjOJziYefU9=QhXe1JKE1vD0vWYjUtP6eg@mail.gmail.com> <CAB6MW0vHb2Nk1-HoWK4PaVdGY05w-dnRG8MhAHysHhLaRMDarg@mail.gmail.com>
From: Fabian Hueske <fhueske@gmail.com>
Date: Thu, 7 Dec 2017 18:39:43 +0100
Message-ID: <CAAdrtT33amyTdFUuSm-0tgu-RXOWdhbs9ZV71f_QJY_ct335-A@mail.gmail.com>
Subject: Re: Flink Batch Performance degradation at scale
To: Garrett Barton <garrett.barton@gmail.com>
Cc: user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="94eb2c14c83010d40c055fc3905f"
archived-at: Thu, 07 Dec 2017 18:11:48 -0000

--94eb2c14c83010d40c055fc3905f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Ah, no direct memory buffer...
Can you try to disable off-heap memory?

2017-12-07 18:35 GMT+01:00 Garrett Barton <garrett.barton@gmail.com>:

> Stacktrace generates every time with the following settings (tried
> different memory fractions):
> yarn-session.sh -n 400 -s 2 -tm 9200 -jm 5120
> akka.ask.timeout: 60s
> containerized.heap-cutoff-ratio: 0.15
> taskmanager.memory.fraction: 0.7/0.3/0.1
> taskmanager.memory.off-heap: true
> taskmanager.memory.preallocate: true
> env.getConfig().setExecutionMode(ExecutionMode.BATCH)
>
> Hand Jammed top of the stack:
> java.lang.RuntimeException: Error obtaining the sorted input: Thread
> 'SortMerger Reading Thread' terminated due to an exception:
> java.lang.OutOfMemoryError: Direct buffer memory
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.
> getInterator(UnilateralSortMerger.java:619)
> at org.apache.flink.runtime.operators.BatchTask.getInput(
> BatchTask.java:1095)
> at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)
> at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355=
)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Thread' terminated due to an exception:
> java.lang.OutOfMemoryError: Direct buffer memory
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> ThreadBase.run(UnilateralSortMerger.java:800)
> Caused by: org.apache.flink.runtime.io.network.netty.exception.LocalTrans=
portException:
> java.lang.OutOfMemoryError: Direct buffer memory
> at org.apache.flink.runtime.io.network.netty.
> PartitionRequestClientHandler.exceptionCaught(
> PartitionRequestClientHandler.java:149)
> ... lots of netty stuffs
>
>
> While I observe the taskmanagers I never see their JVM heaps get high at
> all.  Mind you I cant tell which task will blow and then see its TM in ti=
me
> to see what it looks like.  But each one I do look at the heap usage is
> ~150MB/6.16GB (with fraction: 0.1)
>
> On Thu, Dec 7, 2017 at 11:59 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> Hmm, the OOM sounds like a bug to me. Can you provide the stacktrace?
>> The managed memory should be divided among all possible consumers. In
>> case of your simple job, this should just be Sorter.
>> In fact, I'd try to reduce the fraction to give more memory to the JVM
>> heap (OOM means there was not enough (heap) memory).
>>
>> Enabling BATCH mode means that the records are not shipped to the sorter
>> in a pipelined fashion but buffered at (and written to the disk of) the
>> sender task.
>> Once the input was consumed, the data is shipped to the receiver tasks
>> (the sorter). This mode decouples tasks and also reduces the number of
>> network buffers because fewer connection must be active at the same time=
.+
>> Here's a link to an internal design document (not sure how up to date it
>> is though...) [1].
>>
>> Did you try to check if the problem is cause by data skew?
>> You could add a MapPartition tasks instead of the PartitionSorter to
>> count the number of records per partition.
>>
>> Best, Fabian
>>
>> [1] https://cwiki.apache.org/confluence/display/FLINK/Data+excha
>> nge+between+tasks
>>
>> 2017-12-07 16:30 GMT+01:00 Garrett Barton <garrett.barton@gmail.com>:
>>
>>> Thanks for the reply again,
>>>
>>>  I'm currently doing runs with:
>>> yarn-session.sh -n 700 -s 2 -tm 9200 -jm 5120
>>> akka.ask.timeout: 60s
>>> containerized.heap-cutoff-ratio: 0.15
>>> taskmanager.memory.fraction: 0.7
>>> taskmanager.memory.off-heap: true
>>> taskmanager.memory.preallocate: true
>>>
>>> When I change the config setExecutionMode() to BATCH, no matter what
>>> memory fraction I choose the sort instantly fails with SortMerger OOM
>>> exceptions.  Even when I set fraction to 0.95.  The data source part is
>>> ridiculously fast though, ~30 seconds!  Disabling batch mode and keepin=
g
>>> the other changes looks like to do the same behavior as before, jobs be=
en
>>> running for ~20 minutes now.  Does Batch mode disable spilling to disk,=
 or
>>> does batch with a combo of off heap disable spilling to disk?  Is there
>>> more documentation on what Batch mode does under the covers?
>>>
>>> As for the flow itself, yes it used to be a lot smaller, I broke it out
>>> manually by adding the sort/partition to see which steps were causing m=
e
>>> the slowdown, thinking it was my code, I wanted to separate the operati=
ons.
>>>
>>> Thank you again for your help.
>>>
>>> On Thu, Dec 7, 2017 at 4:49 AM, Fabian Hueske <fhueske@gmail.com> wrote=
:
>>>
>>>> That doesn't look like a bad configuration.
>>>>
>>>> I have to correct myself regarding the size of the managed memory. The
>>>> fraction (70%) is applied on the free memory after the TM initializati=
on.
>>>> This means that memory for network buffers (and other data structures)=
 are
>>>> subtracted before the managed memory is allocated.
>>>> The actual size of the managed memory is logged in the TM log file
>>>> during start up.
>>>>
>>>> You could also try to decrease the number of slots per TM to 1 but add
>>>> more vCores (yarn.containers.vcores []) because the sorter runs in
>>>> multiple threads.
>>>>
>>>> Adding a GroupCombineFunction for pre-aggregation (if possible...)
>>>> would help to mitigate the effects of the data skew.
>>>> Another thing I'd like to ask: Are you adding the partitioner and
>>>> sorter explicitly to the plan and if so why? Usually, the partitioning=
 and
>>>> sorting is done as part of the GroupReduce.
>>>>
>>>> Best, Fabian
>>>>
>>>> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>>> setup/config.html#yarn
>>>>
>>>> 2017-12-06 23:32 GMT+01:00 Garrett Barton <garrett.barton@gmail.com>:
>>>>
>>>>> Wow thank you for the reply, you gave me a lot to look into and mess
>>>>> with. I'll start testing with the various memory options and env sett=
ings
>>>>> tomorrow.
>>>>>
>>>>> BTW the current flink cluster is launched like:
>>>>> yarn-session.sh -n 700 -s 2 -tm 9200 -jm 5120
>>>>>
>>>>> with flink-conf.yaml property overrides of:
>>>>> # so bigger clusters don't fail to init
>>>>> akka.ask.timeout: 60s
>>>>> # so more memory is given to the JVM from the yarn container
>>>>> containerized.heap-cutoff-ratio: 0.15
>>>>>
>>>>> So each flink slot doesn't necessarily get a lot of ram, you said 70%
>>>>> of ram goes to the job by default, so that's (9200*0.85)*0.70 =3D 547=
4MB.  So
>>>>> each slot is sitting with ~2737MB of usable space.  Would you have a
>>>>> different config for taking overall the same amount of ram?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 6, 2017 at 11:49 AM, Fabian Hueske <fhueske@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Garrett,
>>>>>>
>>>>>> data skew might be a reason for the performance degradation.
>>>>>>
>>>>>> The plan you shared is pretty simple. The following happens you run
>>>>>> the program:
>>>>>> - The data source starts to read data and pushes the records to the
>>>>>> FlatMapFunction. From there the records are shuffed (using
>>>>>> hash-partitioning) to the sorter.
>>>>>> - The sorter tasks consume the records and write them into a memory
>>>>>> buffer. When the buffer is full, it is sorted and spilled to disk. W=
hen the
>>>>>> buffer was spilled, it is filled again with records, sorted, and spi=
lled.
>>>>>> - The initially fast processing happens because at the beginning the
>>>>>> sorter is not waiting for buffers to be sorted or spilled because th=
ey are
>>>>>> empty.
>>>>>>
>>>>>> The performance of the plan depends (among other things) on the size
>>>>>> of the sort buffers. The sort buffers are taken from Flink's managed
>>>>>> memory.
>>>>>> Unless you configured something else, 70% of to the TaskManager heap
>>>>>> memory is reserved as managed memory.
>>>>>> If you use Flink only for batch jobs, I would enable preallocation
>>>>>> and off-heap memory (see configuration options [1]). You can also co=
nfigure
>>>>>> a fixed size for the managed memory. The more memory you configure, =
the
>>>>>> more is available for sorting.
>>>>>>
>>>>>> The managed memory of a TM is evenly distributed to all its
>>>>>> processing slots. Hence, having more slots per TM means that each sl=
ot has
>>>>>> fewer managed memory (for sorting or joins or ...).
>>>>>> So many slots are not necessarily good for performance (unless you
>>>>>> increase the number of TMs / memory as well), especially in case of =
data
>>>>>> skew when most slots receive only little data and cannot leverage th=
eir
>>>>>> memory.
>>>>>> If your data is heavily skewed, it might make sense to have fewer
>>>>>> slots such that each slot has more memory for sorting.
>>>>>>
>>>>>> Skew has also an effect on downstream operations. In case of skew,
>>>>>> some of the sorter tasks are overloaded and cannot accept more data.
>>>>>> Due to the pipelined shuffles, this leads to a back pressure behavio=
r
>>>>>> that propagates down to the sources.
>>>>>> You can disable pipelining by setting the execution mode on the
>>>>>> execution configuration to BATCH [2]. This will break the pipeline b=
ut
>>>>>> write the result of the FlatMap to disk.
>>>>>> This might help, if the FlatMap is compute intensive or filters many
>>>>>> records.
>>>>>>
>>>>>> The data sizes don't sound particular large, so this should be
>>>>>> something that Flink should be able to handle.
>>>>>>
>>>>>> Btw. you don't need to convert the JSON plan output. You can paste i=
t
>>>>>> into the plan visualizer [3].
>>>>>> I would not worry about the missing statistics. The optimizer does
>>>>>> not leverage them at the current state.
>>>>>>
>>>>>> Best, Fabian
>>>>>>
>>>>>> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>>>>> setup/config.html#managed-memory
>>>>>> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>>>>> dev/execution_configuration.html
>>>>>> [3] http://flink.apache.org/visualizer/
>>>>>>
>>>>>> 2017-12-06 16:45 GMT+01:00 Garrett Barton <garrett.barton@gmail.com>=
:
>>>>>>
>>>>>>> Fabian,
>>>>>>>
>>>>>>>  Thank you for the reply.  Yes I do watch via the ui, is there
>>>>>>> another way to see progress through the steps?
>>>>>>>
>>>>>>> I think I just figured it out, the hangup is in the sort phase (ID
>>>>>>> 4) where 2 slots take all the time.  Looking in the UI most slots g=
et less
>>>>>>> than 500MB of data to sort, these two have 6.7GB and 7.3GB each, to=
gether
>>>>>>> its about 272M records and these will run for hours at this point. =
 Looks
>>>>>>> like I need to figure out a different partitioning/sort strategy. I=
 never
>>>>>>> noticed before because when I run the system at ~1400 slots I don't=
 use the
>>>>>>> UI anymore as its gets unresponsive.  400 Slots is painfully slow, =
but
>>>>>>> still works.
>>>>>>>
>>>>>>>
>>>>>>> The getEnv output is very cool! Also very big, I've tried to
>>>>>>> summarize it here in more of a yaml format as its on a different ne=
twork.
>>>>>>> Note the parallelism was just set to 10 as I didn't know if that ef=
fected
>>>>>>> output.  Hopefully I didn't flub a copy paste step, it looks good t=
o me.
>>>>>>>
>>>>>>>
>>>>>>> =E2=80=8BThis flow used to be far fewer steps, but as it wasn't sca=
ling I
>>>>>>> broke it out into all the distinct pieces so I could see where it f=
ailed.=E2=80=8B
>>>>>>> Source and sink are both Hive tables.  I wonder if the inputformat =
is
>>>>>>> expected to give more info to seed some of these stat values?
>>>>>>>
>>>>>>> =E2=80=8Bnodes
>>>>>>>     id: 6
>>>>>>>     type: source
>>>>>>>     pact: Data Source
>>>>>>>     contents: at CreateInput(ExecutionEnvironment.java:533)
>>>>>>>     parallelism: 10
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: RANDOM_PARTITIONED
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: none
>>>>>>>         name: Grouping value: not grouped
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: 0
>>>>>>>         name: Cumulative Disk I/O value: 0
>>>>>>>         name: Cumulative CPU value: 0
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none
>>>>>>>
>>>>>>>     id: 5
>>>>>>>     type: pact
>>>>>>>     pact: FlatMap
>>>>>>>     contents: FlatMap at main()
>>>>>>>     parallelism: 10
>>>>>>>     predecessors:
>>>>>>>         id: 6, ship_strategy: Forward, exchange_mode: PIPELINED
>>>>>>>     driver_strategy: FlatMap
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: RANDOM_PARTITIONED
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: none
>>>>>>>         name: Grouping value: not grouped
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: 0
>>>>>>>         name: Cumulative Disk I/O value: 0
>>>>>>>         name: Cumulative CPU value: 0
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none
>>>>>>>
>>>>>>>     id: 4
>>>>>>>     type: pact
>>>>>>>     pact: Sort-Partition
>>>>>>>     contents: Sort at main()
>>>>>>>     parallelism: 10
>>>>>>>     predecessors:
>>>>>>>         id: 5, ship_strategy: Hash Partition on [0,2]
>>>>>>> local_strategy: Sort on [0:ASC,2:ASC,1:ASC], exchange_mode: PIPELIN=
ED
>>>>>>>     driver_strategy: No-Op
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: HASH_PARTITIONED
>>>>>>>         name: Partitioned on value: [0,2]
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: [0:ASC,2:ASC,1:ASC]
>>>>>>>         name: Grouping value: [0,2,1]
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: unknown
>>>>>>>         name: Cumulative Disk I/O value: unknown
>>>>>>>         name: Cumulative CPU value: unknown
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none
>>>>>>>
>>>>>>>     id: 3
>>>>>>>     type: pact
>>>>>>>     pact: GroupReduce
>>>>>>>     contents: GroupReduce at first(SortedGrouping.java:210)
>>>>>>>     parallelism: 10
>>>>>>>     predecessors:
>>>>>>>         id: 4, ship_strategy: Forward, exchange_mode: PIPELINED
>>>>>>>     driver_strategy: Sorted Group Reduce
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: RANDOM_PARTITIONED
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: none
>>>>>>>         name: Grouping value: not grouped
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: unknown
>>>>>>>         name: Cumulative Disk I/O value: unknown
>>>>>>>         name: Cumulative CPU value: unknown
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none
>>>>>>>
>>>>>>>
>>>>>>>     id: 2
>>>>>>>     type: pact
>>>>>>>     pact: Map
>>>>>>>     contents: Map at ()
>>>>>>>     parallelism: 10
>>>>>>>     predecessors:
>>>>>>>         id: 3, ship_strategy: Forward, exchange_mode: PIPELINED
>>>>>>>     driver_strategy: Map
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: RANDOM_PARTITIONED
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: none
>>>>>>>         name: Grouping value: not grouped
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: unknown
>>>>>>>         name: Cumulative Disk I/O value: unknown
>>>>>>>         name: Cumulative CPU value: unknown
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none
>>>>>>>
>>>>>>>     id: 1
>>>>>>>     type: pact
>>>>>>>     pact: Map
>>>>>>>     contents: map at main()
>>>>>>>     parallelism: 10
>>>>>>>     predecessors:
>>>>>>>         id: 2, ship_strategy: Forward, exchange_mode: PIPELINED
>>>>>>>     driver_strategy: Map
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: RANDOM_PARTITIONED
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: none
>>>>>>>         name: Grouping value: not grouped
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: unknown
>>>>>>>         name: Cumulative Disk I/O value: unknown
>>>>>>>         name: Cumulative CPU value: unknown
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none
>>>>>>>
>>>>>>>     id: 0
>>>>>>>     type: sink
>>>>>>>     pact: Data Sink
>>>>>>>     contents: org.apache.flink.api.java.jado
>>>>>>> op.mapreduce.HadoopOutputFormat
>>>>>>>     parallelism: 10
>>>>>>>     predecessors:
>>>>>>>         id: 1, ship_strategy: Forward, exchange_mode: PIPELINED
>>>>>>>     driver_strategy: Map
>>>>>>>     global_properties:
>>>>>>>         name: partitioning v: RANDOM_PARTITIONED
>>>>>>>         name: Partitioning Order value: none
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     local_properties:
>>>>>>>         name: Order value: none
>>>>>>>         name: Grouping value: not grouped
>>>>>>>         name: Uniqueness value: not unique
>>>>>>>     estimates:
>>>>>>>         name: Est. Output Size value: unknown
>>>>>>>         name: Est Cardinality value: unknown
>>>>>>>     costs:
>>>>>>>         name: Network value: 0
>>>>>>>         name: Disk I/O value 0
>>>>>>>         name: CPU value: 0
>>>>>>>         name: Cumulative Network value: unknown
>>>>>>>         name: Cumulative Disk I/O value: unknown
>>>>>>>         name: Cumulative CPU value: unknown
>>>>>>>     compiler_hints:
>>>>>>>         name: Output Size (bytes) value: none
>>>>>>>         name: Output Cardinality value: none
>>>>>>>         name: Avg. Output Record Size (bytes) value: none
>>>>>>>         name: Filter Factor value: none=E2=80=8B
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 5, 2017 at 5:36 PM, Fabian Hueske <fhueske@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Flink's operators are designed to work in memory as long as
>>>>>>>> possible and spill to disk once the memory budget is exceeded.
>>>>>>>> Moreover, Flink aims to run programs in a pipelined fashion, such
>>>>>>>> that multiple operators can process data at the same time.
>>>>>>>> This behavior can make it a bit tricky to analyze the runtime
>>>>>>>> behavior and progress of operators.
>>>>>>>>
>>>>>>>> It would be interesting to have a look at the execution plan for
>>>>>>>> the program that you are running.
>>>>>>>> The plan can be obtained from the ExecutionEnvironment by calling
>>>>>>>> env.getExecutionPlan() instead of env.execute().
>>>>>>>>
>>>>>>>> I would also like to know how you track the progress of the
>>>>>>>> program.
>>>>>>>> Are you looking at the record counts displayed in the WebUI?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Fabian
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-12-05 22:03 GMT+01:00 Garrett Barton <garrett.barton@gmail.co=
m
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> I have been moving some old MR and hive workflows into Flink
>>>>>>>>> because I'm enjoying the api's and the ease of development is won=
derful.
>>>>>>>>> Things have largely worked great until I tried to really scale so=
me of the
>>>>>>>>> jobs recently.
>>>>>>>>>
>>>>>>>>> I have for example one etl job that reads in about 12B records at
>>>>>>>>> a time and does a sort, some simple transformations, validation, =
a
>>>>>>>>> re-partition and then output to a hive table.
>>>>>>>>> When I built it with the sample set, ~200M, it worked great, took
>>>>>>>>> maybe a minute and blew threw it.
>>>>>>>>>
>>>>>>>>> What I have observed is there is some kind of saturation reached
>>>>>>>>> depending on number of slots, number of nodes and the overall siz=
e of data
>>>>>>>>> to move.  When I run the 12B set, the first 1B go through in unde=
r 1
>>>>>>>>> minute, really really fast.  But its an extremely sharp drop off =
after
>>>>>>>>> that, the next 1B might take 15 minutes, and then if I wait for t=
he next
>>>>>>>>> 1B, its well over an hour.
>>>>>>>>>
>>>>>>>>> What I cant find is any obvious indicators or things to look at,
>>>>>>>>> everything just grinds to a halt, I don't think the job would eve=
r actually
>>>>>>>>> complete.
>>>>>>>>>
>>>>>>>>> Is there something in the design of flink in batch mode that is
>>>>>>>>> perhaps memory bound?  Adding more nodes/tasks does not fix it, j=
ust gets
>>>>>>>>> me a little further along.  I'm already running around ~1,400 slo=
ts at this
>>>>>>>>> point, I'd postulate needing 10,000+ to potentially make the job =
run, but
>>>>>>>>> thats too much of my cluster gone, and I have yet to get flink to=
 be stable
>>>>>>>>> past 1,500.
>>>>>>>>>
>>>>>>>>> Any idea's on where to look, or what to debug?  GUI is also very
>>>>>>>>> cumbersome to use at this slot count too, so other measurement id=
eas are
>>>>>>>>> welcome too!
>>>>>>>>>
>>>>>>>>> Thank you all.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--94eb2c14c83010d40c055fc3905f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Ah, no direct memory buffer...<br></div>Can you try t=
o disable off-heap memory?<br></div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">2017-12-07 18:35 GMT+01:00 Garrett Barton <span dir=3D"l=
tr">&lt;<a href=3D"mailto:garrett.barton@gmail.com" target=3D"_blank">garre=
tt.barton@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
ca,sans-serif;color:rgb(0,0,0)">Stacktrace generates every time with the fo=
llowing settings (tried different memory fractions):<br><span class=3D"m_-5=
253665295992567385gmail-im">yarn-session.sh -n 400 -s 2 -tm 9200 -jm 5120<b=
r></span><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
sans-serif;color:rgb(0,0,0)">akka.ask.timeout: 60s<br></div>containerized.h=
eap-cutoff-rati<wbr>o: 0.15<br>taskmanager.memory.fraction: 0.7/0.3/0.1<br>=
taskmanager.memory.off-heap: true<br>taskmanager.memory.preallocate<wbr>: t=
rue<br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvet=
ica,sans-serif;color:rgb(0,0,0)">env.getConfig().<wbr>setExecutionMode(<wbr=
>ExecutionMode.BATCH)<br></div><div class=3D"gmail_default" style=3D"font-f=
amily:arial,helvetica,sans-serif;color:rgb(0,0,0)"><br></div><div class=3D"=
gmail_default" style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,=
0,0)">Hand Jammed top of the stack:<br></div><div class=3D"gmail_default" s=
tyle=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">java.lang.=
RuntimeException: Error obtaining the sorted input: Thread &#39;SortMerger =
Reading Thread&#39; terminated due to an exception: java.lang.OutOfMemoryEr=
ror: Direct buffer memory<br></div><div class=3D"gmail_default" style=3D"fo=
nt-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">at org.apache.flink.=
runtime.<wbr>operators.sort.<wbr>UnilateralSortMerger.<wbr>getInterator(<wb=
r>UnilateralSortMerger.java:619)<br>at org.apache.flink.runtime.<wbr>operat=
ors.BatchTask.getInput(<wbr>BatchTask.java:1095)<br>at org.apache.flink.run=
time.<wbr>operators.NoOpDriver.run(<wbr>NoOpDriver.java:82)<br>at org.apach=
e.flink.runtime.<wbr>operators.BatchTask.run(<wbr>BatchTask.java:490)<br>at=
 org.apache.flink.runtime.<wbr>operators.BatchTask.invoke(<wbr>BatchTask.ja=
va:355)<br>at org.apache.flink.runtime.<wbr>taskmanager.Task.run(Task.<wbr>=
java:702)<br></div><div class=3D"gmail_default" style=3D"font-family:arial,=
helvetica,sans-serif;color:rgb(0,0,0)">at java.lang.Thread.run(Thread.<wbr>=
java:745)<br></div><div class=3D"gmail_default" style=3D"font-family:arial,=
helvetica,sans-serif;color:rgb(0,0,0)">Caused by: java.io.IOException: Thre=
ad&#39; terminated due to an exception: java.lang.OutOfMemoryError: Direct =
buffer memory<br></div><div class=3D"gmail_default" style=3D"font-family:ar=
ial,helvetica,sans-serif;color:rgb(0,0,0)">at org.apache.flink.runtime.<wbr=
>operators.sort.<wbr>UnilateralSortMerger$<wbr>ThreadBase.run(<wbr>Unilater=
alSortMerger.java:800)<br></div><div class=3D"gmail_default" style=3D"font-=
family:arial,helvetica,sans-serif;color:rgb(0,0,0)">Caused by: <a href=3D"h=
ttp://org.apache.flink.runtime.io">org.apache.flink.runtime.io</a>.<wbr>net=
work.netty.exception.<wbr>LocalTransportException: java.lang.OutOfMemoryErr=
or: Direct buffer memory<br></div><div class=3D"gmail_default" style=3D"fon=
t-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">at <a href=3D"http://=
org.apache.flink.runtime.io">org.apache.flink.runtime.io</a>.<wbr>network.n=
etty.<wbr>PartitionRequestClientHandler.<wbr>exceptionCaught(<wbr>Partition=
RequestClientHandler.<wbr>java:149)<br></div><div class=3D"gmail_default" s=
tyle=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">... lots o=
f netty stuffs<br></div><div class=3D"gmail_default" style=3D"font-family:a=
rial,helvetica,sans-serif;color:rgb(0,0,0)"><br><br></div><div class=3D"gma=
il_default" style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0=
)">While I observe the taskmanagers I never see their JVM heaps get high at=
 all.=C2=A0 Mind you I cant tell which task will blow and then see its TM i=
n time to see what it looks like.=C2=A0 But each one I do look at the heap =
usage is ~150MB/6.16GB (with fraction: 0.1)<br></div></div><div class=3D"HO=
EnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Thu, Dec 7, 2017 at 11:59 AM, Fabian Hueske <span dir=3D"ltr">&lt=
;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank">fhueske@gmail.com</=
a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><di=
v><div><div><div><div>Hmm, the OOM sounds like a bug to me. Can you provide=
 the stacktrace?<br></div>The managed memory should be divided among all po=
ssible consumers. In case of your simple job, this should just be Sorter. <=
br>In fact, I&#39;d try to reduce the fraction to give more memory to the J=
VM heap (OOM means there was not enough (heap) memory).<br><br></div>Enabli=
ng BATCH mode means that the records are not shipped to the sorter in a pip=
elined fashion but buffered at (and written to the disk of) the sender task=
. <br></div>Once the input was consumed, the data is shipped to the receive=
r tasks (the sorter). This mode decouples tasks and also reduces the number=
 of network buffers because fewer connection must be active at the same tim=
e.+<br></div>Here&#39;s a link to an internal design document (not sure how=
 up to date it is though...) [1].<br><br></div>Did you try to check if the =
problem is cause by data skew? <br>You could add a MapPartition tasks inste=
ad of the PartitionSorter to count the number of records per partition.<br>=
<div><br></div><div>Best, Fabian<br></div><div><br>[1] <a href=3D"https://c=
wiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks" targe=
t=3D"_blank">https://cwiki.apache.org/confl<wbr>uence/display/FLINK/Data+ex=
cha<wbr>nge+between+tasks</a><br></div></div><div class=3D"m_-5253665295992=
567385HOEnZb"><div class=3D"m_-5253665295992567385h5"><div class=3D"gmail_e=
xtra"><br><div class=3D"gmail_quote">2017-12-07 16:30 GMT+01:00 Garrett Bar=
ton <span dir=3D"ltr">&lt;<a href=3D"mailto:garrett.barton@gmail.com" targe=
t=3D"_blank">garrett.barton@gmail.com</a>&gt;</span>:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-f=
amily:arial,helvetica,sans-serif;color:rgb(0,0,0)">Thanks for the reply aga=
in,<br><br></div><div class=3D"gmail_default" style=3D"font-family:arial,he=
lvetica,sans-serif;color:rgb(0,0,0)">=C2=A0I&#39;m currently doing runs wit=
h:<span><br>yarn-session.sh -n 700 -s 2 -tm 9200 -jm 5120<br></span><div cl=
ass=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;color=
:rgb(0,0,0)">akka.ask.timeout: 60s<br></div>containerized.heap-cutoff-rati<=
wbr>o: 0.15<br>taskmanager.memory.fraction: 0.7<br>taskmanager.memory.off-h=
eap: true<br>taskmanager.memory.preallocate<wbr>: true<br><br></div><div cl=
ass=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;color=
:rgb(0,0,0)">When I change the config setExecutionMode() to BATCH, no matte=
r what memory fraction I choose the sort instantly fails with SortMerger OO=
M exceptions.=C2=A0 Even when I set fraction to 0.95.=C2=A0 The data source=
 part is ridiculously fast though, ~30 seconds!=C2=A0 Disabling batch mode =
and keeping the other changes looks like to do the same behavior as before,=
 jobs been running for ~20 minutes now.=C2=A0 Does Batch mode disable spill=
ing to disk, or does batch with a combo of off heap disable spilling to dis=
k?=C2=A0 Is there more documentation on what Batch mode does under the cove=
rs?<br><br></div><div class=3D"gmail_default" style=3D"font-family:arial,he=
lvetica,sans-serif;color:rgb(0,0,0)">As for the flow itself, yes it used to=
 be a lot smaller, I broke it out manually by adding the sort/partition to =
see which steps were causing me the slowdown, thinking it was my code, I wa=
nted to separate the operations.<br><br></div><div class=3D"gmail_default" =
style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">Thank you=
 again for your help.<br></div></div><div class=3D"m_-5253665295992567385m_=
-3937266395209916677HOEnZb"><div class=3D"m_-5253665295992567385m_-39372663=
95209916677h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On=
 Thu, Dec 7, 2017 at 4:49 AM, Fabian Hueske <span dir=3D"ltr">&lt;<a href=
=3D"mailto:fhueske@gmail.com" target=3D"_blank">fhueske@gmail.com</a>&gt;</=
span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><=
div><div>That doesn&#39;t look like a bad configuration.<br></div><div><br>=
</div>I have to correct myself regarding the size of the managed memory. Th=
e fraction (70%) is applied on the free memory after the TM initialization.=
 This means that memory for network buffers (and other data structures) are=
 subtracted before the managed memory is allocated.<br></div>The actual siz=
e of the managed memory is logged in the TM log file during start up.<br><b=
r></div>You could also try to decrease the number of slots per TM to 1 but =
add more vCores (<code>yarn.containers.vcores [])</code> because the sorter=
 runs in multiple threads.<code></code><br></div><div><br></div><div>Adding=
 a GroupCombineFunction for pre-aggregation (if possible...) would help to =
mitigate the effects of the data skew.</div><div>Another thing I&#39;d like=
 to ask: Are you adding the partitioner and sorter explicitly to the plan a=
nd if so why? Usually, the partitioning and sorting is done as part of the =
GroupReduce.</div><div></div><div><br></div><div>Best, Fabian<code></code><=
br><code></code></div><div><code><br></code>[1] <a href=3D"https://ci.apach=
e.org/projects/flink/flink-docs-release-1.3/setup/config.html#yarn" target=
=3D"_blank">https://ci.apache.org/projects<wbr>/flink/flink-docs-release-1.=
3/<wbr>setup/config.html#yarn</a><code><br></code></div></div><div class=3D=
"m_-5253665295992567385m_-3937266395209916677m_-1872206671170137419HOEnZb">=
<div class=3D"m_-5253665295992567385m_-3937266395209916677m_-18722066711701=
37419h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-12-=
06 23:32 GMT+01:00 Garrett Barton <span dir=3D"ltr">&lt;<a href=3D"mailto:g=
arrett.barton@gmail.com" target=3D"_blank">garrett.barton@gmail.com</a>&gt;=
</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"g=
mail_default" style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0=
,0)">Wow thank you for the reply, you gave me a lot to look into and mess w=
ith. I&#39;ll start testing with the various memory options and env setting=
s tomorrow.<br><br></div><div class=3D"gmail_default" style=3D"font-family:=
arial,helvetica,sans-serif;color:rgb(0,0,0)">BTW the current flink cluster =
is launched like:<br></div><div class=3D"gmail_default" style=3D"font-famil=
y:arial,helvetica,sans-serif;color:rgb(0,0,0)">yarn-session.sh -n 700 -s 2 =
-tm 9200 -jm 5120<br><br></div><div class=3D"gmail_default" style=3D"font-f=
amily:arial,helvetica,sans-serif;color:rgb(0,0,0)">with flink-conf.yaml pro=
perty overrides of:<br></div><div class=3D"gmail_default" style=3D"font-fam=
ily:arial,helvetica,sans-serif;color:rgb(0,0,0)"># so bigger clusters don&#=
39;t fail to init<br></div><div class=3D"gmail_default" style=3D"font-famil=
y:arial,helvetica,sans-serif;color:rgb(0,0,0)">akka.ask.timeout: 60s<br></d=
iv><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-s=
erif;color:rgb(0,0,0)"># so more memory is given to the JVM from the yarn c=
ontainer<br></div><div class=3D"gmail_default" style=3D"font-family:arial,h=
elvetica,sans-serif;color:rgb(0,0,0)">containerized.heap-cutoff-rati<wbr>o:=
 0.15<br><br></div><div class=3D"gmail_default" style=3D"font-family:arial,=
helvetica,sans-serif;color:rgb(0,0,0)">So each flink slot doesn&#39;t neces=
sarily get a lot of ram, you said 70% of ram goes to the job by default, so=
 that&#39;s (9200*0.85)*0.70 =3D 5474MB.=C2=A0 So each slot is sitting with=
 ~2737MB of usable space.=C2=A0 Would you have a different config for takin=
g overall the same amount of ram?<br><br><br><br></div></div><div class=3D"=
m_-5253665295992567385m_-3937266395209916677m_-1872206671170137419m_7107916=
11440855303HOEnZb"><div class=3D"m_-5253665295992567385m_-39372663952099166=
77m_-1872206671170137419m_710791611440855303h5"><div class=3D"gmail_extra">=
<br><div class=3D"gmail_quote">On Wed, Dec 6, 2017 at 11:49 AM, Fabian Hues=
ke <span dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_bl=
ank">fhueske@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><div dir=3D"ltr"><div><div><div><div><div><div><div><div><div><div><div=
><div><div>Hi Garrett,<br><br></div>data skew might be a reason for the per=
formance degradation.<br></div><br></div>The plan you shared is pretty simp=
le. The following happens you run the program:<br></div>- The data source s=
tarts to read data and pushes the records to the FlatMapFunction. From ther=
e the records are shuffed (using hash-partitioning) to the sorter.<br></div=
>- The sorter tasks consume the records and write them into a memory buffer=
. When the buffer is full, it is sorted and spilled to disk. When the buffe=
r was spilled, it is filled again with records, sorted, and spilled. <br></=
div>- The initially fast processing happens because at the beginning the so=
rter is not waiting for buffers to be sorted or spilled because they are em=
pty.<br><br></div>The performance of the plan depends (among other things) =
on the size of the sort buffers. The sort buffers are taken from Flink&#39;=
s managed memory. <br></div>Unless you configured something else, 70% of to=
 the TaskManager heap memory is reserved as managed memory. <br></div>If yo=
u use Flink only for batch jobs, I would enable preallocation and off-heap =
memory (see configuration options [1]). You can also configure a fixed size=
 for the managed memory. The more memory you configure, the more is availab=
le for sorting.<br><br></div>The managed memory of a TM is evenly distribut=
ed to all its processing slots. Hence, having more slots per TM means that =
each slot has fewer managed memory (for sorting or joins or ...). <br></div=
><div>So many slots are not necessarily good for performance (unless you in=
crease the number of TMs / memory as well), especially in case of data skew=
 when most slots receive only little data and cannot leverage their memory.=
<br></div>If your data is heavily skewed, it might make sense to have fewer=
 slots such that each slot has more memory for sorting.<br></div><div><br><=
/div><div>Skew has also an effect on downstream operations. In case of skew=
, some of the sorter tasks are overloaded and cannot accept more data. <br>=
</div><div>Due to the pipelined shuffles, this leads to a back pressure beh=
avior that propagates down to the sources.<br></div></div>You can disable p=
ipelining by setting the execution mode on the execution configuration to B=
ATCH [2]. This will break the pipeline but write the result of the FlatMap =
to disk. <br>This might help, if the FlatMap is compute intensive or filter=
s many records.<br><div><div><div><div><div><br></div><div></div><div>The d=
ata sizes don&#39;t sound particular large, so this should be something tha=
t Flink should be able to handle.</div><div><br></div><div>Btw. you don&#39=
;t need to convert the JSON plan output. You can paste it into the plan vis=
ualizer [3].</div><div>I would not worry about the missing statistics. The =
optimizer does not leverage them at the current state.<br></div><div><br></=
div><div>Best, Fabian</div><div><br>[1] <a href=3D"https://ci.apache.org/pr=
ojects/flink/flink-docs-release-1.3/setup/config.html#managed-memory" targe=
t=3D"_blank">https://ci.apache.org/projects<wbr>/flink/flink-docs-release-1=
.3/<wbr>setup/config.html#managed-memo<wbr>ry</a></div><div>[2] <a href=3D"=
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/execution_c=
onfiguration.html" target=3D"_blank">https://ci.apache.org/projects<wbr>/fl=
ink/flink-docs-release-1.3/<wbr>dev/execution_configuration.ht<wbr>ml</a></=
div><div>[3] <a href=3D"http://flink.apache.org/visualizer/" target=3D"_bla=
nk">http://flink.apache.org/visual<wbr>izer/</a><br></div></div></div></div=
></div><div><div class=3D"m_-5253665295992567385m_-3937266395209916677m_-18=
72206671170137419m_710791611440855303m_-722743526391344903h5"><div class=3D=
"gmail_extra"><br><div class=3D"gmail_quote">2017-12-06 16:45 GMT+01:00 Gar=
rett Barton <span dir=3D"ltr">&lt;<a href=3D"mailto:garrett.barton@gmail.co=
m" target=3D"_blank">garrett.barton@gmail.com</a>&gt;</span>:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div style=3D"font=
-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">Fabian,<br><br></div><=
div style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">=C2=
=A0Thank you for the reply.=C2=A0 Yes I do watch via the ui, is there anoth=
er way to see progress through the steps?<br><br></div><div style=3D"font-f=
amily:arial,helvetica,sans-serif;color:rgb(0,0,0)">I think I just figured i=
t out, the hangup is in the sort phase (ID 4) where 2 slots take all the ti=
me.=C2=A0 Looking in the UI most slots get less than 500MB of data to sort,=
 these two have 6.7GB and 7.3GB each, together its about 272M records and t=
hese will run for hours at this point.=C2=A0 Looks like I need to figure ou=
t a different partitioning/sort strategy. I never noticed before because wh=
en I run the system at ~1400 slots I don&#39;t use the UI anymore as its ge=
ts unresponsive.=C2=A0 400 Slots is painfully slow, but still works.<br></d=
iv><div style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0)"><=
br><br></div><div style=3D"font-family:arial,helvetica,sans-serif;color:rgb=
(0,0,0)">The getEnv output is very cool! Also very big, I&#39;ve tried to s=
ummarize it here in more of a yaml format as its on a different network.=C2=
=A0 Note the parallelism was just set to 10 as I didn&#39;t know if that ef=
fected output.=C2=A0 Hopefully I didn&#39;t flub a copy paste step, it look=
s good to me.=C2=A0 </div><br><br><div>=E2=80=8BThis flow used to be far fe=
wer steps, but as it wasn&#39;t scaling I broke it out into all the distinc=
t pieces so I could see where it failed.=E2=80=8B=C2=A0 Source and sink are=
 both Hive tables.=C2=A0 I wonder if the inputformat is expected to give mo=
re info to seed some of these stat values?<br><br></div><div></div><div>=E2=
=80=8Bnodes<br>=C2=A0=C2=A0=C2=A0 id: 6<br>=C2=A0=C2=A0=C2=A0 type: source<=
br>=C2=A0=C2=A0=C2=A0 pact: Data Source<br>=C2=A0=C2=A0=C2=A0 contents: at =
CreateInput(ExecutionEnvironme<wbr>nt.java:533)<br>=C2=A0=C2=A0=C2=A0 paral=
lelism: 10<br>=C2=A0=C2=A0=C2=A0 global_properties:<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: partitioning v: RANDOM_PARTITIONED<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Partitioning Order value: none<br>=C2=A0=C2=
=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness value: not unique<br>=C2=A0=
=C2=A0=C2=A0 local_properties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 nam=
e: Order value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Groupin=
g value: not grouped<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniquen=
ess value: not unique<br>=C2=A0=C2=A0=C2=A0 estimates:<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 name: Est. Output Size value: unknown<br>=C2=A0=C2=
=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est Cardinality value: unknown<br>=C2=A0=
=C2=A0=C2=A0 costs:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Network =
value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Disk I/O value 0<br=
>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: CPU value: 0<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Network value: 0<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Disk I/O value: 0<br>=C2=A0=C2=
=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative CPU value: 0<br>=C2=A0=C2=A0=
=C2=A0 compiler_hints:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Outpu=
t Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: O=
utput Cardinality value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name=
: Avg. Output Record Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 name: Filter Factor value: none<br><br>=C2=A0=C2=A0=C2=A0 id: =
5<br>=C2=A0=C2=A0=C2=A0 type: pact<br>=C2=A0=C2=A0=C2=A0 pact: FlatMap<br>=
=C2=A0=C2=A0=C2=A0 contents: FlatMap at main()<br>=C2=A0=C2=A0=C2=A0 parall=
elism: 10<br>=C2=A0=C2=A0=C2=A0 predecessors:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 id: 6, ship_strategy: Forward, exchange_mode: PIPELINED<br>=C2=
=A0=C2=A0=C2=A0 driver_strategy: FlatMap<br>=C2=A0=C2=A0=C2=A0 global_prope=
rties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: partitioning v: RANDO=
M_PARTITIONED<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Partitioning O=
rder value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness =
value: not unique<br>=C2=A0=C2=A0=C2=A0 local_properties:<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Order value: none<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 name: Grouping value: not grouped<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 name: Uniqueness value: not unique<br>=C2=A0=C2=A0=C2=A0 es=
timates:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est. Output Size va=
lue: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est Cardinality=
 value: unknown<br>=C2=A0=C2=A0=C2=A0 costs:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 name: Network value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 name: Disk I/O value 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: C=
PU value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Netwo=
rk value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Disk =
I/O value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative CPU =
value: 0<br>=C2=A0=C2=A0=C2=A0 compiler_hints:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 name: Output Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: Output Cardinality value: none<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 name: Avg. Output Record Size (bytes) value: none<br=
>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Filter Factor value: none<br><=
br>=C2=A0=C2=A0=C2=A0 id: 4<br>=C2=A0=C2=A0=C2=A0 type: pact<br>=C2=A0=C2=
=A0=C2=A0 pact: Sort-Partition<br>=C2=A0=C2=A0=C2=A0 contents: Sort at main=
()<br>=C2=A0=C2=A0=C2=A0 parallelism: 10<br>=C2=A0=C2=A0=C2=A0 predecessors=
:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 id: 5, ship_strategy: Hash Parti=
tion on [0,2] local_strategy: Sort on [0:ASC,2:ASC,1:ASC], exchange_mode: P=
IPELINED<br>=C2=A0=C2=A0=C2=A0 driver_strategy: No-Op<br>=C2=A0=C2=A0=C2=A0=
 global_properties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: partitio=
ning v: HASH_PARTITIONED<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Par=
titioned on value: [0,2]<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Par=
titioning Order value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: =
Uniqueness value: not unique<br>=C2=A0=C2=A0=C2=A0 local_properties:<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Order value: [0:ASC,2:ASC,1:ASC]<b=
r>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Grouping value: [0,2,1]<br>=
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness value: not unique<br=
>=C2=A0=C2=A0=C2=A0 estimates:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 nam=
e: Est. Output Size value: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=
 name: Est Cardinality value: unknown<br>=C2=A0=C2=A0=C2=A0 costs:<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Network value: 0<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Disk I/O value 0<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 name: CPU value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=
 name: Cumulative Network value: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=
=C2=A0 name: Cumulative Disk I/O value: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 name: Cumulative CPU value: unknown<br>=C2=A0=C2=A0=C2=A0 c=
ompiler_hints:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Output Size (=
bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Output Ca=
rdinality value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Avg. O=
utput Record Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 name: Filter Factor value: none<br><br>=C2=A0=C2=A0=C2=A0 id: 3<br>=C2=
=A0=C2=A0=C2=A0 type: pact<br>=C2=A0=C2=A0=C2=A0 pact: GroupReduce<br>=C2=
=A0=C2=A0=C2=A0 contents: GroupReduce at first(SortedGrouping.java:210)<br>=
=C2=A0=C2=A0=C2=A0 parallelism: 10<br>=C2=A0=C2=A0=C2=A0 predecessors:<br>=
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 id: 4, ship_strategy: Forward, exchan=
ge_mode: PIPELINED<br>=C2=A0=C2=A0=C2=A0 driver_strategy: Sorted Group Redu=
ce<br>=C2=A0=C2=A0=C2=A0 global_properties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: partitioning v: RANDOM_PARTITIONED<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: Partitioning Order value: none<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness value: not unique<br>=C2=A0=C2=A0=
=C2=A0 local_properties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Ord=
er value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Grouping valu=
e: not grouped<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness va=
lue: not unique<br>=C2=A0=C2=A0=C2=A0 estimates:<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 name: Est. Output Size value: unknown<br>=C2=A0=C2=A0=C2=A0=
 =C2=A0=C2=A0=C2=A0 name: Est Cardinality value: unknown<br>=C2=A0=C2=A0=C2=
=A0 costs:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Network value: 0<=
br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Disk I/O value 0<br>=C2=A0=
=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: CPU value: 0<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: Cumulative Network value: unknown<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Disk I/O value: unknown<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative CPU value: unknown<br>=
=C2=A0=C2=A0=C2=A0 compiler_hints:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=
 name: Output Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=
=C2=A0 name: Output Cardinality value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: Avg. Output Record Size (bytes) value: none<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Filter Factor value: none<br><br><br>=C2=A0=
=C2=A0=C2=A0 id: 2<br>=C2=A0=C2=A0=C2=A0 type: pact<br>=C2=A0=C2=A0=C2=A0 p=
act: Map<br>=C2=A0=C2=A0=C2=A0 contents: Map at ()<br>=C2=A0=C2=A0=C2=A0 pa=
rallelism: 10<br>=C2=A0=C2=A0=C2=A0 predecessors:<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 id: 3, ship_strategy: Forward, exchange_mode: PIPELINED<br>=
=C2=A0=C2=A0=C2=A0 driver_strategy: Map<br>=C2=A0=C2=A0=C2=A0 global_proper=
ties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: partitioning v: RANDOM=
_PARTITIONED<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Partitioning Or=
der value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness v=
alue: not unique<br>=C2=A0=C2=A0=C2=A0 local_properties:<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 name: Order value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 name: Grouping value: not grouped<br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 name: Uniqueness value: not unique<br>=C2=A0=C2=A0=C2=A0 estim=
ates:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est. Output Size value=
: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est Cardinality va=
lue: unknown<br>=C2=A0=C2=A0=C2=A0 costs:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: Network value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 n=
ame: Disk I/O value 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: CPU va=
lue: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Network va=
lue: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Disk=
 I/O value: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulati=
ve CPU value: unknown<br>=C2=A0=C2=A0=C2=A0 compiler_hints:<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: Output Size (bytes) value: none<br>=C2=A0=
=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Output Cardinality value: none<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Avg. Output Record Size (bytes) va=
lue: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Filter Factor valu=
e: none<br><br>=C2=A0=C2=A0=C2=A0 id: 1<br>=C2=A0=C2=A0=C2=A0 type: pact<br=
>=C2=A0=C2=A0=C2=A0 pact: Map<br>=C2=A0=C2=A0=C2=A0 contents: map at main()=
<br>=C2=A0=C2=A0=C2=A0 parallelism: 10<br>=C2=A0=C2=A0=C2=A0 predecessors:<=
br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 id: 2, ship_strategy: Forward, exc=
hange_mode: PIPELINED<br>=C2=A0=C2=A0=C2=A0 driver_strategy: Map<br>=C2=A0=
=C2=A0=C2=A0 global_properties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 na=
me: partitioning v: RANDOM_PARTITIONED<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=
=C2=A0 name: Partitioning Order value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: Uniqueness value: not unique<br>=C2=A0=C2=A0=C2=A0 local_pr=
operties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Order value: none<=
br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Grouping value: not grouped<=
br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness value: not unique=
<br>=C2=A0=C2=A0=C2=A0 estimates:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =
name: Est. Output Size value: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 name: Est Cardinality value: unknown<br>=C2=A0=C2=A0=C2=A0 costs:<br>=
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Network value: 0<br>=C2=A0=C2=
=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Disk I/O value 0<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: CPU value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 name: Cumulative Network value: unknown<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: Cumulative Disk I/O value: unknown<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: Cumulative CPU value: unknown<br>=C2=A0=C2=A0=C2=
=A0 compiler_hints:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Output S=
ize (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Outp=
ut Cardinality value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: A=
vg. Output Record Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: Filter Factor value: none<br><br>=C2=A0=C2=A0=C2=A0 id: 0<b=
r>=C2=A0=C2=A0=C2=A0 type: sink<br>=C2=A0=C2=A0=C2=A0 pact: Data Sink<br>=
=C2=A0=C2=A0=C2=A0 contents: org.apache.flink.api.java.jado<wbr>op.mapreduc=
e.HadoopOutputForma<wbr>t<br>=C2=A0=C2=A0=C2=A0 parallelism: 10<br>=C2=A0=
=C2=A0=C2=A0 predecessors:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 id: 1, =
ship_strategy: Forward, exchange_mode: PIPELINED<br>=C2=A0=C2=A0=C2=A0 driv=
er_strategy: Map<br>=C2=A0=C2=A0=C2=A0 global_properties:<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 name: partitioning v: RANDOM_PARTITIONED<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Partitioning Order value: none<br>=
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Uniqueness value: not unique<br=
>=C2=A0=C2=A0=C2=A0 local_properties:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 name: Order value: none<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: =
Grouping value: not grouped<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: =
Uniqueness value: not unique<br>=C2=A0=C2=A0=C2=A0 estimates:<br>=C2=A0=C2=
=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est. Output Size value: unknown<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Est Cardinality value: unknown<br>=
=C2=A0=C2=A0=C2=A0 costs:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Ne=
twork value: 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Disk I/O valu=
e 0<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: CPU value: 0<br>=C2=A0=
=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Network value: unknown<br>=
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative Disk I/O value: unkn=
own<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Cumulative CPU value: un=
known<br>=C2=A0=C2=A0=C2=A0 compiler_hints:<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 name: Output Size (bytes) value: none<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 name: Output Cardinality value: none<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 name: Avg. Output Record Size (bytes) value: none<br>=C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 name: Filter Factor value: none=E2=80=8B=
</div><br><div style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,=
0,0)"><br><br></div></div><div class=3D"m_-5253665295992567385m_-3937266395=
209916677m_-1872206671170137419m_710791611440855303m_-722743526391344903m_-=
7082671397986435148gmail-HOEnZb"><div class=3D"m_-5253665295992567385m_-393=
7266395209916677m_-1872206671170137419m_710791611440855303m_-72274352639134=
4903m_-7082671397986435148gmail-h5"><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Tue, Dec 5, 2017 at 5:36 PM, Fabian Hueske <span dir=
=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank">fhueske=
@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><div>Hi,<br><br></div>=
Flink&#39;s operators are designed to work in memory as long as possible an=
d spill to disk once the memory budget is exceeded.</div><div>Moreover, Fli=
nk aims to run programs in a pipelined fashion, such that multiple operator=
s can process data at the same time. <br></div><div>This behavior can make =
it a bit tricky to analyze the runtime behavior and progress of operators.<=
br></div><br></div><div>It would be interesting to have a look at the execu=
tion plan for the program that you are running.<br></div>The plan can be ob=
tained from the ExecutionEnvironment by calling env.getExecutionPlan() inst=
ead of env.execute().<br><br></div>I would also like to know how you track =
the progress of the program. <br>Are you looking at the record counts displ=
ayed in the WebUI?<br><br></div>Best,<br></div>Fabian<br><div><div><div><br=
><br></div></div></div></div><div class=3D"m_-5253665295992567385m_-3937266=
395209916677m_-1872206671170137419m_710791611440855303m_-722743526391344903=
m_-7082671397986435148gmail-m_-6111241180770962098HOEnZb"><div class=3D"m_-=
5253665295992567385m_-3937266395209916677m_-1872206671170137419m_7107916114=
40855303m_-722743526391344903m_-7082671397986435148gmail-m_-611124118077096=
2098h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-12-0=
5 22:03 GMT+01:00 Garrett Barton <span dir=3D"ltr">&lt;<a href=3D"mailto:ga=
rrett.barton@gmail.com" target=3D"_blank">garrett.barton@gmail.com</a>&gt;<=
/span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.=
8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"lt=
r"><div style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">I=
 have been moving some old MR and hive workflows into Flink because I&#39;m=
 enjoying the api&#39;s and the ease of development is wonderful.=C2=A0 Thi=
ngs have largely worked great until I tried to really scale some of the job=
s recently.<br><br></div><div style=3D"font-family:arial,helvetica,sans-ser=
if;color:rgb(0,0,0)">I have for example one etl job that reads in about 12B=
 records at a time and does a sort, some simple transformations, validation=
, a re-partition and then output to a hive table.<br></div><div style=3D"fo=
nt-family:arial,helvetica,sans-serif;color:rgb(0,0,0)">When I built it with=
 the sample set, ~200M, it worked great, took maybe a minute and blew threw=
 it.<br><br>What I have observed is there is some kind of saturation reache=
d depending on number of slots, number of nodes and the overall size of dat=
a to move.=C2=A0 When I run the 12B set, the first 1B go through in under 1=
 minute, really really fast.=C2=A0 But its an extremely sharp drop off afte=
r that, the next 1B might take 15 minutes, and then if I wait for the next =
1B, its well over an hour.<br><br></div><div style=3D"font-family:arial,hel=
vetica,sans-serif;color:rgb(0,0,0)">What I cant find is any obvious indicat=
ors or things to look at, everything just grinds to a halt, I don&#39;t thi=
nk the job would ever actually complete.<br><br></div><div style=3D"font-fa=
mily:arial,helvetica,sans-serif;color:rgb(0,0,0)">Is there something in the=
 design of flink in batch mode that is perhaps memory bound?=C2=A0 Adding m=
ore nodes/tasks does not fix it, just gets me a little further along.=C2=A0=
 I&#39;m already running around ~1,400 slots at this point, I&#39;d postula=
te needing 10,000+ to potentially make the job run, but thats too much of m=
y cluster gone, and I have yet to get flink to be stable past 1,500.<br><br=
></div><div style=3D"font-family:arial,helvetica,sans-serif;color:rgb(0,0,0=
)">Any idea&#39;s on where to look, or what to debug?=C2=A0 GUI is also ver=
y cumbersome to use at this slot count too, so other measurement ideas are =
welcome too!<br><br></div><div style=3D"font-family:arial,helvetica,sans-se=
rif;color:rgb(0,0,0)">Thank you all.<br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--94eb2c14c83010d40c055fc3905f--