beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Alonso <>
Subject Re: Chasing OOM errors
Date Tue, 30 Jan 2018 18:11:38 GMT
Not all dumps are 1GB, here you can see a couple more of them with bigger

A couple of job ids: 2018-01-30_02_57_28-751284895952373783
and 2018-01-29_03_09_28-5756483832988685011

About relevant parts of my code, here:  you can
see the most relevant bits with comments, I hope that is easy to
understand, let me know otherwise.

Thanks for your help!!

[image: Screen Shot 2018-01-30 at 18.27.11.png][image: Screen Shot
2018-01-30 at 18.26.07.png]

On Mon, Jan 29, 2018 at 10:53 PM Eugene Kirpichov <>

> Regarding how dynamic writes work: it's considerably more complex than
> just using destination as the key; it depends also on how you configure
> your sharding, how many destinations there are etc. - see
> is probably the second most complex transform in all of Beam, second only
> to BigQueryIO.write()...).
> On Mon, Jan 29, 2018 at 1:50 PM Eugene Kirpichov <>
> wrote:
>> Hmm it's weird, this heap dump seems to be of size just 1GB which is way
>> below the limit your workers should have. Are the dumps all small like that?
>> Can you share a Dataflow job ID and some relevant part of your code?
>> On Mon, Jan 29, 2018 at 12:02 PM Carlos Alonso <>
>> wrote:
>>> We're using n1-standard-4 (4 vCPUs, 15 GB memory) machines (the default
>>> ones). And please, find the dominator tree view of one of our heap dumps.
>>> About the code for dynamic writes... Could you quickly summarise what
>>> does it do? From what I've dive into the code I think I saw a reduce by key
>>> operation that I guessed uses the file's path as the key. Is that correct?
>>> Does that mean that the more files the more the work can be parallelised?
>>> Thanks!
>>> [image: Screen Shot 2018-01-29 at 20.57.15.png]
>>> On Mon, Jan 29, 2018 at 8:04 PM Eugene Kirpichov <>
>>> wrote:
>>>> The code for doing dynamic writes tries to limit its memory usage, so
>>>> it wouldn't be my first place to look (but I wouldn't rule it out). Are you
>>>> using workers with a large number of cores or threads per worker?
>>>> In MAT or YourKit, the most useful tool is the Dominator Tree. Can you
>>>> paste a screenshot of the dominator tree expanded to some reasonable depth?
>>>> On Mon, Jan 29, 2018, 10:45 AM Carlos Alonso <>
>>>> wrote:
>>>>> Hi Eugene!
>>>>> Thanks for your comments. Yes, we've downloaded a couple of dumps, but
>>>>> TBH, couldn't understand anything (using the Eclipse MAT), I was wondering
>>>>> if the trace and the fact that "the smaller the buffer, the more OOM
>>>>> errors" could give any of you a hint as I think it may be on the writing
>>>>> part...
>>>>> Do you know how the dynamic writes are distributed on workers? Based
>>>>> on the full path?
>>>>> Regards
>>>>> On Mon, Jan 29, 2018 at 7:38 PM Eugene Kirpichov <>
>>>>> wrote:
>>>>>> Hi, I assume you're using the Dataflow runner. Have you tried using
>>>>>> the OOM debugging flags at
>>>>>>  ?
>>>>>> On Mon, Jan 29, 2018 at 2:05 AM Carlos Alonso <>
>>>>>> wrote:
>>>>>>> Hi everyone!!
>>>>>>> On our pipeline we're experiencing OOM errors. It's been a while
>>>>>>> since we've been trying to nail them down but without any luck.
>>>>>>> Our pipeline:
>>>>>>> 1. Reads messages from PubSub of different types (about 50 different
>>>>>>> types).
>>>>>>> 2. Outputs KV elements being the key the type and the value the
>>>>>>> element message itself (JSON)
>>>>>>> 3. Applies windowing (fixed windows of one our. With early and
>>>>>>> firings after one minute after processing the first element in
pane). Two
>>>>>>> days allowed lateness and discarding fired panes.
>>>>>>> 4. Buffers the elements (using stateful and timely processing).
>>>>>>> buffer the elements for 15 minutes or until it reaches a maximum
size of
>>>>>>> 16Mb. This step's objective is to avoid window's panes grow too
>>>>>>> 5. Writes the outputted elements to files in GCS using dynamic
>>>>>>> routing. Being the route: type/window/buffer_index-paneInfo.
>>>>>>> Our fist step was without buffering and just windows with early
>>>>>>> late firings on 15 minutes, but. guessing that OOMs were because
of panes
>>>>>>> growing too big we built that buffering step to trigger on size
as well.
>>>>>>> The full trace we're seeing can be found here:
>>>>>>> There's an annoying thing we've experienced and is that, the
>>>>>>> the buffer size, the more OOM errors we see which was a bit disappointing...
>>>>>>> Can anyone please give us any hint?
>>>>>>> Thanks in advance!

View raw message