apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vlad Rozov <v.ro...@datatorrent.com>
Subject Re: Thread and Container locality
Date Mon, 28 Sep 2015 15:19:30 GMT
Ram,

The stream between operators in case of CONTAINER_LOCAL is InlineStream. 
InlineStream extends DefaultReservoir that extends CircularBuffer. 
CircularBuffer does not use synchronized methods or locks, it uses 
volatile. I guess that using volatile causes CPU cache invalidation and 
along with memory locality (in thread local case tuple is always local 
to both threads, while in container local case the second operator 
thread may see data significantly later after the first thread produced 
it) these two factors negatively impact CONTAINER_LOCAL performance. It 
is still quite surprising that the impact is so significant.

Thank you,

Vlad

On 9/27/15 16:45, Munagala Ramanath wrote:
> Vlad,
>
> That's a fascinating and counter-intuitive result. I wonder if some 
> internal synchronization is happening
> (maybe the stream between them is a shared data structure that is lock 
> protected) to
> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both 
> going as fast as possible
> it is likely that they will be frequently blocked by the lock. If that 
> is indeed the case, some sort of lock
> striping or a near-lockless protocol for stream access should tilt the 
> balance in favor of CONTAINER_LOCAL.
>
> In the thread-local case of course there is no need for such locking.
>
> Ram
>
> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <v.rozov@datatorrent.com 
> <mailto:v.rozov@datatorrent.com>> wrote:
>
>     Changed subject to reflect shift of discussion.
>
>     After I recompiled netlet and hardcoded 0 wait time in the
>     CircularBuffer.put() method, I still see the same difference even
>     when I increased operator memory to 10 GB and set "-D
>     dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
>     dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU % is
>     close to 100% both for thread and container local locality
>     settings. Note that in thread local two operators share 100% CPU,
>     while in container local each gets its own 100% load. It sounds
>     that container local will outperform thread local only when number
>     of emitted tuples is (relatively) low, for example when it is CPU
>     costly to produce tuples (hash computations,
>     compression/decompression, aggregations, filtering with complex
>     expressions). In cases where operator may emit 5 or more million
>     tuples per second, thread local may outperform container local
>     even when both operators are CPU intensive.
>
>
>
>
>     Thank you,
>
>     Vlad
>
>     On 9/26/15 22:52, Timothy Farkas wrote:
>>     Hi Vlad,
>>
>>     I just took a look at the CircularBuffer. Why are threads polling the state
>>     of the buffer before doing operations? Couldn't polling be avoided entirely
>>     by using something like Condition variables to signal when the buffer is
>>     ready for an operation to be performed?
>>
>>     Tim
>>
>>     On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<v.rozov@datatorrent.com> <mailto:v.rozov@datatorrent.com>
>>     wrote:
>>
>>>     After looking at few stack traces I think that in the benchmark
>>>     application operators compete for the circular buffer that passes slices
>>>     from the emitter output to the consumer input and sleeps that avoid busy
>>>     wait are too long for the benchmark operators. I don't see the stack
>>>     similar to the one below all the time I take the threads dump, but still
>>>     quite often to suspect that sleep is the root cause. I'll recompile with
>>>     smaller sleep time and see how this will affect performance.
>>>
>>>     ----
>>>     "1/wordGenerator:RandomWordInputModule" prio=10 tid=0x00007f78c8b8c000
>>>     nid=0x780f waiting on condition [0x00007f78abb17000]
>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>          at java.lang.Thread.sleep(Native Method)
>>>          at
>>>     com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
>>>          at com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
>>>          at com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
>>>          at
>>>     com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
>>>          at
>>>     com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
>>>          at com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
>>>          at
>>>     com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>
>>>     "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800 nid=0x780d
>>>     waiting on condition [0x00007f78abc18000]
>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>          at java.lang.Thread.sleep(Native Method)
>>>          at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
>>>          at
>>>     com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>
>>>     ----
>>>
>>>
>>>     On 9/26/15 20:59, Amol Kekre wrote:
>>>
>>>>     A good read -
>>>>     http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
>>>>
>>>>     Though it does not explain order of magnitude difference.
>>>>
>>>>     Amol
>>>>
>>>>
>>>>     On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<v.rozov@datatorrent.com>
<mailto:v.rozov@datatorrent.com>
>>>>     wrote:
>>>>
>>>>     In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL by an
order
>>>>>     of magnitude and both operators compete for CPU. I'll take a closer
look
>>>>>     why.
>>>>>
>>>>>     Thank you,
>>>>>
>>>>>     Vlad
>>>>>
>>>>>
>>>>>     On 9/26/15 14:52, Thomas Weise wrote:
>>>>>
>>>>>     THREAD_LOCAL - operators share thread
>>>>>>     CONTAINER_LOCAL - each operator has its own thread
>>>>>>
>>>>>>     So as long as operators utilize the CPU sufficiently (compete),
the
>>>>>>     latter
>>>>>>     will perform better.
>>>>>>
>>>>>>     There will be cases where a single thread can accommodate multiple
>>>>>>     operators. For example, a socket reader (mostly waiting for IO)
and a
>>>>>>     decompress (CPU hungry) can share a thread.
>>>>>>
>>>>>>     But to get back to the original question, stream locality does
generally
>>>>>>     not reduce the total memory requirement. If you add multiple
operators
>>>>>>     into
>>>>>>     one container, that container will also require more memory and
that's
>>>>>>     how
>>>>>>     the container size is calculated in the physical plan. You may
get some
>>>>>>     extra mileage when multiple operators share the same heap but
the need
>>>>>>     to
>>>>>>     identify the memory requirement per operator does not go away.
>>>>>>
>>>>>>     Thomas
>>>>>>
>>>>>>
>>>>>>     On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
>>>>>>     ram@datatorrent.com <mailto:ram@datatorrent.com>>
>>>>>>     wrote:
>>>>>>
>>>>>>     Would CONTAINER_LOCAL achieve the same thing and perform a little
better
>>>>>>
>>>>>>>     on
>>>>>>>     a multi-core box ?
>>>>>>>
>>>>>>>     Ram
>>>>>>>
>>>>>>>     On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
>>>>>>>     sandeep@datatorrent.com <mailto:sandeep@datatorrent.com>>
>>>>>>>     wrote:
>>>>>>>
>>>>>>>     Yes, with this approach only two containers are required:
one for stram
>>>>>>>     and
>>>>>>>
>>>>>>>     another for all operators. You can easily fit around 10 operators
in
>>>>>>>>     less
>>>>>>>>     than 1GB.
>>>>>>>>     On 27 Sep 2015 00:32, "Timothy Farkas"<tim@datatorrent.com>
<mailto:tim@datatorrent.com>  wrote:
>>>>>>>>
>>>>>>>>     Hi Ram,
>>>>>>>>
>>>>>>>>>     You could make all the operators thread local. This
cuts down on the
>>>>>>>>>     overhead of separate containers and maximizes the
memory available to
>>>>>>>>>
>>>>>>>>>     each
>>>>>>>>     operator.
>>>>>>>>>     Tim
>>>>>>>>>
>>>>>>>>>     On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath
<
>>>>>>>>>
>>>>>>>>>     ram@datatorrent.com <mailto:ram@datatorrent.com>
>>>>>>>>     wrote:
>>>>>>>>
>>>>>>>>>         Hi,
>>>>>>>>>
>>>>>>>>>>     I was running into memory issues when deploying
my  app on the
>>>>>>>>>>
>>>>>>>>>>     sandbox
>>>>>>>>     where all the operators were stuck forever in the PENDING
state
>>>>>>>>
>>>>>>>>>     because
>>>>>>>>>
>>>>>>>>     they were being continually aborted and restarted because
of the
>>>>>>>>
>>>>>>>>>     limited
>>>>>>>>>     memory on the sandbox. After some experimentation,
I found that the
>>>>>>>>>
>>>>>>>>>>     following config values seem to work:
>>>>>>>>>>     ------------------------------------------
>>>>>>>>>>     <
>>>>>>>>>>
>>>>>>>>>>     https://datatorrent.slack.com/archives/engineering/p1443263607000010
>>>>>>>>>>
>>>>>>>>>>     *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>
>>>>>>>>>>
>>>>>>>>>>     <value>500</value>
>>>>>>>>>         </property>  <property>    <name>dt.application.​.operator.*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     *​.attr.MEMORY_MB</name>    <value>200</value>
 </property>
>>>>>>>>>>
>>>>>>>>>>     <property>
>>>>>>>     <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
>>>>>>>
>>>>>>>           <value>512</value>  </property>*
>>>>>>>>>     ------------------------------------------------
>>>>>>>>>>     Are these reasonable values ? Is there a more
systematic way of
>>>>>>>>>>
>>>>>>>>>>     coming
>>>>>>>>     up
>>>>>>>>
>>>>>>>>     with these values than trial-and-error ? Most of my operators
-- with
>>>>>>>>>     the
>>>>>>>>>     exception of fileWordCount -- need very little memory;
is there a way
>>>>>>>>>     to
>>>>>>>>>     cut all values down to the bare minimum and maximize
available memory
>>>>>>>>>     for
>>>>>>>>>     this one operator ?
>>>>>>>>>
>>>>>>>>>>     Thanks.
>>>>>>>>>>
>>>>>>>>>>     Ram
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message