apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vlad Rozov <v.ro...@datatorrent.com>
Subject Re: Thread and Container locality
Date Mon, 28 Sep 2015 17:51:15 GMT
I created a simple test to check how quickly java can count to 
Integer.MAX_INTEGER. The result that I see is consistent with 
CONTAINER_LOCAL behavior:

counting long in a single thread: 0.9 sec
counting volatile long in a single thread: 17.7 sec
counting volatile long shared between two threads: 186.3 sec

I suggest that we look into 
https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf

or similar algorithm.

Thank you,

Vlad


On 9/28/15 08:19, Vlad Rozov wrote:
> Ram,
>
> The stream between operators in case of CONTAINER_LOCAL is 
> InlineStream. InlineStream extends DefaultReservoir that extends 
> CircularBuffer. CircularBuffer does not use synchronized methods or 
> locks, it uses volatile. I guess that using volatile causes CPU cache 
> invalidation and along with memory locality (in thread local case 
> tuple is always local to both threads, while in container local case 
> the second operator thread may see data significantly later after the 
> first thread produced it) these two factors negatively impact 
> CONTAINER_LOCAL performance. It is still quite surprising that the 
> impact is so significant.
>
> Thank you,
>
> Vlad
>
> On 9/27/15 16:45, Munagala Ramanath wrote:
>> Vlad,
>>
>> That's a fascinating and counter-intuitive result. I wonder if some 
>> internal synchronization is happening
>> (maybe the stream between them is a shared data structure that is 
>> lock protected) to
>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both 
>> going as fast as possible
>> it is likely that they will be frequently blocked by the lock. If 
>> that is indeed the case, some sort of lock
>> striping or a near-lockless protocol for stream access should tilt 
>> the balance in favor of CONTAINER_LOCAL.
>>
>> In the thread-local case of course there is no need for such locking.
>>
>> Ram
>>
>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <v.rozov@datatorrent.com 
>> <mailto:v.rozov@datatorrent.com>> wrote:
>>
>>     Changed subject to reflect shift of discussion.
>>
>>     After I recompiled netlet and hardcoded 0 wait time in the
>>     CircularBuffer.put() method, I still see the same difference even
>>     when I increased operator memory to 10 GB and set "-D
>>     dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
>>     dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
>>     is close to 100% both for thread and container local locality
>>     settings. Note that in thread local two operators share 100% CPU,
>>     while in container local each gets its own 100% load. It sounds
>>     that container local will outperform thread local only when
>>     number of emitted tuples is (relatively) low, for example when it
>>     is CPU costly to produce tuples (hash computations,
>>     compression/decompression, aggregations, filtering with complex
>>     expressions). In cases where operator may emit 5 or more million
>>     tuples per second, thread local may outperform container local
>>     even when both operators are CPU intensive.
>>
>>
>>
>>
>>     Thank you,
>>
>>     Vlad
>>
>>     On 9/26/15 22:52, Timothy Farkas wrote:
>>>     Hi Vlad,
>>>
>>>     I just took a look at the CircularBuffer. Why are threads polling the state
>>>     of the buffer before doing operations? Couldn't polling be avoided entirely
>>>     by using something like Condition variables to signal when the buffer is
>>>     ready for an operation to be performed?
>>>
>>>     Tim
>>>
>>>     On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<v.rozov@datatorrent.com>
<mailto:v.rozov@datatorrent.com>
>>>     wrote:
>>>
>>>>     After looking at few stack traces I think that in the benchmark
>>>>     application operators compete for the circular buffer that passes slices
>>>>     from the emitter output to the consumer input and sleeps that avoid busy
>>>>     wait are too long for the benchmark operators. I don't see the stack
>>>>     similar to the one below all the time I take the threads dump, but still
>>>>     quite often to suspect that sleep is the root cause. I'll recompile with
>>>>     smaller sleep time and see how this will affect performance.
>>>>
>>>>     ----
>>>>     "1/wordGenerator:RandomWordInputModule" prio=10 tid=0x00007f78c8b8c000
>>>>     nid=0x780f waiting on condition [0x00007f78abb17000]
>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>          at java.lang.Thread.sleep(Native Method)
>>>>          at
>>>>     com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
>>>>          at com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
>>>>          at com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
>>>>          at
>>>>     com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
>>>>          at
>>>>     com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
>>>>          at com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
>>>>          at
>>>>     com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>>
>>>>     "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800 nid=0x780d
>>>>     waiting on condition [0x00007f78abc18000]
>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>          at java.lang.Thread.sleep(Native Method)
>>>>          at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
>>>>          at
>>>>     com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>>
>>>>     ----
>>>>
>>>>
>>>>     On 9/26/15 20:59, Amol Kekre wrote:
>>>>
>>>>>     A good read -
>>>>>     http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
>>>>>
>>>>>     Though it does not explain order of magnitude difference.
>>>>>
>>>>>     Amol
>>>>>
>>>>>
>>>>>     On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<v.rozov@datatorrent.com>
<mailto:v.rozov@datatorrent.com>
>>>>>     wrote:
>>>>>
>>>>>     In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL by
an order
>>>>>>     of magnitude and both operators compete for CPU. I'll take a
closer look
>>>>>>     why.
>>>>>>
>>>>>>     Thank you,
>>>>>>
>>>>>>     Vlad
>>>>>>
>>>>>>
>>>>>>     On 9/26/15 14:52, Thomas Weise wrote:
>>>>>>
>>>>>>     THREAD_LOCAL - operators share thread
>>>>>>>     CONTAINER_LOCAL - each operator has its own thread
>>>>>>>
>>>>>>>     So as long as operators utilize the CPU sufficiently (compete),
the
>>>>>>>     latter
>>>>>>>     will perform better.
>>>>>>>
>>>>>>>     There will be cases where a single thread can accommodate
multiple
>>>>>>>     operators. For example, a socket reader (mostly waiting for
IO) and a
>>>>>>>     decompress (CPU hungry) can share a thread.
>>>>>>>
>>>>>>>     But to get back to the original question, stream locality
does generally
>>>>>>>     not reduce the total memory requirement. If you add multiple
operators
>>>>>>>     into
>>>>>>>     one container, that container will also require more memory
and that's
>>>>>>>     how
>>>>>>>     the container size is calculated in the physical plan. You
may get some
>>>>>>>     extra mileage when multiple operators share the same heap
but the need
>>>>>>>     to
>>>>>>>     identify the memory requirement per operator does not go
away.
>>>>>>>
>>>>>>>     Thomas
>>>>>>>
>>>>>>>
>>>>>>>     On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
>>>>>>>     ram@datatorrent.com <mailto:ram@datatorrent.com>>
>>>>>>>     wrote:
>>>>>>>
>>>>>>>     Would CONTAINER_LOCAL achieve the same thing and perform
a little better
>>>>>>>
>>>>>>>>     on
>>>>>>>>     a multi-core box ?
>>>>>>>>
>>>>>>>>     Ram
>>>>>>>>
>>>>>>>>     On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
>>>>>>>>     sandeep@datatorrent.com <mailto:sandeep@datatorrent.com>>
>>>>>>>>     wrote:
>>>>>>>>
>>>>>>>>     Yes, with this approach only two containers are required:
one for stram
>>>>>>>>     and
>>>>>>>>
>>>>>>>>     another for all operators. You can easily fit around
10 operators in
>>>>>>>>>     less
>>>>>>>>>     than 1GB.
>>>>>>>>>     On 27 Sep 2015 00:32, "Timothy Farkas"<tim@datatorrent.com>
<mailto:tim@datatorrent.com>  wrote:
>>>>>>>>>
>>>>>>>>>     Hi Ram,
>>>>>>>>>
>>>>>>>>>>     You could make all the operators thread local.
This cuts down on the
>>>>>>>>>>     overhead of separate containers and maximizes
the memory available to
>>>>>>>>>>
>>>>>>>>>>     each
>>>>>>>>>     operator.
>>>>>>>>>>     Tim
>>>>>>>>>>
>>>>>>>>>>     On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath
<
>>>>>>>>>>
>>>>>>>>>>     ram@datatorrent.com <mailto:ram@datatorrent.com>
>>>>>>>>>     wrote:
>>>>>>>>>
>>>>>>>>>>         Hi,
>>>>>>>>>>
>>>>>>>>>>>     I was running into memory issues when deploying
my  app on the
>>>>>>>>>>>
>>>>>>>>>>>     sandbox
>>>>>>>>>     where all the operators were stuck forever in the
PENDING state
>>>>>>>>>
>>>>>>>>>>     because
>>>>>>>>>>
>>>>>>>>>     they were being continually aborted and restarted
because of the
>>>>>>>>>
>>>>>>>>>>     limited
>>>>>>>>>>     memory on the sandbox. After some experimentation,
I found that the
>>>>>>>>>>
>>>>>>>>>>>     following config values seem to work:
>>>>>>>>>>>     ------------------------------------------
>>>>>>>>>>>     <
>>>>>>>>>>>
>>>>>>>>>>>     https://datatorrent.slack.com/archives/engineering/p1443263607000010
>>>>>>>>>>>
>>>>>>>>>>>     *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>
>>>>>>>>>>>
>>>>>>>>>>>     <value>500</value>
>>>>>>>>>>         </property>  <property>    <name>dt.application.​.operator.*
>>>>>>>>>>>
>>>>>>>>>>>     *​.attr.MEMORY_MB</name>    <value>200</value>
 </property>
>>>>>>>>>>>
>>>>>>>>>>>     <property>
>>>>>>>>     <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
>>>>>>>>
>>>>>>>>           <value>512</value>  </property>*
>>>>>>>>>>     ------------------------------------------------
>>>>>>>>>>>     Are these reasonable values ? Is there a
more systematic way of
>>>>>>>>>>>
>>>>>>>>>>>     coming
>>>>>>>>>     up
>>>>>>>>>
>>>>>>>>>     with these values than trial-and-error ? Most of
my operators -- with
>>>>>>>>>>     the
>>>>>>>>>>     exception of fileWordCount -- need very little
memory; is there a way
>>>>>>>>>>     to
>>>>>>>>>>     cut all values down to the bare minimum and maximize
available memory
>>>>>>>>>>     for
>>>>>>>>>>     this one operator ?
>>>>>>>>>>
>>>>>>>>>>>     Thanks.
>>>>>>>>>>>
>>>>>>>>>>>     Ram
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message