apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vlad Rozov <v.ro...@datatorrent.com>
Subject Thread and Container locality
Date Sun, 27 Sep 2015 19:17:50 GMT
Changed subject to reflect shift of discussion.

After I recompiled netlet and hardcoded 0 wait time in the 
CircularBuffer.put() method, I still see the same difference even when I 
increased operator memory to 10 GB and set "-D 
dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D 
dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU % is close 
to 100% both for thread and container local locality settings. Note that 
in thread local two operators share 100% CPU, while in container local 
each gets its own 100% load. It sounds that container local will 
outperform thread local only when number of emitted tuples is 
(relatively) low, for example when it is CPU costly to produce tuples 
(hash computations, compression/decompression, aggregations, filtering 
with complex expressions). In cases where operator may emit 5 or more 
million tuples per second, thread local may outperform container local 
even when both operators are CPU intensive.




Thank you,

Vlad

On 9/26/15 22:52, Timothy Farkas wrote:
> Hi Vlad,
>
> I just took a look at the CircularBuffer. Why are threads polling the state
> of the buffer before doing operations? Couldn't polling be avoided entirely
> by using something like Condition variables to signal when the buffer is
> ready for an operation to be performed?
>
> Tim
>
> On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov <v.rozov@datatorrent.com>
> wrote:
>
>> After looking at few stack traces I think that in the benchmark
>> application operators compete for the circular buffer that passes slices
>> from the emitter output to the consumer input and sleeps that avoid busy
>> wait are too long for the benchmark operators. I don't see the stack
>> similar to the one below all the time I take the threads dump, but still
>> quite often to suspect that sleep is the root cause. I'll recompile with
>> smaller sleep time and see how this will affect performance.
>>
>> ----
>> "1/wordGenerator:RandomWordInputModule" prio=10 tid=0x00007f78c8b8c000
>> nid=0x780f waiting on condition [0x00007f78abb17000]
>>     java.lang.Thread.State: TIMED_WAITING (sleeping)
>>      at java.lang.Thread.sleep(Native Method)
>>      at
>> com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
>>      at com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
>>      at com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
>>      at
>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
>>      at
>> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
>>      at com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
>>      at
>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>
>> "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800 nid=0x780d
>> waiting on condition [0x00007f78abc18000]
>>     java.lang.Thread.State: TIMED_WAITING (sleeping)
>>      at java.lang.Thread.sleep(Native Method)
>>      at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
>>      at
>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>
>> ----
>>
>>
>> On 9/26/15 20:59, Amol Kekre wrote:
>>
>>> A good read -
>>> http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
>>>
>>> Though it does not explain order of magnitude difference.
>>>
>>> Amol
>>>
>>>
>>> On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov <v.rozov@datatorrent.com>
>>> wrote:
>>>
>>> In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL by an order
>>>> of magnitude and both operators compete for CPU. I'll take a closer look
>>>> why.
>>>>
>>>> Thank you,
>>>>
>>>> Vlad
>>>>
>>>>
>>>> On 9/26/15 14:52, Thomas Weise wrote:
>>>>
>>>> THREAD_LOCAL - operators share thread
>>>>> CONTAINER_LOCAL - each operator has its own thread
>>>>>
>>>>> So as long as operators utilize the CPU sufficiently (compete), the
>>>>> latter
>>>>> will perform better.
>>>>>
>>>>> There will be cases where a single thread can accommodate multiple
>>>>> operators. For example, a socket reader (mostly waiting for IO) and a
>>>>> decompress (CPU hungry) can share a thread.
>>>>>
>>>>> But to get back to the original question, stream locality does generally
>>>>> not reduce the total memory requirement. If you add multiple operators
>>>>> into
>>>>> one container, that container will also require more memory and that's
>>>>> how
>>>>> the container size is calculated in the physical plan. You may get some
>>>>> extra mileage when multiple operators share the same heap but the need
>>>>> to
>>>>> identify the memory requirement per operator does not go away.
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
>>>>> ram@datatorrent.com>
>>>>> wrote:
>>>>>
>>>>> Would CONTAINER_LOCAL achieve the same thing and perform a little better
>>>>>
>>>>>> on
>>>>>> a multi-core box ?
>>>>>>
>>>>>> Ram
>>>>>>
>>>>>> On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
>>>>>> sandeep@datatorrent.com>
>>>>>> wrote:
>>>>>>
>>>>>> Yes, with this approach only two containers are required: one for
stram
>>>>>> and
>>>>>>
>>>>>> another for all operators. You can easily fit around 10 operators
in
>>>>>>> less
>>>>>>> than 1GB.
>>>>>>> On 27 Sep 2015 00:32, "Timothy Farkas" <tim@datatorrent.com>
wrote:
>>>>>>>
>>>>>>> Hi Ram,
>>>>>>>
>>>>>>>> You could make all the operators thread local. This cuts
down on the
>>>>>>>> overhead of separate containers and maximizes the memory
available to
>>>>>>>>
>>>>>>>> each
>>>>>>> operator.
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
>>>>>>>>
>>>>>>>> ram@datatorrent.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>>     Hi,
>>>>>>>>
>>>>>>>>> I was running into memory issues when deploying my  app
on the
>>>>>>>>>
>>>>>>>>> sandbox
>>>>>>> where all the operators were stuck forever in the PENDING state
>>>>>>>
>>>>>>>> because
>>>>>>>>
>>>>>>> they were being continually aborted and restarted because of
the
>>>>>>>
>>>>>>>> limited
>>>>>>>> memory on the sandbox. After some experimentation, I found
that the
>>>>>>>>
>>>>>>>>> following config values seem to work:
>>>>>>>>> ------------------------------------------
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> https://datatorrent.slack.com/archives/engineering/p1443263607000010
>>>>>>>>>
>>>>>>>>> *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>
>>>>>>>>>
>>>>>>>>> <value>500</value>
>>>>>>>>     </property>  <property>    <name>dt.application.​.operator.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *​.attr.MEMORY_MB</name>    <value>200</value>
 </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
>>>>>>
>>>>>>       <value>512</value>  </property>*
>>>>>>>> ------------------------------------------------
>>>>>>>>> Are these reasonable values ? Is there a more systematic
way of
>>>>>>>>>
>>>>>>>>> coming
>>>>>>> up
>>>>>>>
>>>>>>> with these values than trial-and-error ? Most of my operators
-- with
>>>>>>>> the
>>>>>>>> exception of fileWordCount -- need very little memory; is
there a way
>>>>>>>> to
>>>>>>>> cut all values down to the bare minimum and maximize available
memory
>>>>>>>> for
>>>>>>>> this one operator ?
>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Ram
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>


Mime
View raw message