Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Subject: Re: scylladb
To: user@cassandra.apache.org
References: <CAMwLmJZTF-yYFVi8WxGjU_gL38oUxL6d15Ub5QCoDj+i-cGPmw@mail.gmail.com>
 <CADOsr-HtwraKpmRTj2eZ01Z87riFjEKPk5b2yRR5-eF3_Fa=7A@mail.gmail.com>
 <0c153096-af45-3a69-28ea-4a6471a47e04@scylladb.com>
 <CADQ6LY=nTRGGvvwY7Q+fvm7NjkyztAw1O7tmuxqT4tkRu188zA@mail.gmail.com>
 <7bd42a18-af9c-c9d5-98c9-083ced49cd5d@scylladb.com>
 <CADQ6LYmbBUOHWv3UgGvhZF25WEuPGtu6yuQ+sViT9QnsudQRyw@mail.gmail.com>
 <CAENxBwxyA2iwhgxykJftsp+07ZrigRHa_6WnFpq84=v9oC3C3Q@mail.gmail.com>
 <6d45df80-8193-73bc-0f1c-81079ab64d5c@scylladb.com>
 <CADOsr-HJ9ifkOB+4iOn2D66Re2TjENuj5pB79i2GZy3TrEc7mg@mail.gmail.com>
 <5f568408-9e60-1d09-64f9-60e4101f69a6@scylladb.com>
 <CADOsr-H5-tn4S1tECXLSBtLHK38_FJbM6VQ7VQBXyeFOdsmVpQ@mail.gmail.com>
 <16035618-57b8-7ccb-717d-83a8aa56d7dc@scylladb.com>
 <CADOsr-EfH5xbz80DjG0giTxr3xjyXOXA79ZpzvnXpX88iCZRfw@mail.gmail.com>
 <394b983a-3ff4-dd63-c01b-20adc33947d4@scylladb.com>
 <4df2c638-4ba6-0e50-a360-6760d5f3daa4@scylladb.com>
 <CADOsr-Hfsns=tF-1ToEg7mnsWQZ3QDHJpeze5C11ySFW-zt=Ww@mail.gmail.com>
From: Avi Kivity <avi@scylladb.com>
Organization: ScyllaDB
Message-ID: <877ef59c-27f4-4c2e-e1d3-f7d4ab08ed55@scylladb.com>
Date: Sun, 12 Mar 2017 11:48:39 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.7.0
MIME-Version: 1.0
In-Reply-To: <CADOsr-Hfsns=tF-1ToEg7mnsWQZ3QDHJpeze5C11ySFW-zt=Ww@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------DEAC4E01F2AF7F3E01657706"
archived-at: Sun, 12 Mar 2017 09:57:08 -0000

This is a multi-part message in MIME format.
--------------DEAC4E01F2AF7F3E01657706
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

We already quantified it, the result is Scylla. Now, Scylla's 
performance is only in part due to the threading model, so I can't give 
you a number that quantifies how much just this aspect of the design is 
worth.  Removing it (or adding it to Cassandra) is a multi-man-year 
effort that I can't justify for this conversation.


If you want to continue to use kernel threads for you applications, by 
all means continue to do so.  They're the right choice for all but the 
most I/O intensive applications.  But for these I/O intensive 
applications thread-per-core is the right choice, regardless of the 
points you raise.


I encourage you to study the seastar code base [1] and documentation [2] 
to see how we handled those problems.  I'll also comment a bit below.


[1] https://github.com/scylladb/seastar

[2] http://www.seastar-project.org/


On 03/12/2017 11:07 AM, Kant Kodali wrote:
> @Avi
>
> "User-level scheduling is great for high performance I/O intensive 
> applications like databases and file systems." This is generally a 
> claim made by people you want to use user-level threads but I rarely 
> had seen any significant performance gain. Since you are claiming that 
> you do. It would be great if you can quantify that. The other day I 
> have seen a benchmark of a Golang server which supports user level 
> threads/green threads natively and it was able to handle 10K 
> concurrent requests. Even Nginx which is written and C and uses kernel 
> threads can handle that many with Non-blocking I/O. We all know 
> concurrency is not parallelism.
>
> You may have to pay for something which could be any of the following.
>
> *Duplication of the schedulers*
> M:N requires two schedulers which basically do same work, one at user 
> level and one in kernel. This is undesirable. It requires frequent 
> data communications between kernel and user space for scheduling 
> information transference.
>
> Duplication takes more space in both Dcache and Icache for scheduling 
> than a single scheduler. It is highly undesirable if cache misses are 
> caused by the schedulers but the application, because a L2 cache miss 
> could be more expensive than a kernel thread switch. Then the 
> additional scheduler might become a trouble maker! In this case, to 
> save kernel trappings does not justify a user-scheduler, which is more 
> truen when the processors are providing faster and faster kernel 
> trapping execution.


That's not a problem, at least in my experience. The kernel scheduler 
needs to schedule only one thread, and that very infrequently. It is 
completely out of any hot path.

>
> *Thread local data maintenance*
> M:N has to maintain thread specific data, which are already provided 
> by kernel for kernel thread, such as the TLS data, error number. To 
> provide the same feature for user threads is not straightforward, 
> because, for example, the error number is returned for system call 
> failure and supported by kernel. User-level support degrades system 
> performance and increases system complexity.

This is also not a problem, we capture error codes in exceptions 
immediately after a system call and so we don't need to rely on TLS for 
errno.

>
> *System info oblivious*
> Kernel scheduler is close to underlying platform and architecture. It 
> can take advantage of their features. This is difficult for user 
> thread library because it's a layer at user level. User threads are 
> second-order entities in the system. If a kernel thread uses a GDT 
> slot for TLS data, a user thread perhaps can only use an LDT slot for 
> TLS data. With increasingly more supports available from the new 
> processors for threading/scheduling (Hyperthreading, NUMA, many-core), 
> the second order nature seriously limits the ability of M:N threading.

Those are non-issues, in my experience.  In fact it's the other way 
around, the kernel scheduler cannot assume anything about the threads it 
is preempting and so has to save more state.  The threads being 
preempted also cannot assume anything about the kernel scheduler, and so 
have to use atomic read-modify-write instructions for synchronization, 
and to perform a system call whenever they need to block or wake another 
thread.


>
> On Sun, Mar 12, 2017 at 1:05 AM, Avi Kivity <avi@scylladb.com 
> <mailto:avi@scylladb.com>> wrote:
>
>     btw, for an example of how user-level tasks can be scheduled in a
>     way that cannot be done with kernel threads, see this pair of blog
>     posts:
>
>
>     http://www.scylladb.com/2016/04/14/io-scheduler-1/
>     <http://www.scylladb.com/2016/04/14/io-scheduler-1/>
>
>     http://www.scylladb.com/2016/04/29/io-scheduler-2/
>     <http://www.scylladb.com/2016/04/29/io-scheduler-2/>
>
>
>     There's simply no way to get this kind of control when you rely on
>     the kernel for scheduling and page cache management.  As a result
>     you have to overprovision your node and then you mostly
>     underutilize it.
>
>
>     On 03/12/2017 10:23 AM, Avi Kivity wrote:
>>
>>
>>
>>     On 03/12/2017 12:19 AM, Kant Kodali wrote:
>>>     My response is inline.
>>>
>>>     On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <avi@scylladb.com
>>>     <mailto:avi@scylladb.com>> wrote:
>>>
>>>         There are several issues at play here.
>>>
>>>         First, a database runs a large number of concurrent
>>>         operations, each of which only consumes a small amount of
>>>         CPU. The high concurrency is need to hide latency: disk
>>>         latency, or the latency of contacting a remote node.
>>>
>>>     *Ok so you are talking about hiding I/O latency.  If all these
>>>     I/O are non-blocking system calls then a thread per core and
>>>     callback mechanism should suffice isn't it?*
>>
>>     Scylla uses a mix of user-level threads and callbacks. Most of
>>     the code uses callbacks (fronted by a future/promise API).
>>     SSTable writers  (memtable flush, compaction) use a user-level
>>     thread (internally implemented using callbacks).  The important
>>     bit is multiplexing many concurrent operations onto a single
>>     kernel thread.
>>
>>
>>>         This means that the scheduler will need to switch contexts
>>>         very often. A kernel thread scheduler knows very little
>>>         about the application, so it has to switch a lot of
>>>         context.  A user level scheduler is tightly bound to the
>>>         application, so it can perform the switching faster.
>>>
>>>
>>>     *sure but this applies in other direction as well. A user level
>>>     scheduler has no idea about kernel level scheduler either. 
>>>     There is literally no coordination between kernel level
>>>     scheduler and user level scheduler in linux or any major OS. It
>>>     may be possible with OS's that support scheduler
>>>     activation(LWP's) and upcall mechanism. *
>>
>>     There is no need for coordination, because the kernel scheduler
>>     has no scheduling decisions to make.  With one thread per core,
>>     bound to its core, the kernel scheduler can't make the wrong
>>     decision because it has just one choice.
>>
>>
>>>     *Even then it is hard to say if it is all worth it (The research
>>>     shows performance may not outweigh the complexity). Golang
>>>     problem is exactly this if one creates 1000 go routines/green
>>>     threads where each of them is making a blocking system call then
>>>     it would create 1000 kernel threads underneath because it has no
>>>     way to know that the kernel thread is blocked (no upcall). *
>>
>>     All of the significant system calls we issue are through the main
>>     thread, either asynchronous or non-blocking.
>>
>>>     *And in non-blocking case I still don't even see a significant
>>>     performance when compared to few kernel threads with callback
>>>     mechanism.*
>>
>>     We do.
>>
>>>     *  If you are saying user level scheduling is the Future
>>>     (perhaps I would just let the researchers argue about it) As of
>>>     today that is not case else languages would have had it natively
>>>     instead of using third party frameworks or libraries.
>>>     *
>>
>>     User-level scheduling is great for high performance I/O intensive
>>     applications like databases and file systems.  It's not a general
>>     solution, and it involves a lot of effort to set up the
>>     infrastructure. However, for our use case, it was worth it.
>>
>>>         There are also implications on the concurrency primitives in
>>>         use (locks etc.) -- they will be much faster for the
>>>         user-level scheduler, because they cooperate with the
>>>         scheduler.  For example, no atomic read-modify-write
>>>         instructions need to be executed.
>>>
>>>
>>>          Second, how many (kernel) threads should you run?*This
>>>     question one will always have. If there are 10K user level
>>>     threads that maps to only one kernel thread then they cannot
>>>     exploit parallelism. so there is no right answer but a thread
>>>     per core is a reasonable/good choice.
>>>     *
>>
>>     Only if you can multiplex many operations on top of each of those
>>     threads.  Otherwise, the CPUs end up underutilized.
>>
>>>         If you run too few threads, then you will not be able to
>>>         saturate the CPU resources.  This is a common problem with
>>>         Cassandra -- it's very hard to get it to consume all of the
>>>         CPU power on even a moderately large machine. On the other
>>>         hand, if you have too many threads, you will see latency
>>>         rise very quickly, because kernel scheduling granularity is
>>>         on the order of milliseconds.  User-level scheduling,
>>>         because it leaves control in the hand of the application,
>>>         allows you to both saturate the CPU and maintain low latency.
>>>
>>>
>>>         F*or my workload and probably others I had seen Cassandra
>>>     was always been CPU bound.*
>>>
>>>
>>
>>
>>     Yes, but does it consume 100% of all of the cores on your
>>     machine?  Cassandra generally doesn't (on a larger machine), and
>>     when you profile it, you see it spending much of its time in
>>     atomic operations, or parking/unparking threads -- fighting with
>>     itself. It doesn't scale within the machine.  Scylla will happily
>>     utilize all of the cores that it is assigned (all of them by
>>     default in most configurations), and the bigger the machine you
>>     give it, the happier it will be.
>>
>>>         There are other factors, like NUMA-friendliness, but in the
>>>         end it all boils down to efficiency and control.
>>>
>>>         None of this is new btw, it's pretty common in the storage
>>>         world.
>>>
>>>         Avi
>>>
>>>
>>>         On 03/11/2017 11:18 PM, Kant Kodali wrote:
>>>>         Here is the Java version
>>>>         http://docs.paralleluniverse.co/quasar/
>>>>         <http://docs.paralleluniverse.co/quasar/> but I still don't
>>>>         see how user level scheduling can be beneficial (This is a
>>>>         well debated problem)? How can this add to the performance?
>>>>         or say why is user level scheduling necessary Given the
>>>>         Thread per core design and the callback mechanism?
>>>>
>>>>         On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity
>>>>         <avi@scylladb.com <mailto:avi@scylladb.com>> wrote:
>>>>
>>>>             Scylla uses a the seastar framework, which provides for
>>>>             both user-level thread scheduling and simple
>>>>             run-to-completion tasks.
>>>>
>>>>             Huge pages are limited to 2MB (and 1GB, but these
>>>>             aren't available as transparent hugepages).
>>>>
>>>>
>>>>             On 03/11/2017 10:26 PM, Kant Kodali wrote:
>>>>>             @Dor
>>>>>
>>>>>             1) You guys have a CPU scheduler? you mean user level
>>>>>             thread Scheduler that maps user level threads to
>>>>>             kernel level threads? I thought C++ by default creates
>>>>>             native kernel threads but sure nothing will stop
>>>>>             someone to create a user level scheduling library if
>>>>>             that's what you are talking about?
>>>>>             2) How can one create THP of size 1KB? According to
>>>>>             this post
>>>>>             <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html> it
>>>>>             looks like the valid values 2MB and 1GB.
>>>>>
>>>>>             Thanks,
>>>>>             kant
>>>>>
>>>>>             On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity
>>>>>             <avi@scylladb.com <mailto:avi@scylladb.com>> wrote:
>>>>>
>>>>>                 Agreed, I'd recommend to treat benchmarks as a
>>>>>                 rough guide to see where there is potential, and
>>>>>                 follow through with your own tests.
>>>>>
>>>>>                 On 03/11/2017 09:37 PM, Edward Capriolo wrote:
>>>>>>
>>>>>>                 Benchmarks are great for FUDly blog posts. Real
>>>>>>                 world work loads matter more. Every NoSQL vendor
>>>>>>                 wins their benchmarks.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>


--------------DEAC4E01F2AF7F3E01657706
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>We already quantified it, the result is Scylla. Now, Scylla's
      performance is only in part due to the threading model, so I can't
      give you a number that quantifies how much just this aspect of the
      design is worth.  Removing it (or adding it to Cassandra) is a
      multi-man-year effort that I can't justify for this conversation.</p>
    <p><br>
    </p>
    <p>If you want to continue to use kernel threads for you
      applications, by all means continue to do so.  They're the right
      choice for all but the most I/O intensive applications.  But for
      these I/O intensive applications thread-per-core is the right
      choice, regardless of the points you raise.</p>
    <p><br>
    </p>
    <p>I encourage you to study the seastar code base [1] and
      documentation [2] to see how we handled those problems.  I'll also
      comment a bit below.</p>
    <p><br>
    </p>
    <p>[1] <a class="moz-txt-link-freetext" href="https://github.com/scylladb/seastar">https://github.com/scylladb/seastar</a></p>
    <p>[2] <a class="moz-txt-link-freetext" href="http://www.seastar-project.org/">http://www.seastar-project.org/</a><br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 03/12/2017 11:07 AM, Kant Kodali
      wrote:<br>
    </div>
    <blockquote
cite="mid:CADOsr-Hfsns=tF-1ToEg7mnsWQZ3QDHJpeze5C11ySFW-zt=Ww@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div><font face="arial, helvetica, sans-serif">@Avi</font></div>
        <div><font face="arial, helvetica, sans-serif"><br>
          </font></div>
        <font face="arial, helvetica, sans-serif">"User-level scheduling
          is great for high performance I/O intensive applications like
          databases and file systems." This is generally a claim made by
          people you want to use user-level threads but I rarely had
          seen any significant performance gain. Since you are claiming
          that you do. It would be great if you can quantify that. The
          other day I have seen a benchmark of a Golang server which
          supports user level threads/green threads natively and it was
          able to handle 10K concurrent requests. Even Nginx which is
          written and C and uses kernel threads can handle that many
          with Non-blocking I/O. We all know concurrency is not
          parallelism. </font>
        <div><font face="arial, helvetica, sans-serif"><br>
          </font></div>
        <div><font face="arial, helvetica, sans-serif">You may have to
            pay for something which could be any of the following.</font></div>
        <div><font face="arial, helvetica, sans-serif"><br>
          </font></div>
        <div><font face="arial, helvetica, sans-serif"><b
              style="color:rgb(51,51,51);text-align:justify">Duplication
              of the schedulers</b><br
              style="color:rgb(51,51,51);text-align:justify">
            <span style="color:rgb(51,51,51);text-align:justify">M:N
              requires two schedulers which basically do same work, one
              at user level and one in kernel. This is undesirable. It
              requires frequent data communications between kernel and
              user space for scheduling information transference. </span><br
              style="color:rgb(51,51,51);text-align:justify">
            <span style="color:rgb(51,51,51);text-align:justify"><br>
            </span></font></div>
        <div><font face="arial, helvetica, sans-serif"><span
              style="color:rgb(51,51,51);text-align:justify">Duplication
              takes more space in both Dcache and Icache for scheduling
              than a single scheduler. It is highly undesirable if cache
              misses are caused by the schedulers but the application,
              because a L2 cache miss could be more expensive than a
              kernel thread switch. Then the additional scheduler might
              become a trouble maker! In this case, to save kernel
              trappings does not justify a user-scheduler, which is more
              truen when the processors are providing faster and faster
              kernel trapping execution.</span><br
              style="color:rgb(51,51,51);text-align:justify">
          </font></div>
      </div>
    </blockquote>
    <br>
    <br>
    That's not a problem, at least in my experience. The kernel
    scheduler needs to schedule only one thread, and that very
    infrequently. It is completely out of any hot path.<br>
    <br>
    <blockquote
cite="mid:CADOsr-Hfsns=tF-1ToEg7mnsWQZ3QDHJpeze5C11ySFW-zt=Ww@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div><font face="arial, helvetica, sans-serif"><br
              style="color:rgb(51,51,51);text-align:justify">
            <b style="color:rgb(51,51,51);text-align:justify">Thread
              local data maintenance</b><span
              style="color:rgb(51,51,51);text-align:justify"> </span><br
              style="color:rgb(51,51,51);text-align:justify">
            <span style="color:rgb(51,51,51);text-align:justify">M:N has
              to maintain thread specific data, which are already
              provided by kernel for kernel thread, such as the TLS
              data, error number. To provide the same feature for user
              threads is not straightforward, because, for example, the
              error number is returned for system call failure and
              supported by kernel. User-level support degrades system
              performance and increases system complexity.</span><br
              style="color:rgb(51,51,51);text-align:justify">
          </font></div>
      </div>
    </blockquote>
    <br>
    This is also not a problem, we capture error codes in exceptions
    immediately after a system call and so we don't need to rely on TLS
    for errno.<br>
    <br>
    <blockquote
cite="mid:CADOsr-Hfsns=tF-1ToEg7mnsWQZ3QDHJpeze5C11ySFW-zt=Ww@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div><font face="arial, helvetica, sans-serif"><br
              style="color:rgb(51,51,51);text-align:justify">
            <b style="color:rgb(51,51,51);text-align:justify">System
              info oblivious</b><span
              style="color:rgb(51,51,51);text-align:justify"> </span><br
              style="color:rgb(51,51,51);text-align:justify">
            <span style="color:rgb(51,51,51);text-align:justify">Kernel
              scheduler is close to underlying platform and
              architecture. It can take advantage of their features.
              This is difficult for user thread library because it's a
              layer at user level. User threads are second-order
              entities in the system. If a kernel thread uses a GDT slot
              for TLS data, a user thread perhaps can only use an LDT
              slot for TLS data. With increasingly more supports
              available from the new processors for threading/scheduling
              (Hyperthreading, NUMA, many-core), the second order nature
              seriously limits the ability of M:N threading.</span></font></div>
      </div>
    </blockquote>
    <br>
    Those are non-issues, in my experience.  In fact it's the other way
    around, the kernel scheduler cannot assume anything about the
    threads it is preempting and so has to save more state.  The threads
    being preempted also cannot assume anything about the kernel
    scheduler, and so have to use atomic read-modify-write instructions
    for synchronization, and to perform a system call whenever they need
    to block or wake another thread.<br>
    <br>
    <br>
    <br>
    <blockquote
cite="mid:CADOsr-Hfsns=tF-1ToEg7mnsWQZ3QDHJpeze5C11ySFW-zt=Ww@mail.gmail.com"
      type="cite">
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Sun, Mar 12, 2017 at 1:05 AM, Avi
          Kivity <span dir="ltr">&lt;<a moz-do-not-send="true"
              href="mailto:avi@scylladb.com" target="_blank">avi@scylladb.com</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000">
              <p>btw, for an example of how user-level tasks can be
                scheduled in a way that cannot be done with kernel
                threads, see this pair of blog posts:</p>
              <p><br>
              </p>
              <p>  <a moz-do-not-send="true"
                  class="m_7909094398361274793moz-txt-link-freetext"
                  href="http://www.scylladb.com/2016/04/14/io-scheduler-1/"
                  target="_blank">http://www.scylladb.com/2016/<wbr>04/14/io-scheduler-1/</a></p>
              <p>  <a moz-do-not-send="true"
                  class="m_7909094398361274793moz-txt-link-freetext"
                  href="http://www.scylladb.com/2016/04/29/io-scheduler-2/"
                  target="_blank">http://www.scylladb.com/2016/<wbr>04/29/io-scheduler-2/</a></p>
              <p><br>
              </p>
              <p>There's simply no way to get this kind of control when
                you rely on the kernel for scheduling and page cache
                management.  As a result you have to overprovision your
                node and then you mostly underutilize it.<br>
              </p>
              <div>
                <div class="h5"> <br>
                  <div class="m_7909094398361274793moz-cite-prefix">On
                    03/12/2017 10:23 AM, Avi Kivity wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <p><br>
                    </p>
                    <br>
                    <div class="m_7909094398361274793moz-cite-prefix">On
                      03/12/2017 12:19 AM, Kant Kodali wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">My response is inline.<br>
                        <div class="gmail_extra"><br>
                          <div class="gmail_quote">On Sat, Mar 11, 2017
                            at 1:43 PM, Avi Kivity <span dir="ltr">&lt;<a
                                moz-do-not-send="true"
                                href="mailto:avi@scylladb.com"
                                target="_blank">avi@scylladb.com</a>&gt;</span>
                            wrote:<br>
                            <blockquote class="gmail_quote"
                              style="margin:0px 0px 0px
                              0.8ex;border-left:1px solid
                              rgb(204,204,204);padding-left:1ex">
                              <div bgcolor="#FFFFFF">
                                <div
                                  class="m_7909094398361274793gmail-m_4736269022775154005moz-cite-prefix">There
                                  are several issues at play here.<br>
                                  <br>
                                  First, a database runs a large number
                                  of concurrent operations, each of
                                  which only consumes a small amount of
                                  CPU. The high concurrency is need to
                                  hide latency: disk latency, or the
                                  latency of contacting a remote node. </div>
                              </div>
                            </blockquote>
                            <div> </div>
                            <div><b>Ok so you are talking about hiding
                                I/O latency.  If all these I/O are
                                non-blocking system calls then a thread
                                per core and callback mechanism should
                                suffice isn't it?</b></div>
                            <div> </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    Scylla uses a mix of user-level threads and
                    callbacks. Most of the code uses callbacks (fronted
                    by a future/promise API). SSTable writers  (memtable
                    flush, compaction) use a user-level thread
                    (internally implemented using callbacks).  The
                    important bit is multiplexing many concurrent
                    operations onto a single kernel thread.<br>
                    <br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <blockquote class="gmail_quote"
                              style="margin:0px 0px 0px
                              0.8ex;border-left:1px solid
                              rgb(204,204,204);padding-left:1ex">
                              <div bgcolor="#FFFFFF">
                                <div
                                  class="m_7909094398361274793gmail-m_4736269022775154005moz-cite-prefix">This
                                  means that the scheduler will need to
                                  switch contexts very often. A kernel
                                  thread scheduler knows very little
                                  about the application, so it has to
                                  switch a lot of context.  A user level
                                  scheduler is tightly bound to the
                                  application, so it can perform the
                                  switching faster. </div>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div><b>sure but this applies in other
                                direction as well. A user level
                                scheduler has no idea about kernel level
                                scheduler either.  There is literally no
                                coordination between kernel level
                                scheduler and user level scheduler in
                                linux or any major OS. It may be
                                possible with OS's that support
                                scheduler activation(LWP's) and upcall
                                mechanism. </b></div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    There is no need for coordination, because the
                    kernel scheduler has no scheduling decisions to
                    make.  With one thread per core, bound to its core,
                    the kernel scheduler can't make the wrong decision
                    because it has just one choice.<br>
                    <br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <div><b>Even then it is hard to say if it is
                                all worth it (The research shows
                                performance may not outweigh the
                                complexity). Golang problem is exactly
                                this if one creates 1000 go
                                routines/green threads where each of
                                them is making a blocking system call
                                then it would create 1000 kernel threads
                                underneath because it has no way to know
                                that the kernel thread is blocked (no
                                upcall). </b></div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    All of the significant system calls we issue are
                    through the main thread, either asynchronous or
                    non-blocking.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <div><b>And in non-blocking case I still
                                don't even see a significant performance
                                when compared to few kernel threads with
                                callback mechanism.</b></div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    We do.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <div><b>  If you are saying user level
                                scheduling is the Future (perhaps I
                                would just let the researchers argue
                                about it) As of today that is not case
                                else languages would have had it
                                natively instead of using third party
                                frameworks or libraries. <br>
                              </b></div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    User-level scheduling is great for high performance
                    I/O intensive applications like databases and file
                    systems.  It's not a general solution, and it
                    involves a lot of effort to set up the
                    infrastructure. However, for our use case, it was
                    worth it.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <div> </div>
                            <blockquote class="gmail_quote"
                              style="margin:0px 0px 0px
                              0.8ex;border-left:1px solid
                              rgb(204,204,204);padding-left:1ex">
                              <div bgcolor="#FFFFFF">
                                <div
                                  class="m_7909094398361274793gmail-m_4736269022775154005moz-cite-prefix">
                                  There are also implications on the
                                  concurrency primitives in use (locks
                                  etc.) -- they will be much faster for
                                  the user-level scheduler, because they
                                  cooperate with the scheduler.  For
                                  example, no atomic read-modify-write
                                  instructions need to be executed.<br>
                                </div>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>    </div>
                            <div>     Second, how many (kernel) threads
                              should you run?<b> This question one will
                                always have. If there are 10K user level
                                threads that maps to only one kernel
                                thread then they cannot exploit
                                parallelism. so there is no right answer
                                but a thread per core is a
                                reasonable/good choice. <br>
                              </b></div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    Only if you can multiplex many operations on top of
                    each of those threads.  Otherwise, the CPUs end up
                    underutilized.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <div> </div>
                            <blockquote class="gmail_quote"
                              style="margin:0px 0px 0px
                              0.8ex;border-left:1px solid
                              rgb(204,204,204);padding-left:1ex">
                              <div bgcolor="#FFFFFF">
                                <div
                                  class="m_7909094398361274793gmail-m_4736269022775154005moz-cite-prefix">If
                                  you run too few threads, then you will
                                  not be able to saturate the CPU
                                  resources.  This is a common problem
                                  with Cassandra -- it's very hard to
                                  get it to consume all of the CPU power
                                  on even a moderately large machine. On
                                  the other hand, if you have too many
                                  threads, you will see latency rise
                                  very quickly, because kernel
                                  scheduling granularity is on the order
                                  of milliseconds.  User-level
                                  scheduling, because it leaves control
                                  in the hand of the application, allows
                                  you to both saturate the CPU and
                                  maintain low latency.</div>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>    F<b>or my workload and probably
                                others I had seen Cassandra was always
                                been CPU bound.</b> </div>
                            <blockquote class="gmail_quote"
                              style="margin:0px 0px 0px
                              0.8ex;border-left:1px solid
                              rgb(204,204,204);padding-left:1ex">
                              <div bgcolor="#FFFFFF">
                                <div
                                  class="m_7909094398361274793gmail-m_4736269022775154005moz-cite-prefix">
                                  <br>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    <br>
                    Yes, but does it consume 100% of all of the cores on
                    your machine?  Cassandra generally doesn't (on a
                    larger machine), and when you profile it, you see it
                    spending much of its time in atomic operations, or
                    parking/unparking threads -- fighting with itself. 
                    It doesn't scale within the machine.  Scylla will
                    happily utilize all of the cores that it is assigned
                    (all of them by default in most configurations), and
                    the bigger the machine you give it, the happier it
                    will be.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">
                            <blockquote class="gmail_quote"
                              style="margin:0px 0px 0px
                              0.8ex;border-left:1px solid
                              rgb(204,204,204);padding-left:1ex">
                              <div bgcolor="#FFFFFF">
                                <div
                                  class="m_7909094398361274793gmail-m_4736269022775154005moz-cite-prefix">
                                  There are other factors, like
                                  NUMA-friendliness, but in the end it
                                  all boils down to efficiency and
                                  control.<br>
                                  <br>
                                  None of this is new btw, it's pretty
                                  common in the storage world.<span
                                    class="m_7909094398361274793gmail-HOEnZb"><font
                                      color="#888888"><br>
                                      <br>
                                      Avi</font></span>
                                  <div>
                                    <div
                                      class="m_7909094398361274793gmail-h5"><br>
                                      <br>
                                      On 03/11/2017 11:18 PM, Kant
                                      Kodali wrote:<br>
                                    </div>
                                  </div>
                                </div>
                                <div>
                                  <div
                                    class="m_7909094398361274793gmail-h5">
                                    <blockquote type="cite">
                                      <div dir="ltr">Here is the Java
                                        version <a
                                          moz-do-not-send="true"
                                          href="http://docs.paralleluniverse.co/quasar/"
                                          target="_blank">http://docs.paralleluniverse.c<wbr>o/quasar/</a>
                                        but I still don't see how user
                                        level scheduling can be
                                        beneficial (This is a well
                                        debated problem)? How can this
                                        add to the performance? or say
                                        why is user level scheduling
                                        necessary Given the Thread per
                                        core design and the callback
                                        mechanism?</div>
                                      <div class="gmail_extra"><br>
                                        <div class="gmail_quote">On Sat,
                                          Mar 11, 2017 at 12:51 PM, Avi
                                          Kivity <span dir="ltr">&lt;<a
                                              moz-do-not-send="true"
                                              href="mailto:avi@scylladb.com"
                                              target="_blank">avi@scylladb.com</a>&gt;</span>
                                          wrote:<br>
                                          <blockquote
                                            class="gmail_quote"
                                            style="margin:0px 0px 0px
                                            0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
                                            <div bgcolor="#FFFFFF">
                                              <div
class="m_7909094398361274793gmail-m_4736269022775154005m_-2624491701272821178moz-cite-prefix">Scylla
                                                uses a the seastar
                                                framework, which
                                                provides for both
                                                user-level thread
                                                scheduling and simple
                                                run-to-completion tasks.<br>
                                                <br>
                                                Huge pages are limited
                                                to 2MB (and 1GB, but
                                                these aren't available
                                                as transparent
                                                hugepages).
                                                <div>
                                                  <div
                                                    class="m_7909094398361274793gmail-m_4736269022775154005h5"><br>
                                                    <br>
                                                    On 03/11/2017 10:26
                                                    PM, Kant Kodali
                                                    wrote:<br>
                                                  </div>
                                                </div>
                                              </div>
                                              <div>
                                                <div
                                                  class="m_7909094398361274793gmail-m_4736269022775154005h5">
                                                  <blockquote
                                                    type="cite">
                                                    <div dir="ltr">@Dor 
                                                      <div><br>
                                                      </div>
                                                      <div>1) You guys
                                                        have a CPU
                                                        scheduler? you
                                                        mean user level
                                                        thread Scheduler
                                                        that maps user
                                                        level threads to
                                                        kernel level
                                                        threads? I
                                                        thought C++ by
                                                        default creates
                                                        native kernel
                                                        threads but sure
                                                        nothing will
                                                        stop someone to
                                                        create a user
                                                        level scheduling
                                                        library if
                                                        that's what you
                                                        are talking
                                                        about?</div>
                                                      <div>2) How can
                                                        one create THP
                                                        of size 1KB?
                                                        According to
                                                        this <a
                                                          moz-do-not-send="true"
href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html"
target="_blank">post</a> it looks like the valid values 2MB and 1GB.</div>
                                                      <div><br>
                                                      </div>
                                                      <div>Thanks,</div>
                                                      <div>kant</div>
                                                    </div>
                                                    <div
                                                      class="gmail_extra"><br>
                                                      <div
                                                        class="gmail_quote">On
                                                        Sat, Mar 11,
                                                        2017 at 11:41
                                                        AM, Avi Kivity <span
                                                          dir="ltr">&lt;<a
moz-do-not-send="true" href="mailto:avi@scylladb.com" target="_blank">avi@scylladb.com</a>&gt;</span>
                                                        wrote:<br>
                                                        <blockquote
                                                          class="gmail_quote"
style="margin:0px 0px 0px 0.8ex;border-left:1px solid
                                                          rgb(204,204,204);padding-left:1ex">
                                                          <div
                                                          bgcolor="#FFFFFF">
                                                          <div
class="m_7909094398361274793gmail-m_4736269022775154005m_-2624491701272821178m_6948238497389291502moz-cite-prefix">Agreed,
                                                          I'd recommend
                                                          to treat
                                                          benchmarks as
                                                          a rough guide
                                                          to see where
                                                          there is
                                                          potential, and
                                                          follow through
                                                          with your own
                                                          tests.<span><br>
                                                          <br>
                                                          On 03/11/2017
                                                          09:37 PM,
                                                          Edward
                                                          Capriolo
                                                          wrote:<br>
                                                          </span></div>
                                                          <span>
                                                          <blockquote
                                                          type="cite">
                                                          <div dir="ltr"><br>
                                                          <div
                                                          class="gmail_extra">Benchmarks
                                                          are great for
                                                          FUDly blog
                                                          posts. Real
                                                          world work
                                                          loads matter
                                                          more. Every
                                                          NoSQL vendor
                                                          wins their
                                                          benchmarks.</div>
                                                          </div>
                                                          </blockquote>
                                                          <p><br>
                                                          </p>
                                                          </span></div>
                                                        </blockquote>
                                                      </div>
                                                      <br>
                                                    </div>
                                                  </blockquote>
                                                  <br>
                                                  <br>
                                                  <p><br>
                                                  </p>
                                                </div>
                                              </div>
                                            </div>
                                          </blockquote>
                                        </div>
                                        <br>
                                      </div>
                                    </blockquote>
                                    <p><br>
                                    </p>
                                  </div>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                          <br>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                  </blockquote>
                  <br>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------DEAC4E01F2AF7F3E01657706--