cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Kivity <...@scylladb.com>
Subject Re: scylladb
Date Sun, 12 Mar 2017 10:32:25 GMT
If you have thread-per-core and N (logical) cores, and have M tasks 
running concurrently where M > N, then you need a scheduler to decide 
which of those M tasks gets to run on those N kernel threads.  Whether 
those M tasks are user-level threads, or callbacks, or a mix of the two 
is immaterial.  In such cases a scheduler always exists, even if it is a 
simple FIFO queue.


Scheduling happens either voluntarily (the task issues I/O) or 
involuntarily (the scheduler decides it needs to run another task to 
satisfy latency SLA), but it has to happen.  The only case where it 
doesn't need to happen is if M<=N, in which case your server will be 
underutilized whenever your task has to wait.


On 03/12/2017 12:17 PM, Kant Kodali wrote:
> @Avi
>
> I don't disagree with thread per core design and in fact I said that 
> is a reasonable/good choice. But I am having a hard time seeing 
> through how user level scheduling can make a significant difference 
> even in Non-blocking I/O case. My question really is that if you 
> already have TPC why do you need user level scheduling ? And if the 
> answer is to switch between user level tasks then I am simply trying 
> to say "concurrency is not parallelism" (just because one was able to 
> switch between user level threads doesn't mean they are running in 
> parallel underneath). Why not simple schedule those on kernel threads 
> running on those cores and have a callback mechanism. Why would one 
> need to deal with user level scheduling overhead and all the problems 
> that comes with it. This to me just sounds like difference in the 
> design paradigm but doesn't seem to add much to the performance.
>
> Seaster sounds very similar to Quasar. And I am not seeing great 
> benefits from it.
>
>
>
>
> On Sun, Mar 12, 2017 at 1:48 AM, Avi Kivity <avi@scylladb.com 
> <mailto:avi@scylladb.com>> wrote:
>
>     We already quantified it, the result is Scylla. Now, Scylla's
>     performance is only in part due to the threading model, so I can't
>     give you a number that quantifies how much just this aspect of the
>     design is worth.  Removing it (or adding it to Cassandra) is a
>     multi-man-year effort that I can't justify for this conversation.
>
>
>     If you want to continue to use kernel threads for you
>     applications, by all means continue to do so.  They're the right
>     choice for all but the most I/O intensive applications.  But for
>     these I/O intensive applications thread-per-core is the right
>     choice, regardless of the points you raise.
>
>
>     I encourage you to study the seastar code base [1] and
>     documentation [2] to see how we handled those problems. I'll also
>     comment a bit below.
>
>
>     [1] https://github.com/scylladb/seastar
>     <https://github.com/scylladb/seastar>
>
>     [2] http://www.seastar-project.org/ <http://www.seastar-project.org/>
>
>
>     On 03/12/2017 11:07 AM, Kant Kodali wrote:
>>     @Avi
>>
>>     "User-level scheduling is great for high performance I/O
>>     intensive applications like databases and file systems." This is
>>     generally a claim made by people you want to use user-level
>>     threads but I rarely had seen any significant performance gain.
>>     Since you are claiming that you do. It would be great if you can
>>     quantify that. The other day I have seen a benchmark of a Golang
>>     server which supports user level threads/green threads natively
>>     and it was able to handle 10K concurrent requests. Even Nginx
>>     which is written and C and uses kernel threads can handle that
>>     many with Non-blocking I/O. We all know concurrency is not
>>     parallelism.
>>
>>     You may have to pay for something which could be any of the
>>     following.
>>
>>     *Duplication of the schedulers*
>>     M:N requires two schedulers which basically do same work, one at
>>     user level and one in kernel. This is undesirable. It requires
>>     frequent data communications between kernel and user space for
>>     scheduling information transference.
>>
>>     Duplication takes more space in both Dcache and Icache for
>>     scheduling than a single scheduler. It is highly undesirable if
>>     cache misses are caused by the schedulers but the application,
>>     because a L2 cache miss could be more expensive than a kernel
>>     thread switch. Then the additional scheduler might become a
>>     trouble maker! In this case, to save kernel trappings does not
>>     justify a user-scheduler, which is more truen when the processors
>>     are providing faster and faster kernel trapping execution.
>
>
>     That's not a problem, at least in my experience. The kernel
>     scheduler needs to schedule only one thread, and that very
>     infrequently. It is completely out of any hot path.
>
>>
>>     *Thread local data maintenance*
>>     M:N has to maintain thread specific data, which are already
>>     provided by kernel for kernel thread, such as the TLS data, error
>>     number. To provide the same feature for user threads is not
>>     straightforward, because, for example, the error number is
>>     returned for system call failure and supported by kernel.
>>     User-level support degrades system performance and increases
>>     system complexity.
>
>     This is also not a problem, we capture error codes in exceptions
>     immediately after a system call and so we don't need to rely on
>     TLS for errno.
>
>>
>>     *System info oblivious*
>>     Kernel scheduler is close to underlying platform and
>>     architecture. It can take advantage of their features. This is
>>     difficult for user thread library because it's a layer at user
>>     level. User threads are second-order entities in the system. If a
>>     kernel thread uses a GDT slot for TLS data, a user thread perhaps
>>     can only use an LDT slot for TLS data. With increasingly more
>>     supports available from the new processors for
>>     threading/scheduling (Hyperthreading, NUMA, many-core), the
>>     second order nature seriously limits the ability of M:N threading.
>
>     Those are non-issues, in my experience.  In fact it's the other
>     way around, the kernel scheduler cannot assume anything about the
>     threads it is preempting and so has to save more state.  The
>     threads being preempted also cannot assume anything about the
>     kernel scheduler, and so have to use atomic read-modify-write
>     instructions for synchronization, and to perform a system call
>     whenever they need to block or wake another thread.
>
>
>
>
>>
>>     On Sun, Mar 12, 2017 at 1:05 AM, Avi Kivity <avi@scylladb.com
>>     <mailto:avi@scylladb.com>> wrote:
>>
>>         btw, for an example of how user-level tasks can be scheduled
>>         in a way that cannot be done with kernel threads, see this
>>         pair of blog posts:
>>
>>
>>         http://www.scylladb.com/2016/04/14/io-scheduler-1/
>>         <http://www.scylladb.com/2016/04/14/io-scheduler-1/>
>>
>>         http://www.scylladb.com/2016/04/29/io-scheduler-2/
>>         <http://www.scylladb.com/2016/04/29/io-scheduler-2/>
>>
>>
>>         There's simply no way to get this kind of control when you
>>         rely on the kernel for scheduling and page cache management. 
>>         As a result you have to overprovision your node and then you
>>         mostly underutilize it.
>>
>>
>>         On 03/12/2017 10:23 AM, Avi Kivity wrote:
>>>
>>>
>>>
>>>         On 03/12/2017 12:19 AM, Kant Kodali wrote:
>>>>         My response is inline.
>>>>
>>>>         On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity
>>>>         <avi@scylladb.com <mailto:avi@scylladb.com>> wrote:
>>>>
>>>>             There are several issues at play here.
>>>>
>>>>             First, a database runs a large number of concurrent
>>>>             operations, each of which only consumes a small amount
>>>>             of CPU. The high concurrency is need to hide latency:
>>>>             disk latency, or the latency of contacting a remote node.
>>>>
>>>>         *Ok so you are talking about hiding I/O latency. If all
>>>>         these I/O are non-blocking system calls then a thread per
>>>>         core and callback mechanism should suffice isn't it?*
>>>
>>>         Scylla uses a mix of user-level threads and callbacks. Most
>>>         of the code uses callbacks (fronted by a future/promise
>>>         API). SSTable writers (memtable flush, compaction) use a
>>>         user-level thread (internally implemented using callbacks). 
>>>         The important bit is multiplexing many concurrent operations
>>>         onto a single kernel thread.
>>>
>>>
>>>>             This means that the scheduler will need to switch
>>>>             contexts very often. A kernel thread scheduler knows
>>>>             very little about the application, so it has to switch
>>>>             a lot of context.  A user level scheduler is tightly
>>>>             bound to the application, so it can perform the
>>>>             switching faster.
>>>>
>>>>
>>>>         *sure but this applies in other direction as well. A user
>>>>         level scheduler has no idea about kernel level scheduler
>>>>         either.  There is literally no coordination between kernel
>>>>         level scheduler and user level scheduler in linux or any
>>>>         major OS. It may be possible with OS's that support
>>>>         scheduler activation(LWP's) and upcall mechanism. *
>>>
>>>         There is no need for coordination, because the kernel
>>>         scheduler has no scheduling decisions to make.  With one
>>>         thread per core, bound to its core, the kernel scheduler
>>>         can't make the wrong decision because it has just one choice.
>>>
>>>
>>>>         *Even then it is hard to say if it is all worth it (The
>>>>         research shows performance may not outweigh the
>>>>         complexity). Golang problem is exactly this if one creates
>>>>         1000 go routines/green threads where each of them is making
>>>>         a blocking system call then it would create 1000 kernel
>>>>         threads underneath because it has no way to know that the
>>>>         kernel thread is blocked (no upcall). *
>>>
>>>         All of the significant system calls we issue are through the
>>>         main thread, either asynchronous or non-blocking.
>>>
>>>>         *And in non-blocking case I still don't even see a
>>>>         significant performance when compared to few kernel threads
>>>>         with callback mechanism.*
>>>
>>>         We do.
>>>
>>>>         *  If you are saying user level scheduling is the Future
>>>>         (perhaps I would just let the researchers argue about it)
>>>>         As of today that is not case else languages would have had
>>>>         it natively instead of using third party frameworks or
>>>>         libraries.
>>>>         *
>>>
>>>         User-level scheduling is great for high performance I/O
>>>         intensive applications like databases and file systems. 
>>>         It's not a general solution, and it involves a lot of effort
>>>         to set up the infrastructure. However, for our use case, it
>>>         was worth it.
>>>
>>>>             There are also implications on the concurrency
>>>>             primitives in use (locks etc.) -- they will be much
>>>>             faster for the user-level scheduler, because they
>>>>             cooperate with the scheduler.  For example, no atomic
>>>>             read-modify-write instructions need to be executed.
>>>>
>>>>
>>>>              Second, how many (kernel) threads should you run?*This
>>>>         question one will always have. If there are 10K user level
>>>>         threads that maps to only one kernel thread then they
>>>>         cannot exploit parallelism. so there is no right answer but
>>>>         a thread per core is a reasonable/good choice.
>>>>         *
>>>
>>>         Only if you can multiplex many operations on top of each of
>>>         those threads.  Otherwise, the CPUs end up underutilized.
>>>
>>>>             If you run too few threads, then you will not be able
>>>>             to saturate the CPU resources.  This is a common
>>>>             problem with Cassandra -- it's very hard to get it to
>>>>             consume all of the CPU power on even a moderately large
>>>>             machine. On the other hand, if you have too many
>>>>             threads, you will see latency rise very quickly,
>>>>             because kernel scheduling granularity is on the order
>>>>             of milliseconds. User-level scheduling, because it
>>>>             leaves control in the hand of the application, allows
>>>>             you to both saturate the CPU and maintain low latency.
>>>>
>>>>
>>>>             F*or my workload and probably others I had seen
>>>>         Cassandra was always been CPU bound.*
>>>>
>>>>
>>>
>>>
>>>         Yes, but does it consume 100% of all of the cores on your
>>>         machine? Cassandra generally doesn't (on a larger machine),
>>>         and when you profile it, you see it spending much of its
>>>         time in atomic operations, or parking/unparking threads --
>>>         fighting with itself.  It doesn't scale within the machine. 
>>>         Scylla will happily utilize all of the cores that it is
>>>         assigned (all of them by default in most configurations),
>>>         and the bigger the machine you give it, the happier it will be.
>>>
>>>>             There are other factors, like NUMA-friendliness, but in
>>>>             the end it all boils down to efficiency and control.
>>>>
>>>>             None of this is new btw, it's pretty common in the
>>>>             storage world.
>>>>
>>>>             Avi
>>>>
>>>>
>>>>             On 03/11/2017 11:18 PM, Kant Kodali wrote:
>>>>>             Here is the Java version
>>>>>             http://docs.paralleluniverse.co/quasar/
>>>>>             <http://docs.paralleluniverse.co/quasar/> but I still
>>>>>             don't see how user level scheduling can be beneficial
>>>>>             (This is a well debated problem)? How can this add to
>>>>>             the performance? or say why is user level scheduling
>>>>>             necessary Given the Thread per core design and the
>>>>>             callback mechanism?
>>>>>
>>>>>             On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity
>>>>>             <avi@scylladb.com <mailto:avi@scylladb.com>>
wrote:
>>>>>
>>>>>                 Scylla uses a the seastar framework, which
>>>>>                 provides for both user-level thread scheduling and
>>>>>                 simple run-to-completion tasks.
>>>>>
>>>>>                 Huge pages are limited to 2MB (and 1GB, but these
>>>>>                 aren't available as transparent hugepages).
>>>>>
>>>>>
>>>>>                 On 03/11/2017 10:26 PM, Kant Kodali wrote:
>>>>>>                 @Dor
>>>>>>
>>>>>>                 1) You guys have a CPU scheduler? you mean user
>>>>>>                 level thread Scheduler that maps user level
>>>>>>                 threads to kernel level threads? I thought C++ by
>>>>>>                 default creates native kernel threads but sure
>>>>>>                 nothing will stop someone to create a user level
>>>>>>                 scheduling library if that's what you are talking
>>>>>>                 about?
>>>>>>                 2) How can one create THP of size 1KB? According
>>>>>>                 to this post
>>>>>>                 <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html>
it
>>>>>>                 looks like the valid values 2MB and 1GB.
>>>>>>
>>>>>>                 Thanks,
>>>>>>                 kant
>>>>>>
>>>>>>                 On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity
>>>>>>                 <avi@scylladb.com <mailto:avi@scylladb.com>>
wrote:
>>>>>>
>>>>>>                     Agreed, I'd recommend to treat benchmarks as
>>>>>>                     a rough guide to see where there is
>>>>>>                     potential, and follow through with your own
>>>>>>                     tests.
>>>>>>
>>>>>>                     On 03/11/2017 09:37 PM, Edward Capriolo wrote:
>>>>>>>
>>>>>>>                     Benchmarks are great for FUDly blog posts.
>>>>>>>                     Real world work loads matter more. Every
>>>>>>>                     NoSQL vendor wins their benchmarks.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>


Mime
View raw message