Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8C314200C35 for ; Sun, 12 Mar 2017 10:57:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8AC31160B77; Sun, 12 Mar 2017 09:57:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B99A1160B63 for ; Sun, 12 Mar 2017 10:57:06 +0100 (CET) Received: (qmail 34136 invoked by uid 500); 12 Mar 2017 09:57:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 34126 invoked by uid 99); 12 Mar 2017 09:57:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Mar 2017 09:57:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 94A65C0B1D for ; Sun, 12 Mar 2017 09:57:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.979 X-Spam-Level: * X-Spam-Status: No, score=1.979 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=scylladb-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id pZOSY4c_P0fJ for ; Sun, 12 Mar 2017 09:57:01 +0000 (UTC) Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EF7395F295 for ; Sun, 12 Mar 2017 09:57:00 +0000 (UTC) Received: by mail-wm0-f51.google.com with SMTP id t189so21498866wmt.1 for ; Sun, 12 Mar 2017 01:57:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=scylladb-com.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:organization:message-id:date:user-agent :mime-version:in-reply-to; bh=ogBDz8pvZdwiQRecy4ozI5ArvphnRXVXUjN5lHyjh8s=; b=e9syQP0sfisGYbP4XGFeJYoYeR+RC+L5207RnYYp+ZY28ompNiRqFHYTUJmvG57yFh ExXCVWaZe+3EHrcRamWtaYC/YjLa9rS93hZQQz+skiBFMX6SS52I+LgD0tHAXaeWYVQa BHQIryTmWvr4GlTKpmQ+TJVyCYVbFRVkCW049ewI0KejOAEkwlOHcfQ5iaPFMyr0TdxU lnKCTw6K/GliO11ZxVG4jNt9u82XyUFHaI4EePnIGUtbGtl7XeMG3pGzFWUqVI2ZM+nh DjGr/KQ8MKrt6+PzPEgJT2taixHloGR0zwVxGj/3V96nqf1bWM2IwuG3I3OgrourqkEy VHEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to; bh=ogBDz8pvZdwiQRecy4ozI5ArvphnRXVXUjN5lHyjh8s=; b=NrDCmOunzybv3PQw1OKuuClgHbQhPp05onv8GocZZd3yuO4LxDUq7BPGxhIeB217J3 d5m8dxWWAnus/FR+/yEK11bjcLx5UD0F4CnBSbqFFHQm9BdrDK+be6d4DEdLAI+5uW1/ //8RBPiSpDWF8CZm4pGp4NlwVKlhRvrtBq8eXbqVirdVPHZQVIaxmEYnSCny7uxdGODf 25ysJ409PhtjUQFugQdfTVLxNN/JsQlKFTY3o3CdfrrzGIJ4O5/Qsz6yIlIr5ZdAWPeE 0HRzaevfKu9fkPPZjm5q3Pd6ppQXn/hdBm+kyMU5ox3BKSmPMuHV7HfVEdROLCTr616N CVPg== X-Gm-Message-State: AFeK/H1620ZzfSfPTb7X+4hcs0ms5mqiExJB/cvVxBIf4GOxvgA/M9+MB+tnpsEjEcmrSIEs X-Received: by 10.28.147.147 with SMTP id v141mr6472892wmd.110.1489312121926; Sun, 12 Mar 2017 01:48:41 -0800 (PST) Received: from avi.cloudius-systems.com ([37.142.229.250]) by smtp.gmail.com with ESMTPSA id m83sm6733052wmc.33.2017.03.12.01.48.40 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 12 Mar 2017 01:48:40 -0800 (PST) Subject: Re: scylladb To: user@cassandra.apache.org References: <0c153096-af45-3a69-28ea-4a6471a47e04@scylladb.com> <7bd42a18-af9c-c9d5-98c9-083ced49cd5d@scylladb.com> <6d45df80-8193-73bc-0f1c-81079ab64d5c@scylladb.com> <5f568408-9e60-1d09-64f9-60e4101f69a6@scylladb.com> <16035618-57b8-7ccb-717d-83a8aa56d7dc@scylladb.com> <394b983a-3ff4-dd63-c01b-20adc33947d4@scylladb.com> <4df2c638-4ba6-0e50-a360-6760d5f3daa4@scylladb.com> From: Avi Kivity Organization: ScyllaDB Message-ID: <877ef59c-27f4-4c2e-e1d3-f7d4ab08ed55@scylladb.com> Date: Sun, 12 Mar 2017 11:48:39 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------DEAC4E01F2AF7F3E01657706" archived-at: Sun, 12 Mar 2017 09:57:08 -0000 This is a multi-part message in MIME format. --------------DEAC4E01F2AF7F3E01657706 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit We already quantified it, the result is Scylla. Now, Scylla's performance is only in part due to the threading model, so I can't give you a number that quantifies how much just this aspect of the design is worth. Removing it (or adding it to Cassandra) is a multi-man-year effort that I can't justify for this conversation. If you want to continue to use kernel threads for you applications, by all means continue to do so. They're the right choice for all but the most I/O intensive applications. But for these I/O intensive applications thread-per-core is the right choice, regardless of the points you raise. I encourage you to study the seastar code base [1] and documentation [2] to see how we handled those problems. I'll also comment a bit below. [1] https://github.com/scylladb/seastar [2] http://www.seastar-project.org/ On 03/12/2017 11:07 AM, Kant Kodali wrote: > @Avi > > "User-level scheduling is great for high performance I/O intensive > applications like databases and file systems." This is generally a > claim made by people you want to use user-level threads but I rarely > had seen any significant performance gain. Since you are claiming that > you do. It would be great if you can quantify that. The other day I > have seen a benchmark of a Golang server which supports user level > threads/green threads natively and it was able to handle 10K > concurrent requests. Even Nginx which is written and C and uses kernel > threads can handle that many with Non-blocking I/O. We all know > concurrency is not parallelism. > > You may have to pay for something which could be any of the following. > > *Duplication of the schedulers* > M:N requires two schedulers which basically do same work, one at user > level and one in kernel. This is undesirable. It requires frequent > data communications between kernel and user space for scheduling > information transference. > > Duplication takes more space in both Dcache and Icache for scheduling > than a single scheduler. It is highly undesirable if cache misses are > caused by the schedulers but the application, because a L2 cache miss > could be more expensive than a kernel thread switch. Then the > additional scheduler might become a trouble maker! In this case, to > save kernel trappings does not justify a user-scheduler, which is more > truen when the processors are providing faster and faster kernel > trapping execution. That's not a problem, at least in my experience. The kernel scheduler needs to schedule only one thread, and that very infrequently. It is completely out of any hot path. > > *Thread local data maintenance* > M:N has to maintain thread specific data, which are already provided > by kernel for kernel thread, such as the TLS data, error number. To > provide the same feature for user threads is not straightforward, > because, for example, the error number is returned for system call > failure and supported by kernel. User-level support degrades system > performance and increases system complexity. This is also not a problem, we capture error codes in exceptions immediately after a system call and so we don't need to rely on TLS for errno. > > *System info oblivious* > Kernel scheduler is close to underlying platform and architecture. It > can take advantage of their features. This is difficult for user > thread library because it's a layer at user level. User threads are > second-order entities in the system. If a kernel thread uses a GDT > slot for TLS data, a user thread perhaps can only use an LDT slot for > TLS data. With increasingly more supports available from the new > processors for threading/scheduling (Hyperthreading, NUMA, many-core), > the second order nature seriously limits the ability of M:N threading. Those are non-issues, in my experience. In fact it's the other way around, the kernel scheduler cannot assume anything about the threads it is preempting and so has to save more state. The threads being preempted also cannot assume anything about the kernel scheduler, and so have to use atomic read-modify-write instructions for synchronization, and to perform a system call whenever they need to block or wake another thread. > > On Sun, Mar 12, 2017 at 1:05 AM, Avi Kivity > wrote: > > btw, for an example of how user-level tasks can be scheduled in a > way that cannot be done with kernel threads, see this pair of blog > posts: > > > http://www.scylladb.com/2016/04/14/io-scheduler-1/ > > > http://www.scylladb.com/2016/04/29/io-scheduler-2/ > > > > There's simply no way to get this kind of control when you rely on > the kernel for scheduling and page cache management. As a result > you have to overprovision your node and then you mostly > underutilize it. > > > On 03/12/2017 10:23 AM, Avi Kivity wrote: >> >> >> >> On 03/12/2017 12:19 AM, Kant Kodali wrote: >>> My response is inline. >>> >>> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity >> > wrote: >>> >>> There are several issues at play here. >>> >>> First, a database runs a large number of concurrent >>> operations, each of which only consumes a small amount of >>> CPU. The high concurrency is need to hide latency: disk >>> latency, or the latency of contacting a remote node. >>> >>> *Ok so you are talking about hiding I/O latency. If all these >>> I/O are non-blocking system calls then a thread per core and >>> callback mechanism should suffice isn't it?* >> >> Scylla uses a mix of user-level threads and callbacks. Most of >> the code uses callbacks (fronted by a future/promise API). >> SSTable writers (memtable flush, compaction) use a user-level >> thread (internally implemented using callbacks). The important >> bit is multiplexing many concurrent operations onto a single >> kernel thread. >> >> >>> This means that the scheduler will need to switch contexts >>> very often. A kernel thread scheduler knows very little >>> about the application, so it has to switch a lot of >>> context. A user level scheduler is tightly bound to the >>> application, so it can perform the switching faster. >>> >>> >>> *sure but this applies in other direction as well. A user level >>> scheduler has no idea about kernel level scheduler either. >>> There is literally no coordination between kernel level >>> scheduler and user level scheduler in linux or any major OS. It >>> may be possible with OS's that support scheduler >>> activation(LWP's) and upcall mechanism. * >> >> There is no need for coordination, because the kernel scheduler >> has no scheduling decisions to make. With one thread per core, >> bound to its core, the kernel scheduler can't make the wrong >> decision because it has just one choice. >> >> >>> *Even then it is hard to say if it is all worth it (The research >>> shows performance may not outweigh the complexity). Golang >>> problem is exactly this if one creates 1000 go routines/green >>> threads where each of them is making a blocking system call then >>> it would create 1000 kernel threads underneath because it has no >>> way to know that the kernel thread is blocked (no upcall). * >> >> All of the significant system calls we issue are through the main >> thread, either asynchronous or non-blocking. >> >>> *And in non-blocking case I still don't even see a significant >>> performance when compared to few kernel threads with callback >>> mechanism.* >> >> We do. >> >>> * If you are saying user level scheduling is the Future >>> (perhaps I would just let the researchers argue about it) As of >>> today that is not case else languages would have had it natively >>> instead of using third party frameworks or libraries. >>> * >> >> User-level scheduling is great for high performance I/O intensive >> applications like databases and file systems. It's not a general >> solution, and it involves a lot of effort to set up the >> infrastructure. However, for our use case, it was worth it. >> >>> There are also implications on the concurrency primitives in >>> use (locks etc.) -- they will be much faster for the >>> user-level scheduler, because they cooperate with the >>> scheduler. For example, no atomic read-modify-write >>> instructions need to be executed. >>> >>> >>> Second, how many (kernel) threads should you run?*This >>> question one will always have. If there are 10K user level >>> threads that maps to only one kernel thread then they cannot >>> exploit parallelism. so there is no right answer but a thread >>> per core is a reasonable/good choice. >>> * >> >> Only if you can multiplex many operations on top of each of those >> threads. Otherwise, the CPUs end up underutilized. >> >>> If you run too few threads, then you will not be able to >>> saturate the CPU resources. This is a common problem with >>> Cassandra -- it's very hard to get it to consume all of the >>> CPU power on even a moderately large machine. On the other >>> hand, if you have too many threads, you will see latency >>> rise very quickly, because kernel scheduling granularity is >>> on the order of milliseconds. User-level scheduling, >>> because it leaves control in the hand of the application, >>> allows you to both saturate the CPU and maintain low latency. >>> >>> >>> F*or my workload and probably others I had seen Cassandra >>> was always been CPU bound.* >>> >>> >> >> >> Yes, but does it consume 100% of all of the cores on your >> machine? Cassandra generally doesn't (on a larger machine), and >> when you profile it, you see it spending much of its time in >> atomic operations, or parking/unparking threads -- fighting with >> itself. It doesn't scale within the machine. Scylla will happily >> utilize all of the cores that it is assigned (all of them by >> default in most configurations), and the bigger the machine you >> give it, the happier it will be. >> >>> There are other factors, like NUMA-friendliness, but in the >>> end it all boils down to efficiency and control. >>> >>> None of this is new btw, it's pretty common in the storage >>> world. >>> >>> Avi >>> >>> >>> On 03/11/2017 11:18 PM, Kant Kodali wrote: >>>> Here is the Java version >>>> http://docs.paralleluniverse.co/quasar/ >>>> but I still don't >>>> see how user level scheduling can be beneficial (This is a >>>> well debated problem)? How can this add to the performance? >>>> or say why is user level scheduling necessary Given the >>>> Thread per core design and the callback mechanism? >>>> >>>> On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity >>>> > wrote: >>>> >>>> Scylla uses a the seastar framework, which provides for >>>> both user-level thread scheduling and simple >>>> run-to-completion tasks. >>>> >>>> Huge pages are limited to 2MB (and 1GB, but these >>>> aren't available as transparent hugepages). >>>> >>>> >>>> On 03/11/2017 10:26 PM, Kant Kodali wrote: >>>>> @Dor >>>>> >>>>> 1) You guys have a CPU scheduler? you mean user level >>>>> thread Scheduler that maps user level threads to >>>>> kernel level threads? I thought C++ by default creates >>>>> native kernel threads but sure nothing will stop >>>>> someone to create a user level scheduling library if >>>>> that's what you are talking about? >>>>> 2) How can one create THP of size 1KB? According to >>>>> this post >>>>> it >>>>> looks like the valid values 2MB and 1GB. >>>>> >>>>> Thanks, >>>>> kant >>>>> >>>>> On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity >>>>> > wrote: >>>>> >>>>> Agreed, I'd recommend to treat benchmarks as a >>>>> rough guide to see where there is potential, and >>>>> follow through with your own tests. >>>>> >>>>> On 03/11/2017 09:37 PM, Edward Capriolo wrote: >>>>>> >>>>>> Benchmarks are great for FUDly blog posts. Real >>>>>> world work loads matter more. Every NoSQL vendor >>>>>> wins their benchmarks. >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> > > --------------DEAC4E01F2AF7F3E01657706 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit

We already quantified it, the result is Scylla. Now, Scylla's performance is only in part due to the threading model, so I can't give you a number that quantifies how much just this aspect of the design is worth.  Removing it (or adding it to Cassandra) is a multi-man-year effort that I can't justify for this conversation.


If you want to continue to use kernel threads for you applications, by all means continue to do so.  They're the right choice for all but the most I/O intensive applications.  But for these I/O intensive applications thread-per-core is the right choice, regardless of the points you raise.


I encourage you to study the seastar code base [1] and documentation [2] to see how we handled those problems.  I'll also comment a bit below.


[1] https://github.com/scylladb/seastar

[2] http://www.seastar-project.org/


On 03/12/2017 11:07 AM, Kant Kodali wrote:
@Avi

"User-level scheduling is great for high performance I/O intensive applications like databases and file systems." This is generally a claim made by people you want to use user-level threads but I rarely had seen any significant performance gain. Since you are claiming that you do. It would be great if you can quantify that. The other day I have seen a benchmark of a Golang server which supports user level threads/green threads natively and it was able to handle 10K concurrent requests. Even Nginx which is written and C and uses kernel threads can handle that many with Non-blocking I/O. We all know concurrency is not parallelism. 

You may have to pay for something which could be any of the following.

Duplication of the schedulers
M:N requires two schedulers which basically do same work, one at user level and one in kernel. This is undesirable. It requires frequent data communications between kernel and user space for scheduling information transference. 

Duplication takes more space in both Dcache and Icache for scheduling than a single scheduler. It is highly undesirable if cache misses are caused by the schedulers but the application, because a L2 cache miss could be more expensive than a kernel thread switch. Then the additional scheduler might become a trouble maker! In this case, to save kernel trappings does not justify a user-scheduler, which is more truen when the processors are providing faster and faster kernel trapping execution.


That's not a problem, at least in my experience. The kernel scheduler needs to schedule only one thread, and that very infrequently. It is completely out of any hot path.


Thread local data maintenance 
M:N has to maintain thread specific data, which are already provided by kernel for kernel thread, such as the TLS data, error number. To provide the same feature for user threads is not straightforward, because, for example, the error number is returned for system call failure and supported by kernel. User-level support degrades system performance and increases system complexity.

This is also not a problem, we capture error codes in exceptions immediately after a system call and so we don't need to rely on TLS for errno.


System info oblivious 
Kernel scheduler is close to underlying platform and architecture. It can take advantage of their features. This is difficult for user thread library because it's a layer at user level. User threads are second-order entities in the system. If a kernel thread uses a GDT slot for TLS data, a user thread perhaps can only use an LDT slot for TLS data. With increasingly more supports available from the new processors for threading/scheduling (Hyperthreading, NUMA, many-core), the second order nature seriously limits the ability of M:N threading.

Those are non-issues, in my experience.  In fact it's the other way around, the kernel scheduler cannot assume anything about the threads it is preempting and so has to save more state.  The threads being preempted also cannot assume anything about the kernel scheduler, and so have to use atomic read-modify-write instructions for synchronization, and to perform a system call whenever they need to block or wake another thread.




On Sun, Mar 12, 2017 at 1:05 AM, Avi Kivity <avi@scylladb.com> wrote:

btw, for an example of how user-level tasks can be scheduled in a way that cannot be done with kernel threads, see this pair of blog posts:


  http://www.scylladb.com/2016/04/14/io-scheduler-1/

  http://www.scylladb.com/2016/04/29/io-scheduler-2/


There's simply no way to get this kind of control when you rely on the kernel for scheduling and page cache management.  As a result you have to overprovision your node and then you mostly underutilize it.


On 03/12/2017 10:23 AM, Avi Kivity wrote:



On 03/12/2017 12:19 AM, Kant Kodali wrote:
My response is inline.

On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <avi@scylladb.com> wrote:
There are several issues at play here.

First, a database runs a large number of concurrent operations, each of which only consumes a small amount of CPU. The high concurrency is need to hide latency: disk latency, or the latency of contacting a remote node.
 
Ok so you are talking about hiding I/O latency.  If all these I/O are non-blocking system calls then a thread per core and callback mechanism should suffice isn't it?
 

Scylla uses a mix of user-level threads and callbacks. Most of the code uses callbacks (fronted by a future/promise API). SSTable writers  (memtable flush, compaction) use a user-level thread (internally implemented using callbacks).  The important bit is multiplexing many concurrent operations onto a single kernel thread.


This means that the scheduler will need to switch contexts very often. A kernel thread scheduler knows very little about the application, so it has to switch a lot of context.  A user level scheduler is tightly bound to the application, so it can perform the switching faster. 

sure but this applies in other direction as well. A user level scheduler has no idea about kernel level scheduler either.  There is literally no coordination between kernel level scheduler and user level scheduler in linux or any major OS. It may be possible with OS's that support scheduler activation(LWP's) and upcall mechanism.

There is no need for coordination, because the kernel scheduler has no scheduling decisions to make.  With one thread per core, bound to its core, the kernel scheduler can't make the wrong decision because it has just one choice.


Even then it is hard to say if it is all worth it (The research shows performance may not outweigh the complexity). Golang problem is exactly this if one creates 1000 go routines/green threads where each of them is making a blocking system call then it would create 1000 kernel threads underneath because it has no way to know that the kernel thread is blocked (no upcall).

All of the significant system calls we issue are through the main thread, either asynchronous or non-blocking.

And in non-blocking case I still don't even see a significant performance when compared to few kernel threads with callback mechanism.

We do.

  If you are saying user level scheduling is the Future (perhaps I would just let the researchers argue about it) As of today that is not case else languages would have had it natively instead of using third party frameworks or libraries.

User-level scheduling is great for high performance I/O intensive applications like databases and file systems.  It's not a general solution, and it involves a lot of effort to set up the infrastructure. However, for our use case, it was worth it.

 
There are also implications on the concurrency primitives in use (locks etc.) -- they will be much faster for the user-level scheduler, because they cooperate with the scheduler.  For example, no atomic read-modify-write instructions need to be executed.

    
     Second, how many (kernel) threads should you run? This question one will always have. If there are 10K user level threads that maps to only one kernel thread then they cannot exploit parallelism. so there is no right answer but a thread per core is a reasonable/good choice.

Only if you can multiplex many operations on top of each of those threads.  Otherwise, the CPUs end up underutilized.

 
If you run too few threads, then you will not be able to saturate the CPU resources.  This is a common problem with Cassandra -- it's very hard to get it to consume all of the CPU power on even a moderately large machine. On the other hand, if you have too many threads, you will see latency rise very quickly, because kernel scheduling granularity is on the order of milliseconds.  User-level scheduling, because it leaves control in the hand of the application, allows you to both saturate the CPU and maintain low latency.

    For my workload and probably others I had seen Cassandra was always been CPU bound. 



Yes, but does it consume 100% of all of the cores on your machine?  Cassandra generally doesn't (on a larger machine), and when you profile it, you see it spending much of its time in atomic operations, or parking/unparking threads -- fighting with itself.  It doesn't scale within the machine.  Scylla will happily utilize all of the cores that it is assigned (all of them by default in most configurations), and the bigger the machine you give it, the happier it will be.

There are other factors, like NUMA-friendliness, but in the end it all boils down to efficiency and control.

None of this is new btw, it's pretty common in the storage world.

Avi


On 03/11/2017 11:18 PM, Kant Kodali wrote:
Here is the Java version http://docs.paralleluniverse.co/quasar/ but I still don't see how user level scheduling can be beneficial (This is a well debated problem)? How can this add to the performance? or say why is user level scheduling necessary Given the Thread per core design and the callback mechanism?

On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity <avi@scylladb.com> wrote:
Scylla uses a the seastar framework, which provides for both user-level thread scheduling and simple run-to-completion tasks.

Huge pages are limited to 2MB (and 1GB, but these aren't available as transparent hugepages).


On 03/11/2017 10:26 PM, Kant Kodali wrote:
@Dor 

1) You guys have a CPU scheduler? you mean user level thread Scheduler that maps user level threads to kernel level threads? I thought C++ by default creates native kernel threads but sure nothing will stop someone to create a user level scheduling library if that's what you are talking about?
2) How can one create THP of size 1KB? According to this post it looks like the valid values 2MB and 1GB.

Thanks,
kant

On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity <avi@scylladb.com> wrote:
Agreed, I'd recommend to treat benchmarks as a rough guide to see where there is potential, and follow through with your own tests.

On 03/11/2017 09:37 PM, Edward Capriolo wrote:

Benchmarks are great for FUDly blog posts. Real world work loads matter more. Every NoSQL vendor wins their benchmarks.













--------------DEAC4E01F2AF7F3E01657706--