incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gaurav parashar <>
Subject Re: What is the difference between S4 and Twitter Storm projects?
Date Thu, 15 Jan 2015 05:25:22 GMT
I would like to know whether load shedding is available in other real time 
systems like storm , spark or kafka?
If not then why they have not implemented it in these systems?

On Tuesday, 11 October 2011 23:18:39 UTC+5:30, Leo Neumeyer wrote:
> I think that both projects share similar goals. I haven't done a detailed 
> study of Strom so I can't give you a very detailed comparison. However, it 
> is great to see more of these systems so we can keep learning from each 
> other. 
> Let me summarize the approach we are taking in S4-piper which is the next 
> release that will be done in the Apache Incubator. (Here for now: 
> The high level goal is unchanged: distributed stream processing and ease 
> to use. This implies hiding the distributed nature of the application from 
> the user. 
> We basically embrace symmetry, strict OO design, no multithreading 
> required, the Prototype design pattern, extensive use of the Java platform, 
> static typing, no string literals, no large XML configuration files.
> - Symmetry: all nodes in a cluster are identical. Makes management, 
> configuration and deployment easier. Fewer parts that can break.
> - OO Design: not much I can say here except that processing elements (PEs) 
> are objects, streams are objects, and events are objects. The core API is 
> very small and works in a coordinated way to build the platform.
> - In most cases, application developers don't need to write any 
> multithreaded code. By default PE threads are synchronized.
> - Prototype pattern design: once a PE is configured, we instantiate a PE 
> object that we call a PE prototype. The PE prototype has the PE type and 
> the configuration. PE instances are cloned from the prototype and keyed 
> using a (key_finder, key_value) tuple. The key_finder is a function that 
> extracts a value from the event and that value is the key for the PE 
> instance. This guarantees that we have a unique instance of a PE for a 
> given key. This can be used in a single node or in a cluster with multiple 
> nodes. For the app developer, this means that the app is a graph of 
> prototype PEs connected buy streams. The key logic resides in the stream 
> object. Building an app is as easy as drawing a graph. We have a basic API 
> but people will be able to build tools on top of it (GUI, DSLs, etc.) We 
> are planning to provide an API on top of Guice so we can take advantage of 
> Guice's dependency injection under the hood.
> - Using the Java platform. We feel that having a consistent platform that 
> is familiar to many developers helps make things easier. While dealing with 
> some complexity and less known frameworks under the hood (Generics, Guice, 
> OSGI for dynamic app loading, ..) we try to leave the external API as pure 
> as simple as possible. We also want to take advantage of new language 
> feature coming in Java 7,8... We also have the option to easily add a Scala 
> API. 
> - Static typing: I always thought that dynamic languages were much nicer 
> but I can see the trade offs. In a large system with multiple programmers 
> and modules, having static typing with a good OO design makes understanding 
> and maintaining the code much easier. To accomplish this we use Guice to 
> replace XML config files which are nightmare if you had lots of PEs to 
> configure. I'd rather use a Java API to build the graph. We eliminate the 
> use of string literals to find objects, this is cleaner, more efficient, 
> easy to refactor, and we can easily find things in Eclipse. For remote 
> objects, we send POJOs around and pretty much hide all the details. The 
> stream object is smart enough to pass events to the comm layer when  the 
> target PE instance is in another JVM. The event magically arrives in the 
> stream's queue at the destination. 
> - In terms of pluggability, S4 can use any comm protocol and serialization 
> scheme. You just need to implement it in the comm layer. Right now we have 
> a simple UDP implementation and a TCP implementation using Netty. We tested 
> ZMQ but saw no performance advantage and no need for it since everything is 
> Java in our case. However, I may be missing something. Netty gives you the 
> flexibility to implement alternate protocols.
> - In terms of losing events, it is not a matter of losing or not losing, 
> you often need to plan how to handle peak traffic in a real-time system, 
> otherwise it is probably over engineered. Our plan is to provide an API for 
> load shedding so app developers can decide what to do when the queues are 
> getting full. Some may want to throw events, others may want to switch to a 
> less computationally expensive algorithm.  I prefer to think in a 
> probabilistic way, that is, event loss or load shedding are part of the 
> algorithm and is evaluated in terms of a performance metric as it is done 
> in a communications system.
>  - I also experimented on how to run batch tasks in S4. Unlike streaming, 
> in batch mode, we can afford infinite delays and prevent event loss. This 
> is important for offline tasks and for debugging. For this we need to 
> support blocking queues in the stream so when a queue is full, it blocks 
> instead of losing events. In a cluster, this means, that blocking needs to 
> propagate upstream across nodes. The downside is that you may end up having 
> deadlocks if the graph has cycles but that is a whole other story.  In the 
> model example I push events at a high data rate into S4 to train a model. 
> The exact same model is also used at run time. This brings consistency for 
> training and execution because it uses the same code base and the same 
> framework. It doesn't have all the functionality of Hadoop but can be used 
> for some tasks. 
> Hope this helps for now. 
> -leo
> On Oct 11, 2011, at 6:54 AM, OG wrote:
> You may fine this useful: 
> Otis
> --
> Sematext is hiring!
> On Oct 10, 5:29 am, yasith tharindu <> wrote:
> Any body have a idea about the $Subject ?
> --
> Thanks..
> Regards...
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message