incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andriy Pidgurskiy <>
Subject Re: S4 use-case questions
Date Wed, 16 Jan 2013 08:51:35 GMT
Dear Kishore,

thanks a lot for prompt response.

The aggregation state is stored in memory so data persisted only for end
results, trade analytic metrics and specific intermediate values. However I
am considering ability to move data out of memory into disk in case memory
is scarce (e.g. for not-so-often updated portfolios).
User scenario is that: trades arrived, portfolio aggregated, no more trades
for this portfolio so far hence memory spilled over into HBase(or hdsf) and
PE is killed. Some time later new trade arrived for this portfolio and I
want to be able to start PE on the node where data is stored.

In case node fails I want task to be taken over by the other node and
aggregation state rebuilt by re-aggregating trade analytic metrics again.
It is important though that new data streams (trades) with the same
affinity (portfolio ID) should be processed by newly created Task/PE

Hope it makes sense.


On Wed, Jan 16, 2013 at 6:58 AM, kishore g <> wrote:

> Hi Andriy,
> Before answering your questions, it will be helpful to get some additional
> info.
> How are you storing the aggregation state? Are you storing it on hdfs. If
> yes, are you using hbase api or using hdfs client directly.
> You mention you need affinity of task and aggregation state, what is the
> desired behavior when the node fails. Do you want the task to be
> launched/taken over on other node in which case aggregation state will not
> be available locally or you are willing to wait until the task is restarted
> on the same node.
> thanks,
> Kishore G
> On Tue, Jan 15, 2013 at 5:55 PM, Andriy Pidgurskiy <
> > wrote:
>> Hello S4 community.
>> I have spent a some time reading all about S4 I could find including
>> original 2010 paper and presentations on 2011. I have managed to build and
>> start test applications and tried some scenarios myself.
>> Now I am trying to find out if S4 can fit for a very specific use case
>> and it seems like I need your advise regarding it.
>> Let's assume I have some trades to be separately analysed (S4 PE) and
>> aggregated (another PEs) by portfolio and by country
>> - Is it possible to guarantee data locality? I.e. having many tasks per
>> node how do I ensure that trade analytic results from the same portfolio
>> will always be processed by the same task? The task will have aggregation
>> state and it will builds up incrementally
>> - I've found references for testing S4 on 8 servers with total 128Gb of
>> RAM. Do you think having 20 servers to run 200 million PEs using total 1Tb
>> of RAM (64Gb per server) should be relatively straightforward?
>> - Is it allowed / appropriate to manipulate Streams runtime? I.e. define
>> and route events based on trade's attributes without re-deploying the App.
>> - I saw wiki page about S4 Piper with Hadoop YARN and S4-25 JIRA issue.
>> Does this integration mean that "data locality" agreement can be set
>> between Hadoop/HBase and S4? i.e. having PEs instantiated on the server
>> where intermediate results are stored (so my portfolio or county trade
>> analytics are aggregated on the same server every time)
>> Would much appreciate your help.
>> --
>> Regards.
>> Andriy Pidgurskiy.

Andriy Pidgurskiy.

View raw message