incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: S4 use-case questions
Date Wed, 16 Jan 2013 23:28:04 GMT
Thanks for the clarification.  Find your answers inline

- Is it possible to guarantee data locality? I.e. having many tasks per
node how do I ensure that trade analytic results from the same portfolio
will always be processed by the same task? The task will have aggregation
state and it will builds up incrementally
[KG]: In the current implementation of s4, when a node starts it grabs what
ever task is available. There is no affinity of a process to task or say
partition. However with
S4-110<https://issues.apache.org/jira/browse/S4-110>it will have
affinity. Basically you can assign a unique id to a process
and it will always grab the same partition which means the a given
portfolio will always land up on the same node.

- I've found references for testing S4 on 8 servers with total 128Gb of
RAM. Do you think having 20 servers to run 200 million PEs using total 1Tb
of RAM (64Gb per server) should be relatively straightforward?
[KG]This depends on the what you are storing in your PE.

- Is it allowed / appropriate to manipulate Streams runtime? I.e. define
and route events based on trade's attributes without re-deploying the App.
[KG]: can you elaborate on this? routing of events is purely based on the
event key. Currently you have to decide upfront on the keys you need to
re-route on. If you want to dynamically change that, you can maintain that
in zookeeper/hdfs and watch for changes. This might become easier with
s4-110.

- I saw wiki page about S4 Piper with Hadoop YARN and S4-25 JIRA issue.
Does this integration mean that "data locality" agreement can be set
between Hadoop/HBase and S4? i.e. having PEs instantiated on the server
where intermediate results are stored (so my portfolio or county trade
analytics are aggregated on the same server every time)
[KG]: This is possible with s4-25, though i dont think we have that
support.


Regarding you fail over requirement, i think its feasible. You might also
want to look at the checkpointing framework, it might need support for
hdfs/hbase.

Hope this helps.

Thanks
Kishore G




On Wed, Jan 16, 2013 at 12:51 AM, Andriy Pidgurskiy
<a.pidgurskiy@gmail.com>wrote:

> Dear Kishore,
>
> thanks a lot for prompt response.
>
> The aggregation state is stored in memory so data persisted only for end
> results, trade analytic metrics and specific intermediate values. However I
> am considering ability to move data out of memory into disk in case memory
> is scarce (e.g. for not-so-often updated portfolios).
> User scenario is that: trades arrived, portfolio aggregated, no more
> trades for this portfolio so far hence memory spilled over into HBase(or
> hdsf) and PE is killed. Some time later new trade arrived for this
> portfolio and I want to be able to start PE on the node where data is
> stored.
>
> In case node fails I want task to be taken over by the other node and
> aggregation state rebuilt by re-aggregating trade analytic metrics again.
> It is important though that new data streams (trades) with the same
> affinity (portfolio ID) should be processed by newly created Task/PE
>
> Hope it makes sense.
>
> Regards,
> Andriy.
>
>
> On Wed, Jan 16, 2013 at 6:58 AM, kishore g <g.kishore@gmail.com> wrote:
>
>> Hi Andriy,
>>
>> Before answering your questions, it will be helpful to get some
>> additional info.
>>
>> How are you storing the aggregation state? Are you storing it on hdfs. If
>> yes, are you using hbase api or using hdfs client directly.
>>
>> You mention you need affinity of task and aggregation state, what is the
>> desired behavior when the node fails. Do you want the task to be
>> launched/taken over on other node in which case aggregation state will not
>> be available locally or you are willing to wait until the task is restarted
>> on the same node.
>>
>> thanks,
>> Kishore G
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 15, 2013 at 5:55 PM, Andriy Pidgurskiy <
>> a.pidgurskiy@gmail.com> wrote:
>>
>>> Hello S4 community.
>>>
>>>
>>> I have spent a some time reading all about S4 I could find including
>>> original 2010 paper and presentations on 2011. I have managed to build and
>>> start test applications and tried some scenarios myself.
>>>
>>> Now I am trying to find out if S4 can fit for a very specific use case
>>> and it seems like I need your advise regarding it.
>>>
>>> Let's assume I have some trades to be separately analysed (S4 PE) and
>>> aggregated (another PEs) by portfolio and by country
>>>
>>> - Is it possible to guarantee data locality? I.e. having many tasks per
>>> node how do I ensure that trade analytic results from the same portfolio
>>> will always be processed by the same task? The task will have aggregation
>>> state and it will builds up incrementally
>>> - I've found references for testing S4 on 8 servers with total 128Gb of
>>> RAM. Do you think having 20 servers to run 200 million PEs using total 1Tb
>>> of RAM (64Gb per server) should be relatively straightforward?
>>> - Is it allowed / appropriate to manipulate Streams runtime? I.e. define
>>> and route events based on trade's attributes without re-deploying the App.
>>> - I saw wiki page about S4 Piper with Hadoop YARN and S4-25 JIRA issue.
>>> Does this integration mean that "data locality" agreement can be set
>>> between Hadoop/HBase and S4? i.e. having PEs instantiated on the server
>>> where intermediate results are stored (so my portfolio or county trade
>>> analytics are aggregated on the same server every time)
>>>
>>>
>>> Would much appreciate your help.
>>>
>>>
>>>
>>> --
>>> Regards.
>>> Andriy Pidgurskiy.
>>>
>>
>>
>
>
> --
> Regards.
> Andriy Pidgurskiy.

Mime
View raw message