incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andriy Pidgurskiy <a.pidgurs...@gmail.com>
Subject Re: S4 use-case questions
Date Thu, 17 Jan 2013 00:09:00 GMT
Thanks a lot Kishore,


Your assistance is very comprehensive and much appreciated.

Have a good evening.


Regards,
Andriy.



On Wed, Jan 16, 2013 at 11:28 PM, kishore g <g.kishore@gmail.com> wrote:

> Thanks for the clarification.  Find your answers inline
>
> - Is it possible to guarantee data locality? I.e. having many tasks per
> node how do I ensure that trade analytic results from the same portfolio
> will always be processed by the same task? The task will have aggregation
> state and it will builds up incrementally
> [KG]: In the current implementation of s4, when a node starts it grabs
> what ever task is available. There is no affinity of a process to task or
> say partition. However with S4-110<https://issues.apache.org/jira/browse/S4-110>it
will have affinity. Basically you can assign a unique id to a process
> and it will always grab the same partition which means the a given
> portfolio will always land up on the same node.
>
> - I've found references for testing S4 on 8 servers with total 128Gb of
> RAM. Do you think having 20 servers to run 200 million PEs using total 1Tb
> of RAM (64Gb per server) should be relatively straightforward?
> [KG]This depends on the what you are storing in your PE.
>
> - Is it allowed / appropriate to manipulate Streams runtime? I.e. define
> and route events based on trade's attributes without re-deploying the App.
> [KG]: can you elaborate on this? routing of events is purely based on the
> event key. Currently you have to decide upfront on the keys you need to
> re-route on. If you want to dynamically change that, you can maintain that
> in zookeeper/hdfs and watch for changes. This might become easier with
> s4-110.
>
> - I saw wiki page about S4 Piper with Hadoop YARN and S4-25 JIRA issue.
> Does this integration mean that "data locality" agreement can be set
> between Hadoop/HBase and S4? i.e. having PEs instantiated on the server
> where intermediate results are stored (so my portfolio or county trade
> analytics are aggregated on the same server every time)
> [KG]: This is possible with s4-25, though i dont think we have that
> support.
>
>
> Regarding you fail over requirement, i think its feasible. You might also
> want to look at the checkpointing framework, it might need support for
> hdfs/hbase.
>
> Hope this helps.
>
> Thanks
> Kishore G
>
>
>
>
> On Wed, Jan 16, 2013 at 12:51 AM, Andriy Pidgurskiy <
> a.pidgurskiy@gmail.com> wrote:
>
>> Dear Kishore,
>>
>> thanks a lot for prompt response.
>>
>> The aggregation state is stored in memory so data persisted only for end
>> results, trade analytic metrics and specific intermediate values. However I
>> am considering ability to move data out of memory into disk in case memory
>> is scarce (e.g. for not-so-often updated portfolios).
>> User scenario is that: trades arrived, portfolio aggregated, no more
>> trades for this portfolio so far hence memory spilled over into HBase(or
>> hdsf) and PE is killed. Some time later new trade arrived for this
>> portfolio and I want to be able to start PE on the node where data is
>> stored.
>>
>> In case node fails I want task to be taken over by the other node and
>> aggregation state rebuilt by re-aggregating trade analytic metrics again.
>> It is important though that new data streams (trades) with the same
>> affinity (portfolio ID) should be processed by newly created Task/PE
>>
>> Hope it makes sense.
>>
>> Regards,
>> Andriy.
>>
>>
>> On Wed, Jan 16, 2013 at 6:58 AM, kishore g <g.kishore@gmail.com> wrote:
>>
>>> Hi Andriy,
>>>
>>> Before answering your questions, it will be helpful to get some
>>> additional info.
>>>
>>> How are you storing the aggregation state? Are you storing it on hdfs.
>>> If yes, are you using hbase api or using hdfs client directly.
>>>
>>> You mention you need affinity of task and aggregation state, what is the
>>> desired behavior when the node fails. Do you want the task to be
>>> launched/taken over on other node in which case aggregation state will not
>>> be available locally or you are willing to wait until the task is restarted
>>> on the same node.
>>>
>>> thanks,
>>> Kishore G
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 15, 2013 at 5:55 PM, Andriy Pidgurskiy <
>>> a.pidgurskiy@gmail.com> wrote:
>>>
>>>> Hello S4 community.
>>>>
>>>>
>>>> I have spent a some time reading all about S4 I could find including
>>>> original 2010 paper and presentations on 2011. I have managed to build and
>>>> start test applications and tried some scenarios myself.
>>>>
>>>> Now I am trying to find out if S4 can fit for a very specific use case
>>>> and it seems like I need your advise regarding it.
>>>>
>>>> Let's assume I have some trades to be separately analysed (S4 PE) and
>>>> aggregated (another PEs) by portfolio and by country
>>>>
>>>> - Is it possible to guarantee data locality? I.e. having many tasks per
>>>> node how do I ensure that trade analytic results from the same portfolio
>>>> will always be processed by the same task? The task will have aggregation
>>>> state and it will builds up incrementally
>>>> - I've found references for testing S4 on 8 servers with total 128Gb of
>>>> RAM. Do you think having 20 servers to run 200 million PEs using total 1Tb
>>>> of RAM (64Gb per server) should be relatively straightforward?
>>>> - Is it allowed / appropriate to manipulate Streams runtime? I.e.
>>>> define and route events based on trade's attributes without re-deploying
>>>> the App.
>>>> - I saw wiki page about S4 Piper with Hadoop YARN and S4-25 JIRA issue.
>>>> Does this integration mean that "data locality" agreement can be set
>>>> between Hadoop/HBase and S4? i.e. having PEs instantiated on the server
>>>> where intermediate results are stored (so my portfolio or county trade
>>>> analytics are aggregated on the same server every time)
>>>>
>>>>
>>>> Would much appreciate your help.
>>>>
>>>>
>>>>
>>>> --
>>>> Regards.
>>>> Andriy Pidgurskiy.
>>>>
>>>
>>>
>>
>>
>> --
>> Regards.
>> Andriy Pidgurskiy.
>
>
>


-- 
Regards.
Andriy Pidgurskiy.

Mime
View raw message