incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagmohan Chauhan <>
Subject Re: Performance issue in Yahoo S4
Date Thu, 22 Mar 2012 07:14:25 GMT

Kishore : Thanks  for bringing up a very important question. We actually
thought exactly similar today.

Let s  give you give the brief overview of what we understand about Yahoo
S4 architecture in terms of client adapter and our setup.

1. Our client application is running  on one node and reads input
sequentially from a file line by line.
2. The client application reads the line,makes message and dispatch it to
the client adapter which is running on the same node.So  we use localhost
and port number 2334.
3.  We think that client adapter is Yahoo S4 internal part and we do not
have to touch it .It shall read the input message coming from our
client application and convert it into json object to make event and then
send it across network to different S4 nodes for further processing. We
have three other nodes in cluster for S4 node processing.
So, we do not change anything with adapter.
4. We are using NFS filesystem and all nodes are on same switch.
5. We are measuring total time by calculating time when our client
application starts sending data and when it is finished.

There is one other thing which is confusing to us : The port
number in client -stub.xml is 2334 and that is where we send our messages
form the client application. But we also see a client adapter port in
clusters.xml. So, what is the relation between the two .

We were thinking of using two client adapters today and we tried but could
not succeed.
If someone can clear our doubts and shed some light on how we can make
multiple client adapters or make existing adapter multi-threaded then it
may be helpful to investigate our issue.

On Wed, Mar 21, 2012 at 11:35 PM, kishore g <> wrote:

> Is the client adapter reading and sending the data sequentially? Then the
> time taken will be roughly same in all the three cases. You will see
> improvement if the client adapter is able to send data to all nodes
> simultaneously, is the adapter multi threaded?
> Can you tell more about the setup and how is the total time measured.
> On Wed, Mar 21, 2012 at 11:04 PM, Jagmohan Chauhan <
>> wrote:
>> Hi
>> We checked it today and we saw that the different nodes are
>> getting different events . We checked by printing the words and they were
>> different. So, the client adapter is not sending same data to each node.
>> On Wed, Mar 21, 2012 at 9:38 AM, Matthieu Morel <>wrote:
>>> Hi,
>>> I wonder whether you are sending the same data from the adapter to each
>>> of the nodes. Can you check that? (You could compare the final word counts,
>>> between settings with 1 or more nodes).
>>> Regards,
>>> Matthieu
>>> On 3/21/12 4:38 AM, Jagmohan Chauhan wrote:
>>>>  Hi
>>>> We are working on Yahoo S4 for a project. We are using  a simple
>>>> application where we are reading words from a file , making sentences
>>>> out of it and printing the sentences on the console. We have made two
>>>> PE's for it. The first PE extracts the words thrown by the client
>>>> adapter, looks for the . , which means end of a sentence, forms a
>>>> sentence and sends it to next PE. The second PE takes the sentence and
>>>> prints it on console.   The file size from which our client application
>>>> is reading and feeding input to the adapter is 1 GB.  The first PE's is
>>>> keyless while for second one we performed experiments with same key as
>>>> well as different keys.
>>>> We are finding an unusual issue when we are trying with different
>>>> configuration of nodes.  We are trying to run the application on a
>>>> cluster which has 4 systems.
>>>> We are using 1 system for client adapter and other three as Processing
>>>> nodes.  The issue we are observing is that with increasing number of
>>>> nodes the execution time is increasing for same data set(file).
>>>> Here are some statistics :
>>>> 1 node configuration: Time is 2 min 10 sec
>>>> 2 node configuration : Time is 2 min 30 sec
>>>> 3 node configuration :Time is 2min 40 sec
>>>> We could not  reason about this issue as we thought that with increasing
>>>> nodes we shall get better execution time . Can anyone please shed some
>>>> light on this issue. Is the overhead of disseminating events is so high
>>>> that it does not improve the execution time.
>>>> --
>>>> Thanks and Regards
>>>> Jagmohan Chauhan
>>>> MSc student,CS
>>>> Univ. of Saskatchewan
>>>> IEEE Graduate Student Member
>>>> >
>> --
>> Thanks and Regards
>> Jagmohan Chauhan
>> MSc student,CS
>> Univ. of Saskatchewan
>> IEEE Graduate Student Member

Thanks and Regards
Jagmohan Chauhan
MSc student,CS
Univ. of Saskatchewan
IEEE Graduate Student Member

View raw message