asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <bamou...@gmail.com>
Subject Re: Parallel feed ingestion
Date Wed, 24 May 2017 01:28:55 GMT
Sorry for not replying sooner. I am catching up with the email overload.

Xikui is right. the socket adapter would be a poor choice to saturate the nodes. I would think
something like the firehose would be best to do that and you can run that on multiple NCs.
Another option would be to use the filesystem feed or update the socket adapter to accept
and process multiple connections (I would go with this since it is the most interesting one
and you will create a very useful adapter).
multiple feeds running concurrently should work too.

Cheers,
~Abdullah.

> On May 17, 2017, at 1:25 PM, Xikui Wang <xikuiw@uci.edu> wrote:
> 
> Hi,
> 
> Firstly, 3) won't work well as the socket server inside of AsterixDB takes
> connection
> from client side one at a time. The thing you will observe while having two
> clients sending
> data to one socket simultaneously is, the 1st client will go through and
> the 2nd will be
> blocked after several hundreds records. This will continue until the 1st
> one finishes.
> 
> The comparison between 1) and 2) is interesting. (@Abdullah please correct
> me if I'm wrong.)
> IMO, 1) achieves parallelism at the operator level by having intake
> operator
> running on designated nodes simultaneously. 2) achieves that at job level
> by simply
> putting up several jobs which run independently. I think 1) may have less
> overhead
> compared to 2), since part of the workflow that can be shared is duplicated
> multiple times in 2).
> It would be useful to see how these two performs in saturated conditions.
> 
> Best,
> Xikui
> 
> On Wed, May 17, 2017 at 12:11 PM, Mike Carey <dtabass@gmail.com> wrote:
> 
>> @Xikui?  @Abdullah?
>> 
>> 
>> 
>> On 5/17/17 11:40 AM, Ildar Absalyamov wrote:
>> 
>>> In light of Steven’s discussion about feeds in parallel thread I was
>>> wondering what would be a correct way to push parallel ingestion as far as
>>> possible in multinode\multipartition environment.
>>> In one of my experiments I am trying to saturate the ingestion to see the
>>> effect of computing stats in background.
>>> Several things I’ve tried:
>>> 1) Open a socket adapter on all NC:
>>> create feed Feed using socket_adapter
>>> (
>>>     ("sockets”="NC1:10001,NC2:10001,…”),
>>> …)
>>> 
>>> 2) Connect several Feeds to a single dataset.
>>> create feed Feed1 using socket_adapter
>>> (
>>>     ("sockets”="NC1:10001”),
>>> …)
>>> create feed Feed2 using socket_adapter
>>> (
>>>     ("sockets”="NC2:10001”),
>>> …)
>>> 
>>> 3) Have several nodes sending data into a single socket.
>>> 
>>> In my previous experiments the parallelization did not quite show that
>>> the bottleneck was on the sender side, but I am wondering if that will
>>> still be the case, since a lot of things happened under the hood since the
>>> last time.
>>> 
>>> Best regards,
>>> Ildar
>>> 
>> 
>> 


Mime
View raw message