asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raman Grover <ramangrove...@gmail.com>
Subject Re: Socket feed questions
Date Tue, 27 Oct 2015 16:48:10 GMT
Hi,


In the case when data is being received from an external source (e.g.
during feed ingestion), a slow rate of arrival of data may result in
excessive delays until the data is deposited into the target dataset and
made accessible to queries. Data moves along a data ingestion pipeline
between operators as packed fixed size frames. The default behavior is to
wait for the frame to be full before dispatching the contained data to the
downstream operator. However, as noted, this may not suit all scenarios
particularly when data source is sending data at a low rate. To cater to
different scenarios, AsterixDB allows configuring the behavior. The
different options are described next.

*Push data downstream when*
(a) Frame is full (default)
(b) At least N records (data items) have been collected into a partially
filled frame
(c) At least T seconds have elapsed since the last record was put into the
frame

*How to configure the behavior?*
At the time of defining a feed, an end-user may specify configuration
parameters that determine the runtime behavior (options (a), (b) or (c)
from above).

The parameters are described below:

*"parser-policy"*: A specific strategy chosen from a set of pre-defined
values -
  (i)  *  "frame_full"*
 This is the default value. As the name suggests, this choice causes frames
to be pushed by the feed adaptor only when there isn't sufficient space for
an additional record to fit in. This corresponds to option (a).

 (ii)   * "counter_timer_expired"  *
 Use this as the value if you wish to set either option (b) or (c)  or a
combination of both.

*Some Examples*

1) Pack a maximum of 100 records into a data frame and push it downstream.

 create feed my_feed using my_adaptor
(("parser-policy"="counter_timer_expired"), ("batch-size"="100"), ... other
parameters);

2) Wait till 2 seconds and send however many records collected in a frame
downstream.
 create feed my_feed using my_adaptor
(("parser-policy"="counter_timer_expired"), ("batch-interval"="2")... other
parameters);

3) Wait till 100 records have been collected into a data frame or 2 seconds
have elapsed since the last record was put into the current data frame.
 create feed my_feed using my_adaptor
(("parser-policy"="counter_timer_expired"), ("batch-interval"="2"),
("batch-size"="100"),... other parameters);


*Note*
The above config parameters are not specific to using a particular
implementation of an adaptor but are available for use with any feed
adaptor. Some adaptors that ship with AsterixDB use different default
values for above to suit their specific scenario. E.g. the pull-based
twitter adaptor uses "counter_timer_expired" as the "parser-policy" and
sets the  parameter "batch-interval".


Regards,
Raman
PS: The names of the parameters described above are not as intuitive as one
would like them to be. The names need to be changed.








On Thu, Oct 22, 2015 at 9:09 AM, Mike Carey <dtabass@gmail.com> wrote:

> I think we need to have tuning parameters - like batch size and maximum
> tolerable latency (in case there's a lull and you still want to push stuff
> with some worst-case delay). @Raman Grover - remind me (us) what's
> available in this regard?
>
> On 10/22/15 4:29 AM, Pääkkönen Pekka wrote:
>
> Hi,
>
>
>
> Yes, you are right. I tried sending a larger amount of data, and data is
> now stored to the database.
>
>
>
> Does it make sense to configure a smaller batch size in order to get more
> frequent writes?
>
> Or would it significantly impact performance?
>
>
>
> -Pekka
>
>
>
>
>
> Data moves through the pipeline in frame-sized batches, so one
>
> (uniformed :-)) guess is that you aren't running very long, and you're
>
> only seeing the data flow when you close because only then do you have a
>
> batch's worth.  Is that possible?  You can test this by running longer
>
> (more data) and seeing if you start to see the expected incremental
>
> flow/inserts.  (And we need tunability in this area, e.g., parameters on
>
> how much batching and/or low much latency to tolerate on each feed.)
>
>
>
> On 10/21/15 4:45 AM, Pääkkönen Pekka wrote:
>
> >
>
> > Hi,
>
> >
>
> > Thanks, now I am able to create a socket feed, and save items to the
>
> > dataset from the feed.
>
> >
>
> > It seems that data items are written to the dataset after I close the
>
> > socket at the client.
>
> >
>
> > Is there some way to indicate to AsterixDB feed (with a newline or
>
> > other indicator) that data can be written to the database, when the
>
> > connection is open?
>
> >
>
> > After I close the socket at the client, the feed seems to close down.
>
> > Or is it only paused, until it is resumed?
>
> >
>
> > -Pekka
>
> >
>
> > Hi Pekka,
>
> >
>
> > That's interesting, I'm not sure why the CC would appear as being down
>
> >
>
> > to Managix. However if you can access the web console, it that
>
> >
>
> > evidently isn't the case.
>
> >
>
> > As for data ingestion via sockets, yes it is possible, but it kind of
>
> >
>
> > depends on what's meant by sockets. There's no tutorial for it, but
>
> >
>
> > take a look at SocketBasedFeedAdapter in the source, as well as
>
> >
>
> >
> https://github.com/kisskys/incubator-asterixdb/blob/kisskys/indexonlyhilbertbtree/asterix-experiments/src/main/java/org/apache/asterix/experiment/client/SocketTweetGenerator.java
>
> >
>
> > for some examples of how it works.
>
> >
>
> > Hope that helps!
>
> >
>
> > Thanks,
>
> >
>
> > -Ian
>
> >
>
> > On Mon, Oct 19, 2015 at 10:15 PM, Pääkkönen Pekka
>
> > <Pekka.Paakkonen@vtt.fi> <Pekka.Paakkonen@vtt.fi> wrote:
>
> > > Hi Ian,
>
> > >
>
> > >
>
> > >
>
> > > Thanks for the reply.
>
> > >
>
> > > I compiled AsterixDB v0.87 and started it.
>
> > >
>
> > >
>
> > >
>
> > > However, I get the following warnings:
>
> > >
>
> > > INFO: Name:my_asterix
>
> > >
>
> > > Created:Mon Oct 19 08:37:16 UTC 2015
>
> > >
>
> > > Web-Url:http://192.168.101.144:19001
>
> > >
>
> > > State:UNUSABLE
>
> > >
>
> > >
>
> > >
>
> > > WARNING!:Cluster Controller not running at master
>
> > >
>
> > >
>
> > >
>
> > > Also, I see the following warnings in my_asterixdb1.log. there are no
>
> > > warnings or errors in cc.log
>
> > >
>
> > > “
>
> > >
>
> > > Oct 19, 2015 8:37:39 AM
>
> > > org.apache.hyracks.api.lifecycle.LifeCycleComponentManager configure
>
> > >
>
> > > SEVERE: LifecycleComponentManager configured
>
> > > org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@7559ec47
>
> > >
>
> > > ..
>
> > >
>
> > > INFO: Completed sharp checkpoint.
>
> > >
>
> > > Oct 19, 2015 8:37:40 AM
> org.apache.asterix.om.util.AsterixClusterProperties
>
> > > getIODevices
>
> > >
>
> > > WARNING: Configuration parameters for nodeId my_asterix_node1 not
> found. The
>
> > > node has not joined yet or has left.
>
> > >
>
> > > Oct 19, 2015 8:37:40 AM
> org.apache.asterix.om.util.AsterixClusterProperties
>
> > > getIODevices
>
> > >
>
> > > WARNING: Configuration parameters for nodeId my_asterix_node1 not
> found. The
>
> > > node has not joined yet or has left.
>
> > >
>
> > > Oct 19, 2015 8:38:38 AM
>
> > > org.apache.hyracks.control.common.dataset.ResultStateSweeper sweep
>
> > >
>
> > > INFO: Result state cleanup instance successfully completed.”
>
> > >
>
> > >
>
> > >
>
> > > I seems that AsterixDB is running, and I can access it at port 19001.
>
> > >
>
> > >
>
> > >
>
> > > The documentation shows ingestion of tweets, but I would be interested
> in
>
> > > using sockets.
>
> > >
>
> > > Is it possible to ingest data from sockets?
>
> > >
>
> > >
>
> > >
>
> > > Regards,
>
> > >
>
> > > -Pekka
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > Hey there Pekka,
>
> > >
>
> > > Your intuition is correct, most of the newer feeds features are in the
>
> > >
>
> > > current master branch and not in the (very) old 0.8.6 release. If you'd
>
> > >
>
> > > like to experiment with them you'll have to build from source. The
> details
>
> > >
>
> > > about that are here:
>
> > >
>
> > >
> https://asterixdb.incubator.apache.org/dev-setup.html#setting-up-an-asterix-development-environment-in-eclipse
>
> > >
>
> > > , but they're probably a bit overkill for just trying to get the
> compiled
>
> > >
>
> > > binaries. For that all you really need to do is :
>
> > >
>
> > > - Clone Hyracks from git
>
> > >
>
> > > - 'mvn clean install -DskipTests'
>
> > >
>
> > > - Clone AsterixDB
>
> > >
>
> > > - 'mvn clean package -DskipTests'
>
> > >
>
> > > Then, the binaries will sit in asterix-installer/target
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > For an example, the documentation shows how to set up a feed that's
>
> > >
>
> > > ingesting Tweets:
>
> > >
>
> > >
> https://asterix-jenkins.ics.uci.edu/job/asterix-test-full/site/asterix-doc/feeds/tutorial.html
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > Thanks,
>
> > >
>
> > > -Ian
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > On Wed, Oct 7, 2015 at 9:48 PM, Pääkkönen Pekka
> <Pekka.Paakkonen@vtt.fi> <Pekka.Paakkonen@vtt.fi>
>
> > >
>
> > > wrote:
>
> > >
>
> > >
>
> > >
>
> > >> Hi,
>
> > >
>
> > >>
>
> > >
>
> > >>
>
> > >
>
> > >>
>
> > >
>
> > >> I would like to experiment with a socket-based feed.
>
> > >
>
> > >>
>
> > >
>
> > >> Can you point me to an example on how to utilize them?
>
> > >
>
> > >>
>
> > >
>
> > >> Do I need to install 0.8.7-snapshot version of AsterixDB in order to
>
> > >
>
> > >> experiment with feeds?
>
> > >
>
> > >>
>
> > >
>
> > >>
>
> > >
>
> > >>
>
> > >
>
> > >> Regards,
>
> > >
>
> > >>
>
> > >
>
> > >> -Pekka Pääkkönen
>
> > >
>
> > >>
>
> > >
>
> > >
>
> >
>
>
>
>
>


-- 
Raman

Mime
View raw message