spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mario Ds Briggs" <mario.bri...@in.ibm.com>
Subject Re: submissionTime vs batchTime, DirectKafka
Date Thu, 10 Mar 2016 05:21:28 GMT

Look at
  org.apache.spark.streaming.scheduler.JobGenerator

it has a RecurringTimer (timer) that will simply post 'JobGenerate' events
to a EventLoop at the batchInterval time.

This EventLoop's thread then picks up these events, uses the
streamingContext.graph' to generate a Job (InputDstream's compute method).
batchInfo.submissionTime is the time recorded after this generateJob
completes. The Job is then sent to the org.apache.spark.streaming.scheduler
.JobScheduler who has a ThreadExecutorPool to execute the Job.

JobGenerate events are not the only event that gets posted to the
JobGenerator.eventLoop. Other events are like DoCheckpoint,
ClearCheckpointData, ClearMetadata are also posted and all these events are
serviced by the EventLoop's single thread. So for instance if a
DoCheckPoint, ClearCheckpointData and ClearMetadata events are queued
before your nth JobGenerate event, then there will be a time difference
between the batchTime and SubmissionTime for that nth batch


thanks
Mario






On Thu, Mar 10, 2016 at 10:29 AM, Sachin Aggarwal <
different.sachin@gmail.com> wrote:
  Hi cody,

  let me try once again to explain with example.

  In BatchInfo class of spark "scheduling delay" is defined as

  def schedulingDelay: Option[Long] = processingStartTime.map(_ -
  submissionTime)

  I am dumping batchinfo object in my LatencyListener which
  extends StreamingListener.
  batchTime = 1457424695400 ms
  submissionTime = 1457425630780 ms
  difference = 935380 ms

  can this be considered a lag in processing of events . what is possible
  explaination for this lag?

  On Thu, Mar 10, 2016 at 12:22 AM, Cody Koeninger <cody@koeninger.org>
  wrote:
   I'm really not sure what you're asking.

   On Wed, Mar 9, 2016 at 12:43 PM, Sachin Aggarwal
   <different.sachin@gmail.com> wrote:
   > where are we capturing this delay?
   > I am aware of scheduling delay which is defined as processing
   > time-submission time not the batch create time
   >
   > On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger <cody@koeninger.org>
   wrote:
   >>
   >> Spark streaming by default will not start processing a batch until
   the
   >> current batch is finished.  So if your processing time is larger than
   >> your batch time, delays will build up.
   >>
   >> On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal
   >> <different.sachin@gmail.com> wrote:
   >> > Hi All,
   >> >
   >> > we have batchTime and submissionTime.
   >> >
   >> > @param batchTime   Time of the batch
   >> >
   >> > @param submissionTime  Clock time of when jobs of this batch was
   >> > submitted
   >> > to the streaming scheduler queue
   >> >
   >> > 1) we are seeing difference between batchTime and submissionTime
   for
   >> > small
   >> > batches(300ms) even in minutes for direct kafka this we see, only
   when
   >> > the
   >> > processing time is more than the batch interval. how can we explain
   this
   >> > delay??
   >> >
   >> > 2) In one of case batch processing time is more then batch
   interval,
   >> > then
   >> > will spark fetch the next batch data from kafka parallelly
   processing
   >> > the
   >> > current batch or it will wait for current batch to finish first ?
   >> >
   >> > I would be thankful if you give me some pointers
   >> >
   >> > Thanks!
   >> > --
   >> >
   >> > Thanks & Regards
   >> >
   >> > Sachin Aggarwal
   >> > 7760502772
   >
   >
   >
   >
   > --
   >
   > Thanks & Regards
   >
   > Sachin Aggarwal
   > 7760502772



  --

  Thanks & Regards

  Sachin Aggarwal
  7760502772



--

Thanks & Regards

Sachin Aggarwal
7760502772


Mime
View raw message