spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From marmbrus <>
Subject [GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...
Date Fri, 30 Sep 2016 01:37:40 GMT
Github user marmbrus commented on the issue:
    I spent a while playing around with this today on a real cluster, and overall it is pretty
cool!  I have a few suggestions we should implement in the long run, but these can probably
be done in follow-up PRs
     - It would be nice to be able to do something other than earliest/latest.  In particular,
I have some low volume streams where I want to go back a bit as I'm working interactively.
 Going back all the way results in a really long job.  If there was a time index, I'd love
to say `-10 minutes` but even just `-100 offsets would be great.
     - When specifying `earliest`, you end up with really big partitions.  As a result you
get virtually no progress reporting until its done.  I'd consider having a max number of offsets
in a task.  It should be easy to split them up in `getBatch`.  One question, is it a problem
if two tasks are pulling from the same topic partition in parallel?  Does this break the assumptions
of our caching?
     - Similarly, you get no results until the end.  We might also want to have a `maxOffsetsInBatch`
similar to what we have in the `FileStreamSource`.
    I can do a final pass over the code, but do we think we are getting close to something
that we can merge and iterate on?

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message