spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From koeninger <...@git.apache.org>
Subject [GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Date Tue, 21 Aug 2018 01:54:18 GMT
Github user koeninger commented on the issue:

    https://github.com/apache/spark/pull/22138
  
    Seeking means the pre-fetched data is wasted, so it's not a light
    operation.  It shouldn't be unavoidable, e.g. if consumers were cached
    keyed by topicpartition, groupid, next offset to be processed.  One concern
    there would be how to make sure you don't have lots of idle consumers.
    
    The question of how serious an issue is could be solved by measurement, but
    I don't have production structured streaming jobs, much less ones that
    exhibit the kind of behavior tdas was talking about in the original ticket.
    
    On Mon, Aug 20, 2018 at 7:36 PM, Jungtaek Lim <notifications@github.com>
    wrote:
    
    > @koeninger <https://github.com/koeninger>
    >
    > I'm not sure but are you saying that an executor cares about multiple
    > queries (multiple jobs) concurrently? I honestly didn't notice it. If that
    > is going to be problem, we should add something (could we get query id at
    > that time?) in cache key to differentiate consumers. If we want to avoid
    > extra seeking due to different offsets, consumers should not be reused
    > among with multiple queries, and that's just a matter of cache key.
    >
    > If you are thinking about co-use of consumers among multiple queries
    > because of reusing connection to Kafka, I think extra seeking is
    > unavoidable (I guess fetched data should be much more critical issue unless
    > we never reuse after returning to pool). If seeking is light operation, we
    > may even go with only reusing connection (not position we already sought):
    > always resetting position (and data maybe?) when borrowing from pool or
    > returning consumer to pool.
    >
    > Btw, the rationalization of this patch is not solving the issue you're
    > referring. This patch is also based on #20767
    > <https://github.com/apache/spark/pull/20767> but dealing with another
    > improvements pointed out in comments: adopt pool library to not reinvent
    > the wheel, and also enabling metrics regarding the pool.
    >
    > I'm not sure the issue you're referring is a serious one (show-stopper):
    > if the issue is a kind of serious, someone should handle the issue once we
    > are aware of the issue at March, or at least relevant JIRA issue should be
    > filed with detailed explanation before. I'd like to ask you in favor of
    > handling (or filing) the issue since you may know the issue best.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/22138#issuecomment-414498067>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AAGAB7qFjFrj9dWWkIcUcKcAKbEicuOwks5uS0gDgaJpZM4WCUJs>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message