spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Braithwaite <a...@cloudflare.com>
Subject Re: Spark Streaming Specify Kafka Partition
Date Thu, 03 Dec 2015 23:21:28 GMT
One quick newbie question since I got another chance to look at this
today.  We're using java for our spark applications.  The createDirectStream
we were using previously [1] returns a JavaPairInputDStream, but the
createDirectStream with fromOffsets expects an argument recordClass to pass
into the generic constructor for createDirectStream.

In the code for the first function signature (without fromOffsets) it's
being constructed in Scala as just a tuple (K, V).   How do I pass this
same class/type information from java as the record class to get a
JavaPairInputDStream<K,
V>?

I understand this might be a question more fit for a scala mailing list but
google is failing me at the moment for hints on the interoperability of
scala and java generics.

[1] The original createDirectStream:
https://github.com/apache/spark/blob/branch-1.5/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L395-L423

Thanks,
- Alan

On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger <cody@koeninger.org> wrote:

> I actually haven't tried that, since I tend to do the offset lookups if
> necessary.
>
> It's possible that it will work, try it and let me know.
>
> Be aware that if you're doing a count() or take() operation directly on
> the rdd it'll definitely give you the wrong result if you're using -1 for
> one of the offsets.
>
>
>
> On Tue, Dec 1, 2015 at 9:58 AM, Alan Braithwaite <alan@cloudflare.com>
> wrote:
>
>> Neat, thanks.  If I specify something like -1 as the offset, will it
>> consume from the latest offset or do I have to instrument that manually?
>>
>> - Alan
>>
>> On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>>
>>> Yes, there is a version of createDirectStream that lets you specify
>>> fromOffsets: Map[TopicAndPartition, Long]
>>>
>>> On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite <alan@cloudflare.com>
>>> wrote:
>>>
>>>> Is there any mechanism in the kafka streaming source to specify the
>>>> exact partition id that we want a streaming job to consume from?
>>>>
>>>> If not, is there a workaround besides writing our a custom receiver?
>>>>
>>>> Thanks,
>>>> - Alan
>>>>
>>>
>>>
>>
>

Mime
View raw message