crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-606) Create a KafkaSource
Date Fri, 06 May 2016 19:18:12 GMT


Micah Whitacre commented on CRUNCH-606:

[~joshwills] what's the best way to handle converting data coming out of Kafka?

Currently I'm having the consumers pass in the new Kafka Deserializer class instances for
Key and Value so that when I read data from Kafka I don't get byte[] but instead get values
like String, Avro SpecificRecord, etc.  I also am having them provide PTableTypes<K, V>
where the K and V values correspond to whatever is coming out of Kafka. 

Using the existing PTypes, I'm hitting errors like:

java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.avro.mapred.AvroWrapper
	at org.apache.crunch.types.avro.AvroKeyConverter.convertInput(
	at org.apache.hadoop.mapred.MapTask.runNewMapper(

Is there a workaround I'm missing or should I create something similar to Orcs/HBaseTypes
where I have them provide instances of Kafka Serializer/Deserializer for the payloads?  Downside
is requiring them to have a Serializer when they wouldn't normally but not a deal breaker
b/c I'm guessing most have that already written somewhere on the produce side of things.

> Create a KafkaSource
> --------------------
>                 Key: CRUNCH-606
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-606.patch
> Pulling data out of Kafka is a common use case and some of the ways to do it Kafka Connect,
Camus, Gobblin do not integrate nicely with existing processing pipelines like Crunch.  With
Kafka 0.9, the consuming API is a lot easier so we should build a Source implementation that
can read from Kafka.

This message was sent by Atlassian JIRA

View raw message