crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-606) Create a KafkaSource
Date Mon, 09 May 2016 17:46:13 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Micah Whitacre updated CRUNCH-606:
----------------------------------
    Attachment: CRUNCH-606.diff

So the way I have the source currently written it will support reading from 1:M topics as
long as they produce the same payloads.  This fits our use case where our topics are segregated
by source but payloads are all the same.  In theory the flexibility in the Serializer/Deserializer
is by topic to support things like migration of the serialized format (e.g. person_json, person_avro
could both produce Person objects but their byte form might be different)  

To be honest we don't really have this use case but it has been indicated that this is how
things like the SchemaRegistry from Confluent does passivity.

Attached is the patch of my code so far.  I think the only missing piece is the converter
and need for more integration testing.  

You'll notice that right now I have the KafkaSource taking in a PTableType.  While I'd love
for the source to be PTypeFamily agnostic it seems like if I could change the InputFormat/RecordReader
to only support AvroTypeFamily or write a ConverterShim.  I'll play around with this some
more.

> Create a KafkaSource
> --------------------
>
>                 Key: CRUNCH-606
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-606
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-606.diff, CRUNCH-606.patch
>
>
> Pulling data out of Kafka is a common use case and some of the ways to do it Kafka Connect,
Camus, Gobblin do not integrate nicely with existing processing pipelines like Crunch.  With
Kafka 0.9, the consuming API is a lot easier so we should build a Source implementation that
can read from Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message