manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafa Haro (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
Date Thu, 13 Aug 2015 16:10:46 GMT


Rafa Haro commented on CONNECTORS-1162:

Hi [~tugbadogan]. The aim of using Kafka as a repository connector within ManifoldCF is actually
for uses cases where Kafka can be used for, somehow, transporting something that can be reconstructed
as a "Document" that you would like to index or push to an output connector. So, the intention
shouldn't be to maintain the same messages Kafka structure within the repository connector
(i.e. a Record in Kafka shouldn't be equivalent to a RepositoryDocument). That doesn't make
any sense (for me at least). At the repository connector, I would do like a reduce stage and
will join together all the fields for the same "document". In order to do that, the topic
of all kafka messages must correspond to the Document URI or identifier. That is something
that you have to impose to the integrator. You can seed the topics and you should create different
document identifiers if new messages for the same topic/document has arrived at seeding time
in the next job.

Now, the situation that could happen is that you are not going to be able to rebuild the whole
document using the already consumed data for a topic and the new data coming from the stream.
Under that situation, an OutputConnector should allow you to update the document and not replace
it. If that possible [~daddywri]? 

> Apache Kafka Output Connector
> -----------------------------
>                 Key: CONNECTORS-1162
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>         Attachments: 1.JPG, 2.JPG
> Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality
of a messaging system, but with a unique design. A single Kafka broker can handle hundreds
of megabytes of reads and writes per second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as
a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment.
A Kafka output connector could be used for streaming or dispatching crawled documents or metadata
and put them in a BigData processing pipeline

This message was sent by Atlassian JIRA

View raw message