manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafa Haro (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
Date Thu, 13 Aug 2015 18:01:46 GMT


Rafa Haro commented on CONNECTORS-1162:

Hi [[~daddywri], please don't misunderstood me. I wasn't meaning to add new requirements,
I just was trying to shed light on the problem that Tugba reported. What I was meaning was
that, with Kafka, it would be theoretically possible to receive at seeding time only part
of a document (a set of kafka messages, but not the whole document). A solution for this could
be to index what you receive in one job and then make an update of the document in the final
index (let's call it OutputConnector instead of index). With kafka would be probably impossible
to retrieve the whole document if you need to reindex, so in that situation, the OutputConnector
API would have to support updates instead of reindex. I was just asking if that is possible,
not requiring it :-). Anyway, Kafka seems not to be suitable for a Repository Connector. That
was the reason I created the issue only for a Output Connector if I remember correctly.

> Apache Kafka Output Connector
> -----------------------------
>                 Key: CONNECTORS-1162
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>         Attachments: 1.JPG, 2.JPG
> Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality
of a messaging system, but with a unique design. A single Kafka broker can handle hundreds
of megabytes of reads and writes per second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as
a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment.
A Kafka output connector could be used for streaming or dispatching crawled documents or metadata
and put them in a BigData processing pipeline

This message was sent by Atlassian JIRA

View raw message