manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tugba Dogan (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
Date Thu, 13 Aug 2015 15:39:47 GMT


Tugba Dogan commented on CONNECTORS-1162:

I think that Kafka API doesn't have a method to fetch a document with its document identifier
because Kafka is mainly designed as messaging queue instead of storing documents with some
path or ID. But, if we want to fetch documents one by one we can use message offsets as their
document ID. We can seek to that offset and fetch a single message from the queue. So, this
method might solve our problem but I think it's going to be a little bit slower comparing
to continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message.
Instead of that, there is a poll method which fetches ConsumerRecords that contains all of
the messages from the offset he starts.

I thought, we might fetches data an store them in some cache and use those data later in processDocuments

> Apache Kafka Output Connector
> -----------------------------
>                 Key: CONNECTORS-1162
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>         Attachments: 1.JPG, 2.JPG
> Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality
of a messaging system, but with a unique design. A single Kafka broker can handle hundreds
of megabytes of reads and writes per second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as
a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment.
A Kafka output connector could be used for streaming or dispatching crawled documents or metadata
and put them in a BigData processing pipeline

This message was sent by Atlassian JIRA

View raw message