Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@manifoldcf.apache.org
Date: Thu, 13 Aug 2015 15:39:47 +0000 (UTC)
From: "Tugba Dogan (JIRA)" <jira@apache.org>
To: dev@manifoldcf.apache.org
Message-ID: <JIRA.12774589.1423752111000.57336.1439480387639@Atlassian.JIRA>
In-Reply-To: <JIRA.12774589.1423752111000@Atlassian.JIRA>
References: <JIRA.12774589.1423752111000@Atlassian.JIRA>
 <JIRA.12774589.1423752111327@arcas>
Subject: [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695381#comment-14695381 ] 

Tugba Dogan commented on CONNECTORS-1162:
-----------------------------------------

I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data later in processDocuments method.

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)