Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D77B318183 for ; Thu, 13 Aug 2015 15:39:47 +0000 (UTC) Received: (qmail 65051 invoked by uid 500); 13 Aug 2015 15:39:47 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 65005 invoked by uid 500); 13 Aug 2015 15:39:47 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 64974 invoked by uid 99); 13 Aug 2015 15:39:47 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Aug 2015 15:39:47 +0000 Date: Thu, 13 Aug 2015 15:39:47 +0000 (UTC) From: "Tugba Dogan (JIRA)" To: dev@manifoldcf.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695381#comment-14695381 ] Tugba Dogan commented on CONNECTORS-1162: ----------------------------------------- I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. > Apache Kafka Output Connector > ----------------------------- > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish > Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 > Reporter: Rafa Haro > Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 2.3 > > Attachments: 1.JPG, 2.JPG > > > Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)