Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@manifoldcf.apache.org
Date: Wed, 29 Jul 2015 23:47:04 +0000 (UTC)
From: "Karl Wright (JIRA)" <jira@apache.org>
To: dev@manifoldcf.apache.org
Message-ID: <JIRA.12774589.1423752111000.325351.1438213624868@Atlassian.JIRA>
In-Reply-To: <JIRA.12774589.1423752111000@Atlassian.JIRA>
References: <JIRA.12774589.1423752111000@Atlassian.JIRA>
 <JIRA.12774589.1423752111327@arcas>
Subject: [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646925#comment-14646925 ] 

Karl Wright commented on CONNECTORS-1162:
-----------------------------------------

bq. Are we going to get topic messages from the beginning or as from the job started?

It is standard practice for a job to represent all documents in a repository, unless there is an explicit way in the UI to limit the documents taken based on timestamp.  I don't think such a UI feature is necessary for the first version of the Kafka connector, though.

bq. Also, I want to ask that how can I store offset value so that it can resume to consume when another job starts.

I assume that you mean, "how do I get ManifoldCF to crawl only the new documents that were created since the last job run?"  If that is correct, then have a look at the javadoc for the addSeedDocuments() method:

{code}
  /** Queue "seed" documents.  Seed documents are the starting places for crawling activity.  Documents
  * are seeded when this method calls appropriate methods in the passed in ISeedingActivity object.
  *
  * This method can choose to find repository changes that happen only during the specified time interval.
  * The seeds recorded by this method will be viewed by the framework based on what the
  * getConnectorModel() method returns.
  *
  * It is not a big problem if the connector chooses to create more seeds than are
  * strictly necessary; it is merely a question of overall work required.
  *
  * The end time and seeding version string passed to this method may be interpreted for greatest efficiency.
  * For continuous crawling jobs, this method will
  * be called once, when the job starts, and at various periodic intervals as the job executes.
  *
  * When a job's specification is changed, the framework automatically resets the seeding version string to null.  The
  * seeding version string may also be set to null on each job run, depending on the connector model returned by
  * getConnectorModel().
  *
  * Note that it is always ok to send MORE documents rather than less to this method.
  * The connector will be connected before this method can be called.
  *@param activities is the interface this method should use to perform whatever framework actions are desired.
  *@param spec is a document specification (that comes from the job).
  *@param seedTime is the end of the time range of documents to consider, exclusive.
  *@param lastSeedVersionString is the last seeding version string for this job, or null if the job has no previous seeding version string.
  *@param jobMode is an integer describing how the job is being run, whether continuous or once-only.
  *@return an updated seeding version string, to be stored with the job.
  */
  public String addSeedDocuments(ISeedingActivity activities, Specification spec,
    String lastSeedVersion, long seedTime, int jobMode)
    throws ManifoldCFException, ServiceInterruption;
{code}

For "lastSeedVersion", your connector will initially receive null.  You should return a seeding version string that MCF will store.  On the next job run, that string you returned is passed back in as "lastSeedVersion".  You can put whatever you like in that string, such as the date of the last crawl, or offset value, or whatever makes sense for your repository.

Hope this helps.

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)