manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
Date Wed, 29 Jul 2015 23:47:04 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646925#comment-14646925
] 

Karl Wright commented on CONNECTORS-1162:
-----------------------------------------

bq. Are we going to get topic messages from the beginning or as from the job started?

It is standard practice for a job to represent all documents in a repository, unless there
is an explicit way in the UI to limit the documents taken based on timestamp.  I don't think
such a UI feature is necessary for the first version of the Kafka connector, though.

bq. Also, I want to ask that how can I store offset value so that it can resume to consume
when another job starts.

I assume that you mean, "how do I get ManifoldCF to crawl only the new documents that were
created since the last job run?"  If that is correct, then have a look at the javadoc for
the addSeedDocuments() method:

{code}
  /** Queue "seed" documents.  Seed documents are the starting places for crawling activity.
 Documents
  * are seeded when this method calls appropriate methods in the passed in ISeedingActivity
object.
  *
  * This method can choose to find repository changes that happen only during the specified
time interval.
  * The seeds recorded by this method will be viewed by the framework based on what the
  * getConnectorModel() method returns.
  *
  * It is not a big problem if the connector chooses to create more seeds than are
  * strictly necessary; it is merely a question of overall work required.
  *
  * The end time and seeding version string passed to this method may be interpreted for greatest
efficiency.
  * For continuous crawling jobs, this method will
  * be called once, when the job starts, and at various periodic intervals as the job executes.
  *
  * When a job's specification is changed, the framework automatically resets the seeding
version string to null.  The
  * seeding version string may also be set to null on each job run, depending on the connector
model returned by
  * getConnectorModel().
  *
  * Note that it is always ok to send MORE documents rather than less to this method.
  * The connector will be connected before this method can be called.
  *@param activities is the interface this method should use to perform whatever framework
actions are desired.
  *@param spec is a document specification (that comes from the job).
  *@param seedTime is the end of the time range of documents to consider, exclusive.
  *@param lastSeedVersionString is the last seeding version string for this job, or null if
the job has no previous seeding version string.
  *@param jobMode is an integer describing how the job is being run, whether continuous or
once-only.
  *@return an updated seeding version string, to be stored with the job.
  */
  public String addSeedDocuments(ISeedingActivity activities, Specification spec,
    String lastSeedVersion, long seedTime, int jobMode)
    throws ManifoldCFException, ServiceInterruption;
{code}

For "lastSeedVersion", your connector will initially receive null.  You should return a seeding
version string that MCF will store.  On the next job run, that string you returned is passed
back in as "lastSeedVersion".  You can put whatever you like in that string, such as the date
of the last crawl, or offset value, or whatever makes sense for your repository.

Hope this helps.

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality
of a messaging system, but with a unique design. A single Kafka broker can handle hundreds
of megabytes of reads and writes per second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as
a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment.
A Kafka output connector could be used for streaming or dispatching crawled documents or metadata
and put them in a BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message