manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: processing document addition and delete in order
Date Fri, 13 Jun 2014 17:21:46 GMT
I see; so you are not crawling a repository but instead a sequence of
commands, and you don't know what the actual state of the "repository" is
until all the commands are processed.

ManifoldCF is not really designed to crawl sequentially-ordered commands.
If you can process the commands in sequence first into a "repository" of
your own construction, then ManifoldCF would be well-suited to picking
documents out of there.  I'm trying to think of a good way to do this
without actually doing that preprocessing step, but at the moment I'm
coming up with nothing useful.

Karl



On Fri, Jun 13, 2014 at 1:14 PM, Matteo Grolla <m.grolla@sourcesense.com>
wrote:

> Hi Karl
>         the reason is that if I read the commands in this order from the
> files
>
>         add{doc1}, delete{doc1}, add{doc1}
>
>         after the crawl I should find doc1 in solr
>         but if I process them in this order
>
>         add{doc1}, add{doc1}, delete{doc1}
>
>         there won't be doc1 in solr after the crawl
>
> The concern about sequential performance is right but my use cases
> typically involve few deletion and lots of adds
>
>         suppose I have
>
>         add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}
>
>
>         I could process
>         add{doc1}, add{doc2}, add{doc3} in parallel
>         then  delete{doc1}
>         then proceed in parallel till the next delete
>
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:
>
> > One other point: if the reason that you would be trying to order things
> is
> > because you'd want to process the xml document before processing its
> > children, you don't need to worry about that at all either, because the
> > framework takes care of that automatically.  All you need to do is handle
> > the case where the carrydown data is not present.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <daddywri@gmail.com> wrote:
> >
> >> Hi Matteo,
> >>
> >> The prerequisite event logic is the only way to order document
> processing
> >> in ManifoldCF.  The javadoc for the event methods is probably the best
> >> reference to use.  I can't say from your description how it would map,
> but
> >> here's the description in question:
> >>
> >> /** This interface abstracts from the activities that use and govern
> >> events.
> >> *
> >> * The purpose of this model is to allow a connector to:
> >> * (a) insure that documents whose prerequisites have not been met do not
> >> get processed until those prerequisites are completed
> >> * (b) guarantee that only one thread at a time deal with sequencing of
> >> documents
> >> *
> >> * The way it works is as follows.  We define the notion of an "event",
> >> which is described by a simple string (and thus can be global,
> >> * local to a connection, or local to a job, whichever is appropriate).
>  An
> >> event is managed solely by the connector that knows about it.
> >> * Effectively it can be in either of two states: "completed", or
> >> "pending".  The only time the framework ever changes an event state is
> when
> >> * the crawler is restarted, at which point all pending events are marked
> >> "completed".
> >> *
> >> * Documents, when they are added to the processing queue, specify the
> set
> >> of events on which they will block.  If an event is in the "pending"
> state,
> >> * no documents that block on that event will be processed at that time.
> >> Of course, it is possible that a document could be handed to processing
> >> just before
> >> * an event entered the "pending" state - in which case it is the
> >> responsibility of the connector itself to avoid any problems or
> conflicts.
> >> This can
> >> * usually be handled by proper handling of event signalling.  More on
> that
> >> later.
> >> *
> >> * The presumed underlying model of flow inside the connector's
> processing
> >> method is as follows:
> >> * (1) The connector examines the document in question, and decides
> whether
> >> it can be processed successfully or not, based on what it knows about
> >> sequencing
> >> * (2) If the connector determines that the document can properly be
> >> processed, it does so, and that's it.
> >> * (3) If the connector finds a sequencing-related problem, it:
> >> *     (a) Begins an appropriate event sequence.
> >> *     (b) If the framework indicates that this event is already in the
> >> "pending" state, then some other thread is already handling the event,
> and
> >> the connector
> >> *          should abort processing of the current document.
> >> *     (c) If the framework successfully begins the event sequence, then
> >> the connector code knows unequivocably that it is the only thread
> >> processing the event.
> >> *         It should take whatever action it needs to - which might be
> >> requesting special documents, for instance.  [Note well: At this time,
> >> there is no way
> >> *         to guarantee that special documents added to the queue are in
> >> fact properly synchronized by this mechanism, so I recommend avoiding
> this
> >> practice,
> >> *         and instead handling any special document sequences without
> >> involving the queue.]
> >> *     (d) If the connector CANNOT successfully take the action it needs
> to
> >> to push the sequence along, it MUST set the event back to the
> "completed"
> >> state.
> >> *         Otherwise, the event will remain in the "pending" state until
> >> the next time the crawler is restarted.
> >> *     (e) If the current document cannot yet be processed, its
> processing
> >> should be aborted.
> >> * (4) When the connector determines that the event's conditions have
> been
> >> met, or when it determines that an event sequence is no longer viable
> and
> >> has been
> >> *     aborted, it must set the event status to "completed".
> >> *
> >> * In summary, a connector may perform the following event-related
> actions:
> >> * (a) Set an event into the "pending" state
> >> * (b) Set an event into the "completed" state
> >> * (c) Add a document to the queue with a specified set of prerequisite
> >> events attached
> >> * (d) Request that the current document be requeued for later processing
> >> (i.e. abort processing of a document due to sequencing reasons)
> >> *
> >> */
> >> public interface IEventActivity extends INamingActivity
> >> {
> >>  public static final String _rcsid = "@(#)$Id: IEventActivity.java
> 988245
> >> 2010-08-23 18:39:35Z kwright $";
> >>
> >>  /** Begin an event sequence.
> >>  * This method should be called by a connector when a sequencing event
> >> should enter the "pending" state.  If the event is already in that
> state,
> >>  * this method will return false, otherwise true.  The connector has the
> >> responsibility of appropriately managing sequencing given the response
> >>  * status.
> >>  *@param eventName is the event name.
> >>  *@return false if the event is already in the "pending" state.
> >>  */
> >>  public boolean beginEventSequence(String eventName)
> >>    throws ManifoldCFException;
> >>
> >>  /** Complete an event sequence.
> >>  * This method should be called to signal that an event is no longer in
> >> the "pending" state.  This can mean that the prerequisite processing is
> >>  * completed, but it can also mean that prerequisite processing was
> >> aborted or cannot be completed.
> >>  * Note well: This method should not be called unless the connector is
> >> CERTAIN that an event is in progress, and that the current thread has
> >>  * the sole right to complete it.  Otherwise, race conditions can
> develop
> >> which would be difficult to diagnose.
> >>  *@param eventName is the event name.
> >>  */
> >>  public void completeEventSequence(String eventName)
> >>    throws ManifoldCFException;
> >>
> >>  /** Abort processing a document (for sequencing reasons).
> >>  * This method should be called in order to cause the specified document
> >> to be requeued for later processing.  While this is similar in some
> respects
> >>  * to the semantics of a ServiceInterruption, it is applicable to only
> >> one document at a time, and also does not specify any delay period,
> since
> >> it is
> >>  * presumed that the reason for the requeue is because of sequencing
> >> issues synchronized around an underlying event.
> >>  *@param localIdentifier is the document identifier to requeue
> >>  */
> >>  public void retryDocumentProcessing(String localIdentifier)
> >>    throws ManifoldCFException;
> >>
> >>
> >> }
> >>
> >>
> >> As you can see, these constraints are significant and can cause
> >> single-threaded behavior, so unless you've got a real requirement for
> >> ordering, it's better not to do it.
> >>
> >> Furthermore, the question of deletions is really not germane, because
> >> ManifoldCF does not in fact order deletions at all.  They are done
> either
> >> as a side-effect of document processing (when a document is discovered
> to
> >> not be there anymore), or at the end of a job (when orphaned documents
> are
> >> removed).  They are also deleted when the job that owns them is deleted.
> >>
> >> Karl
> >>
> >>
> >>
> >> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <
> m.grolla@sourcesense.com>
> >> wrote:
> >>
> >>> Hi
> >>>        I'm going to develop  a manifold connector and one requirements
> >>> is that it should be able to handle document insertion and deletion in
> >>> order (details coming).
> >>> Actually I've already built such crawler as a standalone application
> and
> >>> the design was conceptually this
> >>>
> >>> instead of a Document Queue I have a CommandQueue
> >>>        commands can be delete (specifying the docId) or add (specifying
> >>> the doc to be added)
> >>> when a worker thread takes a delete no other worker is allowed to pick
> >>> other commands from the queue until the delete has been committed
> >>>
> >>>
> >>> Ex. suppose I have the following chunk of CommandQueue:
> >>>
> >>> add{doc1}, delete{doc1}, add{doc1}
> >>>
> >>> I need to avoid the situation where commands are processed in this
> order:
> >>> add{doc1}, add{doc1}, delete{doc1}
> >>>
> >>>
> >>> I think the EventSequence could help me implement this synchronization
> in
> >>> Manifold
> >>> when seeding the identifiers I could embed in the identifier the
> command
> >>> Ex.
> >>>        instead of stuffing the identifier "hd-samsing-500GB"
> >>>        I could stuff "add hd-samsung-500GB"
> >>>
> >>> The question is: Am I running into huge troubles trying to implement
> this
> >>> requirement or not?
> >>>
> >>> --
> >>> Matteo Grolla
> >>> Sourcesense - making sense of Open Source
> >>> http://www.sourcesense.com
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message