manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <m.gro...@sourcesense.com>
Subject Re: processing document addition and delete in order
Date Sat, 14 Jun 2014 13:09:07 GMT
You perfectly described the situation.
If I could set of xml files where each set represents a snapshot of the source system state
then my crawler would fit manifold design much better.
I'll see if it's possible. For sure concurrency can be better exploited this way.

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 19:21, Karl Wright ha scritto:

> I see; so you are not crawling a repository but instead a sequence of
> commands, and you don't know what the actual state of the "repository" is
> until all the commands are processed.
> 
> ManifoldCF is not really designed to crawl sequentially-ordered commands.
> If you can process the commands in sequence first into a "repository" of
> your own construction, then ManifoldCF would be well-suited to picking
> documents out of there.  I'm trying to think of a good way to do this
> without actually doing that preprocessing step, but at the moment I'm
> coming up with nothing useful.
> 
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 1:14 PM, Matteo Grolla <m.grolla@sourcesense.com>
> wrote:
> 
>> Hi Karl
>>        the reason is that if I read the commands in this order from the
>> files
>> 
>>        add{doc1}, delete{doc1}, add{doc1}
>> 
>>        after the crawl I should find doc1 in solr
>>        but if I process them in this order
>> 
>>        add{doc1}, add{doc1}, delete{doc1}
>> 
>>        there won't be doc1 in solr after the crawl
>> 
>> The concern about sequential performance is right but my use cases
>> typically involve few deletion and lots of adds
>> 
>>        suppose I have
>> 
>>        add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}
>> 
>> 
>>        I could process
>>        add{doc1}, add{doc2}, add{doc3} in parallel
>>        then  delete{doc1}
>>        then proceed in parallel till the next delete
>> 
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:
>> 
>>> One other point: if the reason that you would be trying to order things
>> is
>>> because you'd want to process the xml document before processing its
>>> children, you don't need to worry about that at all either, because the
>>> framework takes care of that automatically.  All you need to do is handle
>>> the case where the carrydown data is not present.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <daddywri@gmail.com> wrote:
>>> 
>>>> Hi Matteo,
>>>> 
>>>> The prerequisite event logic is the only way to order document
>> processing
>>>> in ManifoldCF.  The javadoc for the event methods is probably the best
>>>> reference to use.  I can't say from your description how it would map,
>> but
>>>> here's the description in question:
>>>> 
>>>> /** This interface abstracts from the activities that use and govern
>>>> events.
>>>> *
>>>> * The purpose of this model is to allow a connector to:
>>>> * (a) insure that documents whose prerequisites have not been met do not
>>>> get processed until those prerequisites are completed
>>>> * (b) guarantee that only one thread at a time deal with sequencing of
>>>> documents
>>>> *
>>>> * The way it works is as follows.  We define the notion of an "event",
>>>> which is described by a simple string (and thus can be global,
>>>> * local to a connection, or local to a job, whichever is appropriate).
>> An
>>>> event is managed solely by the connector that knows about it.
>>>> * Effectively it can be in either of two states: "completed", or
>>>> "pending".  The only time the framework ever changes an event state is
>> when
>>>> * the crawler is restarted, at which point all pending events are marked
>>>> "completed".
>>>> *
>>>> * Documents, when they are added to the processing queue, specify the
>> set
>>>> of events on which they will block.  If an event is in the "pending"
>> state,
>>>> * no documents that block on that event will be processed at that time.
>>>> Of course, it is possible that a document could be handed to processing
>>>> just before
>>>> * an event entered the "pending" state - in which case it is the
>>>> responsibility of the connector itself to avoid any problems or
>> conflicts.
>>>> This can
>>>> * usually be handled by proper handling of event signalling.  More on
>> that
>>>> later.
>>>> *
>>>> * The presumed underlying model of flow inside the connector's
>> processing
>>>> method is as follows:
>>>> * (1) The connector examines the document in question, and decides
>> whether
>>>> it can be processed successfully or not, based on what it knows about
>>>> sequencing
>>>> * (2) If the connector determines that the document can properly be
>>>> processed, it does so, and that's it.
>>>> * (3) If the connector finds a sequencing-related problem, it:
>>>> *     (a) Begins an appropriate event sequence.
>>>> *     (b) If the framework indicates that this event is already in the
>>>> "pending" state, then some other thread is already handling the event,
>> and
>>>> the connector
>>>> *          should abort processing of the current document.
>>>> *     (c) If the framework successfully begins the event sequence, then
>>>> the connector code knows unequivocably that it is the only thread
>>>> processing the event.
>>>> *         It should take whatever action it needs to - which might be
>>>> requesting special documents, for instance.  [Note well: At this time,
>>>> there is no way
>>>> *         to guarantee that special documents added to the queue are in
>>>> fact properly synchronized by this mechanism, so I recommend avoiding
>> this
>>>> practice,
>>>> *         and instead handling any special document sequences without
>>>> involving the queue.]
>>>> *     (d) If the connector CANNOT successfully take the action it needs
>> to
>>>> to push the sequence along, it MUST set the event back to the
>> "completed"
>>>> state.
>>>> *         Otherwise, the event will remain in the "pending" state until
>>>> the next time the crawler is restarted.
>>>> *     (e) If the current document cannot yet be processed, its
>> processing
>>>> should be aborted.
>>>> * (4) When the connector determines that the event's conditions have
>> been
>>>> met, or when it determines that an event sequence is no longer viable
>> and
>>>> has been
>>>> *     aborted, it must set the event status to "completed".
>>>> *
>>>> * In summary, a connector may perform the following event-related
>> actions:
>>>> * (a) Set an event into the "pending" state
>>>> * (b) Set an event into the "completed" state
>>>> * (c) Add a document to the queue with a specified set of prerequisite
>>>> events attached
>>>> * (d) Request that the current document be requeued for later processing
>>>> (i.e. abort processing of a document due to sequencing reasons)
>>>> *
>>>> */
>>>> public interface IEventActivity extends INamingActivity
>>>> {
>>>> public static final String _rcsid = "@(#)$Id: IEventActivity.java
>> 988245
>>>> 2010-08-23 18:39:35Z kwright $";
>>>> 
>>>> /** Begin an event sequence.
>>>> * This method should be called by a connector when a sequencing event
>>>> should enter the "pending" state.  If the event is already in that
>> state,
>>>> * this method will return false, otherwise true.  The connector has the
>>>> responsibility of appropriately managing sequencing given the response
>>>> * status.
>>>> *@param eventName is the event name.
>>>> *@return false if the event is already in the "pending" state.
>>>> */
>>>> public boolean beginEventSequence(String eventName)
>>>>   throws ManifoldCFException;
>>>> 
>>>> /** Complete an event sequence.
>>>> * This method should be called to signal that an event is no longer in
>>>> the "pending" state.  This can mean that the prerequisite processing is
>>>> * completed, but it can also mean that prerequisite processing was
>>>> aborted or cannot be completed.
>>>> * Note well: This method should not be called unless the connector is
>>>> CERTAIN that an event is in progress, and that the current thread has
>>>> * the sole right to complete it.  Otherwise, race conditions can
>> develop
>>>> which would be difficult to diagnose.
>>>> *@param eventName is the event name.
>>>> */
>>>> public void completeEventSequence(String eventName)
>>>>   throws ManifoldCFException;
>>>> 
>>>> /** Abort processing a document (for sequencing reasons).
>>>> * This method should be called in order to cause the specified document
>>>> to be requeued for later processing.  While this is similar in some
>> respects
>>>> * to the semantics of a ServiceInterruption, it is applicable to only
>>>> one document at a time, and also does not specify any delay period,
>> since
>>>> it is
>>>> * presumed that the reason for the requeue is because of sequencing
>>>> issues synchronized around an underlying event.
>>>> *@param localIdentifier is the document identifier to requeue
>>>> */
>>>> public void retryDocumentProcessing(String localIdentifier)
>>>>   throws ManifoldCFException;
>>>> 
>>>> 
>>>> }
>>>> 
>>>> 
>>>> As you can see, these constraints are significant and can cause
>>>> single-threaded behavior, so unless you've got a real requirement for
>>>> ordering, it's better not to do it.
>>>> 
>>>> Furthermore, the question of deletions is really not germane, because
>>>> ManifoldCF does not in fact order deletions at all.  They are done
>> either
>>>> as a side-effect of document processing (when a document is discovered
>> to
>>>> not be there anymore), or at the end of a job (when orphaned documents
>> are
>>>> removed).  They are also deleted when the job that owns them is deleted.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <
>> m.grolla@sourcesense.com>
>>>> wrote:
>>>> 
>>>>> Hi
>>>>>       I'm going to develop  a manifold connector and one requirements
>>>>> is that it should be able to handle document insertion and deletion in
>>>>> order (details coming).
>>>>> Actually I've already built such crawler as a standalone application
>> and
>>>>> the design was conceptually this
>>>>> 
>>>>> instead of a Document Queue I have a CommandQueue
>>>>>       commands can be delete (specifying the docId) or add (specifying
>>>>> the doc to be added)
>>>>> when a worker thread takes a delete no other worker is allowed to pick
>>>>> other commands from the queue until the delete has been committed
>>>>> 
>>>>> 
>>>>> Ex. suppose I have the following chunk of CommandQueue:
>>>>> 
>>>>> add{doc1}, delete{doc1}, add{doc1}
>>>>> 
>>>>> I need to avoid the situation where commands are processed in this
>> order:
>>>>> add{doc1}, add{doc1}, delete{doc1}
>>>>> 
>>>>> 
>>>>> I think the EventSequence could help me implement this synchronization
>> in
>>>>> Manifold
>>>>> when seeding the identifiers I could embed in the identifier the
>> command
>>>>> Ex.
>>>>>       instead of stuffing the identifier "hd-samsing-500GB"
>>>>>       I could stuff "add hd-samsung-500GB"
>>>>> 
>>>>> The question is: Am I running into huge troubles trying to implement
>> this
>>>>> requirement or not?
>>>>> 
>>>>> --
>>>>> Matteo Grolla
>>>>> Sourcesense - making sense of Open Source
>>>>> http://www.sourcesense.com
>>>>> 
>>>>> 
>>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message