manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CONNECTORS-989) Support virtual child document model
Date Thu, 03 Jul 2014 12:06:25 GMT
Karl Wright created CONNECTORS-989:
--------------------------------------

             Summary: Support virtual child document model
                 Key: CONNECTORS-989
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-989
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Framework agents process
    Affects Versions: ManifoldCF 1.7
            Reporter: Karl Wright
            Assignee: Karl Wright


In some cases, documents that are indexed may be virtual children of those that are queued.
 A good example of this is RSS feeds where the data being indexed all comes from the feed.

In order to implement this, the following changes would be required:

(1) IProcessActivity.ingestDocument() has a variant which allows you to include a virtual
child document identifier in addition to the main document identifier.
(2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys -- one for main
(queued) document identifier, one for child virtual document identifier.
(3) IIncrementalIngester has two new methods: beginDocument() and endDocument(), both of which
take a main (queued) document identifier as an argument.
(4) ingeststatus table has two additional columns: a state, and a child key.
(5) The flow is: at beginDocument() time, put all records relating to a document into a "processing"
state.  Documents that are seen have their state changed.  Documents never
   encountered are deleted at the end.
(6) Incremental decisions not to update an output record STILL will require that the record
be touched and its state set.
(7) DocumentIngest records for the entire set of children will be fetched when the document
is queued.
(8) The getDocumentVersions() method must be modified to allow return of version strings for
all children, although there can be "shortcuts" as well (where a single version
    string applies to all children.)
(9) The decision about whether to refetch a document is based on the returned version strings
and on those fetched by the stuffer thread.
(10) Similarly, processDocuments() receives version strings for all virtual children.
(11) There is no need to actively reset the state of document records on restart; the current
logic should be robust enough to be able to generate the required deletions.
(12) Deleting a document deletes ALL child virtual documents.  This happens within the incremental
ingester.
(13) Requeuing interval must be computed across all children, taking the minimum, since there's
no requirement that an ingeststatus record exist for the parent.
(14) All other logic, including making sure only one agent operates on a url at a time, is
the same.
(15) Interrupting the delete phase is safe because next time the doc is processed the records
will be removed.

Analysis:
- The critical thing is making the non-virtual case no worse.
- For a virtual child document, instead of one db access, there are two.
- For document records that are not changed, there are two additional writes that were not
needed before.
- There's an additional index (or the document key index has another subfield).
- If the queries written can be done in such a way as to treat the standard (no child document)
case specially, we may be able to avoid much impact; only two index queries per document returning
zero rows each
- If we handle the standard case using the same mechanism, the WorkerThread logic dealing
with deletions can go away.

Summary:
- Additional database overhead in the non-virtual indexing case consists of one additional
write and one additional zero-row query, OR two additional zero-row queries.
- Additional database overhead in the non-virtual skip case consists of two additional writes,
OR two additional zero-row queries.
- The overhead is low but is significant and will impact overall framework performance
- The up-sides are as follows: (a) handling an important but infrequent case better; (b) less
connector involvement in indexing (e.g., IProcessActivity.deleteDocument() does nothing now,
and can be deprecated).




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message