manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1118) Documents processed by the shared drive connector incur an unnecessary synchronisation hit
Date Tue, 09 Dec 2014 16:20:12 GMT


Karl Wright commented on CONNECTORS-1118:

The best solution would be to promote the PipelineConnections and PipelineConnectionsWithVersions
classes to be first-class API-level objects, preferably with some interface depiction, e.g.
IPipelineConnections and IPipelineConnectionsWithVersions.  All the methods in IIncrementalIngester
would be changed to take IPipelineConnections inputs instead of IPipelineSpecification objects.
 Then it would be possible to cache the objects for at least the duration of a single document's

This is not a trivial change and will require some time to implement.

It's also worth noting that the *reason* for the locking in this case is for cache management.
 The objects that are being loaded are in fact cached objects constructed from their database
images -- locking is needed to insure cache consistency only.  If zookeeper is so slow that
it is dragging down even our caching implementation, we should seriously consider chucking
it in favor of another solution.

> Documents processed by the shared drive connector incur an unnecessary synchronisation
> ------------------------------------------------------------------------------------------
>                 Key: CONNECTORS-1118
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework core
>    Affects Versions: ManifoldCF 1.7.2
>            Reporter: Aeham Abushwashi
>            Assignee: Karl Wright
> Each document processed by the shared drive connector is passed through SharedDriveConnector#checkInclude
to verify whether the document is eligible for ingestion. The calls made here to WorkerThread$ProcessActivity#checkMimeTypeIndexable
and WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as they each
create a fresh instance of IncrementalIngester$PipelineConnections on every call. The constructor
of IncrementalIngester$PipelineConnections can be very expensive due to the loading of output
connection objects, which in turn requires some locking (via ZK - in a distrubuted environment).
> The other area of inefficiency is in WorkerThread$ProcessActivity#processDocumentReferences.
This method creates new instances of PriorityCalculator using the less-efficient 3-arg constructor.
This can be addressed using the same pattern implemented for CONNECTORS-1094
> To highlight the impact of the above calls, I profiled an active worker thread for 40
minutes. During that window, it spent ~23 minutes in SharedDriveConnector#checkInclude and
its callees + 9 minutes creating instances of PriorityCalculator.
> I've seen the above issues when using the shared drive connector but I think other connectors
too could be impacted - depending on how they're implemented.

This message was sent by Atlassian JIRA

View raw message