manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: proposals for writing manifold connectors
Date Tue, 01 Jul 2014 10:16:34 GMT
Hi Matteo,

Thank you for the work you put in here.

I have a response to one particular question buried deep in your code:


          FIXME: carrydown values
            serializing the documents seems a quick way to reach the goal
            what is the size limit for this data in the sql table?

            What I'd really like is avoiding the use of a db table for
carrydown data and keep them in memory
            something like:
              MCF starts processing File A
              docs A1, A2, ... AN are added to the queue
              MCF starts processing File B
              docs B1, B2, ... are added to the queue
              and so on...
              as soon as all docs A1..AN have been processed, A is
considered processed
              in case of failure (manifold is restarted in the middle
of a crawl)
                all files (A, B...) should be reprocessed
              the size of the queue should be bounded
                once filled MCF should stop processing files untill
more docs are processed

            -I'd like to avoid putting pressure on the db if possible,
so that it doesn't become a concern in production

Carrydown is explicitly designed to use unlimited-length database
fields.  Your proposal would work OK only within a single cluster
member; however, among multiple cluster
members it could not work.  The database is the ManifoldCF medium of
choice for handling stateful information and for handling
cross-cluster data requirements.


On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <>

> Hi,
>         I wrote a repository connector for crawling solrxml files
> The work is based on the filesystem connector but I made several hopefully
> interesting changes which could be applied elsewhere.
> I have also a couple of questions
> For details see the read me file
> Matteo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message