manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <m.gro...@sourcesense.com>
Subject Re: proposals for writing manifold connectors
Date Tue, 01 Jul 2014 14:36:48 GMT
Hi Karl

I read the book and in general all principles seems sound. 
What I'm thinking is that for some specific connectors (to say in specific conditions) people
may want to exploit the specificity taking different approaches
	
> I don't think database management is as difficult as you seem to think.

Maybe I wasn't clear on this but what I mean is this:
If I propose to my typical customer 
a crawler that requires postgres even for simple crawls they's probably prefer to write a
custom app for simple crawls.
If I could at least say that the db doesn't grow a lot that would mitigate the problem.
I don't know if I'm the only one with these problem


> You have forgotten what happens when either errors occur during processing,
> or the agents process working on your documents dies

Let's say fileA 
contains DocA1.. DocA100

As expressed in the comment on carry down data:
if I have errors or the crawler dies in the processing of DocA50
since I want FileA to be considered processed only when all its docs have been processed
at restart the system should 
reparse FileA
skip DocA1..DocA49 (if I'm handling versioning for them)
process DocA50..DocA100

If there's a failure I have to reparse FileA but I avoid storing 100 docs in the db.
For me that's good, failures are not so frequent.

> You are forgetting the fact that MCF is incremental.
Let's say:

in the first crawl MCF processes
FileA dated 2014-01-01
containing Doc1..Doc10
all docs are versioned  2014-01-01

in the second crawl
FileB dated 2014-01-02
containing Doc1..Doc5
all docs are versioned  2014-01-02

so Doc1..Doc5 are overwritten with data from fileB
I don't need carry down data from previous crawl








-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto:

> Hi Matteo,
> 
> I don't think database management is as difficult as you seem to think.
> But more importantly, you seem to have issues with very basic aspects of
> the ManifoldCF design.  It may help to read the book (see
> https://manifoldcf.apache.org/en_US/books-and-presentations.html), to
> understand more completely where the design decisions came from.
> 
> In short, ManifoldCF is built on top of a database for the following
> reasons:
> - resilience
> - restartability
> - control of memory footprint
> - synchronization
> 
> It was, in fact, designed around the capabilities of databases (not just
> postgresql, but all modern databases), so it is not surprising that it uses
> the database for everything persistent at all.  Incremental crawling means
> that even more things need to be persistent in ManifoldCF than they might
> in other crawling designs.
> 
> So I suggest that you carefully read the first chapter of MCF in Action,
> and consider the design point of this crawler carefully.
> 
> As for your specific questions:
> 
> 'I can't say: "before processing another file process all the docs from
> previous files"' - Yes, and the reason for that is because the MCF queue is
> built once again in the database.  Documents that are to be processed must
> be queried, and that query must be performant.
> 
> 'do I need intra cluster synchronization to process the docs contained in a
> file?
> If I state that the machine that processed the file is the one that
> processes the docs contained in it then I don't.'
> 
> You have forgotten what happens when either errors occur during processing,
> or the agents process working on your documents dies (because it got
> killed, say).  Unless you want to lose all context, then you need
> persistent storage.
> 
> 'If it's difficult to do without a db for carry down data I'd like that
> table to remain small, maybe empty it at the end of every crawl.
> How could I do that?'
> 
> You are forgetting the fact that MCF is incremental.  If you want it to do
> the minimum work on a subsequent crawl, it has to keep track of what inputs
> are around for each document to be processed.
> 
> Karl
> 
> 
> 
> On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <m.grolla@sourcesense.com>
> wrote:
> 
>> Hi Karl,
>>        glad it's appreciated.
>> 
>> concerning the answer:
>> 
>> My intention is not to avoid using the db, just limiting it's use to what
>> strictly necessary.
>> And surely I don't want to find new way to handle intra-cluster
>> communication.
>> 
>> For me it would be ok to keep track of crawled docIds an versions in the
>> db, I'd just like to avoid putting there carry down data.
>> 
>> Motivations:
>> I recently worked on a couple of projects dealing about crawling data from
>> external data sources and importing it in solr.
>> One was implemented as a Lucidworks Search connector
>> The other was a custom crawling app.
>> In both cases the crawling part was simple, the data source was giving me
>> all the modified / deleted documents  within a time interval.
>> The processing pipeline to enrich and transform the documents was more
>> involved.
>> 
>> In cases like these I'd like to just focus on sizing the crawler and solr
>> instances.
>> If I have to size a db I'll have to deal with a dba and many customers are
>> not experienced on Postgres so mcf solution becomes less appealing.
>> Even if I find a postgres dba  I'll have to deal with him for things like
>> performance problems, sizeā€¦
>> All things I'd like to avoid if not strictly necessary
>> 
>> please correct me if I'm wrong in what follows
>> Why do I need carry down data in db?
>> because I wan't bounded memory usage and have no control on the order mcf
>> follows in processing docs
>> I can't say: "before processing another file process all the docs from
>> previous files"
>> do I need intra cluster synchronization to process the docs contained in a
>> file?
>> If I state that the machine that processed the file is the one that
>> processes the docs contained in it then I don't.
>> 
>> What do you think?
>> If it's difficult to do without a db for carry down data I'd like that
>> table to remain small, maybe empty it at the end of every crawl.
>> How could I do that?
>> 
>> If I were to synthesize this mail in one sentence I'd say:
>> "Given simple crawling requirements I'd like o be able to implement an MCF
>> solution that is performant and simple to manage"
>> 
>> thanks
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>> 
>>> Hi Matteo,
>>> 
>>> Thank you for the work you put in here.
>>> 
>>> I have a response to one particular question buried deep in your code:
>>> 
>>>>>>>>> 
>>> 
>>>         FIXME: carrydown values
>>>           serializing the documents seems a quick way to reach the goal
>>>           what is the size limit for this data in the sql table?
>>> 
>>>           PROPOSAL
>>>           What I'd really like is avoiding the use of a db table for
>>> carrydown data and keep them in memory
>>>           something like:
>>>             MCF starts processing File A
>>>             docs A1, A2, ... AN are added to the queue
>>>             MCF starts processing File B
>>>             docs B1, B2, ... are added to the queue
>>>             and so on...
>>>             as soon as all docs A1..AN have been processed, A is
>>> considered processed
>>>             in case of failure (manifold is restarted in the middle
>>> of a crawl)
>>>               all files (A, B...) should be reprocessed
>>>             the size of the queue should be bounded
>>>               once filled MCF should stop processing files untill
>>> more docs are processed
>>> 
>>>           MOTIVATION
>>>           -I'd like to avoid putting pressure on the db if possible,
>>> so that it doesn't become a concern in production
>>>           -performance
>>> <<<<<<
>>> 
>>> Carrydown is explicitly designed to use unlimited-length database
>>> fields.  Your proposal would work OK only within a single cluster
>>> member; however, among multiple cluster
>>> members it could not work.  The database is the ManifoldCF medium of
>>> choice for handling stateful information and for handling
>>> cross-cluster data requirements.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <matteo.grolla@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>>       I wrote a repository connector for crawling solrxml files
>>>> 
>>>> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>>>> 
>>>> The work is based on the filesystem connector but I made several
>> hopefully
>>>> interesting changes which could be applied elsewhere.
>>>> I have also a couple of questions
>>>> For details see the read me file
>>>> 
>>>> Matteo
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message