manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <m.gro...@sourcesense.com>
Subject Re: proposals for writing manifold connectors
Date Wed, 02 Jul 2014 21:22:02 GMT
Hi Karl,
	what I'm saying is that his connector doesn't need them 
and if this means that it can be implemented efficiently without modifying MCF framework that
would be great.

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 02/lug/2014, alle ore 20:44, Karl Wright ha scritto:

> Hi Matteo,
> 
> You may not need document tracking and synchronization, but that is
> mcf's major purpose, so any general feature would have to handle that
> case.
> 
> Karl
> 
> Sent from my Windows Phone
> From: Matteo Grolla
> Sent: 7/2/2014 10:49 AM
> To: dev@manifoldcf.apache.org
> Subject: Re: proposals for writing manifold connectors
> Hi Karl,
> 	one note (maybe obvious):
> parent document (File) are not to be indexed in solr, I'm only
> interested in keeping track in log reports
> 
>> What is not controversial is that the IProcessActivity.ingestDocument()
> tell me if I've got it why ingestDocument must change:
> in case of crashes/errors when mcf restarts
> if finds Docs to be processed in the queue and it has to know the
> corresponding File to resume processing
> 
> if I had a hook at the beginning of a crawl that would allow me to
> remove all Docs instances from the queue (but leave the File
> instances)
> then I could just recreate the Docs instances in the queue (Files are
> reprocessed)
> 
>> But the basic problem is that, in order to be able to delete child
> Am I misunderstanding or you are thinking that
> 	if I delete FileA then I must delete DocA1..DocAN ?
> because I don't need this.
> 
> thanks
> 
> 
> -- 
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
> 
> Il giorno 02/lug/2014, alle ore 12:57, Karl Wright ha scritto:
> 
>> Hi Matteo,
>> 
>> I looked a bit into what it would require to be able to index multiple
>> documents given a single parent document, WITHOUT the child documents
>> hitting the ManifoldCF document queue.
>> 
>> What is not controversial is that the IProcessActivity.ingestDocument()
>> method would change so that *two* document identifiers were passed in.  The
>> first would be the parent document identifier, and the second would be a
>> child document identifier, which you would make up presumably based on the
>> parent as a starting point.  This would be a requirement for the
>> incremental indexer.
>> 
>> But the basic problem is that, in order to be able to delete child
>> documents from the index without involving the repository connector, we
>> need to relate parent documents with child documents in some way, inside
>> the incremental indexer.  There are a number of possible ways of doing
>> this; the simplest would be to just add another column to the ingeststatus
>> table which would allow the separation of parent and child document
>> identifiers.  However, the simple solution is not very good because it
>> greatly exacerbates a problem which we already have in the incremental
>> indexer: there are multiple copies of the document version string being
>> kept, one for each record.  Also, there is currently no logic at all in
>> place to deal with the situation where the list of child documents shrinks;
>> that logic would have to be worked out and there would need to be tracking
>> to identify records that needed to go away as a result.
>> 
>> In short, this would be a significant change -- which is OK, but before
>> considering it I'd have to work it through carefully, and make sure we
>> don't lose performance etc.
>> 
>> Thanks,
>> Karl
>> 
>> 
>> 
>> On Tue, Jul 1, 2014 at 12:12 PM, Matteo Grolla <m.grolla@sourcesense.com>
>> wrote:
>> 
>>> Hi Karl,
>>>       first of all thanks
>>> 
>>>> The reason MCF is not currently structured this way is because a decision
>>> I think that in general MCF design is sound and generic.
>>> As connector developer I'd just like to have more flexibility in
>>> particular situations.
>>> Maybe what I'm searching for is already there or wouldn't be disruptive to
>>> introduce.
>>> Mail exchange doesn't make this kind of discussion easy
>>> To make precise proposals I should probably give a detailed look at the
>>> framework source code
>>> 
>>> --
>>> Matteo Grolla
>>> Sourcesense - making sense of Open Source
>>> http://www.sourcesense.com
>>> 
>>> Il giorno 01/lug/2014, alle ore 16:47, Karl Wright ha scritto:
>>> 
>>>> Hi Matteo,
>>>> 
>>>> Ok, from your description it sounds like what you primarily want is for
>>> the
>>>> processing of one document to generate N index entries, each with its own
>>>> version string etc.  This would never go near the queue since effectively
>>>> you'd only be dealing with the large files there (FileA and FileB in your
>>>> example).  You are planning to get away with doing no incremental
>>>> management because you will simply repeat yourself if something goes
>>> wrong
>>>> in the middle and document processing is not completed.
>>>> 
>>>> The reason MCF is not currently structured this way is because a decision
>>>> needs to be made *up front* whether to process the document or not, and
>>>> that cannot be done in your model without actually fetching and
>>> processing
>>>> the large file.  So it is in fact a chicken-and-egg problem.  I will
>>> think
>>>> if I can see a solution to it but I've considered this in the past and
>>> not
>>>> found a good way to structure this kind of arrangement.  Indeed,
>>> carry-down
>>>> was designed in part to solve this problem.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jul 1, 2014 at 10:36 AM, Matteo Grolla <m.grolla@sourcesense.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Karl
>>>>> 
>>>>> I read the book and in general all principles seems sound.
>>>>> What I'm thinking is that for some specific connectors (to say in
>>> specific
>>>>> conditions) people may want to exploit the specificity taking different
>>>>> approaches
>>>>> 
>>>>>> I don't think database management is as difficult as you seem to
think.
>>>>> 
>>>>> Maybe I wasn't clear on this but what I mean is this:
>>>>> If I propose to my typical customer
>>>>> a crawler that requires postgres even for simple crawls they's probably
>>>>> prefer to write a custom app for simple crawls.
>>>>> If I could at least say that the db doesn't grow a lot that would
>>> mitigate
>>>>> the problem.
>>>>> I don't know if I'm the only one with these problem
>>>>> 
>>>>> 
>>>>>> You have forgotten what happens when either errors occur during
>>>>> processing,
>>>>>> or the agents process working on your documents dies
>>>>> 
>>>>> Let's say fileA
>>>>> contains DocA1.. DocA100
>>>>> 
>>>>> As expressed in the comment on carry down data:
>>>>> if I have errors or the crawler dies in the processing of DocA50
>>>>> since I want FileA to be considered processed only when all its docs
>>> have
>>>>> been processed
>>>>> at restart the system should
>>>>> reparse FileA
>>>>> skip DocA1..DocA49 (if I'm handling versioning for them)
>>>>> process DocA50..DocA100
>>>>> 
>>>>> If there's a failure I have to reparse FileA but I avoid storing 100
>>> docs
>>>>> in the db.
>>>>> For me that's good, failures are not so frequent.
>>>>> 
>>>>>> You are forgetting the fact that MCF is incremental.
>>>>> Let's say:
>>>>> 
>>>>> in the first crawl MCF processes
>>>>> FileA dated 2014-01-01
>>>>> containing Doc1..Doc10
>>>>> all docs are versioned  2014-01-01
>>>>> 
>>>>> in the second crawl
>>>>> FileB dated 2014-01-02
>>>>> containing Doc1..Doc5
>>>>> all docs are versioned  2014-01-02
>>>>> 
>>>>> so Doc1..Doc5 are overwritten with data from fileB
>>>>> I don't need carry down data from previous crawl
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Matteo Grolla
>>>>> Sourcesense - making sense of Open Source
>>>>> http://www.sourcesense.com
>>>>> 
>>>>> Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto:
>>>>> 
>>>>>> Hi Matteo,
>>>>>> 
>>>>>> I don't think database management is as difficult as you seem to
think.
>>>>>> But more importantly, you seem to have issues with very basic aspects
>>> of
>>>>>> the ManifoldCF design.  It may help to read the book (see
>>>>>> https://manifoldcf.apache.org/en_US/books-and-presentations.html),
to
>>>>>> understand more completely where the design decisions came from.
>>>>>> 
>>>>>> In short, ManifoldCF is built on top of a database for the following
>>>>>> reasons:
>>>>>> - resilience
>>>>>> - restartability
>>>>>> - control of memory footprint
>>>>>> - synchronization
>>>>>> 
>>>>>> It was, in fact, designed around the capabilities of databases (not
>>> just
>>>>>> postgresql, but all modern databases), so it is not surprising that
it
>>>>> uses
>>>>>> the database for everything persistent at all.  Incremental crawling
>>>>> means
>>>>>> that even more things need to be persistent in ManifoldCF than they
>>> might
>>>>>> in other crawling designs.
>>>>>> 
>>>>>> So I suggest that you carefully read the first chapter of MCF in
>>> Action,
>>>>>> and consider the design point of this crawler carefully.
>>>>>> 
>>>>>> As for your specific questions:
>>>>>> 
>>>>>> 'I can't say: "before processing another file process all the docs
from
>>>>>> previous files"' - Yes, and the reason for that is because the MCF
>>> queue
>>>>> is
>>>>>> built once again in the database.  Documents that are to be processed
>>>>> must
>>>>>> be queried, and that query must be performant.
>>>>>> 
>>>>>> 'do I need intra cluster synchronization to process the docs contained
>>>>> in a
>>>>>> file?
>>>>>> If I state that the machine that processed the file is the one that
>>>>>> processes the docs contained in it then I don't.'
>>>>>> 
>>>>>> You have forgotten what happens when either errors occur during
>>>>> processing,
>>>>>> or the agents process working on your documents dies (because it
got
>>>>>> killed, say).  Unless you want to lose all context, then you need
>>>>>> persistent storage.
>>>>>> 
>>>>>> 'If it's difficult to do without a db for carry down data I'd like
that
>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>> How could I do that?'
>>>>>> 
>>>>>> You are forgetting the fact that MCF is incremental.  If you want
it to
>>>>> do
>>>>>> the minimum work on a subsequent crawl, it has to keep track of what
>>>>> inputs
>>>>>> are around for each document to be processed.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <
>>> m.grolla@sourcesense.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Karl,
>>>>>>>     glad it's appreciated.
>>>>>>> 
>>>>>>> concerning the answer:
>>>>>>> 
>>>>>>> My intention is not to avoid using the db, just limiting it's
use to
>>>>> what
>>>>>>> strictly necessary.
>>>>>>> And surely I don't want to find new way to handle intra-cluster
>>>>>>> communication.
>>>>>>> 
>>>>>>> For me it would be ok to keep track of crawled docIds an versions
in
>>> the
>>>>>>> db, I'd just like to avoid putting there carry down data.
>>>>>>> 
>>>>>>> Motivations:
>>>>>>> I recently worked on a couple of projects dealing about crawling
data
>>>>> from
>>>>>>> external data sources and importing it in solr.
>>>>>>> One was implemented as a Lucidworks Search connector
>>>>>>> The other was a custom crawling app.
>>>>>>> In both cases the crawling part was simple, the data source was
giving
>>>>> me
>>>>>>> all the modified / deleted documents  within a time interval.
>>>>>>> The processing pipeline to enrich and transform the documents
was more
>>>>>>> involved.
>>>>>>> 
>>>>>>> In cases like these I'd like to just focus on sizing the crawler
and
>>>>> solr
>>>>>>> instances.
>>>>>>> If I have to size a db I'll have to deal with a dba and many
customers
>>>>> are
>>>>>>> not experienced on Postgres so mcf solution becomes less appealing.
>>>>>>> Even if I find a postgres dba  I'll have to deal with him for
things
>>>>> like
>>>>>>> performance problems, sizeā€¦
>>>>>>> All things I'd like to avoid if not strictly necessary
>>>>>>> 
>>>>>>> please correct me if I'm wrong in what follows
>>>>>>> Why do I need carry down data in db?
>>>>>>> because I wan't bounded memory usage and have no control on the
order
>>>>> mcf
>>>>>>> follows in processing docs
>>>>>>> I can't say: "before processing another file process all the
docs from
>>>>>>> previous files"
>>>>>>> do I need intra cluster synchronization to process the docs contained
>>>>> in a
>>>>>>> file?
>>>>>>> If I state that the machine that processed the file is the one
that
>>>>>>> processes the docs contained in it then I don't.
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> If it's difficult to do without a db for carry down data I'd
like that
>>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>>> How could I do that?
>>>>>>> 
>>>>>>> If I were to synthesize this mail in one sentence I'd say:
>>>>>>> "Given simple crawling requirements I'd like o be able to implement
an
>>>>> MCF
>>>>>>> solution that is performant and simple to manage"
>>>>>>> 
>>>>>>> thanks
>>>>>>> 
>>>>>>> --
>>>>>>> Matteo Grolla
>>>>>>> Sourcesense - making sense of Open Source
>>>>>>> http://www.sourcesense.com
>>>>>>> 
>>>>>>> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>>>>>>> 
>>>>>>>> Hi Matteo,
>>>>>>>> 
>>>>>>>> Thank you for the work you put in here.
>>>>>>>> 
>>>>>>>> I have a response to one particular question buried deep
in your
>>> code:
>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>>      FIXME: carrydown values
>>>>>>>>        serializing the documents seems a quick way to reach
the
>>> goal
>>>>>>>>        what is the size limit for this data in the sql table?
>>>>>>>> 
>>>>>>>>        PROPOSAL
>>>>>>>>        What I'd really like is avoiding the use of a db table
for
>>>>>>>> carrydown data and keep them in memory
>>>>>>>>        something like:
>>>>>>>>          MCF starts processing File A
>>>>>>>>          docs A1, A2, ... AN are added to the queue
>>>>>>>>          MCF starts processing File B
>>>>>>>>          docs B1, B2, ... are added to the queue
>>>>>>>>          and so on...
>>>>>>>>          as soon as all docs A1..AN have been processed,
A is
>>>>>>>> considered processed
>>>>>>>>          in case of failure (manifold is restarted in the
middle
>>>>>>>> of a crawl)
>>>>>>>>            all files (A, B...) should be reprocessed
>>>>>>>>          the size of the queue should be bounded
>>>>>>>>            once filled MCF should stop processing files untill
>>>>>>>> more docs are processed
>>>>>>>> 
>>>>>>>>        MOTIVATION
>>>>>>>>        -I'd like to avoid putting pressure on the db if possible,
>>>>>>>> so that it doesn't become a concern in production
>>>>>>>>        -performance
>>>>>>>> <<<<<<
>>>>>>>> 
>>>>>>>> Carrydown is explicitly designed to use unlimited-length
database
>>>>>>>> fields.  Your proposal would work OK only within a single
cluster
>>>>>>>> member; however, among multiple cluster
>>>>>>>> members it could not work.  The database is the ManifoldCF
medium of
>>>>>>>> choice for handling stateful information and for handling
>>>>>>>> cross-cluster data requirements.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <
>>> matteo.grolla@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>    I wrote a repository connector for crawling solrxml
files
>>>>>>>>> 
>>>>>>>>> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>>>>>>>>> 
>>>>>>>>> The work is based on the filesystem connector but I made
several
>>>>>>> hopefully
>>>>>>>>> interesting changes which could be applied elsewhere.
>>>>>>>>> I have also a couple of questions
>>>>>>>>> For details see the read me file
>>>>>>>>> 
>>>>>>>>> Matteo
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message