manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Re-sending docs to output connector
Date Tue, 24 May 2011 23:28:56 GMT
The only requirement you may have overlooked is the requirement that
Solr be able to take advantage of the item cache automatically if it
happens to be restarted in the middle of an indexing pass.  If you
think about it, you will realize that this cannot be done externally
to Solr, unless Solr learns how to "pull" documents from the item
cache, and keep track somehow of the last item/operation it
successfully committed.  That's why I proposed putting the whole cache
under Solr auspices.  Deletions also would need to be enumerated in
the "cache", so it would not really be a cache but more like a
transaction log.  But I agree that the right place for such a
transaction log is effectively between MCF and Solr.

Obviously the cache would also need to be disk based, or once again
guaranteed delivery would not be possible.  Compression might be
useful, as would be checkpoints in case the data got large.  This is
very database-like, so CouchDB might be a reasonable way to do it,
especially if this code is considered to be part of Solr.  If part of
ManifoldCF, we should try to see if PostgreSQL would suffice, since it
will likely be already installed and ready to go.

Karl

On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl <jan.asf@cominvent.com> wrote:
> The "Refetch all ingested documents" works, but with Web crawling the problem is that
it will take almost as long as a new crawl to re-feed.
>
> The solutions could be
> A) Add a stand-alone cache in front of Solr
> B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
> C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button
somewhere...
>
> The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable,
you'd add the CouchDB access into to properties.xml.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 24. mai 2011, at 15.11, Karl Wright wrote:
>
>> ManifoldCF is designed to deal with the problem of repeated or
>> continuous crawling, doing only what is needed on subsequent crawls.
>> It is thus a true incremental crawler.  But in order for this to work
>> for you, you need to let ManifoldCF do its job of keeping track of
>> what documents (and what document versions) have been handed to the
>> output connection.  For the situation where you change something in
>> Solr, the ManifoldCF solution to that is the "refetch all ingested
>> documents" button in the Crawler UI.  This is on the view page for the
>> output connection.  Clicking that button will cause ManifoldCF to
>> re-index all documents - but will also require ManifoldCF to recrawl
>> them, because ManifoldCF does not keep copies of the documents it
>> crawls anywhere.
>>
>> If you need to avoid recrawling at all costs when you change Solr
>> configurations, you may well need to put some sort of software of your
>> own devising between ManifoldCF and Solr.  You basically want to
>> develop a content repository which ManifoldCF outputs to which can be
>> scanned to send to your Solr instance.  I actually proposed this
>> design for a Solr "guaranteed delivery" mechanism, because until Solr
>> commits a document it can still be lost if the Solr instance is shut
>> down.  Clearly something like this is needed and would also likely
>> solve your problem too.  The main issue, though, is that it would need
>> to be integrated with Solr itself, because you'd really want it to
>> pick up where it left off if Solr is cycled etc.  In my opinion this
>> functionality really can't function as part of ManifoldCF for that
>> reason.
>>
>> Karl
>>
>> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <jan.asf@cominvent.com> wrote:
>>> Hi,
>>>
>>> Is there an easy way to separate fetching from ingestion?
>>> I'd like to first run a crawl for several days, and then feed it to my Solr output
as fast as possible.
>>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>>
>>>
>
>

Mime
View raw message