manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Conlon <Adrian.Con...@arup.com>
Subject RE: Pushing extra items into an index (outside normal crawl job)
Date Fri, 16 Aug 2013 09:52:40 GMT
Thanks Karl,

That's an interesting thought.  So if I've understood what you're saying correctly, I could
create a temporary job, set the priority to one, start it, and that's it?  Individual job
queues are effectively handled separately?  There might be a number of temporary jobs on the
go at any time, I guess, since they couldn't be deleted until the job has finished.  Do you
think that would be an issue?  In any event, that's given me food for thought, so I'll take
a look on that basis.

With regards your second thought.  I had high hopes for a minimal job run, but they seem to
take almost as long as a full job run.  I haven't really sat down and worked out timings,
but a speed up of about 10% on a reasonably sized (400,000 documents or so) JCIFS repository
was all I saw.  Is that what you'd expect?

Adrian

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: 15 August 2013 18:39
To: Adrian Conlon; user@manifoldcf.apache.org
Subject: RE: Pushing extra items into an index (outside normal crawl job)

Hi Adrian,

There is already a concept of job priority.  It is on a scale of one to ten, by default the
value is 5.

Your second idea is also somewhat similar to a "minimal" job run.  Might want to look into
that as well.  Depending on your connector these two constructs together might well work for
you.

Karl

Sent from my Windows Phone
________________________________
From: Adrian Conlon
Sent: 8/15/2013 12:18 PM
To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Pushing extra items into an index (outside normal crawl job)
Hi All,

I've been asked to consider adding items to an index outside normal repository crawl job processing
(e.g. to reduce the latency of a document being added to a repository and being available
in the index)

My initial thoughts on this are that this doesn't really fit in with the current ManifoldCF
architecture.

With that in mind, I've come up with a couple of ideas (neither tested, nor thought through!)
that I'd like to run past the list to see whether they:


a)      Have the possibility of being reasonable

b)      Might be something that could be passed back into the ManifoldCF project (perhaps
as a contrib)

Idea one (probably the most work, but perhaps architecturally most clean):


1)      Introduce the idea of priority into ManifoldCF queues

2)      Add an extra "mcf" web service that allows queue injection

Idea two (easiest, if it works, but quite "hacky"):


1)      Add a web service that uses some "mcf" code to send documents directly to the output
connector

2)      Obviously, this can't go through the ManifoldCF queues

3)      Relies upon a normal mcf job to tidy up any anomalies that might have occurred (deleting
and re-ingesting would be fine, I think)

How do these sound?  Are they worth thinking about?  Or indeed (better yet!), is there a better
way I haven't thought of...?

Thanks,

Adrian

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

Mime
View raw message