manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timo Selvaraj <timo.selva...@gmail.com>
Subject Re: File system continuous crawl settings
Date Fri, 08 May 2015 22:46:07 GMT
Hi Karl,

The only error message which seems to be continuously thrown in manifold log is :

FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null
java.lang.NullPointerException

I do notice that the file that needs to deleted is shown under the Queue Status report and
keeps jumping between “Processing” and “About to Process” statuses every 30 seconds.

Timo


> On May 8, 2015, at 1:40 PM, Karl Wright <daddywri@gmail.com> wrote:
> 
> Hi Timo,
> 
> As I said, I don't think your configuration is the source of the delete issue. I suspect
the searchblox connector.
> 
> In the absence of a thread dump, can you look for exceptions in the manifoldcf log?
> 
> Karl
> 
> Sent from my Windows Phone
> From: Timo Selvaraj
> Sent: 5/8/2015 10:06 AM
> To: user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org>
> Subject: Re: File system continuous crawl settings
> 
> When I change the settings to the following, updated or modified documents are now indexed
but deleting the documents that are removed is still an issue:
> 
> Schedule type:	Rescan documents dynamically
> Minimum recrawl interval:	5 minutes	Maximum recrawl interval:	10 minutes
> Expiration interval:	Infinity	Reseed interval:	60 minutes
> No scheduled run times
> Maximum hop count for link type 'child':	Unlimited
> Hop count mode:	Delete unreachable documents
> 
> Do I need to set the reseed interval to Infinity?
> 
> Any thoughts?
> 
> 
>> On May 8, 2015, at 6:18 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>>
wrote:
>> 
>> I just tried your configuration here.  A deleted document in the file system was
indeed picked up as expected.
>> 
>> I did notice that your "expiration" setting is, essentially, cleaning out documents
at a rapid clip.  With this setting, documents will be expired before they are recrawled.
 You probably want one strategy or the other but not both.
>> 
>> As for why a deleted document is "stuck" in Processing: the only thing I can think
of is that the output connection you've chosen is having trouble deleting the document from
the index.  What output connector are you using?
>> 
>> Karl
>> 
>> 
>> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <timo.selvaraj@gmail.com <mailto:timo.selvaraj@gmail.com>>
wrote:
>> Hi,
>> 
>> We are testing the continuous crawl feature for file system connector on a small
folder to test if new documents are added to the folder, missing documents removed and modified
documents updated are handled by the continuous crawl job:
>> 
>> Here are the settings we use:
>> 
>> Schedule type:	Rescan documents dynamically
>> Minimum recrawl interval:	5 minutes	Maximum recrawl interval:	10 minutes
>> Expiration interval:	5 minutes	Reseed interval:	10 minutes
>> No scheduled run times
>> Maximum hop count for link type 'child':	Unlimited
>> Hop count mode:	Delete unreachable documents
>> 
>> 
>> Adding new documents seem to be getting picked up by the job however removal of a
document or update to a document are not being picked up.
>> 
>> Am I missing any settings for the deletions or updates? I do see the document that
has been removed is showing as Processing under Queue Status and others are showing as Waiting
for Processing.
>> 
>> Any idea what setting is missing for the deletes/updates to be recognized and re-indexed?
>> 
>> Thanks,
>> Timo 
>> 
> 


Mime
View raw message