manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameya Aware <ameya.aw...@gmail.com>
Subject Re: Crawling and indexing very slow
Date Thu, 31 Jul 2014 18:44:21 GMT
So the thing here is i am not looking for any data or content of any of
files. I am just interested in metadata of file.

So i thought it should be possible to not read any file and just get
metadata of file and give to Solr.

This should save lots of time.

Is it possible to do this?

Thanks,
Ameya



On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Ameya,
>
> (1) Please look at the Simple History report.  Note what kinds of
> documents are being fetched, what kinds are being indexed, and how long it
> is taking.  I have noted from your previous posts that you seem to be
> indexing a lot of very large EXE files.  This is useless and you should be
> excluding them.
>
> (2) Please look in the manifoldcf.log file for evidence that fetches
> and/or Solr indexing requests are being retried due to errors.  It doesn't
> take many documents being chronically retried before forward progress drops
> to near zero.
>
> (3) If you look into (1) & (2) and everything seems fine, it may be a
> misalignment between availability of several kinds of resources that is the
> problem.  Please get a thread dump of the agents process while it is
> crawling, using jstack.  Post that thread dump and we can tell you what to
> look at next.
>
> Karl
>
>
>
> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <ameya.aware@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> I am using filesystem connector to index my entire C drive using Solr as
>> output connector.
>>
>> Initial 100000 documents were crawled and indexed successfully in couple
>> of hours but after that indexing slowed down badly (around 15-20 documents
>> per min).
>>
>>
>> I am not able to figure out whether there is issue with MCF or Solr.
>>
>>
>> Can you advice me how to proceed with this?
>>
>>
>> Thanks,
>> Ameya
>>
>
>

Mime
View raw message