Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE2DF1155D for ; Thu, 31 Jul 2014 18:44:50 +0000 (UTC) Received: (qmail 14657 invoked by uid 500); 31 Jul 2014 18:44:50 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 14603 invoked by uid 500); 31 Jul 2014 18:44:50 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 14593 invoked by uid 99); 31 Jul 2014 18:44:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Jul 2014 18:44:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ameya.aware@gmail.com designates 209.85.214.173 as permitted sender) Received: from [209.85.214.173] (HELO mail-ob0-f173.google.com) (209.85.214.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Jul 2014 18:44:47 +0000 Received: by mail-ob0-f173.google.com with SMTP id vb8so1890330obc.4 for ; Thu, 31 Jul 2014 11:44:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=DIskC+IYsbkfQc7rE+9NO8gfSY9PgeCVbOnR/uHTUI4=; b=n/BOxFIkC+nrXBQZPbwnl4YQUAL6tWDZXn2UXizd2JMWBtToJTPmnJtL9SFlka/kU+ ak+W7vOFRxD5CvNahDZelWJTbR7IzxGhBOE4bhMYODndP7z0etTQiVkNxeXakBDV3E1U Sv9UFYGosDnNlcSxVuEwRnd4tk7mU8JbWScsm9hZ1ff7HUxLloStxYw2JAddNOHHOrwL vQSi5Ksy5koPpekbW794tuUJQ7ilAdSFFQkz/OxboEgI/1ulVspSYUpkZd5ZG+wetSqo c/quvlLxIX6HeD32PS6N6RtVj2p+KIQq9BjDTTv6pj0dTt+aUrffB53/AsvlxbXQrbUG EBAQ== MIME-Version: 1.0 X-Received: by 10.182.142.69 with SMTP id ru5mr138190obb.6.1406832261785; Thu, 31 Jul 2014 11:44:21 -0700 (PDT) Received: by 10.182.52.233 with HTTP; Thu, 31 Jul 2014 11:44:21 -0700 (PDT) In-Reply-To: References: Date: Thu, 31 Jul 2014 14:44:21 -0400 Message-ID: Subject: Re: Crawling and indexing very slow From: Ameya Aware To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a11c2d5ce38eef504ff81a90a X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2d5ce38eef504ff81a90a Content-Type: text/plain; charset=UTF-8 So the thing here is i am not looking for any data or content of any of files. I am just interested in metadata of file. So i thought it should be possible to not read any file and just get metadata of file and give to Solr. This should save lots of time. Is it possible to do this? Thanks, Ameya On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright wrote: > Hi Ameya, > > (1) Please look at the Simple History report. Note what kinds of > documents are being fetched, what kinds are being indexed, and how long it > is taking. I have noted from your previous posts that you seem to be > indexing a lot of very large EXE files. This is useless and you should be > excluding them. > > (2) Please look in the manifoldcf.log file for evidence that fetches > and/or Solr indexing requests are being retried due to errors. It doesn't > take many documents being chronically retried before forward progress drops > to near zero. > > (3) If you look into (1) & (2) and everything seems fine, it may be a > misalignment between availability of several kinds of resources that is the > problem. Please get a thread dump of the agents process while it is > crawling, using jstack. Post that thread dump and we can tell you what to > look at next. > > Karl > > > > On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware > wrote: > >> Hi, >> >> >> I am using filesystem connector to index my entire C drive using Solr as >> output connector. >> >> Initial 100000 documents were crawled and indexed successfully in couple >> of hours but after that indexing slowed down badly (around 15-20 documents >> per min). >> >> >> I am not able to figure out whether there is issue with MCF or Solr. >> >> >> Can you advice me how to proceed with this? >> >> >> Thanks, >> Ameya >> > > --001a11c2d5ce38eef504ff81a90a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
So the thing here is i am not looking for any data or cont= ent of any of files. I am just interested in metadata of file.

So i thought it should be possible to not read any file and just get= metadata of file and give to Solr.

This should save lots of time.

Is it possible to do this?

Thanks,
Ameya
=



On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com>= wrote:
Hi Ameya,

(1) Please look at the Sim= ple History report.=C2=A0 Note what kinds of documents are being fetched, w= hat kinds are being indexed, and how long it is taking.=C2=A0 I have noted = from your previous posts that you seem to be indexing a lot of very large E= XE files.=C2=A0 This is useless and you should be excluding them.

(2) Please look in the manifoldcf.log file for evidence that fetc= hes and/or Solr indexing requests are being retried due to errors.=C2=A0 It= doesn't take many documents being chronically retried before forward p= rogress drops to near zero.

(3) If you look into (1) & (2) and everything seems fine, it = may be a misalignment between availability of several kinds of resources th= at is the problem.=C2=A0 Please get a thread dump of the agents process whi= le it is crawling, using jstack.=C2=A0 Post that thread dump and we can tel= l you what to look at next.<= br>
Karl=



On Thu, Jul 31, 2014= at 2:07 PM, Ameya Aware <ameya.aware@gmail.com> wrote:<= br>
Hi,


=
I am using filesystem connector to index my entire C drive using= Solr as output connector.

Initial 100000 documents were crawled and indexed succe= ssfully in couple of hours but after that indexing slowed down badly (aroun= d 15-20 documents per min).


I am not able to figure out whether ther= e is issue with MCF or Solr.


Can yo= u advice me how to proceed with this?


Thanks,
Ameya


--001a11c2d5ce38eef504ff81a90a--