Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6F32517DED for ; Mon, 10 Nov 2014 14:46:07 +0000 (UTC) Received: (qmail 19954 invoked by uid 500); 10 Nov 2014 14:46:07 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 19907 invoked by uid 500); 10 Nov 2014 14:46:07 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 19881 invoked by uid 99); 10 Nov 2014 14:46:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2014 14:46:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=AC_DIV_BONANZA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-yk0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2014 14:46:02 +0000 Received: by mail-yk0-f176.google.com with SMTP id 9so2441340ykp.7 for ; Mon, 10 Nov 2014 06:44:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=7vvE4X1ldGZrmOiIa6CE+SlS+VB/7nZb87K6lrskqrg=; b=jQYgcXBpUAZFYdApf5+k4ETVpdm5MMnd7CBPCn3LOV7z1U33lMghE9qEY6lj+Q+fgK TmysTT+639EJV1kHsY7xUDMwRlQdOCEJ50PNYXuhopa8LVYMmECgCWwJXKUSnSvC1KiV R4rCV8hYDOJYikkUcIjLz/O4Sy/ikkDOXWK0/AgjcX5Au6npF7gFyoRfSlw2S1Xe26H7 N9NtLBodxrl5BcHqoisoAskqqhIB4fi4Dzbo+xBo9fwR6LKiHaWp/wYuhtnQIJQz2BTc zV+0byvgt2YoPDTJnhMvgK3m54xwcQhKWNdEaHlWW3UEu19VOw/X0aY8djqKDXAbishw 0dcA== MIME-Version: 1.0 X-Received: by 10.236.11.129 with SMTP id 1mr29781295yhx.43.1415630651925; Mon, 10 Nov 2014 06:44:11 -0800 (PST) Received: by 10.170.120.83 with HTTP; Mon, 10 Nov 2014 06:44:11 -0800 (PST) In-Reply-To: References: Date: Mon, 10 Nov 2014 09:44:11 -0500 Message-ID: Subject: Re: Scaling in MCF From: Karl Wright To: dev Content-Type: multipart/alternative; boundary=001a1137fe5a2428af0507823273 X-Virus-Checked: Checked by ClamAV on apache.org --001a1137fe5a2428af0507823273 Content-Type: text/plain; charset=UTF-8 Hi Alessandro, bq. is there any parallelism in the indexing process? Yes, it is highly parallel. Many man-years of effort have gone into making sure there are no bottlenecks in document processing and indexing. bq. Are each row indexed sequentially from a datasource ? Are jobs occurring parallelly across different data sources? You should read the architecture chapters of MCF in Action. Short answer: there is no ordering, and worker threads handle documents from multiple jobs. bq. Manifold is really slow in Indexing Gb of Data ( simply crawling from windows shares or Alfresco). For Windows shares, the bottleneck is very likely to be Windows itself, and you can't improve that by increasing parallelism, because Windows servers will fall over and die if you try. We recommend, in fact, throttling JCIFS connections heavily to prevent that from occurring. For Alfresco, I have noted from others that Alfresco is often also a bottleneck. I believe that people tend to severely under-resource their Alfresco instances. You may get better results if you give more memory to your instance. In both cases I highly recommend getting a couple of thread dumps during crawling. This is crude but very helpful in determining where the bottleneck in fact lies. If it is the repository, as I suspect in your case, then you cannot improve things by tweaking MCF in any way. bq. Thinking something like SolrCloud is for Solr, for Manifold... You can spin up multiple agents processes in fact, and have been able to do this since MCF 1.5. However, I doubt this will help you, given your description of the problem so far. Thanks, Karl On Mon, Nov 10, 2014 at 9:30 AM, Alessandro Benedetti < benedetti.alex85@gmail.com> wrote: > Hi Karl, > just thinking how it works right now ... > is there any parallelism in the indexing process? > Are each row indexed sequentially from a datasource ? > Are jobs occurring parallelly across different data sources? ( I think yes) > > Because I was thinking, at least in my use cases, > Manifold is really slow in Indexing Gb of Data ( simply crawling from > windows shares or Alfresco). > Maybe this performances can be helped with a proper cluster organizaiton of > different Manifold Instances. > Thinking something like SolrCloud is for Solr, for Manifold... > Is there any thought about an architecture of different Manifold instances > working together ? > > Cheers > > 2014-11-10 13:22 GMT+00:00 Karl Wright : > > > Hi Aeham, > > > > I have a design which should improve reprioritization dramatically. It's > > described in CONNECTORS-1100, and I'm actively working on it now. This > is, > > however, pretty complicated in that document scheduling and > prioritization > > is anything but simple in ManifoldCF. I'm hoping that when I'm satisfied > > with the work, you will have the ability to try it out in a larger > > setting. But I'm not expecting to be ready for some weeks. > > > > When this work is done, the minimum time required for a job start etc. > will > > be the time needed to clear all existing document priority values to a > > nullDocumentPriority value. If you have 100 million jobqueue rows, and > > most of those are active, it will still undoubtably take PostgreSQL some > > time to update all of them. Could you come up with an estimate for how > > long that would in fact take? You mentioned one hour; how many documents > > was that for? > > > > Karl > > > > > > > > On Fri, Nov 7, 2014 at 12:12 PM, Aeham Abushwashi < > > aeham.abushwashi@exonar.com> wrote: > > > > > Hi Karl, > > > > > > It's great to see performance and scalability emphasised as top > priority > > > items for Manifold! This has been clearly demonstrated through the > > > attention and quick turnaround that a number of recent performance > issues > > > have received. This is very much appreciated! > > > > > > My team is happy to help in anyway we can to make Manifold scale and > > > perform better. We'll continue to report the results of our testing and > > > analyses, and would certainly be willing to contribute best-practices, > > > fixes and enhancements where possible. > > > > > > Cheers, > > > Aeham > > > > > > On 6 November 2014 00:39, Karl Wright wrote: > > > > > > > Hi all, > > > > > > > > We've lately had several users applying ManifoldCF to what I'd call > > > "large" > > > > crawls (10M - 100M documents). This is great news, and I hope their > > > > experiences turn out well. I also hope that, once successful, these > > > users > > > > help us document best practices for crawls of this size. > > > > > > > > It's also a good opportunity to revisit the sizing constraints for > MCF > > as > > > > they exist today. There are really two areas of interest when we > > > consider > > > > the large database instances needed to track this number of > documents. > > > The > > > > first consideration is how quickly we can identify records that need > to > > > be > > > > processed -- and insure that they are processed in an order that > makes > > > > sense given throttling constraints on the queue. The second > > > consideration > > > > is what kind of system overhead is needed to meet the first > constraint, > > > and > > > > whether this becomes unwieldy at some point. > > > > > > > > I've been pleasantly surprised at how well the current MCF > architecture > > > > supports document queuing even when database tables get very large. > We > > > > recently encountered some bugs here, but those were easily fixed, > and I > > > > really see little getting in the way from this angle of MCF scaling > > even > > > to > > > > a billion documents now. However, the overhead needed to manage that > > > > scheduling relies on keeping one specific index in the proper > document > > > > order. Under conditions where jobs are stopped or started, the index > > > often > > > > will need to be reordered. When there are lots of documents that > need > > to > > > > be reprioritized, this can be a very time-consuming operation. In my > > > > opinion this is now the limiting factor for MCF scaling. When it > > starts > > > > taking an hour or more to start a job, or stop it, or restart the > > agents > > > > process, working with MCF becomes clearly less than ideal. So I > think > > > this > > > > deserves some thought and work. > > > > > > > > Over the next couple of weeks, I'm hoping to spend some time thinking > > > > through alternatives to the current index structure, which might > permit > > > > faster starts and stops. There's no guarantee of a full solution, > but > > my > > > > hope would be that with some compound index magic there might be > > > > significant improvements here, at no cost to the performance of > > queuing. > > > > > > > > Thanks, > > > > Karl > > > > > > > > > > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > --001a1137fe5a2428af0507823273--