manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Schneider <scottsc...@gmail.com>
Subject Re: Slow performance with a basic setup
Date Wed, 28 Mar 2012 18:09:22 GMT
Ah, thanks!  I set up postgreSQL in my previous installation, but
missed it this time.

Scott


On Wed, Mar 28, 2012 at 11:06 AM, Karl Wright <daddywri@gmail.com> wrote:
> Now it sounds like you are running into known problems with Apache
> Derby.  That is why we suggest using PostgreSQL rather than Derby for
> any kind of real world crawling.  Derby is super convenient but it has
> problems handling deadlocks properly.
>
> You can also use HSQLDB if you prefer an integrated solution, but
> PostgreSQL is faster.
>
> I suggest you look at
> http://incubator.apache.org/connectors/en_US/performance-tuning.html
> to get an idea what all this is about, and also don't forget to look
> at how-to-build-and-deploy.html for a general idea how to set up both
> single-process and multi-process installations that use PostgreSQL.
>
> Thanks,
> Karl
>
> On Wed, Mar 28, 2012 at 1:56 PM, Scott Schneider <scottsch42@gmail.com> wrote:
>> Thanks for the quick response!  I had been using all the default
>> settings.  Once I deleted the bandwidth throttling, one phase of the
>> job goes much faster.  The # active documents goes from 0 to the total
>> in just a minute or two.  The overall time seems to be shorter, but it
>> still takes about an hour to process ~600 files totaling ~800 kb.  I
>> also increased the max connections to 50 on the web, null, and Solr
>> connections and changed Solr to commit within 30,000 msec rather than
>> at the end of every job.  That does not seem to have made a
>> difference.
>>
>> Actually, I have no idea what state ManifoldCF is in right now.  I hit
>> restart a few hours ago and the status still says "Restarting".  There
>> is nothing in the command windows where I started ManifoldCF or Solr
>> or in the ManifoldCF log file.  The Solr command window does list
>> ManifoldCFSecurityFilter a few times.
>>
>> Scott
>>
>>
>> On Tue, Mar 27, 2012 at 5:37 PM, Karl Wright <daddywri@gmail.com> wrote:
>>> Let's start with some basics.
>>> First of all, how many web connections do you have configured?  What
>>> do you have for throttling?  If you have not modified the default
>>> settings for throttling and are pulling a number of documents off of
>>> ONE server, then throttling is probably severely limiting your crawl
>>> speed.
>>>
>>> Karl
>>>
>>> On Tue, Mar 27, 2012 at 6:24 PM, Scott Schneider <scottsch42@gmail.com>
wrote:
>>>> Hi all,
>>>>
>>>> I have a pretty simple ManifoldCF setup, but I'm getting very slow
>>>> performance.  Can someone help me understand and/or fix this?
>>>>
>>>> My input is a web connector that goes to an Apache HTTP server running
>>>> on the local machine, serving static text files.  I have a null
>>>> authority service.  I output to Solr, also running locally.
>>>>
>>>> The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
>>>> the job one afternoon and the next morning, it was not finished!  It
>>>> had only processed ~2,500 documents.  Strangely, it listed ~10,000
>>>> total documents (and ~7,500 active).
>>>>
>>>> My ultimate goal is to figure out how much space the Solr index takes
>>>> as I add more access tokens.  That's why I'm using the web connector
>>>> and null authority, rather than just using a file system connector.
>>>>
>>>> Thanks,
>>>> Scott

Mime
View raw message