manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Performance issues
Date Fri, 18 Jul 2014 19:33:43 GMT
It depends on how much crawling you are doing.  We used to recommend once
per week.

Karl


On Fri, Jul 18, 2014 at 3:32 PM, Ameya Aware <ameya.aware@gmail.com> wrote:

> cool.. working perfect now.
>
> When do i really have to look into VACUUM FULL command?
>
> Where and how this command needs to be executed?
>
>
> On Fri, Jul 18, 2014 at 3:17 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> If you make changes to the code, of course you have to rebuild.  It is up
>> to you to preserve your configuration and deployment should you do that.
>>
>> I will give you one hint though: if you are changing connector code only,
>> you can just build the connector.  From the connector directory, type "ant
>> deliver-connector" and the connector will be copied into the right place in
>> the distribution.
>>
>> Karl
>>
>>
>>
>> On Fri, Jul 18, 2014 at 3:12 PM, Ameya Aware <ameya.aware@gmail.com>
>> wrote:
>>
>>> So if i make any changes to code, is there a need of issuing 'ant build'
>>> command or i can simply restart the server for changes to take place?
>>>
>>>
>>> On Fri, Jul 18, 2014 at 3:10 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Ameya,
>>>>
>>>> Rebuilding will of course set your properties back to the build
>>>> defaults.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Fri, Jul 18, 2014 at 3:08 PM, Ameya Aware <ameya.aware@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Am i not supposed to run 'ant build' command after changing
>>>>> properties.xml file?
>>>>>
>>>>> Because that is what set my configured PostgreSQL back to derby
>>>>>
>>>>> Ameya
>>>>>
>>>>>
>>>>> On Fri, Jul 18, 2014 at 2:27 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes.
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 18, 2014 at 2:26 PM, Ameya Aware <ameya.aware@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So for Hop filters tab:
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> are you suggesting to choose 3rd option i.e. "Keep unreachable
>>>>>>> documents,forever"?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ameya
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 18, 2014 at 2:15 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Something else you should be aware of: Hop-count filtering
is very
>>>>>>>> expensive.  If you are using a connector that uses it, and
you don't need
>>>>>>>> it, you should consider disabling it.  Pick the bottom radio
button on the
>>>>>>>> Hop Count tab to do that.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 18, 2014 at 1:34 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Ameya,
>>>>>>>>>
>>>>>>>>> If you are still using Derby, which apparently you are
according
>>>>>>>>> to the stack trace, then a pause of 420 seconds is likely
because Derby got
>>>>>>>>> itself stuck.  Derby is like that which is why we don't
recommend it for
>>>>>>>>> production.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 18, 2014 at 1:31 PM, Ameya Aware <
>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> No Karl,
>>>>>>>>>>
>>>>>>>>>> I did not do VACUUM here.
>>>>>>>>>>
>>>>>>>>>> Why would queries stopped after running for about
420 sec? is it
>>>>>>>>>> because of the errors coming in?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 18, 2014 at 12:32 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ameya,
>>>>>>>>>>>
>>>>>>>>>>> For future reference, when you see stuff like
this in the log:
>>>>>>>>>>>
>>>>>>>>>>> >>>>>>
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'39') - Found a
>>>>>>>>>>> long-running query (458934 ms): [UPDATE hopcount
SET deathmark=?,distance=?
>>>>>>>>>>> WHERE id IN(SELECT ownerid FROM hopdeletedeps
t0 WHERE t0.jobid=? AND
>>>>>>>>>>> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink
t1 WHERE
>>>>>>>>>>> t1.jobid=t0.jobid AND t1.linktype=t0.linktype
AND
>>>>>>>>>>> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash
AND
>>>>>>>>>>> t1.isnew=?))]
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'4') - Found a
>>>>>>>>>>> long-running query (420965 ms): [UPDATE hopcount
SET deathmark=?,distance=?
>>>>>>>>>>> WHERE id IN(SELECT ownerid FROM hopdeletedeps
t0 WHERE t0.jobid=? AND
>>>>>>>>>>> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink
t1 WHERE
>>>>>>>>>>> t1.jobid=t0.jobid AND t1.linktype=t0.linktype
AND
>>>>>>>>>>> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash
AND
>>>>>>>>>>> t1.isnew=?))]
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'39') -   Parameter
>>>>>>>>>>> 0: 'D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'19') - Found a
>>>>>>>>>>> long-running query (421120 ms): [UPDATE hopcount
SET deathmark=?,distance=?
>>>>>>>>>>> WHERE id IN(SELECT ownerid FROM hopdeletedeps
t0 WHERE t0.jobid=? AND
>>>>>>>>>>> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink
t1 WHERE
>>>>>>>>>>> t1.jobid=t0.jobid AND t1.linktype=t0.linktype
AND
>>>>>>>>>>> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash
AND
>>>>>>>>>>> t1.isnew=?))]
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'10') - Found a
>>>>>>>>>>> long-running query (420985 ms): [UPDATE hopcount
SET deathmark=?,distance=?
>>>>>>>>>>> WHERE id IN(SELECT ownerid FROM hopdeletedeps
t0 WHERE t0.jobid=? AND
>>>>>>>>>>> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink
t1 WHERE
>>>>>>>>>>> t1.jobid=t0.jobid AND t1.linktype=t0.linktype
AND
>>>>>>>>>>> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash
AND
>>>>>>>>>>> t1.isnew=?))]
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'11') - Found a
>>>>>>>>>>> long-running query (421173 ms): [UPDATE hopcount
SET deathmark=?,distance=?
>>>>>>>>>>> WHERE id IN(SELECT ownerid FROM hopdeletedeps
t0 WHERE t0.jobid=? AND
>>>>>>>>>>> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink
t1 WHERE
>>>>>>>>>>> t1.jobid=t0.jobid AND t1.linktype=t0.linktype
AND
>>>>>>>>>>> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash
AND
>>>>>>>>>>> t1.isnew=?))]
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'4') -   Parameter
>>>>>>>>>>> 0: 'D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'11') -   Parameter
>>>>>>>>>>> 0: 'D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'10') -   Parameter
>>>>>>>>>>> 0: 'D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'39') -   Parameter
>>>>>>>>>>> 1: '-1'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'19') -   Parameter
>>>>>>>>>>> 0: 'D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'39') -   Parameter
>>>>>>>>>>> 2: '1405692432586'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'10') -   Parameter
>>>>>>>>>>> 1: '-1'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'22') - Found a
>>>>>>>>>>> long-running query (421052 ms): [UPDATE hopcount
SET deathmark=?,distance=?
>>>>>>>>>>> WHERE id IN(SELECT ownerid FROM hopdeletedeps
t0 WHERE t0.jobid=? AND
>>>>>>>>>>> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink
t1 WHERE
>>>>>>>>>>> t1.jobid=t0.jobid AND t1.linktype=t0.linktype
AND
>>>>>>>>>>> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash
AND
>>>>>>>>>>> t1.isnew=?))]
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'11') -   Parameter
>>>>>>>>>>> 1: '-1'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'4') -   Parameter
>>>>>>>>>>> 1: '-1'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'11') -   Parameter
>>>>>>>>>>> 2: '1405692432586'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'22') -   Parameter
>>>>>>>>>>> 0: 'D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'10') -   Parameter
>>>>>>>>>>> 2: '1405692432586'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'39') -   Parameter
>>>>>>>>>>> 3: '9ABFEB709B646CD0C84B4B7B6300E2C9BD5E3477'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,505 (Worker thread
'19') -   Parameter
>>>>>>>>>>> 1: '-1'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'39') -   Parameter
>>>>>>>>>>> 4: 'B'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'10') -   Parameter
>>>>>>>>>>> 3: 'A932EC77CEF156EA26A4239F12BAB365E6B4F58D'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'22') -   Parameter
>>>>>>>>>>> 1: '-1'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'11') -   Parameter
>>>>>>>>>>> 3: '9DFF75EBE13D0AAE8AFF025E992C68AB203ED1CB'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'4') -   Parameter
>>>>>>>>>>> 2: '1405692432586'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'11') -   Parameter
>>>>>>>>>>> 4: 'B'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'22') -   Parameter
>>>>>>>>>>> 2: '1405692432586'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'22') -   Parameter
>>>>>>>>>>> 3: '023FDBD3638711F4E55A918B862A064161B0892A'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'22') -   Parameter
>>>>>>>>>>> 4: 'B'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'10') -   Parameter
>>>>>>>>>>> 4: 'B'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'19') -   Parameter
>>>>>>>>>>> 2: '1405692432586'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'4') -   Parameter
>>>>>>>>>>> 3: '0158B8EDFEE3DDB10113B6D6E378D5FBF165E1FD'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'19') -   Parameter
>>>>>>>>>>> 3: 'FD9641C67D0C1EC22B5F05671513D4DD71B4582C'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'4') -   Parameter
>>>>>>>>>>> 4: 'B'
>>>>>>>>>>>  WARN 2014-07-18 11:19:36,506 (Worker thread
'19') -   Parameter
>>>>>>>>>>> 4: 'B'
>>>>>>>>>>> <<<<<<
>>>>>>>>>>>
>>>>>>>>>>> ... it means that MANY queries basically stopped
running for
>>>>>>>>>>> about 420 seconds.  I bet you did a VACUUM then,
right?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 18, 2014 at 12:30 PM, Karl Wright
<
>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ameya,
>>>>>>>>>>>>
>>>>>>>>>>>> The log file is full of errors of all sorts.
 For example:
>>>>>>>>>>>>
>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>  WARN 2014-07-17 17:32:38,709 (Worker thread
'41') - IO
>>>>>>>>>>>> exception during indexing
>>>>>>>>>>>> file:/C:/Program%20Files/eclipse/configuration/org.eclipse.osgi/.manager/.tmp2043698995563843992.instance:
>>>>>>>>>>>> The process cannot access the file because
another process has locked a
>>>>>>>>>>>> portion of the file
>>>>>>>>>>>> java.io.IOException: The process cannot access
the file because
>>>>>>>>>>>> another process has locked a portion of the
file
>>>>>>>>>>>>     at java.io.FileInputStream.readBytes(Native
Method)
>>>>>>>>>>>>     at java.io.FileInputStream.read(Unknown
Source)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.entity.mime.content.InputStreamBody.writeTo(InputStreamBody.java:91)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpMultipart.doWriteTo(ModifiedHttpMultipart.java:211)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpMultipart.writeTo(ModifiedHttpMultipart.java:229)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.manifoldcf.agents.output.solr.ModifiedMultipartEntity.writeTo(ModifiedMultipartEntity.java:187)
>>>>>>>>>>>>     at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>>     at java.lang.reflect.Method.invoke(Unknown
Source)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.execchain.RequestEntityExecHandler.invoke(RequestEntityExecHandler.java:77)
>>>>>>>>>>>>     at com.sun.proxy.$Proxy0.writeTo(Unknown
Source)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:155)
>>>>>>>>>>>>     at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>>     at java.lang.reflect.Method.invoke(Unknown
Source)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.conn.CPoolProxy.invoke(CPoolProxy.java:138)
>>>>>>>>>>>>     at com.sun.proxy.$Proxy1.sendRequestEntity(Unknown
Source)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:292)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>>>>>>>>>     at
>>>>>>>>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:951)
>>>>>>>>>>>> <<<<<
>>>>>>>>>>>>
>>>>>>>>>>>> This error occurs because you are trying
to index a file on
>>>>>>>>>>>> Windows that is open by an application. 
If you do this kind of thing,
>>>>>>>>>>>> ManifoldCF will requeue the document and
will try it again later -- say, in
>>>>>>>>>>>> 5 minutes, and keep retrying it for many
hours before it gives up.
>>>>>>>>>>>>
>>>>>>>>>>>> I suspect that you are not seeing "hangs",
but rather
>>>>>>>>>>>> situations where MCF is simply waiting for
a problem to resolve.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jul 18, 2014 at 11:27 AM, Ameya Aware
<
>>>>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Attaching log file
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jul 18, 2014 at 11:15 AM, Karl
Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, please send the file logs/manifoldcf.log
as well -- as
>>>>>>>>>>>>>> a text file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jul 18, 2014 at 11:12 AM,
Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Could you please get a thread
dump and send that to me?
>>>>>>>>>>>>>>> Please send as a text file not
a screen shot.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To get a thread dump, get the
process ID of the agents
>>>>>>>>>>>>>>> process, and use the jdk's jstack
utility to obtain the dump.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 18, 2014 at 11:08
AM, Ameya Aware <
>>>>>>>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> yeah.. i thought so that
it should not effect in 4000
>>>>>>>>>>>>>>>> documents.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am using filesystem connector
to crawl all of my C drive
>>>>>>>>>>>>>>>> and output connection is
null.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are no error logs in
MCF. MCF is standstill at same
>>>>>>>>>>>>>>>> screen since half an hour.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Attaching some snapshots
for your reference.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jul 18, 2014 at 11:02
AM, Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Ameya,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 4000 documents is nothing
at all.  We have load tests
>>>>>>>>>>>>>>>>> which I run on every
release that include more than 100000 documents on a
>>>>>>>>>>>>>>>>> crawl.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can you be more specific
about the case that you say "hung
>>>>>>>>>>>>>>>>> up"?  Specifically:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (1) What kind of crawl
is this?  SharePoint?  Web?
>>>>>>>>>>>>>>>>> (2) Are there any errors
in the manifoldcf log?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jul 18, 2014
at 10:59 AM, Ameya Aware <
>>>>>>>>>>>>>>>>> ameya.aware@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I spent some time
going through PostgreSQL 9.3 manual.
>>>>>>>>>>>>>>>>>> I configured PostgreSQL
for MCF and saw the significant
>>>>>>>>>>>>>>>>>> change in performance
time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I ran it yesterday
for some 4000 documents. When i
>>>>>>>>>>>>>>>>>> started running again
today, the performance was very poor and after 200
>>>>>>>>>>>>>>>>>> documents, it hung
up.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is it because of
periodic maintenance it needs?  Also, i
>>>>>>>>>>>>>>>>>> would want to know
where and how exactly VACUUM FULL
>>>>>>>>>>>>>>>>>> command needs to
be used?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Jul 17, 2014
at 2:13 PM, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It is fine; I
am running Postgresql 9.3 here.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Jul 17,
2014 at 2:08 PM, Ameya Aware <
>>>>>>>>>>>>>>>>>>> ameya.aware@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> is PostgreySQL
9.3 version good because i already have
>>>>>>>>>>>>>>>>>>>> it in my
machine.. Though documentation says "ManifoldCF
>>>>>>>>>>>>>>>>>>>> has been
tested against version 8.3.7, 8.4.5 and 9.1 of PostgreSQL. "
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Jul
17, 2014 at 1:09 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If you
haven't configured MCF to use PostgreSQL, then
>>>>>>>>>>>>>>>>>>>>> you are
using Derby, which is not recommended for production use.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Instructions
on how to set up MCF to use PostgreSQL
>>>>>>>>>>>>>>>>>>>>> are available
on the MCF site on the how-to-build-and-deploy page.
>>>>>>>>>>>>>>>>>>>>> Configuring
PostgreSQL for millions or tens of millions of documents will
>>>>>>>>>>>>>>>>>>>>> require
someone to learn about PostgreSQL and how to administer it.  The
>>>>>>>>>>>>>>>>>>>>> how-to-build-and-deploy
page provides some (old) guidelines and hints, but
>>>>>>>>>>>>>>>>>>>>> if I
were you I'd read the postgresql manual for the version you install.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu,
Jul 17, 2014 at 1:04 PM, Ameya Aware <
>>>>>>>>>>>>>>>>>>>>> ameya.aware@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Ooh
ok.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Actually
i have never configured PostgreySQL yet. i
>>>>>>>>>>>>>>>>>>>>>> am
simply using binary distribution of MCF to configure file system
>>>>>>>>>>>>>>>>>>>>>> connectors
to connect to Solr.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Do
i need to configure PostgreySQL?? How can i
>>>>>>>>>>>>>>>>>>>>>> proceed
from here to check performance measurements?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On
Thu, Jul 17, 2014 at 12:10 PM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Yes.  Also have a look at the
>>>>>>>>>>>>>>>>>>>>>>>
how-to-build-and-deploy page for hints on how to configure PostgreSQL for
>>>>>>>>>>>>>>>>>>>>>>>
maximum performance.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
ManifoldCF's performance is almost entirely based on
>>>>>>>>>>>>>>>>>>>>>>>
the database.  If you are using PostgreSQL, which is the fastest ManifoldCF
>>>>>>>>>>>>>>>>>>>>>>>
choice, you should be able to see in the logs when queries take a long
>>>>>>>>>>>>>>>>>>>>>>>
time, or when indexes are automatically rebuilt.  Could you provide any
>>>>>>>>>>>>>>>>>>>>>>>
information as to what your overall system setup looks like?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Karl
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
On Thu, Jul 17, 2014 at 11:32 AM, Ameya Aware <
>>>>>>>>>>>>>>>>>>>>>>>
ameya.aware@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
This page?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Ameya
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
On Thu, Jul 17, 2014 at 11:28 AM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>
daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Hi Ameya,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Have you read the performance page?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Karl
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Sent from my Windows Phone
>>>>>>>>>>>>>>>>>>>>>>>>>
------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>
From: Ameya Aware
>>>>>>>>>>>>>>>>>>>>>>>>>
Sent: 7/17/2014 11:27 AM
>>>>>>>>>>>>>>>>>>>>>>>>>
To: user@manifoldcf.apache.org
>>>>>>>>>>>>>>>>>>>>>>>>>
Subject: Performance issues
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Hi
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
I have millions of documents to crawl and send
>>>>>>>>>>>>>>>>>>>>>>>>>
them to Solr.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
But when i run it for thousands documents, it
>>>>>>>>>>>>>>>>>>>>>>>>>
takes too much time for it or sometimes it even hangs up.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
So what could be the way to reduce the performance
>>>>>>>>>>>>>>>>>>>>>>>>>
time?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Also, i do not need content of the documents, i
>>>>>>>>>>>>>>>>>>>>>>>>>
just need metadata, so can i skip content part from reading and fetching
>>>>>>>>>>>>>>>>>>>>>>>>>
and will that improve performance time?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>
Ameya
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message