manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: How delete unreachable documents on continous crawling?
Date Wed, 27 Aug 2014 11:50:48 GMT
Hi Mario,

The default numbers are designed to not alarm the site administrators of
sites being crawled.  If you know that it is safe to crawl your targeted
sites harder, then you can change the numbers.  But I wouldn't do that
unless you have reason to believe it is OK.  It's not a decision this news
group is in a position to answer.

Karl




On Wed, Aug 27, 2014 at 7:41 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
wrote:

>  Ok, you are right.
>
>
>
> In bandwith I see:
> Max connections:10
>
> Max kbytes/sec: 256
>
> Max fetches/min: 12
>
>
>
>
>
> Can I increase that to:
>
> Max connections:100
>
> Max kbytes/sec: 256
>
> Max fetches/min: 120
>
>
>
> Could they be a good values?
>
>
>
> Thanks
>
>
>
>
>
>
>
> *Mario*
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:daddywri@gmail.com]
> *Inviato:* mercoledì 27 agosto 2014 13:19
>
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Bisonti,
>
> I meant the throttling parameters on the "Bandwidth" tab.
>
>
> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#webrepository
>
> Karl
>
>
>
> On Wed, Aug 27, 2014 at 6:36 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>  Thabks a lot.
>
> I understood about full crawl vs minimal crawls
>
>
>
> Third throttling:
>
>
>
> I set for the web repository connection, throttling = 100
>
> I set for the output connection Solr , Throttling, max connection = 1000
>
>
>
> I am using ManifoldCF 1.7
>
>
>
>
>
> My documents are .pdf docs so Tika execute the scan of the content.
>
>
>
> Karl, do you think that the throttling parameters are  right ?
>
>
>
> Thanks a lot!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:daddywri@gmail.com]
> *Inviato:* mercoledì 27 agosto 2014 12:03
>
>
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> First, you don't need a lot of memory for ManifoldCF, although you may
> need it for your search index (e.g. Solr).
>
> Second, different connectors behave differently for full crawls vs.
> minimal crawls.  The web connector makes no distinction, except for the
> removal of unreachable documents at the end of the crawl.
>
> Third, most of the time in your crawl is probably going into waiting
> because of throttling.  Depending on what you are crawling, and whether it
> is your own local pages, you might want to relax the throttling
> constraints.  It is also the case that ManifoldCF 1.5 had a bug in the
> throttling code that made byte-rate throttling 1000x too restrictive.  This
> was fixed in 1.6.
>
> Karl
>
>
>
>
>
> On Wed, Aug 27, 2014 at 5:38 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>
>
> Hallo.
>
>
>
> I increased RAM to 4GB and I execute, manually the job to crawl “Web
> repository”  containing 3800 pdf documents.
>
>
>
> I understood that “Start” executes a full scan instead, “Start minimal”
> executes a incremental scan only on modified documents.
>
>
>
>
>
> I executed the job with “Start” : it used near 20 hours.
>
> After
>
> I executed the job with “Start minimal” : it rescan the same 3800
> documents so it used 20 hours
>
>
>
> Why this?
>
>
>
> Note that there aren’t new documents by the moment that I started job with
> “Start” and the time I started job with “Start minimal”
>
>
>
>
>
> Thanks for your help!
>
>
>
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:daddywri@gmail.com]
> *Inviato:* martedì 12 agosto 2014 17:26
>
>
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Setting up a schedule does not prevent you from starting the job manually.
>
> But it sounds like you understand the solution.
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>  Ok, so, I think to have understood better, now.
>
>
>
> But I have 3800 .pdf documents so “full crawl” by Tika is very long
> because it uses 2 days. (perhaps I need to increase RAM?)
>
>
>
> I am using “web connector” so I see “Start minimal” option.
>
>
>
> I understand that I can do this:
> 1) full crawl on the Saturday night so it deletes orphaned file
>
> 2) start minimal crawl every night except Saturday so it crawls only
> changed documents
>
>
>
> are 1) and 2) right or I haven’t understood?
>
>
>
>
>
> Furthermore, I haven’t so clear  the option:
> “Start even inside a scheduled window” because I tried with “Start when
> scheduled window start” but I am able to start it manually, too.
>
>
>
> Thanks a lot!
>
>
>
>
>
> Mario
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:daddywri@gmail.com]
> *Inviato:* martedì 12 agosto 2014 14:54
>
>
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> What I would do is set up a single job.  (Multiple jobs that share the
> same documents may work but they aren't recommended because a document must
> vanish from ALL jobs that share it before it is removed.)  There are two
> different possibilities for the schedule, depending on the kind of
> connector you are using:
>
> (1) Repeated full crawls
>
> (2) Mostly minimal crawls, with periodic full crawls
>
> If the connector you are using makes any distinction between minimal and
> full crawls, then (2) would probably be more efficient for you.  But only
> on full crawls will unreachable documents be removed.
>
> To do the setup:
>
> -- you will need multiple scheduling records for (2), but may be able to
> do (1) with a single scheduling record
>
> -- for each day, you want the window to start at midnight, and its length
> to be the equivalent of 24 hours
>
> -- you want to select the option to start crawls in the middle of a
> window, not just at the beginning
>
> This should give you what you want.
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>  So , I suppose, the best solution could be :
>
> Continous recrawling and one periodic recrawling to delete orphaned
> documents.
>
>
>
> Can I superimpose the two jobs?
>
>
>
> *Mario Bisonti*
>
> Information and Comunications Technology
>
>
>
> VIMAR SpA
>
> Tel. +39 0424 488 644
>
> mario.bisonti@vimar.com
>
> *Rispetta l’ambiente. **Stampa solo se necessario.*
>
> Take care of the environment. Print only if necessary.
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:daddywri@gmail.com]
> *Inviato:* martedì 12 agosto 2014 12:21
>
>
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Yes, periodic recrawling allows ManifoldCF the opportunity to discover
> abandoned documents and remove them.
>
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>  Ok, thanks..
>
>
>
> So you suggest to me to not use continuos crawling and schedule a
> re-crawling periodically of all documents?
>
> Is it better?
>
> Thanks a lot.
>
>
>
>
>
>
>
> *Mario*
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:daddywri@gmail.com]
> *Inviato:* martedì 12 agosto 2014 12:16
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Please read ManifoldCF in Action Chapter 1.  Continuous crawling has no
> mechanism for deleting unreachable documents, and never will, because it is
> fundamentally impossible to do.
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>  Hallo.
>
> I set continuous crawling on a folder of a website to index the pdf files
> contained.
>
>
>
> Schedule type: Rescan documents dinamically
>
> Recrawl interval (if continuous):5
>
>
>
> I see that if documents are added on the folder, they are indexed, but if
> documents are deleted they aren’t deleted from indexing.
>
> I see that on the “MainfoldCF in action” , is mentioned “…that continuous
> crawling seems to be missing a phase – the “delete unreachable documents”
> phase.”
>
>
>
> But, how could I solve the problem, please?
>
> Thanks a lot for yopur help.
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message