Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 275EB10C65 for ; Mon, 1 Jul 2013 13:56:53 +0000 (UTC) Received: (qmail 41526 invoked by uid 500); 1 Jul 2013 13:56:52 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 41402 invoked by uid 500); 1 Jul 2013 13:56:46 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 41394 invoked by uid 99); 1 Jul 2013 13:56:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jul 2013 13:56:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.138.229.178] (HELO nm40-vm2.bullet.mail.ne1.yahoo.com) (98.138.229.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jul 2013 13:56:39 +0000 Received: from [98.138.90.48] by nm40.bullet.mail.ne1.yahoo.com with NNFMP; 01 Jul 2013 13:56:18 -0000 Received: from [98.138.101.168] by tm1.bullet.mail.ne1.yahoo.com with NNFMP; 01 Jul 2013 13:56:17 -0000 Received: from [127.0.0.1] by omp1079.mail.ne1.yahoo.com with NNFMP; 01 Jul 2013 13:56:17 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 926846.8810.bm@omp1079.mail.ne1.yahoo.com Received: (qmail 54512 invoked by uid 60001); 1 Jul 2013 13:56:17 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1372686977; bh=w67S2ZbloE1QTynUJ78h9TN5ZCjnBxMbOe7lX3KMmUE=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type; b=tgKrlSZPHhtuOqQzfyyNqn3Wop+iWCZqd0M48Kl1ayXEfH4hh433C+1w1P8GbnOANrV/nnINB7csFwxW7EOwewC8ozD2ya6P/rQAnNbJcafdAjhAFWSTuzg+RYAaH7e66e7XyvSB99dP9eIWjGJe9ApIDX288bNDUO1peMl7Lso= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type; b=V+BD1Wjp97DoGjnE3EfNaY0X8mAcaV2MNc5D5wIvBAn0Tf9eZ85Cwa2jK2U+prPY0HelgifXRvH/CWU3h+oZSjkrgDBIkah6WXkxPmEBNaSBgTRvx9y704oY0tFlKFwss9bKEYo3VU2yU2vuHnAg0lPBRD1s6RCoTzK9DXTdsOQ= ; X-YMail-OSG: 0HafxksVM1nfGrM2DEQ.37rfz6pqnXdNWgK3bIF72PuxyAw lc3LtF2v6Db5A.A8BzF87asVGvgp4GUvPBu.kwhKINvAhWOOGdZggb_HV_Xq AHJXCugyjLrif2HQD1ZkWf_GfjAaW6NBXZAQ2Va9k820eMFbADM0rsXhFWhX _BhPPua5WJbfXTPYMBJqRCSKypz9wxUqUjyGYsKBuIMSA73GgLfSdwcOASHG 5vdlQXzo6..kaXrp8RqFkzS2y3adK_lf5GhwfpdCNu2RVpOoCeMD.BeFhq6a 4MsQtqlD9UlrHNoHx67ScUJ6s_QB41i28N0bMaruqi8eDbFJeKPEY3A.NjXR mkQ_q6nKbSHeO8xm87dZKlr90DR5kNR7LvBuDvQq_76QBo53MnG1.eNwCAIZ pnCwfcimJ._W7HmO0tAd9CK2e_.lzgYIrvudQ50RntJPfwhWdmElUdCY1Dyq un3OIXC3QB1wFcXX8eFCc_T1KhMBk0GZIoOL4d2x.SbOR0HazfIKlasQnJQK YLZHZyxekptWLjfT3mUW3Sx8UyLkcNFHXogItaSI0u3HqAe1hwKGqYYB5hnG 0rK7DBYg.NUOMdDdXbZ6kzpROVjdR3AplPL4xcXQEa3ry.jqXErN893zR1Hn nNfiQ7rwJPlpUMUTp8.dVoJsLAc1GiXp_uH8N Received: from [193.140.16.217] by web125304.mail.ne1.yahoo.com via HTTP; Mon, 01 Jul 2013 06:56:17 PDT X-Rocket-MIMEInfo: 002.001,SGksCgpJIGFtIGNyYXdsaW5nIG1haW4gcGFnZXMgb2Ygc29tZSBvbmxpbmUgbmV3c3BhcGVyIHdlYiBzaXRlcy7CoApJIGRvbid0IG5lZWQgZGVsZXRlcyBhdCBhbGwuIEkgYW0gdXNpbmcgY3Jhd2wgb25jZSBtb2RlbC4KCkhlcmUgaXMgdGhlIHNldHRpbmdzIEkgdXNlIDrCoAoKU2NoZWR1bGUgdHlwZTpTY2FuIGV2ZXJ5IGRvY3VtZW50IG9uY2UKU3RhcnQgTWV0aG9kIDogU3RhcnQgYXQgYmVnaW5uaW5nIG9mIHNjaGVkdWxlIHdpbmRvdwoKU2NoZWR1bGVkIHRpbWU6IEFueSBkYXkgb2Ygd2VlayBhdCAxIGEBMAEBAQE- X-Mailer: YahooMailWebService/0.8.148.557 Message-ID: <1372686977.45563.YahooMailNeo@web125304.mail.ne1.yahoo.com> Date: Mon, 1 Jul 2013 06:56:17 -0700 (PDT) From: Ahmet Arslan Reply-To: Ahmet Arslan Subject: web crawler job settings To: "user@manifoldcf.apache.org" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="-1622010646-1600686120-1372686977=:45563" X-Virus-Checked: Checked by ClamAV on apache.org ---1622010646-1600686120-1372686977=:45563 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hi,=0A=0AI am crawling main pages of some online newspaper web sites.=A0=0A= I don't need deletes at all. I am using crawl once model.=0A=0AHere is the = settings I use :=A0=0A=0ASchedule type:Scan every document once=0AStart Met= hod : Start at beginning of schedule window=0A=0AScheduled time: Any day of= week at 1 am 3 am 5 am 7 am 9 am 11 am 1 pm 3 pm 5 pm 7 pm 9 pm 11 pm plus= 0 minutes=0AMaximum run time: No limit=0A=0AMaximum hop count for link typ= e 'link': 1=0AMaximum hop count for link type 'redirect': Unlimited=0AHop c= ount mode: No deletes, forever=0A=0AInclude only hosts matching seeds? yes= =0ASeeds: A few URLs in the form of http://main.page.com/{category} where c= ategory is Sports, Politics etc.=0A=0ABy setting hop count to 1 ( or 2) and= 'no deletes, forever', I am expecting this crawl to be super fast and most= efficient. Minimal DB queries etc. Am I correct?=0A=0AThanks,=0AAhmet=0A ---1622010646-1600686120-1372686977=:45563 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,

I am crawling main pages of some online newspaper web sit= es. 
I don't need deletes at all. I am using crawl once mode= l.

Here is the settings I use : 

Schedule type:=09Scan every document once
Start Method = : Start at beginning of schedule window

Scheduled = time:=09 An= y day of week at 1 am 3 am 5 am 7 am 9 am 11 am 1 pm 3 pm 5 pm 7 pm 9 pm 11= pm plus 0 minutes
Maximum run time:=09 No limit

Max= imum hop count for link type 'link':=09 1
Maximum hop count for link type 'r= edirect':=09 Unlimited
Hop count mode:=09 No deletes, forever

Include only hosts matching seeds?=09 yes
Seeds: A few URLs in the form of= http://main.page.com/{category} where category is Sports, Politics etc.

By setting hop count to 1 ( or 2) and 'no deletes, f= orever', I am expecting this crawl to be super fast and most efficient. Min= imal DB queries etc. Am I correct?

Thanks,
Ahmet

---1622010646-1600686120-1372686977=:45563--