manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steph van Schalkwyk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-104) Make it easier to limit a web crawl to a single site
Date Thu, 23 Aug 2018 22:28:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590882#comment-16590882
] 

Steph van Schalkwyk commented on CONNECTORS-104:
------------------------------------------------

I'm running into a seeding issue on 2.10:

Seed [http://inside.xxx.net/inside/pages/elastic_test/|http://inside.rrd.net/insideRRD/pages/elastic_test/]

starts to crawl [http://inside.rrd.net/inside/pages/|http://inside.rrd.net/insideRRD/pages/] and
seems to ignore the last folder restriction.

I try to use these as "include in crawl/include in index" filters, but then I get nothing
crawled:

http:\/\/inside.xxx.net\/inside\/pages\/elastic_test\/.*

http:\/\/inside\.xxx\.net\/inside\/pages\/elastic_test\/.*

What am I doing wrong? I know I've deployed this same config to many, many sites.

> Make it easier to limit a web crawl to a single site
> ----------------------------------------------------
>
>                 Key: CONNECTORS-104
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Jack Krupansky
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 0.1
>
>
> Unless the user explicitly enters an include regex carefully, a web crawl can quickly
get out of control and start crawling the entire web when all the user may really want is
to crawl just a single web site or portion thereof. So, it would be preferable if either by
default or with a simple button the crawl could be limited to the seed web site(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message