manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steph van Schalkwyk (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-104) Make it easier to limit a web crawl to a single site
Date Thu, 23 Aug 2018 22:28:00 GMT


Steph van Schalkwyk commented on CONNECTORS-104:

I'm running into a seeding issue on 2.10:

Seed [|]

starts to crawl [|] and
seems to ignore the last folder restriction.

I try to use these as "include in crawl/include in index" filters, but then I get nothing



What am I doing wrong? I know I've deployed this same config to many, many sites.

> Make it easier to limit a web crawl to a single site
> ----------------------------------------------------
>                 Key: CONNECTORS-104
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Jack Krupansky
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 0.1
> Unless the user explicitly enters an include regex carefully, a web crawl can quickly
get out of control and start crawling the entire web when all the user may really want is
to crawl just a single web site or portion thereof. So, it would be preferable if either by
default or with a simple button the crawl could be limited to the seed web site(s).

This message was sent by Atlassian JIRA

View raw message