manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1573) Web Crawler exclude from index matches too much?
Date Thu, 24 Jan 2019 23:15:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751689#comment-16751689
] 

Karl Wright commented on CONNECTORS-1573:
-----------------------------------------

Questions like this should be asked to the users@manifoldcf.apache.org list, not via a ticket.

The quick answer: if you look at the simple history, you can tell whether the pages are fetched
or not.  If they are not fetched at all (that is, they do not appear), then your inclusion
and exclusion list is wrong.  That doesn't sound like it's the problem here; it sounds like
*after* fetching it's being blocked.  There are a number of reasons for that; the Simple History
should give you a good idea which answer it is.  If it reports "JOBDESCRIPTION", that means
that the *indexing* inclusion/exclusion rule discarded it   This is not the same as the *fetching*
include/exclusion rules, which is what it sounds like you might be setting.  They're on the
same tabs, just farther down.  The manual does not include the indexing rules sections; this
should be addressed.

I suspect that, based on the regexps you given, you're also overlooking the fact that if the
regexp matches ANYWHERE in the URL it is considered a match.  So if you want a very specific
URL, you need to delimit it with ^ at the beginning and $ at the end, to insure that the entire
URL matches and ONLY that URL.




> Web Crawler exclude from index matches too much?
> ------------------------------------------------
>
>                 Key: CONNECTORS-1573
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1573
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.10
>            Reporter: Korneel Staelens
>            Priority: Major
>
> Hello, 
> I'm not sure this is a bug, or my misinterpretation of the exclusion rules:
> I want to set-up a rule, so that it does NOT index a parentpage, but does index all childpages
of that parent:
> I'm setting up a rule: 
> Inclusions: 
> .*
>  
> Exclustions:
> [http://www.website.com/nl/]
> (I've tried also: http://www.website.com/nl/(\s)* )
> No dice, I'f I'm looking at the logs, I see the pages are crawled, but not indexed due
to job restriction. Is my rule wrong? Or is this a small bug?
>  
> Thanks for advice!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message