manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Issei Nishigata (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1264) HTML parsing doesn't handle unquoted attribute values with "/" characters right
Date Tue, 08 Dec 2015 15:22:10 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046963#comment-15046963
] 

Issei Nishigata commented on CONNECTORS-1264:
---------------------------------------------

Patch that I applied can solve two of cases.

1.  can solve to parse quotes around attribute value.
like below.
{code}
<a href="/hello/out/there">hello</a>
{code}
Then MCF's web crawler extracts links as "/hello/out/there".
"http://localhost/hello/out/there"(for example, ) will be the next crawl object. 

2. can solve to parse no quotes around attribute value.
like below.
{code}
<a href=/hello/out/there>hello</a>
{code}
MCF's web crawler does as well as describe above.




> HTML parsing doesn't handle unquoted attribute values with "/" characters right
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1264
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1264
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.3
>
>         Attachments: CONNECTORS-1264-2.patch, CONNECTORS-1264-3.patch, CONNECTORS-1264.patch,
alternative.patch
>
>
> HTML tags like "<a href=hello/out/there >" fail to parse properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message