incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DROIDS-45) Fail to resovle outlink correctly
Date Thu, 02 Apr 2009 17:58:13 GMT

     [ https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingfai Ma updated DROIDS-45:
-----------------------------

    Attachment: LinkResolverTests.java
                LinkResolver.java

there are some other cases:
- mailto:  , news:
- url parameter with space
- unicode characters

I think it is still far from the full list of all special scenarios

attached is my implementation some custom link transformation to handle more cases. The code
could be moved to LinkExtractor if you think it's ok. I don't use SAX parser so I don't use
LinkExtractor. It would be good if the URL/URI Transformation / resolution could be refactored
to a standalone class.

Another thing is I implemented some checking differently. without doing any benchmark with
modern JDK, I believe my approach that uses indexOf and avoid regex is slightly more efficient.

> Fail to resovle outlink correctly
> ---------------------------------
>
>                 Key: DROIDS-45
>                 URL: https://issues.apache.org/jira/browse/DROIDS-45
>             Project: Droids
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: LinkResolver.java, LinkResolverTests.java
>
>
> I've encountered several cases that outlinks are not extracted correctly. Most are cause
by the use of URI.resolve(). 
> 1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a>
will be resolved to http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test
with param</a> will be resolved to http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will
throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this
scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably caused by non-standard
usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we
cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works
in a Mozilla browser but not in Droids LinkExtractor. 
> this issue is related to the LinkExtractor created in DROIDS-8

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message