From droids-dev-return-250-apmail-incubator-droids-dev-archive=incubator.apache.org@incubator.apache.org Thu Apr 02 17:58:34 2009 Return-Path: Delivered-To: apmail-incubator-droids-dev-archive@minotaur.apache.org Received: (qmail 82264 invoked from network); 2 Apr 2009 17:58:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 17:58:34 -0000 Received: (qmail 26162 invoked by uid 500); 2 Apr 2009 17:58:34 -0000 Delivered-To: apmail-incubator-droids-dev-archive@incubator.apache.org Received: (qmail 26117 invoked by uid 500); 2 Apr 2009 17:58:34 -0000 Mailing-List: contact droids-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: droids-dev@incubator.apache.org Delivered-To: mailing list droids-dev@incubator.apache.org Received: (qmail 26091 invoked by uid 99); 2 Apr 2009 17:58:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 17:58:34 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 17:58:33 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5CDB6234C056 for ; Thu, 2 Apr 2009 10:58:13 -0700 (PDT) Message-ID: <1880390253.1238695093379.JavaMail.jira@brutus> Date: Thu, 2 Apr 2009 10:58:13 -0700 (PDT) From: "Mingfai Ma (JIRA)" To: droids-dev@incubator.apache.org Subject: [jira] Updated: (DROIDS-45) Fail to resovle outlink correctly In-Reply-To: <1093377227.1238676013090.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingfai Ma updated DROIDS-45: ----------------------------- Attachment: LinkResolverTests.java LinkResolver.java there are some other cases: - mailto: , news: - url parameter with space - unicode characters I think it is still far from the full list of all special scenarios attached is my implementation some custom link transformation to handle more cases. The code could be moved to LinkExtractor if you think it's ok. I don't use SAX parser so I don't use LinkExtractor. It would be good if the URL/URI Transformation / resolution could be refactored to a standalone class. Another thing is I implemented some checking differently. without doing any benchmark with modern JDK, I believe my approach that uses indexOf and avoid regex is slightly more efficient. > Fail to resovle outlink correctly > --------------------------------- > > Key: DROIDS-45 > URL: https://issues.apache.org/jira/browse/DROIDS-45 > Project: Droids > Issue Type: Bug > Components: core > Affects Versions: 0.01 > Reporter: Mingfai Ma > Attachments: LinkResolver.java, LinkResolverTests.java > > > I've encountered several cases that outlinks are not extracted correctly. Most are cause by the use of URI.resolve(). > 1. For a base URI of new URI("http://www.domain.com"), test.html will be resolved to http://www.domain.comtest.html > 2. For a base URI of new URI("http://www.domain.com/index.php"), test with param will be resolved to http://www.domain.com/?test=true > 3. for line break!, URL.resolve will throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this scenario affect the default Tika/NekoHTML parsing. ) > I suspect there are many different scenarios, many of them are probably caused by non-standard usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works in a Mozilla browser but not in Droids LinkExtractor. > this issue is related to the LinkExtractor created in DROIDS-8 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.