Return-Path: X-Original-To: apmail-nutch-user-archive@www.apache.org Delivered-To: apmail-nutch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A9F49406 for ; Wed, 19 Oct 2011 13:26:54 +0000 (UTC) Received: (qmail 79035 invoked by uid 500); 19 Oct 2011 13:26:53 -0000 Delivered-To: apmail-nutch-user-archive@nutch.apache.org Received: (qmail 78937 invoked by uid 500); 19 Oct 2011 13:26:53 -0000 Mailing-List: contact user-help@nutch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@nutch.apache.org Delivered-To: mailing list user@nutch.apache.org Received: (qmail 78929 invoked by uid 99); 19 Oct 2011 13:26:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2011 13:26:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [141.51.12.230] (HELO hrz-ws39.hrz.uni-kassel.de) (141.51.12.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2011 13:26:45 +0000 Received: from [141.51.24.97] (hrz-pc113.hrz.uni-kassel.de [141.51.24.97]) by hrz-ws39.hrz.uni-kassel.de (8.14.0/8.14.0) with ESMTP id p9JDQO2n031822 for ; Wed, 19 Oct 2011 15:26:24 +0200 Message-ID: <4E9ED02E.2070900@uni-kassel.de> Date: Wed, 19 Oct 2011 15:27:10 +0200 From: Marek Bachmann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: user@nutch.apache.org Subject: Re: How does nutch handles javaScript in href References: <4E9C31EA.9080708@uni-kassel.de> <201110171705.17361.markus.jelsma@openindex.io> <4E9C4608.7000704@uni-kassel.de> <4E9EBE58.9060101@uni-kassel.de> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-UniKassel-SMTP-MailScanner: Found to be clean X-UniKassel-SMTP-MailScanner-From: m.bachmann@uni-kassel.de X-Virus-Checked: Checked by ClamAV on apache.org On 19.10.2011 14:34, lewis john mcgibbney wrote: > Hi Marek, > > This is v. interesting and I am looking forward to hearing from anyone with > similar problems. Unfortunately I've not experienced this behaviour, however > it is clearly a significant problem as you point out. Ultimately it should > be ironed out. > > What a great tool the ParserChecker is. > > 11/10/19 13:58:05 INFO parse.ParserChecker: parsing: >> http://www.uni-kassel.de/intranet/footernavi/redaktion.html >> 11/10/19 13:58:05 INFO parse.ParserChecker: contentType: >> application/xhtml+xml >> 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml >> at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/** >> parse-plugins.xml >> 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: >> org.apache.nutch.parse.html.HtmlParser mapped to contentType >> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does >> not claim to support contentType: application/xhtml+xml >> > > This indicates that parse-html was not used and the default for wildcard > contentType defaults to parse-tika... am I correct here? According to my parse-plugins.xml, yes: BUT: I added LOG.info("This is HtmlParser"); to the first line in getParse in HtmlParser.java and compiled it. After that I got: (...) 11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml 11/10/19 15:20:08 INFO parse.html: This is HtmlParser --------- Url --------------- http://www.uni-kassel.de/intranet/footernavi/redaktion.html--------- ParseData --------- Version: 5 Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: (...) As I understand this, the HtmlParser IS used and NOT Tika? > If this is the case then it means that parse-tika is not dealing with the > problem as you describe it. However I must also comment, that we recently > committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt > with application/xhtml+xml material. It would be interesting to see if > parse-html in trunk-1.4 deals with this now. If not then I think this needs > to be filed as a JIRA issue and dealt with appropriately. > > Can you please check and get back to us... > > Thanks > > Lewis >