nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <m.bachm...@uni-kassel.de>
Subject Re: How does nutch handles javaScript in href
Date Wed, 19 Oct 2011 13:27:10 GMT
On 19.10.2011 14:34, lewis john mcgibbney wrote:
> Hi Marek,
>
> This is v. interesting and I am looking forward to hearing from anyone with
> similar problems. Unfortunately I've not experienced this behaviour, however
> it is clearly a significant problem as you point out. Ultimately it should
> be ironed out.
>
> What a great tool the ParserChecker is.
>
> 11/10/19 13:58:05 INFO parse.ParserChecker: parsing:
>> http://www.uni-kassel.de/intranet/footernavi/redaktion.html
>> 11/10/19 13:58:05 INFO parse.ParserChecker: contentType:
>> application/xhtml+xml
>> 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml
>> at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**
>> parse-plugins.xml
>> 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:
>> org.apache.nutch.parse.html.HtmlParser mapped to contentType
>> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
>> not claim to support contentType: application/xhtml+xml
>>
>
> This indicates that parse-html was not used and the default for wildcard
> contentType defaults to parse-tika... am I correct here?

According to my parse-plugins.xml, yes:

   <!--  by default if the mimeType is set to *, or
         if it can't be determined, use parse-tika -->
	<mimeType name="*">
	  <plugin id="parse-tika" />
	</mimeType>

BUT:

I added LOG.info("This is HtmlParser"); to the first line in getParse in 
HtmlParser.java and compiled it. After that I got:

(...)
11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file 
does not claim to support contentType: application/xhtml+xml

11/10/19 15:20:08 INFO parse.html: This is HtmlParser

---------
Url
---------------
http://www.uni-kassel.de/intranet/footernavi/redaktion.html---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php

anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:
(...)

As I understand this, the HtmlParser IS used and NOT Tika?



> If this is the case then it means that parse-tika is not dealing with the
> problem as you describe it. However I must also comment, that we recently
> committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt
> with application/xhtml+xml material. It would be interesting to see if
> parse-html in trunk-1.4 deals with this now. If not then I think this needs
> to be filed as a JIRA issue and dealt with appropriately.
>
> Can you please check and get back to us...
>
> Thanks
>
> Lewis
>


Mime
View raw message