nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: removing "\n"... Nutch 1.14
Date Mon, 26 Feb 2018 15:31:17 GMT
Hi,

paragraph breaks have been added by

https://github.com/apache/nutch/pull/190
 and
https://issues.apache.org/jira/browse/NUTCH-2397

It's not configurable.

A simple
  s/\n/ /g
should restore the old "look" of extracted plain texts.

Best,
Sebastian


On 02/26/2018 04:17 PM, BlackIce wrote:
> Hi,
> 
> did run into a problem with Nutch 1.14 which I don't recall having in
> previous versions
> 
> I'm find a lot of "\n"  (Newline?) in my content of crawled sites.
> 
> I've tried with different configurations/constelations of Html parser and
> Tika and just Tika to no avail.
> 
> All the info I can find on this this is regarding older versions of Nutch..
> like ancient versions...
> 
> Did something change on to were there is an extra configuration step now
> required?
> 
> Greetz
> 
> RRK
> 


Mime
View raw message