nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beckstrom <>
Subject parser.html.NodesToExclud
Date Thu, 12 Sep 2019 19:24:42 GMT
Hi All,

I'm running NUTCH 1.15.

In my nutch-site.xml I configured the below parameters and
specifically under   parser.html.NodesToExclude I'm telling it not to index
"div id=sidebar" or "div id=footer" and yet it continues to index those
regions on the page.

Does anyone have suggestions on why this isn't working and what I should do
to resolve this?

Thank you!

  Which text extraction algorithm to use. Valid values are: boilerpipe or
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
  or CanolaExtractor.
      A list of nodes whose content will not be indexed separated by "|".
      Use this to tell the HTML parser to ignore, for example, site
navigation text.

      Each node has three elements, separated by semi-colon:
      the first one is the tag name,
      the second one the attribute name,
      the third one the value of the attribute.

      Example: table;summary;header|div;id;navigation

      Note that nodes with these attributes, and their children, will be
      silently ignored by the parser so verify the indexed content
      with Luke to confirm results.


Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: <>
ph: 763.323.3499

*Fig Leaf Software is now Collective FLS, Inc.*
*Collective FLS, Inc.* <> 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message