nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Zinman <>
Subject DocuemntFragement and XPath
Date Thu, 03 Sep 2009 10:05:53 GMT

I've created a plugin on Nutch 1.0 that extends the HtmlParseFilter.

I wanted to extract some more information from the HTML document.

I've got all the parameters into the filter function and then I wanted to
make some searches using "xpath" on the DocumentFragment object.

I tried to do something simple like extracting all "h1" tags but no matter
what I do I always get 0 results.

What is the relation between DocumentFragment and XPath?

Is it even possible to use XPaths on DocumentFragment object?

  public ParseResult filter(Content content, ParseResult parseResult,
      HTMLMetaTags metaTags, DocumentFragment doc)
      Parse parse = parseResult.get(content.getUrl());
      Metadata metadata = parse.getData().getParseMeta();

      XPathFactory factory = XPathFactory.newInstance();
      XPath xpath = factory.newXPath();

          XPathExpression expr = xpath.compile("//h1");
          Object result = expr.evaluate(d, XPathConstants.NODESET);

          NodeList nodes = (NodeList) result;

          System.out.println("Found " + nodes.getLength() + " matches!");

          for (int i = 0; i < nodes.getLength(); i++)

      catch (XPathExpressionException e)
          System.out.println("Error: " + e);

      return parseResult;


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message