nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Zinman <zze...@gmail.com>
Subject DocuemntFragement and XPath
Date Thu, 03 Sep 2009 10:05:53 GMT
Hi,

I've created a plugin on Nutch 1.0 that extends the HtmlParseFilter.

I wanted to extract some more information from the HTML document.

I've got all the parameters into the filter function and then I wanted to
make some searches using "xpath" on the DocumentFragment object.

I tried to do something simple like extracting all "h1" tags but no matter
what I do I always get 0 results.

What is the relation between DocumentFragment and XPath?

Is it even possible to use XPaths on DocumentFragment object?

  public ParseResult filter(Content content, ParseResult parseResult,
      HTMLMetaTags metaTags, DocumentFragment doc)
  {
      Parse parse = parseResult.get(content.getUrl());
      Metadata metadata = parse.getData().getParseMeta();

      XPathFactory factory = XPathFactory.newInstance();
      XPath xpath = factory.newXPath();

      try
      {
          XPathExpression expr = xpath.compile("//h1");
          Object result = expr.evaluate(d, XPathConstants.NODESET);

          NodeList nodes = (NodeList) result;

          System.out.println("Found " + nodes.getLength() + " matches!");

          for (int i = 0; i < nodes.getLength(); i++)
          {
              System.out.println(nodes.item(i).getNodeValue());
          }

      }
      catch (XPathExpressionException e)
      {
          System.out.println("Error: " + e);
      }

      return parseResult;
  }

Thanks,
Eran

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message