nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cam Bazz <camb...@gmail.com>
Subject modifying parse implementation
Date Sat, 16 Jul 2011 00:21:14 GMT
Hello,

In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:

  public ParseImpl(String text, String features, ParseData data) {
    this(new ParseText(text), new ParseText(features), data, true);
  }

  public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
    this.text = text;
    this.data = data;
    this.features = features;
    this.isCanonical = isCanonical;
  }

  public String getFeatures() {
        return this.features.getText();
  }


and although I create the parseImpl like

ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));

in the HtmlParser.java

I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:

SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!


I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.

Best Regards,
C.B.

Mime
View raw message