nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cam Bazz <>
Subject modifying parse implementation
Date Sat, 16 Jul 2011 00:21:14 GMT

In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:

  public ParseImpl(String text, String features, ParseData data) {
    this(new ParseText(text), new ParseText(features), data, true);

  public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
    this.text = text; = data;
    this.features = features;
    this.isCanonical = isCanonical;

  public String getFeatures() {
        return this.features.getText();

and although I create the parseImpl like

ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));

in the

I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:

SolrIndexer: starting at 2011-07-16 03:06:54 Job failed!

I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.

Best Regards,

View raw message