I've checked in a fix to trunk for this issue, and included a second patch in the ticket.


On Wed, Oct 30, 2013 at 9:02 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Benjamin,

I will have to look at the feed itself to see why only four of the links are extracted.  It is not likely to be due to the patch, but rather the feed format.  As you know, RSS standards are fluid at best and feed publishers often do things in unique ways.

I can't look at this in detail though until this evening.


On Wed, Oct 30, 2013 at 8:57 AM, Benjamin Brandmeier <bdvlop@gmail.com> wrote:
I've patched mcf and started the job. The log (attached) doesn't contain any error messages and the documents crawled are indexed in Solr correctly.

However, only four(!) documents are crawled/indexed, but 58 items exist in the feed. Could this be a configuration issue or might the patch have led to that?


2013/10/30 Karl Wright <daddywri@gmail.com>
I've attached a patch to the ticket, but haven't tried it yet (no access to outside network right now).  Can you try this and see if it works?


On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier <bdvlop@gmail.com> wrote:
Hi Karl,

the stack trace at the point where the NPE occurs is just as long as the one provided in the log.

I've fetched a stack trace at the point where previousContext is null for the first time. After that, the currentContext will be set to null and this leads to the error described.
Maybe this helps:

Daemon Thread [Worker thread '42'] (Suspended (entry into method endElement in XMLParsingContext))
RSSConnector$OuterContextClass(XMLParsingContext).endElement(String, String, String) line: 109
XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String) line: 110
XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String) line: 131
XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char) line: 755
XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader) line: 51
DecodingByteReceiver.dealWithBytes(InputStream) line: 48
BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream) line: 248
BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream) line: 52
Parser.parseWithCharsetDetection(String, InputStream, CharacterReceiver) line: 82
RSSConnector.handleRSSFeedSAX(String, IProcessActivity, RSSConnector$Filter) line: 3481
RSSConnector.processDocuments(String[], String[], IProcessActivity, DocumentSpecification, boolean[], int) line: 1256
WorkerThread.run() line: 559

I've tested this with MCF 1.3 AND 1.4 (from tag). The same error occurs with both versions.


2013/10/30 Karl Wright <daddywri@gmail.com>
Hi Benjamin,

It may be malformed XML that we don't treat properly.  If the log has a full stack trace that would be very helpful.  If not can you get one?



Sent from my Windows Phone

From: Benjamin Brandmeier
Sent: 10/30/2013 6:51 AM
To: user@manifoldcf.apache.org
Subject: RSS Crawl -> NullPointerException

Hi everyone,


I'm facing a problem with the RSS connector. The feed I'm crawling is --> http://blog.fme.de/feed

A NPE occurs at processing time. After some debugging I've found out the following:


Variable previousContext is null in method --> public final void endElement(String namespace, String localName, String qName)

Parameter qName is content:encoded, but there are many tags like this in the feed, so I'm not sure about at which point the error occurs.

The variable previousContext(=null) is written to currentContext. As the stack trace shows, the error happens at org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86), 

at this point currentContext.cleanup(); is called with currentContext = null.


manifoldcf.log is attached.


Any thoughts on this? I tried different settings regarding dechromed content.