manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Brandmeier <bdv...@gmail.com>
Subject Re: RSS Crawl -> NullPointerException
Date Wed, 30 Oct 2013 12:57:25 GMT
I've patched mcf and started the job. The log (attached) doesn't contain
any error messages and the documents crawled are indexed in Solr correctly.

However, only four(!) documents are crawled/indexed, but 58 items exist in
the feed. Could this be a configuration issue or might the patch have led
to that?

Thanks!
Benjamin


2013/10/30 Karl Wright <daddywri@gmail.com>

> I've attached a patch to the ticket, but haven't tried it yet (no access
> to outside network right now).  Can you try this and see if it works?
>
> Thanks,
> Karl
>
>
>
> On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:
>
>> Hi Karl,
>>
>> the stack trace at the point where the NPE occurs is just as long as the
>> one provided in the log.
>>
>> I've fetched a stack trace at the point where previousContext is null for
>> the first time. After that, the currentContext will be set to null and this
>> leads to the error described.
>> Maybe this helps:
>>
>> Daemon Thread [Worker thread '42'] (Suspended (entry into method
>> endElement in XMLParsingContext))
>> RSSConnector$OuterContextClass(XMLParsingContext).endElement(String,
>> String, String) line: 109
>>  XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String)
>> line: 110
>> XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String)
>> line: 131
>>  XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char)
>> line: 755
>> XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader)
>> line: 51
>>  DecodingByteReceiver.dealWithBytes(InputStream) line: 48
>> BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream)
>> line: 248
>>  BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream)
>> line: 52
>> Parser.parseWithCharsetDetection(String, InputStream, CharacterReceiver)
>> line: 82
>>  RSSConnector.handleRSSFeedSAX(String, IProcessActivity,
>> RSSConnector$Filter) line: 3481
>> RSSConnector.processDocuments(String[], String[], IProcessActivity,
>> DocumentSpecification, boolean[], int) line: 1256
>>  WorkerThread.run() line: 559
>>
>>
>> I've tested this with MCF 1.3 AND 1.4 (from tag). The same error occurs
>> with both versions.
>>
>> Benjamin
>>
>>
>> 2013/10/30 Karl Wright <daddywri@gmail.com>
>>
>>> Hi Benjamin,
>>>
>>> It may be malformed XML that we don't treat properly.  If the log has a
>>> full stack trace that would be very helpful.  If not can you get one?
>>>
>>> Thanks!
>>>
>>> Karl
>>>
>>> Sent from my Windows Phone
>>> ------------------------------
>>> From: Benjamin Brandmeier
>>> Sent: 10/30/2013 6:51 AM
>>> To: user@manifoldcf.apache.org
>>> Subject: RSS Crawl -> NullPointerException
>>>
>>>  Hi everyone,
>>>
>>>
>>>
>>> I'm facing a problem with the RSS connector. The feed I'm crawling is
>>> --> http://blog.fme.de/feed
>>>
>>> A NPE occurs at processing time. After some debugging I've found out the
>>> following:
>>>
>>>
>>>
>>> Variable previousContext is null in method --> public final void
>>> endElement(String namespace, String localName, String qName)
>>>
>>> Parameter qName is content:encoded, but there are many tags like this in
>>> the feed, so I'm not sure about at which point the error occurs.
>>>
>>> The variable previousContext(=null) is written to currentContext. As the
>>> stack trace shows, the error happens at
>>> org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86),
>>>
>>> at this point currentContext.cleanup(); is called with currentContext =
>>> null.
>>>
>>>
>>>
>>> manifoldcf.log is attached.
>>>
>>>
>>>
>>> Any thoughts on this? I tried different settings regarding dechromed
>>> content.
>>>
>>>
>>>
>>> Benjamin
>>>
>>
>>
>

Mime
View raw message