manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: RSS Crawl -> NullPointerException
Date Wed, 30 Oct 2013 21:51:58 GMT
I've checked in a fix to trunk for this issue, and included a second patch
in the ticket.

Karl



On Wed, Oct 30, 2013 at 9:02 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Benjamin,
>
> I will have to look at the feed itself to see why only four of the links
> are extracted.  It is not likely to be due to the patch, but rather the
> feed format.  As you know, RSS standards are fluid at best and feed
> publishers often do things in unique ways.
>
> I can't look at this in detail though until this evening.
>
> Karl
>
>
>
> On Wed, Oct 30, 2013 at 8:57 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:
>
>> I've patched mcf and started the job. The log (attached) doesn't contain
>> any error messages and the documents crawled are indexed in Solr correctly.
>>
>> However, only four(!) documents are crawled/indexed, but 58 items exist
>> in the feed. Could this be a configuration issue or might the patch have
>> led to that?
>>
>> Thanks!
>> Benjamin
>>
>>
>> 2013/10/30 Karl Wright <daddywri@gmail.com>
>>
>>> I've attached a patch to the ticket, but haven't tried it yet (no access
>>> to outside network right now).  Can you try this and see if it works?
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> the stack trace at the point where the NPE occurs is just as long as
>>>> the one provided in the log.
>>>>
>>>> I've fetched a stack trace at the point where previousContext is null
>>>> for the first time. After that, the currentContext will be set to null and
>>>> this leads to the error described.
>>>> Maybe this helps:
>>>>
>>>> Daemon Thread [Worker thread '42'] (Suspended (entry into method
>>>> endElement in XMLParsingContext))
>>>> RSSConnector$OuterContextClass(XMLParsingContext).endElement(String,
>>>> String, String) line: 109
>>>>  XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String)
>>>> line: 110
>>>> XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String)
>>>> line: 131
>>>>  XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char)
>>>> line: 755
>>>> XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader)
>>>> line: 51
>>>>  DecodingByteReceiver.dealWithBytes(InputStream) line: 48
>>>> BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream)
>>>> line: 248
>>>>  BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream)
>>>> line: 52
>>>> Parser.parseWithCharsetDetection(String, InputStream,
>>>> CharacterReceiver) line: 82
>>>>  RSSConnector.handleRSSFeedSAX(String, IProcessActivity,
>>>> RSSConnector$Filter) line: 3481
>>>> RSSConnector.processDocuments(String[], String[], IProcessActivity,
>>>> DocumentSpecification, boolean[], int) line: 1256
>>>>  WorkerThread.run() line: 559
>>>>
>>>>
>>>> I've tested this with MCF 1.3 AND 1.4 (from tag). The same error occurs
>>>> with both versions.
>>>>
>>>> Benjamin
>>>>
>>>>
>>>> 2013/10/30 Karl Wright <daddywri@gmail.com>
>>>>
>>>>> Hi Benjamin,
>>>>>
>>>>> It may be malformed XML that we don't treat properly.  If the log has
>>>>> a full stack trace that would be very helpful.  If not can you get one?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Karl
>>>>>
>>>>> Sent from my Windows Phone
>>>>> ------------------------------
>>>>> From: Benjamin Brandmeier
>>>>> Sent: 10/30/2013 6:51 AM
>>>>> To: user@manifoldcf.apache.org
>>>>> Subject: RSS Crawl -> NullPointerException
>>>>>
>>>>>  Hi everyone,
>>>>>
>>>>>
>>>>>
>>>>> I'm facing a problem with the RSS connector. The feed I'm crawling is
>>>>> --> http://blog.fme.de/feed
>>>>>
>>>>> A NPE occurs at processing time. After some debugging I've found out
>>>>> the following:
>>>>>
>>>>>
>>>>>
>>>>> Variable previousContext is null in method --> public final void
>>>>> endElement(String namespace, String localName, String qName)
>>>>>
>>>>> Parameter qName is content:encoded, but there are many tags like this
>>>>> in the feed, so I'm not sure about at which point the error occurs.
>>>>>
>>>>> The variable previousContext(=null) is written to currentContext. As
>>>>> the stack trace shows, the error happens at
>>>>> org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86),
>>>>>
>>>>> at this point currentContext.cleanup(); is called with currentContext
>>>>> = null.
>>>>>
>>>>>
>>>>>
>>>>> manifoldcf.log is attached.
>>>>>
>>>>>
>>>>>
>>>>> Any thoughts on this? I tried different settings regarding dechromed
>>>>> content.
>>>>>
>>>>>
>>>>>
>>>>> Benjamin
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message