manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: RSS Crawl -> NullPointerException
Date Wed, 30 Oct 2013 13:02:45 GMT
Hi Benjamin,

I will have to look at the feed itself to see why only four of the links
are extracted.  It is not likely to be due to the patch, but rather the
feed format.  As you know, RSS standards are fluid at best and feed
publishers often do things in unique ways.

I can't look at this in detail though until this evening.

Karl



On Wed, Oct 30, 2013 at 8:57 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:

> I've patched mcf and started the job. The log (attached) doesn't contain
> any error messages and the documents crawled are indexed in Solr correctly.
>
> However, only four(!) documents are crawled/indexed, but 58 items exist in
> the feed. Could this be a configuration issue or might the patch have led
> to that?
>
> Thanks!
> Benjamin
>
>
> 2013/10/30 Karl Wright <daddywri@gmail.com>
>
>> I've attached a patch to the ticket, but haven't tried it yet (no access
>> to outside network right now).  Can you try this and see if it works?
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:
>>
>>> Hi Karl,
>>>
>>> the stack trace at the point where the NPE occurs is just as long as the
>>> one provided in the log.
>>>
>>> I've fetched a stack trace at the point where previousContext is null
>>> for the first time. After that, the currentContext will be set to null and
>>> this leads to the error described.
>>> Maybe this helps:
>>>
>>> Daemon Thread [Worker thread '42'] (Suspended (entry into method
>>> endElement in XMLParsingContext))
>>> RSSConnector$OuterContextClass(XMLParsingContext).endElement(String,
>>> String, String) line: 109
>>>  XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String)
>>> line: 110
>>> XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String)
>>> line: 131
>>>  XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char)
>>> line: 755
>>> XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader)
>>> line: 51
>>>  DecodingByteReceiver.dealWithBytes(InputStream) line: 48
>>> BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream)
>>> line: 248
>>>  BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream)
>>> line: 52
>>> Parser.parseWithCharsetDetection(String, InputStream, CharacterReceiver)
>>> line: 82
>>>  RSSConnector.handleRSSFeedSAX(String, IProcessActivity,
>>> RSSConnector$Filter) line: 3481
>>> RSSConnector.processDocuments(String[], String[], IProcessActivity,
>>> DocumentSpecification, boolean[], int) line: 1256
>>>  WorkerThread.run() line: 559
>>>
>>>
>>> I've tested this with MCF 1.3 AND 1.4 (from tag). The same error occurs
>>> with both versions.
>>>
>>> Benjamin
>>>
>>>
>>> 2013/10/30 Karl Wright <daddywri@gmail.com>
>>>
>>>> Hi Benjamin,
>>>>
>>>> It may be malformed XML that we don't treat properly.  If the log has a
>>>> full stack trace that would be very helpful.  If not can you get one?
>>>>
>>>> Thanks!
>>>>
>>>> Karl
>>>>
>>>> Sent from my Windows Phone
>>>> ------------------------------
>>>> From: Benjamin Brandmeier
>>>> Sent: 10/30/2013 6:51 AM
>>>> To: user@manifoldcf.apache.org
>>>> Subject: RSS Crawl -> NullPointerException
>>>>
>>>>  Hi everyone,
>>>>
>>>>
>>>>
>>>> I'm facing a problem with the RSS connector. The feed I'm crawling is
>>>> --> http://blog.fme.de/feed
>>>>
>>>> A NPE occurs at processing time. After some debugging I've found out
>>>> the following:
>>>>
>>>>
>>>>
>>>> Variable previousContext is null in method --> public final void
>>>> endElement(String namespace, String localName, String qName)
>>>>
>>>> Parameter qName is content:encoded, but there are many tags like this
>>>> in the feed, so I'm not sure about at which point the error occurs.
>>>>
>>>> The variable previousContext(=null) is written to currentContext. As
>>>> the stack trace shows, the error happens at
>>>> org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86),
>>>>
>>>> at this point currentContext.cleanup(); is called with currentContext =
>>>> null.
>>>>
>>>>
>>>>
>>>> manifoldcf.log is attached.
>>>>
>>>>
>>>>
>>>> Any thoughts on this? I tried different settings regarding dechromed
>>>> content.
>>>>
>>>>
>>>>
>>>> Benjamin
>>>>
>>>
>>>
>>
>

Mime
View raw message