manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Brandmeier <bdv...@gmail.com>
Subject Re: RSS Crawl -> NullPointerException
Date Thu, 31 Oct 2013 09:42:01 GMT
Great Job!
This fixed the issue and crawling works as expected. Thanks for being super
responsive!

Benjamin


2013/10/30 Karl Wright <daddywri@gmail.com>

> I've checked in a fix to trunk for this issue, and included a second patch
> in the ticket.
>
> Karl
>
>
>
> On Wed, Oct 30, 2013 at 9:02 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Benjamin,
>>
>> I will have to look at the feed itself to see why only four of the links
>> are extracted.  It is not likely to be due to the patch, but rather the
>> feed format.  As you know, RSS standards are fluid at best and feed
>> publishers often do things in unique ways.
>>
>> I can't look at this in detail though until this evening.
>>
>> Karl
>>
>>
>>
>> On Wed, Oct 30, 2013 at 8:57 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:
>>
>>> I've patched mcf and started the job. The log (attached) doesn't contain
>>> any error messages and the documents crawled are indexed in Solr correctly.
>>>
>>> However, only four(!) documents are crawled/indexed, but 58 items exist
>>> in the feed. Could this be a configuration issue or might the patch have
>>> led to that?
>>>
>>> Thanks!
>>> Benjamin
>>>
>>>
>>> 2013/10/30 Karl Wright <daddywri@gmail.com>
>>>
>>>> I've attached a patch to the ticket, but haven't tried it yet (no
>>>> access to outside network right now).  Can you try this and see if it works?
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier <bdvlop@gmail.com>wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> the stack trace at the point where the NPE occurs is just as long as
>>>>> the one provided in the log.
>>>>>
>>>>> I've fetched a stack trace at the point where previousContext is null
>>>>> for the first time. After that, the currentContext will be set to null
and
>>>>> this leads to the error described.
>>>>> Maybe this helps:
>>>>>
>>>>> Daemon Thread [Worker thread '42'] (Suspended (entry into method
>>>>> endElement in XMLParsingContext))
>>>>> RSSConnector$OuterContextClass(XMLParsingContext).endElement(String,
>>>>> String, String) line: 109
>>>>>  XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String)
>>>>> line: 110
>>>>> XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String)
>>>>> line: 131
>>>>>  XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char)
>>>>> line: 755
>>>>> XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader)
>>>>> line: 51
>>>>>  DecodingByteReceiver.dealWithBytes(InputStream) line: 48
>>>>> BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream)
>>>>> line: 248
>>>>>  BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream)
>>>>> line: 52
>>>>> Parser.parseWithCharsetDetection(String, InputStream,
>>>>> CharacterReceiver) line: 82
>>>>>  RSSConnector.handleRSSFeedSAX(String, IProcessActivity,
>>>>> RSSConnector$Filter) line: 3481
>>>>> RSSConnector.processDocuments(String[], String[], IProcessActivity,
>>>>> DocumentSpecification, boolean[], int) line: 1256
>>>>>  WorkerThread.run() line: 559
>>>>>
>>>>>
>>>>> I've tested this with MCF 1.3 AND 1.4 (from tag). The same error
>>>>> occurs with both versions.
>>>>>
>>>>> Benjamin
>>>>>
>>>>>
>>>>> 2013/10/30 Karl Wright <daddywri@gmail.com>
>>>>>
>>>>>> Hi Benjamin,
>>>>>>
>>>>>> It may be malformed XML that we don't treat properly.  If the log
has
>>>>>> a full stack trace that would be very helpful.  If not can you get
one?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> Sent from my Windows Phone
>>>>>> ------------------------------
>>>>>> From: Benjamin Brandmeier
>>>>>> Sent: 10/30/2013 6:51 AM
>>>>>> To: user@manifoldcf.apache.org
>>>>>> Subject: RSS Crawl -> NullPointerException
>>>>>>
>>>>>>  Hi everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm facing a problem with the RSS connector. The feed I'm crawling
is
>>>>>> --> http://blog.fme.de/feed
>>>>>>
>>>>>> A NPE occurs at processing time. After some debugging I've found
out
>>>>>> the following:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Variable previousContext is null in method --> public final void
>>>>>> endElement(String namespace, String localName, String qName)
>>>>>>
>>>>>> Parameter qName is content:encoded, but there are many tags like
this
>>>>>> in the feed, so I'm not sure about at which point the error occurs.
>>>>>>
>>>>>> The variable previousContext(=null) is written to currentContext.
As
>>>>>> the stack trace shows, the error happens at
>>>>>> org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86),
>>>>>>
>>>>>> at this point currentContext.cleanup(); is called with currentContext
>>>>>> = null.
>>>>>>
>>>>>>
>>>>>>
>>>>>> manifoldcf.log is attached.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any thoughts on this? I tried different settings regarding dechromed
>>>>>> content.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Benjamin
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message