any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-314) Service fails to return extraction in case of extraction error
Date Tue, 12 Dec 2017 22:02:00 GMT

    [ https://issues.apache.org/jira/browse/ANY23-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288344#comment-16288344
] 

ASF GitHub Bot commented on ANY23-314:
--------------------------------------

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/49
  
    In the case where one encounters a parse and/or extraction error hence an unsuccessful
extraction, one would see the following result. 
    As you can see, the partial extraction is now included at the bottom of the servlet response
which is much better e.g. more forgiving, than a plain stack trace and error message.
    ```
    Failed to fully parse input. The extraction result, at the bottom of this response, if
any, will contain extractions only up until the extraction error.
    ================================================================
    
    ------------ BEGIN Exception context ------------
    ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://any23.apache.org/)
    Errors {
    	ERROR: 	'The entity "copy" was referenced, but not declared.' 	(-1,-1)
    }
    ------------ END   Exception context ------------
    
    org.apache.any23.extractor.ExtractionException: Error while parsing RDF document.
    	at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
    	at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
    	at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:467)
    	at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:256)
    	at org.apache.any23.Any23.extract(Any23.java:300)
    	at org.apache.any23.Any23.extract(Any23.java:452)
    	at org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:117)
    	at org.apache.any23.servlet.Servlet.doGet(Servlet.java:82)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:624)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
    	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
    	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:218)
    	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:442)
    	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1083)
    	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:640)
    	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    	at java.lang.Thread.run(Thread.java:748)
    Caused by: org.eclipse.rdf4j.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber:
306; columnNumber: 55; The entity "copy" was referenced, but not declared.
    	at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:111)
    	at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:95)
    	at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
    	... 29 more
    Caused by: org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException; lineNumber:
306; columnNumber: 55; The entity "copy" was referenced, but not declared.
    	at org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1141)
    	at org.semarglproject.source.XmlSource.process(XmlSource.java:50)
    	at org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87)
    	at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167)
    	at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154)
    	at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:109)
    	... 31 more
    Caused by: org.xml.sax.SAXParseException; lineNumber: 306; columnNumber: 55; The entity
"copy" was referenced, but not declared.
    	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    	at org.semarglproject.source.XmlSource.process(XmlSource.java:48)
    	... 35 more
    ================================================================
    <data>
    <![CDATA[
    @prefix sindice: <http://vocab.sindice.net/> .
    
    <http://any23.apache.org/> <http://vocab.sindice.net/any23#Date-Revision-yyyymmdd>
"20171101"@en ;
    	<http://vocab.sindice.net/any23#Content-Language> "en"@en ;
    	<http://vocab.sindice.net/any23#viewport> "width=device-width, initial-scale=1.0"@en
;
    	<http://vocab.sindice.net/any23#author> "The Apache Software Foundation"@en .
    @prefix dcterms: <http://purl.org/dc/terms/> .
    
    <http://any23.apache.org/> dcterms:title "Apache Any23 – Apache Any23 - Introduction"@en
.
    ]]>
    </data>
    
    ```


> Service fails to return extraction in case of extraction error
> --------------------------------------------------------------
>
>                 Key: ANY23-314
>                 URL: https://issues.apache.org/jira/browse/ANY23-314
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: service
>    Affects Versions: 2.1
>         Environment: Any23 2.2-SNAPSHOT
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 2.2
>
>         Attachments: extraction.json, output.log
>
>
> See the following command line extraction
> {code}
> lmcgibbn@LMC-056430 /usr/local/any23(master) $ ./cli/target/appassembler/bin/any23 rover
-l output.log -o extraction.json https://www.jobcluster.de
> ------------------------------------------------------------------------
> Apache Any23 :: rover
> ------------------------------------------------------------------------
> 0    [main] WARN  org.apache.tika.parser.image.ImageParser  - JBIG2ImageReader not loaded.
jbig2 files will be ignored
> 128  [main] INFO  org.apache.any23.rdf.PopularPrefixes  - Loading prefixes from /org/apache/any23/prefixes/prefixes.properties
> 1388 [main] WARN  org.apache.commons.httpclient.HttpMethodBase  - Going to buffer response
body of large or unknown size. Using getResponseBodyAsStream instead is recommended.
> 4790 [main] INFO  org.apache.any23.extractor.SingleDocumentExtraction  - Processing https://www.jobcluster.de/
> [Fatal Error] :12:46: The entity name must immediately follow the '&' in the entity
reference.
> ------------------------------------------------------------------------
> Apache Any23 FAILURE
> Execution terminated with errors: Error while parsing RDF document.
> Total time: 5s
> Finished at: Tue Dec 12 08:01:14 PST 2017
> Final Memory: 31M/184M
> ------------------------------------------------------------------------
> {code}
> This results in the attached extraction result (extraction.json) and associated log (output.log)
> If I attempt to run the same extraction using the service at [any23.org|http://any23.org/any23/?format=json&uri=https%3A%2F%2Fwww.jobcluster.de%2F&validation-mode=none]
the (partial) extraction result should be returned regardless of whether the entire extraction
was successful or not.
> The service servlet seems to be returning the extraction Exception as oppose to the preferred
extraction result. This issue will fix that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message