any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Apap <fsa...@gmail.com>
Subject Re: Processing Recipes
Date Thu, 17 Dec 2015 18:40:42 GMT
I made some progress but still having issues.

The URL I'm testing is:

http://www.yummly.com/recipe/Whole-Grain-Breakfast-Pitas-1022573

My function is:


public static void parseMicroData(RawItemInfo info, TagSoupParser
tagSoupParser) throws Exception{

try{


ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();

//printDocument(tagSoupParser.getDOM(),System.out);

MicrodataParser.getMicrodataAsJSON(tagSoupParser.getValidatedDOM(true
).getDocument(),new PrintStream(byteArrayOutput));

String result = byteArrayOutput.toString("UTF-8");

logger.info(result);

parseSchemaRecipe(info,result);

}

catch (Exception e){

logger.error("Error with processing "+info.getId(),e);

}

}

This is mainly working, however not all the elements of the NutritionInfo
nested structure are actually coming back.  Any tips on how to debug?  The
site's microdata looks valid.

On Tue, Dec 8, 2015 at 12:49 PM, Frank Apap <fsa317@gmail.com> wrote:

> This URL is causing me problems as well:
>
>
> http://www.yummly.com/recipe/Gabis-Low-Carb-Yeast-Bread-1073667?columns=4&position=1%2F74
>
> It appears to define a schema.org recipe but fails.
>
>
>
> On Tue, Dec 8, 2015 at 9:49 AM, Frank Apap <fsa317@gmail.com> wrote:
>
>> I tried -
>> http://www.food.com/recipe/crock-pot-chicken-with-black-beans-cream-cheese-89204
>> but I get the same error on the site as well.  Not sure what sites will
>> work properly.
>>
>> On Mon, Dec 7, 2015 at 10:52 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi Frank,
>>>
>>> Answer below
>>>
>>> On Mon, Dec 7, 2015 at 3:50 PM, <user-digest-help@any23.apache.org>
>>> wrote:
>>>
>>>>
>>>> Hi, Im trying to process recipes that are marked up, one example of
>>>> such a recipe is:
>>>>
>>>> http://allrecipes.com/recipe/203229/moms-buttermilk-pancakes/
>>>>
>>>> This page can be processed by google rich snippets, but when I try the
>>>> following it doesn't return results:
>>>>
>>>> any23 rover -e html-mf-hrecipe
>>>> http://allrecipes.com/recipe/203229/moms-buttermilk-pancakes/
>>>>
>>>> Using the following I get json results but they are generic (not recipe
>>>> specific):
>>>>
>>>> sudo any23 rover -e html-microdata
>>>> http://allrecipes.com/recipe/203229/moms-buttermilk-pancakes/
>>>>
>>>> Am I missing something?  My ultimate goal is to get the recipe into a
>>>> java object, what would be the best way to do that?
>>>>
>>>>
>>>  When I try this with the Any23.org service at any23.org (running off
>>> of Any23-trunk) I get the following error. Do you have another page we can
>>> try?
>>> Thanks
>>>
>>> org.apache.any23.extractor.ExtractionException: Error while parsing RDF document.
>>> 	at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
>>> 	at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
>>> 	at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:463)
>>> 	at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:255)
>>> 	at org.apache.any23.Any23.extract(Any23.java:298)
>>> 	at org.apache.any23.Any23.extract(Any23.java:450)
>>> 	at org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:114)
>>> 	at org.apache.any23.servlet.Servlet.doGet(Servlet.java:79)
>>> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
>>> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
>>> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
>>> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>> 	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>>> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>>> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
>>> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>>> 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
>>> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:136)
>>> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:74)
>>> 	at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
>>> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>>> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
>>> 	at org.apache.coyote.ajp.AbstractAjpProcessor.process(AbstractAjpProcessor.java:794)
>>> 	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
>>> 	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1575)
>>> 	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1533)
>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> 	at java.lang.Thread.run(Thread.java:745)
>>> Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException;
lineNumber: 11; columnNumber: 788; Element type "n.length" must be followed by either attribute
specifications, ">" or "/>".
>>> 	at org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
>>> 	at org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:95)
>>> 	at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
>>> 	... 29 more
>>> Caused by: org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException;
lineNumber: 11; columnNumber: 788; Element type "n.length" must be followed by either attribute
specifications, ">" or "/>".
>>> 	at org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1130)
>>> 	at org.semarglproject.source.XmlSource.process(XmlSource.java:50)
>>> 	at org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87)
>>> 	at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167)
>>> 	at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154)
>>> 	at org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:109)
>>> 	... 31 more
>>> Caused by: org.xml.sax.SAXParseException; lineNumber: 11; columnNumber: 788;
Element type "n.length" must be followed by either attribute specifications, ">" or "/>".
>>> 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>>> 	at org.semarglproject.source.XmlSource.process(XmlSource.java:48)
>>> 	... 35 more
>>>
>>>
>>>
>>
>

Mime
View raw message