any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ANY23-154) Not able to extract microdata in few test cases
Date Thu, 28 Mar 2013 22:13:15 GMT

     [ https://issues.apache.org/jira/browse/ANY23-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lewis John McGibbney updated ANY23-154:
---------------------------------------

    Attachment: XOYRVIbK.part
                neeraj.nowfloats.com.htm

I attach the source HTML and a report from any23.org (any23-0.7.0-incubating) which details
that in this case no microdaata extractors are called for the markup.
This is truly an open issue and we need to define why the microdata extractors are not recognizing
the embedded structure and being called to parse it out.
                
> Not able to extract microdata in few test cases
> -----------------------------------------------
>
>                 Key: ANY23-154
>                 URL: https://issues.apache.org/jira/browse/ANY23-154
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.7.0
>         Environment: Windows 7 32bit
> JDK 1.6.0_38
> Intel Core 2 duo and 4GB RAM
>            Reporter: Kunal P
>             Fix For: 0.9.0
>
>         Attachments: neeraj.nowfloats.com.htm, XOYRVIbK.part
>
>
> we are using ApacheAny23 API for extracting microdata from the given web-page as part
of internal project.
> we have some test cases where api is not able to parse the microdata. 
> www.neeraj.nowfloats.com (The web page is not following schema.org standards strictly)
> I am giving the snippit of the HTML code here.
> <div id="someid" itemprop="offer" itemscope itemtype="http://schema.org/Offer">
>   <div ... ></div>
> </div>
> It clearly shows that given microdata is a child of some parent microdata specification
as it contains itemscope as well as itemprop in the same tag. And the given <div id="someid">
tag has no parent microdata specification.
> The method used for extracting ItemScopes is as follows,
> import org.apache.any23.extractor.microdata.ItemScope;
> import org.apache.any23.extractor.microdata.MicrodataParser;
> import org.apache.any23.extractor.microdata.MicrodataParserReport;
> Document dom = getDomDocument(String html)
> MicrodataParserReport report = MicrodataParser.getMicrodata(dom);
> ItemScope[] items = report.getDetectedItemScopes();
> here, items doesnt contain any ItemScope which has above test case. 
> In such scenario, how can we extract microdata from the page using any23 api.
> Is there any way to relax the criterion of itemprop and itemscope not appearing in the
same tag so that we get the data from the webpage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message