any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (ANY23-154) Not able to extract microdata in few test cases
Date Thu, 28 Mar 2013 22:21:15 GMT


Lewis John McGibbney commented on ANY23-154:

Can you please check out related open Microdata issues namely, ANY23-131, ANY23-132 and ANY23-16.

> Not able to extract microdata in few test cases
> -----------------------------------------------
>                 Key: ANY23-154
>                 URL:
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.7.0
>         Environment: Windows 7 32bit
> JDK 1.6.0_38
> Intel Core 2 duo and 4GB RAM
>            Reporter: Kunal P
>             Fix For: 0.9.0
>         Attachments:, XOYRVIbK.part
> we are using ApacheAny23 API for extracting microdata from the given web-page as part
of internal project.
> we have some test cases where api is not able to parse the microdata. 
> (The web page is not following standards strictly)
> I am giving the snippit of the HTML code here.
> <div id="someid" itemprop="offer" itemscope itemtype="">
>   <div ... ></div>
> </div>
> It clearly shows that given microdata is a child of some parent microdata specification
as it contains itemscope as well as itemprop in the same tag. And the given <div id="someid">
tag has no parent microdata specification.
> The method used for extracting ItemScopes is as follows,
> import org.apache.any23.extractor.microdata.ItemScope;
> import org.apache.any23.extractor.microdata.MicrodataParser;
> import org.apache.any23.extractor.microdata.MicrodataParserReport;
> Document dom = getDomDocument(String html)
> MicrodataParserReport report = MicrodataParser.getMicrodata(dom);
> ItemScope[] items = report.getDetectedItemScopes();
> here, items doesnt contain any ItemScope which has above test case. 
> In such scenario, how can we extract microdata from the page using any23 api.
> Is there any way to relax the criterion of itemprop and itemscope not appearing in the
same tag so that we get the data from the webpage.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message