any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kunal P (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-154) Not able to extract microdata in few test cases
Date Thu, 02 May 2013 13:14:15 GMT

    [ https://issues.apache.org/jira/browse/ANY23-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647504#comment-13647504
] 

Kunal P commented on ANY23-154:
-------------------------------

In MicrodataParser.java on line number 166,
we found the following snippet which seems to be eliminating the various itemscopes from the
attached file (neeraj.nowfloats.com.htm).
Is this really necessary?

if (!isItemProp(itemScope)) {
	topLevelItemScopes.add(itemScope);
}

                
> Not able to extract microdata in few test cases
> -----------------------------------------------
>
>                 Key: ANY23-154
>                 URL: https://issues.apache.org/jira/browse/ANY23-154
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.7.0
>         Environment: Windows 7 32bit
> JDK 1.6.0_38
> Intel Core 2 duo and 4GB RAM
>            Reporter: Kunal P
>             Fix For: 0.9.0
>
>         Attachments: neeraj.nowfloats.com.htm, XOYRVIbK.part
>
>
> we are using ApacheAny23 API for extracting microdata from the given web-page as part
of internal project.
> we have some test cases where api is not able to parse the microdata. 
> www.neeraj.nowfloats.com (The web page is not following schema.org standards strictly)
> I am giving the snippit of the HTML code here.
> <div id="someid" itemprop="offer" itemscope itemtype="http://schema.org/Offer">
>   <div ... ></div>
> </div>
> It clearly shows that given microdata is a child of some parent microdata specification
as it contains itemscope as well as itemprop in the same tag. And the given <div id="someid">
tag has no parent microdata specification.
> The method used for extracting ItemScopes is as follows,
> import org.apache.any23.extractor.microdata.ItemScope;
> import org.apache.any23.extractor.microdata.MicrodataParser;
> import org.apache.any23.extractor.microdata.MicrodataParserReport;
> Document dom = getDomDocument(String html)
> MicrodataParserReport report = MicrodataParser.getMicrodata(dom);
> ItemScope[] items = report.getDetectedItemScopes();
> here, items doesnt contain any ItemScope which has above test case. 
> In such scenario, how can we extract microdata from the page using any23 api.
> Is there any way to relax the criterion of itemprop and itemscope not appearing in the
same tag so that we get the data from the webpage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message