any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-137) RDFa parser implementation proposal
Date Tue, 29 Jan 2013 22:07:13 GMT

    [ https://issues.apache.org/jira/browse/ANY23-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565867#comment-13565867
] 

Lewis John McGibbney commented on ANY23-137:
--------------------------------------------

Hi Lev,
I've also come across another issue with the existing html-rdfa11 Extractor implementation
and have attached the file.
For reference, here is the log report and output.
{code}
<response><extractors><extractor>html-head-title</extractor><extractor>html-mf-hcard</extractor><extractor>html-mf-adr</extractor><extractor>html-rdfa11</extractor></extractors><report><message/><error/><issueReport><extractorIssues
extractor="html-rdfa11"><issue level="Warning" row="202" col="30">Error while processing
node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[1]/SPAN[1]/A[1]] : 'Cannot
map prefix 'width''</issue><issue level="Warning" row="204" col="30">Error while
processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[2]/SPAN[1]/A[1]]
: 'Cannot map prefix 'width''</issue><issue level="Warning" row="208" col="30">Error
while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/P[1]/SPAN[1]/A[1]]
: 'Cannot map prefix 'width''</issue></extractorIssues></issueReport><validationReport><errors>
</errors><ruleActivations>
</ruleActivations><issues>
</issues></validationReport></report><data>
# OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl)
# BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/)
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/)
# BEGIN: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/)
@prefix dcterms: <http://purl.org/dc/terms/> .

<http://stanford.edu/> dcterms:title "Stanford University"@en .

_:noded01df813432682e65b842257f3757e9 a vcard:Address ;
	vcard:locality "450 Serra Mall, Stanford" ;
	vcard:region "CA" ;
	vcard:postal-code "94305" .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/)

_:node68324ba1f68fb1712ae267fe33274 vcard:fn "Stanford University" ;
	vcard:n _:node17eprgndbx338343 .

_:node17eprgndbx338343 a vcard:Name ;
	vcard:given-name "Stanford" ;
	vcard:family-name "University" .

_:node68324ba1f68fb1712ae267fe33274 vcard:org _:node17eprgndbx338344 .

_:node17eprgndbx338344 a vcard:Organization ;
	vcard:organization-name "Stanford University" .

_:node68324ba1f68fb1712ae267fe33274 vcard:adr _:noded01df813432682e65b842257f3757e9 ;
	vcard:tel <tel:(650)%20723-2300> .
# BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/)

_:node68324ba1f68fb1712ae267fe33274 a vcard:VCard .
# BEGIN: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/)

<http://stanford.edu/> <http://stanford.edu/alternate> <http://news.stanford.edu/rss/index.xml>
.

<http://stanford.edu/css/layout.css?v=3.0> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml>
.

<http://stanford.edu/css/homepage.css?v=3.1> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .

<http://stanford.edu/css/jquery.fancybox.css?v=2.0.5> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .

<http://stanford.edu/css/mobile.css> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml>
.

<https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .

<https://fonts.googleapis.com/css?family=Crimson+Text:400,600,700> <http://stanford.edu/stylesheet>
<http://news.stanford.edu/rss/index.xml> .
# END: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/)
# END: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/)
# END: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/)
# END: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/)
# END: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/)
# END: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/)
</data></response>
{code}


                
> RDFa parser implementation proposal
> -----------------------------------
>
>                 Key: ANY23-137
>                 URL: https://issues.apache.org/jira/browse/ANY23-137
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Lev Khomich
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: rdfa-extractor-proposal.patch
>
>
> As a follow up to discussion [1].
> I've implemented another RDFa extractor for Any23 (0.7.1).
> Proposed code depends on semargl project [2]. It isn't published in maven
> central, therefore I didn't change any poms.
> Still not quite sure about class name (because related ones are already taken),
> feel free to rename it. See attachments for patch with extractor and tests.
> [1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
> [2] http://semarglproject.org

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message