any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (ANY23-168) RDFa properties in <meta> elements not picked up
Date Wed, 26 Mar 2014 23:48:17 GMT


Lewis John McGibbney commented on ANY23-168:

I've been trying to establish the default boolean value of true for property 'any23.extraction.head.meta'
as advised in our documentation[0] as follows
    Any23 runner = new Any23();
    DocumentSource source = runner.createDocumentSource("file:" + fileURIString);
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    TripleHandler handler = new JSONWriter(baos);
    final ExtractionParameters extractionParameters = ExtractionParameters.newDefault();
    extractionParameters.setFlag("any23.extraction.head.meta", true);
    try {
      runner.extract(extractionParameters, source, handler);
    } catch (ExtractionException e) {
    } finally {
but so far, when I am debugging the code, it seems that the SingleDocumentExtraction class
is NOT registering HTMLMetaExtractor for potential extraction.

If you get any further with this then please update this thread, it would be excellent to
get this issue sorted out.

> RDFa properties in <meta> elements not picked up
> ------------------------------------------------
>                 Key: ANY23-168
>                 URL:
>             Project: Apache Any23
>          Issue Type: Bug
>            Reporter: Ruben Verborgh
>              Labels: meta-tags, rdfa
>             Fix For: 1.0.0
> RDFa annotations in <meta> elements are not picked up:
> The Structured Data Testing Tool finds them:
> Additionally, I wonder whether it's a good idea to drop the dcterms:title property extracted
from <title> of an actual dc:title property is present. This allows for more meaningful
titles, for instance:
>     <title>HTML Title – Website Name</title>
>     <meta property="dc:title" content="DC Title"/>
> This would allow to overcome the common situation that the HTML <title> also contains
the website name etc., so is not suited for a "clean" dc:title. I would thus say that an actual
dc:title has precedence over an implied dc:title from <title>.
> Furthermore, I'm confused by the double appearance of
> <> dcterms:title "HTML Title –
Website Name" .
> <> <>
_:nodecfcd208495d565ef66e7dff9f98764da ;
> 	dcterms:title "HTML Title – Website Name" .
> Should the page itself AND some blank node have this dcterms:title? (And what happens
if the <meta> tags are parsed?)

This message was sent by Atlassian JIRA

View raw message