any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bianca Pereira <>
Subject Re: Extracting Blank Nodes instead of IRIs
Date Mon, 14 Jul 2014 09:55:10 GMT

2014-07-10 21:08 GMT+01:00 Lewis John Mcgibbney <>:

> Hi Bianca,
> I cannot reproduce this... The output I get from the webpage serialized as
> JSON for reading purposes is as follows:
> As you can see there are no blank nodes
> being included as the subject relationship.
> This being said, I DO know what you mean as I've encounterd this before
> and find the information about a blank node quite irrelevant if I am honest.

 In order to reproduce this specific case I used the following commands:


  ./apache-any23-core-1.0/bin/rover  -f ntriples -o
index.html?ref_=fn_al_tt_4.nt  index.html?ref_=fn_al_tt_4

>> It seems that in this specific case I could use the content from the
>> property */Person/url* as the unique identifier (*IRI*) for the entity.
>> I suppose it is not a problem of the extractor but on how the page was
>> created. But as many people are using I was wondering if
>> there is any solution for this case. I would be very glad if someone has
>> any idea of a solution.
I tried to look into another website (Rotten Tomatoes) and I found the same

 Again, IMHO, the url could be used as the subject of the triples. I am not
sure if it is valid for all triples in all websites but in those examples
it seems to work fine. Here goes one example from the webpage

 _:nodecfcd208495d565ef66e7dff9f98764da <>
"Sex T
ape (2014)"@en .
_:nodecfcd208495d565ef66e7dff9f98764da <> "R"@en .
_:nodecfcd208495d565ef66e7dff9f98764da <> "Jul 18, 2014 Wide"@en .
_:nodecfcd208495d565ef66e7dff9f98764da <> <> .
_:nodef4501543ed78d92c8615458a688986 <> <>
*_:nodef4501543ed78d92c8615458a688986 <
<>> "Cameron Diaz"@en .*
*_:nodef4501543ed78d92c8615458a688986 <
<>> .*
*_:nodef4501543ed78d92c8615458a688986 <
<file:./sex_tape_2014//celebrity/cameron_diaz/> .*
_:nodecfcd208495d565ef66e7dff9f98764da <>
_:nodef4501543ed78d92c8615458a688986 .

>  Correct, this is NOT a problem with the extractor at all.
> What I think yu are suggesting a possibly a *better* way for us to have a
> fallback value for blank nodes like the one you provided in your example.
> Is this a fair statement for me to make?

I don't know if it is a better way or not. Actually I was hoping that
someone could tell me if it is a reasonable idea or not =) As it is the
first time I really work with data which is not already in triples format.

> If this is true then it would be a case of adding functionality to the
> existing html-rdfa11 or html-head-title extractor (whichever one was used
> in this particular case). I would ask you to log a Jira issue and possibly
> explain what it is that you intend to add... we can certainly work towards
> addressing it and I will help you on this no reservations.

Sorry my ignorance but I don't know which extractor was used =/ I just used
the rover asking the format to be given in ntriples. How can I know which
extractor was used?

> BTW, as I write I am thinking... is this fall back value kind of
> *falsifying* the node relationships? I mean the page is what it is... if we
> use the fall back value then I feel we are kind of manipulating the
> relationships within the page! Does this make sense? Is this a valid point
> I am making?
> Thanks
> Lewis


View raw message