any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bianca Pereira <aivykar...@gmail.com>
Subject Re: Extracting Blank Nodes instead of IRIs
Date Mon, 14 Jul 2014 09:55:10 GMT
Hi,

2014-07-10 21:08 GMT+01:00 Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>:

> Hi Bianca,
>
> I cannot reproduce this... The output I get from the webpage serialized as
> JSON for reading purposes is as follows:
> http://paste.apache.org/hhim As you can see there are no blank nodes
> being included as the subject relationship.
> This being said, I DO know what you mean as I've encounterd this before
> and find the information about a blank node quite irrelevant if I am honest.
>
>

 In order to reproduce this specific case I used the following commands:

  wget http://www.imdb.com/title/tt0286560/?ref_=fn_al_tt_4

  ./apache-any23-core-1.0/bin/rover  -f ntriples -o
index.html?ref_=fn_al_tt_4.nt  index.html?ref_=fn_al_tt_4


>
>> It seems that in this specific case I could use the content from the
>> property */Person/url* as the unique identifier (*IRI*) for the entity.
>> I suppose it is not a problem of the extractor but on how the page was
>> created. But as many people are using schema.org I was wondering if
>> there is any solution for this case. I would be very glad if someone has
>> any idea of a solution.
>>
>>
>>
I tried to look into another website (Rotten Tomatoes) and I found the same
pattern.

 Again, IMHO, the url could be used as the subject of the triples. I am not
sure if it is valid for all triples in all websites but in those examples
it seems to work fine. Here goes one example from the webpage
http://www.rottentomatoes.com/m/sex_tape_2014/

 _:nodecfcd208495d565ef66e7dff9f98764da <http://www.schema.org/Movie/name>
"Sex T
ape (2014)"@en .
_:nodecfcd208495d565ef66e7dff9f98764da <
http://www.schema.org/Movie/contentRating> "R"@en .
_:nodecfcd208495d565ef66e7dff9f98764da <
http://www.schema.org/Movie/datePublished> "Jul 18, 2014 Wide"@en .
_:nodecfcd208495d565ef66e7dff9f98764da <http://www.schema.org/Movie/image> <
http://content9.flixster.com/movie/11/17/70/11177027_det.jpg> .
_:nodef4501543ed78d92c8615458a688986 <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person>
.
*_:nodef4501543ed78d92c8615458a688986 <http://schema.org/Person/name
<http://schema.org/Person/name>> "Cameron Diaz"@en .*
*_:nodef4501543ed78d92c8615458a688986 <http://schema.org/Person/image
<http://schema.org/Person/image>>
<http://content9.flixster.com/rtactor/42/17/42179_tmb.jpg
<http://content9.flixster.com/rtactor/42/17/42179_tmb.jpg>> .*
*_:nodef4501543ed78d92c8615458a688986 <http://schema.org/Person/url
<http://schema.org/Person/url>>
<file:./sex_tape_2014//celebrity/cameron_diaz/> .*
_:nodecfcd208495d565ef66e7dff9f98764da <http://www.schema.org/Movie/actors>
_:nodef4501543ed78d92c8615458a688986 .


>  Correct, this is NOT a problem with the extractor at all.
> What I think yu are suggesting a possibly a *better* way for us to have a
> fallback value for blank nodes like the one you provided in your example.
> Is this a fair statement for me to make?
>

I don't know if it is a better way or not. Actually I was hoping that
someone could tell me if it is a reasonable idea or not =) As it is the
first time I really work with data which is not already in triples format.


> If this is true then it would be a case of adding functionality to the
> existing html-rdfa11 or html-head-title extractor (whichever one was used
> in this particular case). I would ask you to log a Jira issue and possibly
> explain what it is that you intend to add... we can certainly work towards
> addressing it and I will help you on this no reservations.
>

Sorry my ignorance but I don't know which extractor was used =/ I just used
the rover asking the format to be given in ntriples. How can I know which
extractor was used?


>
> BTW, as I write I am thinking... is this fall back value kind of
> *falsifying* the node relationships? I mean the page is what it is... if we
> use the fall back value then I feel we are kind of manipulating the
> relationships within the page! Does this make sense? Is this a valid point
> I am making?
> Thanks
> Lewis
>

Regards,
Bianca

Mime
View raw message