any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Extracting Blank Nodes instead of IRIs
Date Mon, 14 Jul 2014 21:24:43 GMT
Hi Bianca,

On Mon, Jul 14, 2014 at 5:55 AM, <user-digest-help@any23.apache.org> wrote:

>  In order to reproduce this specific case I used the following commands:
>
>   wget http://www.imdb.com/title/tt0286560/?ref_=fn_al_tt_4
>
>   ./apache-any23-core-1.0/bin/rover  -f ntriples -o
> index.html?ref_=fn_al_tt_4.nt  index.html?ref_=fn_al_tt_4
>

OK I will spoke this later today. Thanks for the example this time around.


> I tried to look into another website (Rotten Tomatoes) and I found the
> same pattern.
>
>  Again, IMHO, the url could be used as the subject of the triples. I am
> not sure if it is valid for all triples in all websites but in those
> examples it seems to work fine. Here goes one example from the webpage
> http://www.rottentomatoes.com/m/sex_tape_2014/
>

OK, I doubt I would be able to navigate to that URL whilst on my work
laptop ;)
However, I wonder if you have discovered the XPATH extractor?
http://any23.apache.org/apidocs/org/apache/any23/extractor/xpath/XPathExtractor.html
This is marked as experimental but might do the trick for you if this is
something that needs to be addressed for your ongoing work.


>
>
> I don't know if it is a better way or not. Actually I was hoping that
> someone could tell me if it is a reasonable idea or not =) As it is the
> first time I really work with data which is not already in triples format.
>

Yeah unfortunately this is the real life scenario and we need to accept
that not all data is going to be in the form we want or need. There is
sometimes preprocessing required before the data can get to your target
requirements. I am therefore interested in hearing how we can address this
one.
For me, having a subject value referenced as a blank node e.g.
node_0974638e093e23 (or something similar) is difficult to both interpret
and relate to unless it can be visualized within the web page.
We 'can' do this type of thing with Any23 but from what I can see, you are
using the command line tools and not the Java API directly.


>
>
> Sorry my ignorance but I don't know which extractor was used =/ I just
> used the rover asking the format to be given in ntriples. How can I know
> which extractor was used?
>

Well a quick and easy way to do this would be to navigate to
http://any23.org and run an extraction with validation and fixing set to
true, this will generate a report with the extractors which have been used.
Although this doesn't repost the exact extractor, it will save you time in
narrowing down which one was used. You will most likely need to step
through code in a debugger to see which extractor extracted which triple.

Thanks
Lewis

Mime
View raw message