any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Mostarda <michele.mosta...@gmail.com>
Subject Re: Too many tuples!!
Date Thu, 05 Apr 2012 13:08:56 GMT
Hi Tim,

     another good source for vocab usage / coverage statistics about the
Semantic Web is

       http://sindice.com/stats/basic-stats/

Best.
Mic

2012/4/5 Michele Mostarda <michele.mostarda@gmail.com>

> Hi Tim,
>
>    sorry for delay.
>
> First of all: did you see this initiative [0], it looks like to be similar
> to your task.
>
> I attempted to reproduce your issue using the latest Any23 trunk version
> but I didn't obtain any nesting triples (investigating on it).
>
> The triples with predicate "http://vocab.sindice.net/any23#nesting" are
> generated by a post processing phase which adds meta triples describing
> how HTML markup elements producing triples are nested together.
>
> This is mostly used when you have a page containing nested Microformats
> (like the mf-geo <span class="geo-default">) and you want to keep the
> meaning of the metadata expressed by the Microformats.
>
> For your purpose you don't need to produce the consolidation triples so
> you can skip the production of such triples setting to "off" the flag
> "any23.extraction.metadata.nesting". Specific instructions about how to
> use flags can be found here [1].
>
> Another flag that can reduce the number of generated meta triples is
> "any23.extraction.metadata.domain.per.entity" that when set to
> "off" prevents the generation of domain triples: (
>  _:noded11095f6ff16d9464e5e63653734bb <
> http://vocab.sindice.net/any23#domain> "en.wikipedia.org" . ) .
>
> The quantity of triples produced for the HTML code snippet you pasted is
> 'normal', the RDF data format tends to be a little 'verbose' :)
>
> You can apply filters to prevent to extract triples generated by CSS
> declarations, to do it by commandline use the:
>
>      bin/any23 rover --notrivial  'http://url/to/page'
>
> To do it programmatically take a look at:
>
>     org.apache.any23.filter.IgnoreAccidentalRDFa TripleHandler
> implementation class.
>
> If you notice any unexpected or strange behavior please feel free to
> report an issue at [2].
>
> Hope it helps.
>
> The best.
>
> Mic
>
> [0] http://webdatacommons.org/
> [1] http://incubator.apache.org/any23/configuration.html
> [2] https://issues.apache.org/jira/browse/ANY23
>
>
> 2012/4/5 Tim Potter <tep@yahoo-inc.com>
>
>> Hi Lewis,
>>    Maybe the pages has been modified slightly since I copied that
>> snippet.  If you search for '118°09′03″W' in the page source you should
>> find the entry.    I guest the easiest way to reproduce the problem is to
>> run:
>>
>>  'any23tools Rover
>> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations'
>>
>> It returns somewhere in the order of a million tuples.
>>
>> I found switching off the nested triple production ('any23tools Rover –n
>> http….') returns a lot less. Like a few thousand.
>>
>> Like I said, I don't have enough experience with RDF to know if what
>> Any23 is extracting is correct.  Just seems like a lot of tuples..
>>
>> Thanks for your help.
>>
>> Regards,
>>   Tim P.
>>
>>
>>
>> From: Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
>> Reply-To: "any23-user@incubator.apache.org" <
>> any23-user@incubator.apache.org>
>> Date: Wed, 4 Apr 2012 23:19:11 +0100
>> To: "any23-user@incubator.apache.org" <any23-user@incubator.apache.org>
>> Subject: Re: Too many tuples!!
>>
>> Hi Tim,
>>
>> I've just picked this up, it got lost in my filters.
>>
>> 2012/3/30 Tim Potter <tep@yahoo-inc.com>
>>
>>>
>>> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
>>>
>>
>> With regards to the link to the above URL and the source below, I can't
>> find snippet below in the above page!!! Can you please check and confirm
>> for me.
>>
>>>
>>> Given the HTML Snippet:
>>>
>>> <a href="
>>> http://toolserver.org/~geohack/geohack.php?pagename=List_of_Nike_missile_locations&amp;params=34_22_41_N_118_09_03_W_&amp;title=LA-04-LS<http://toolserver.org/%7Egeohack/geohack.php?pagename=List_of_Nike_missile_locations&params=34_22_41_N_118_09_03_W_&title=LA-04-LS>
>>> " class="external text" rel="nofollow" style="white-space: normal;">
>>>
>>> <span class="geo-default">
>>>
>>> <span title="Maps, aerial photos, and other data for this location"
>>> class="geo-dms">
>>>
>>> <span class="latitude">34°22′41″N</span>
>>>
>>> <span class="longitude">118°09′03″W</span>
>>>
>>> </span>
>>>
>>> </span>
>>>
>>> <span class="geo-multi-punct">&#65279; / &#65279;</span>
>>>
>>> <span class="geo-nondefault">
>>>
>>> <span class="vcard">
>>>
>>> <span title="Maps, aerial photos, and other data for this location"
>>> class="geo-dec">34.37806°N 118.15083°W</span>
>>>
>>> <span style="display: none">
>>>
>>> &#65279; /
>>>
>>> <span class="geo">34.37806; -118.15083</span>
>>>
>>> </span>
>>>
>>> <span style="display: none">
>>>
>>> &#65279; (
>>>
>>> <span class="fn org">LA-04-LS</span>
>>>
>>> )
>>>
>>>  </span>
>>>
>>>  </span>
>>>
>>>  </span>
>>>
>>> </a>
>>>
>>>
>>
>
>
> --
> Michele Mostarda
> Senior Software Engineer
> skype: michele.mostarda
> twitter: micmos
> mail: me@michelemostarda.com
> site : http://www.michelemostarda.com
>
>


-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Mime
View raw message