any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "McBennett, Pat" <McBenne...@DNB.com.INVALID>
Subject How to use Rover from Java code?
Date Wed, 29 Apr 2015 19:17:40 GMT
Hi,

I've just started trying to use Any23 programmatically from Java, and it looks great.

The documentation has sample code [1], but that code doesn't seem to work properly for me
(and has a small typo on line 1 ('Apache' appears twice for some reason, and needs to be removed).
Hitting the example webpage (http://www.rentalinrome.com/semanticloft/semanticloft.htm) from
the command-line using Rover works fine [2], but using the sample code gives me no triples
[3].

So my questions are:

1.      Any idea why isn't the sample code isn't outputting any triples for me?

2.      The sample code won't crawl from the webpage I provide. It just scans that one page,
right?. So I guess I need to use Rover somehow from my Java code - so is there a code sample
for crawling a website given just the entrypoint (e.g. 'http://schema.org')? With code to
show how to configure the MaxPages and MaxDepth, too?

3.      [BONUS QUESTION!] How come when I use Rover to hit 'obvious' markup websites, like
'google.com' (5 triples), 'Schema.org' (2 triples) or 'bbc.co.uk' (8 triples) I get so very
few descriptive triples? I was expecting lots of triples, with lots of links to further information,
etc. Shouldn't these sites by exemplary examples of structured markup...?!.

4.      [FINAL QUESTION] Why does Any23 report [Fatal Error] so often, but then seem to continue
fine? See my Rover output below at [4] for both Google and the BBC.

Thanks,

Pat.

[1] - http://any23.apache.org/dev-data-extraction.html

[2]
C:\Installs\Apache\Any23\v1.1\bin>any23 rover -f ntriples http://www.rentalinrome.com/semanticloft/semanticloft.htm
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

------------------------------------------------------------------------
Apache Any23 :: rover
------------------------------------------------------------------------

[Fatal Error] :80:36: The content of elements must consist of well-formed character data or
markup.
<http://www.rentalinrome.com/trastevereapartments.htm> <http://purl.org/dc/terms/title>
"Rome apartments  Trastevere Area" .
<http://www.rentalinrome.com/trastevereapartments.htm> <http://www.w3.org/1999/xhtml/vocab#icon>
<http://www.rentalinrome.com/favicon.ico> .
<http://www.rentalinrome.com/trastevereapartments.htm> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://www.rentalinrome.com/css/style.css?tmp=635659384205330021> .
<http://www.rentalinrome.com/trastevereapartments.htm> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://www.rentalinrome.com/css/common.css?tmp=635659384205330021> .
<http://www.rentalinrome.com/trastevereapartments.htm> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://www.rentalinrome.com/jquery/plugin/shadowbox/shadowbox.css> .
<http://www.rentalinrome.com/trastevereapartments.htm> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://www.rentalinrome.com/jquery/css/ui-all.css?tmp=635659384205330021> .

------------------------------------------------------------------------
Apache Any23 SUCCESS
Total time: 4s
Finished at: Wed Apr 29 20:06:57 BST 2015
Final Memory: 58M/480M
------------------------------------------------------------------------

[3] - Console output when run from Java (first line is from my code!):
Attempting to extract from [http://www.rentalinrome.com/semanticloft/semanticloft.htm]...
[main] INFO org.apache.any23.extractor.SingleDocumentExtraction - Processing http://www.rentalinrome.com/trastevereapartments.htm
[main] WARN org.openrdf.rio.RDFParserRegistry - New service class org.openrdf.rio.nquads.NQuadsParserFactory
replaces existing service class org.apache.any23.io.nquads.NQuadsParserFactory
[Fatal Error] :80:36: The content of elements must consist of well-formed character data or
markup.


[4] - Output from Rover on 'exemplary' structured markup sites:

C:\Installs\Apache\Any23\v1.1\bin>any23 rover -f ntriples http://schema.org
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

------------------------------------------------------------------------
Apache Any23 :: rover
------------------------------------------------------------------------

<http://schema.org/> <http://purl.org/dc/terms/title> "Home - schema.org"@en .
<http://schema.org/> <http://www.w3.org/1999/xhtml/vocab#stylesheet> <http://schema.org/search_files/schemaorg.css>
.

------------------------------------------------------------------------
Apache Any23 SUCCESS
Total time: 3s
Finished at: Wed Apr 29 20:12:18 BST 2015
Final Memory: 50M/480M
------------------------------------------------------------------------

C:\Installs\Apache\Any23\v1.1\bin>
C:\Installs\Apache\Any23\v1.1\bin>
C:\Installs\Apache\Any23\v1.1\bin>
C:\Installs\Apache\Any23\v1.1\bin>any23 rover -f ntriples http://google.com
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

------------------------------------------------------------------------
Apache Any23 :: rover
------------------------------------------------------------------------

[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.
<http://www.google.ie/?gws_rd=cr&ei=IS1BVY7RE4LaUemGgfgB> <http://purl.org/dc/terms/title>
"Google" .
_:node3065d7a7d82e252f24e326642bd43c3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/WebPage> .
_:node3065d7a7d82e252f24e326642bd43c3 <http://schema.org/WebPage/image> "/images/google_favicon_128.png"@en-ie
.
<http://www.google.ie/?gws_rd=cr&ei=IS1BVY7RE4LaUemGgfgB> <http://www.w3.org/1999/xhtml/microdata#item>
_:node3065d7a7d82e252f24e326642bd43c3 .
<http://www.google.ie/?gws_rd=cr&ei=IS1BVY7RE4LaUemGgfgB> <http://purl.org/dc/terms/title>
"Google"@en-ie .

------------------------------------------------------------------------
Apache Any23 SUCCESS
Total time: 3s
Finished at: Wed Apr 29 20:12:35 BST 2015
Final Memory: 53M/480M
------------------------------------------------------------------------

C:\Installs\Apache\Any23\v1.1\bin>
C:\Installs\Apache\Any23\v1.1\bin>
C:\Installs\Apache\Any23\v1.1\bin>
C:\Installs\Apache\Any23\v1.1\bin>any23 rover -f ntriples http://bbc.co.uk
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

------------------------------------------------------------------------
Apache Any23 :: rover
------------------------------------------------------------------------

[Fatal Error] :18:438: The entity name must immediately follow the '&' in the entity reference.
<http://www.bbc.co.uk/> <http://purl.org/dc/terms/title> "BBC - Homepage" .
<http://www.bbc.co.uk/> <http://www.w3.org/1999/xhtml/vocab#stylesheet> <http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css>
.
<http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://static.bbci.co.uk/h4clock/0.70.3/style/h4clock.css> .
<http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://static.bbci.co.uk/locator/0.119.7/style/locator.css> .
<http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://static.bbci.co.uk/h4weather/0.82.2/style/h4weather.css> .
<http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://static.bbci.co.uk/h4discoveryzone/0.235.3/style/h4discoveryzo
.css> .
<http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://static.bbci.co.uk/h4base/0.211.0/style/h4base.css> .
<http://static.bbci.co.uk/frameworks/barlesque/2.83.4/orb/4/style/orb-fixed.css> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://static.bbci.co.uk/h4domestic/0.66.0/style/h4domestic.css> .

------------------------------------------------------------------------
Apache Any23 SUCCESS
Total time: 4s
Finished at: Wed Apr 29 20:12:52 BST 2015
Final Memory: 58M/480M
------------------------------------------------------------------------

C:\Installs\Apache\Any23\v1.1\bin>


[cid:image001.png@01D082B3.7EEEF210]

Pat McBennett
Architect
The Chase Building, 5th Floor
Carmanhall Road, Sandyford,
Dublin 18, Ireland
Direct +353 1
Mobile +353 8

http://www.dnb.co.uk/

[cid:image002.png@01D082B3.7EEEF210]<http://www.facebook.com/DunBradstreet>[cid:image003.png@01D082B3.7EEEF210]<http://twitter.com/dnbus>[cid:image004.png@01D082B3.7EEEF210]<http://www.linkedin.com/company/dun-&-bradstreet>[cid:image005.png@01D082B3.7EEEF210]<http://www.youtube.com/user/DunandBrad>
[cid:image006.png@01D082B3.7EEEF210]

The information contained in this electronic message and any attachments (the "Message") is
intended for one or more specific individuals or entities, and may be confidential, proprietary,
privileged or otherwise protected by law. If you are not the intended recipient (or you are
not authorised to receive for the recipient), please notify the sender immediately, delete
this Message and do not disclose, distribute, or copy it to any third party or otherwise use
this Message. Electronic messages are not secure or error free and can contain viruses or
may be delayed and the sender is not liable for any of these occurrences. The sender reserves
the right to monitor, record, transfer cross border and retain electronic messages.
"D&B" is a trading style of D&B Business Information Solutions is registered in Ireland.
www.dnb.co.uk



Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message