any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Extract a Open Graph value from a web page
Date Fri, 23 Jan 2015 05:09:02 GMT
Hi Meraj,

Running the website you've provided through any23-vm.apache.org results in
the following output

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix doac: <http://ramonantonio.net/doac/0.1/#> .
@prefix dcterms: <http://purl.org/dc/terms/> .

<http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd>
dcterms:title "MacMall | Apple Mac mini dual-core Intel Core i5 1.4GHz
(Turbo Boost up to 2.7GHz), 4GB RAM, 500GB Hard Drive, Intel HD
Graphics 5000, Mac OS X Yosemite MGEM2LL/A"@en .

_:node7ab4123bafd8f45a207e47585841b13 a <http://schema.org/Product> ;
	<http://schema.org/Product/description> "Mac mini Dual-Core Intel
Core i5 1.4GHz, 4GB DDR3 memory, 500GB SATA hard drive, Intel HD
Graphics 5000 processor, 802.11ac Wi-Fi, Bluetooth, Gigabit Ethernet,
HDMI, SDXC card slot, Two Thunderbolt 2 Ports, Audio in/out, IR
receiver"@en ;
	<http://schema.org/Product/name> """
				Apple Mac mini dual-core Intel Core i5 1.4GHz (Turbo Boost up to
2.7GHz), 4GB RAM, 500GB Hard Drive, Intel HD Graphics 5000, Mac OS X
Yosemite (MGEM2LL/A)
				"""@en .

_:node9d41b06013eb3d847ae58af99799bbb a <http://schema.org/Offer> ;
	<http://schema.org/Offer/price> "$479.00"@en ;
	<http://schema.org/Offer/availability> <http://schema.org/InStock> .

_:node7ab4123bafd8f45a207e47585841b13
<http://schema.org/Product/offers>
_:node9d41b06013eb3d847ae58af99799bbb .

_:nodeaa9a2f42eeabd0b7dfbbd7dfcf6f9 a <http://schema.org/AggregateRating> ;
	<http://schema.org/AggregateRating/bestRating> "Null"@en ;
	<http://schema.org/AggregateRating/ratingValue> "Null"@en ;
	<http://schema.org/AggregateRating/reviewCount> "2"@en .

_:node7ab4123bafd8f45a207e47585841b13
<http://schema.org/Product/aggregateRating>
_:nodeaa9a2f42eeabd0b7dfbbd7dfcf6f9 .

<http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd>
<http://www.w3.org/1999/xhtml/microdata#item>
_:node7ab4123bafd8f45a207e47585841b13 ;
	dcterms:title "MacMall | Apple Mac mini dual-core Intel Core i5
1.4GHz (Turbo Boost up to 2.7GHz), 4GB RAM, 500GB Hard Drive, Intel HD
Graphics 5000, Mac OS X Yosemite MGEM2LL/A"@en ;
	<http://www.w3.org/1999/xhtml/vocab#nofollow>
<http://www.facebook.com/share.php?u=<;url\>> ;
	<http://www.w3.org/1999/xhtml/vocab#ALTERNATE-STYLESHEET>
<http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd//mall/stylesheet/wbd.css>
;
	<http://www.w3.org/1999/xhtml/vocab#canonical>
<http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd>
;
	<http://www.w3.org/1999/xhtml/vocab#ALTERNATE-STYLESHEET>
<http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd///i2.cc-inc.com/sprite/css/mainMenuExtended02.css>
, <http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd//css/search/typeahead/reset.css?typeahead-widget-1.1.1>
, <http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd//css/reset.css?ver=1>
;
	<http://www.w3.org/1999/xhtml/vocab#generator> "ToolTwist"@en ;
	<http://www.w3.org/1999/xhtml/vocab#description> "Apple Mac mini
dual-core Intel Core i5 1.4GHz (Turbo Boost up to 2.7GHz), 4GB RAM,
500GB Hard Drive, Intel HD Graphics 5000, Mac OS X Yosemite MGEM2LL/A
for $479.00 at macmall.com. Systems - Mac Mini - Mac Mini w/ Intel
Core i5 Duo Processor - 1.4 GHz Mac Mini Computers from
macmall.com."@en ;
	<http://www.w3.org/1999/xhtml/vocab#keywords> "Apple Mac mini
dual-core Intel Core i5 1.4GHz (Turbo Boost up to 2.7GHz), 4GB RAM,
500GB Hard Drive, Intel HD Graphics 5000, Mac OS X Yosemite, Apple Mac
Mini, Mac Mini w/ Intel Core i5 Duo Processor, 1.4 GHz Mac Mini
Computers, macmini, 3TED Systems"@en ;
	<http://www.w3.org/1999/xhtml/vocab#format-detection> "telephone=no"@en ;
	<http://www.w3.org/1999/xhtml/vocab#p:domain_verify>
"e02911354daa2202c515e76b11f9561b"@en ;
	<http://www.w3.org/1999/xhtml/vocab#robots> "noodp,noydir"@en .

I think we can improve upon this by supporting both xmlns:og="
http://opengraphprotocol.org/schema/" and xmlns:fb="
http://www.facebook.com/2008/fbml" namespaces... right now it would appear
that we don't. In particular the overwhelming majority of triples coming
from thus page appear to be coming from the microdata parser as they are
exracted from the microdata itemProp's.

One thing I've noticed is that although the HTML TitleExtract [0] is being
called, the HTMLMetaExtractor [1] is not!
We need to investigate this further.

[0]
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TitleExtractor.java
[1]
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/HTMLMetaExtractor.java

On Fri, Jan 16, 2015 at 9:07 AM, <user-digest-help@any23.apache.org> wrote:

>
> I am trying to retrieve an object value as in subject-predicate-object
> from a web page and using the following code , the page I am
> extracting it from is
> http://www.macmall.com/p/Apple-Mac-Mini/product~dpno~13312163~pdp.ijcehjd
>
> as you can clearly see it has og markup using RDFa , however the below
> code fails  to extract any og values  , can you please let me know
> what I might be doing wrong, the property Name that I am trying to
> extract is OGP.imageUrl
>
> Thanks.
>
> private static String retrieveOGPProperty(String URL,String propertyName) {
>
> logger.trace("Entering the method retrieveOGPProperty ");
> String propertyValue = null;
> OGP ogp = OGP.getInstance();
>         org.openrdf.sail.Sail store = new
> org.openrdf.sail.memory.MemoryStore();
>         try {
> store.initialize();
> } catch (SailException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
>         try {
> org.openrdf.repository.RepositoryConnection conn = new
> org.openrdf.repository.sail.SailRepository(store).getConnection();
> RepositoryResult<org.openrdf.model.Statement> statements =
> conn.getStatements(RDFUtils.uri(URL), ogp.imageURL, null,false);
> if(statements.hasNext()){
> //get the first property
> Value object = statements.next().getObject();
> propertyValue = object.stringValue();
> }
> } catch (RepositoryException e) {
> // Log the error and ignore it
> logger.error("Error occured while extracting the OGP property
> "+propertyName,e);
> }
>
>         logger.trace("Exiting the method retrieveOGPProperty ");
>
>         return propertyValue;
> }
>
>
>


-- 
*Lewis*

Mime
View raw message