Return-Path: X-Original-To: apmail-any23-user-archive@www.apache.org Delivered-To: apmail-any23-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 41B6211C8B for ; Sun, 3 Aug 2014 02:18:56 +0000 (UTC) Received: (qmail 94656 invoked by uid 500); 3 Aug 2014 02:18:56 -0000 Delivered-To: apmail-any23-user-archive@any23.apache.org Received: (qmail 94606 invoked by uid 500); 3 Aug 2014 02:18:56 -0000 Mailing-List: contact user-help@any23.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@any23.apache.org Delivered-To: mailing list user@any23.apache.org Received: (qmail 94594 invoked by uid 99); 3 Aug 2014 02:18:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Aug 2014 02:18:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of scorlosquet@gmail.com designates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yh0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Aug 2014 02:18:52 +0000 Received: by mail-yh0-f48.google.com with SMTP id i57so3489669yha.35 for ; Sat, 02 Aug 2014 19:18:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/5g4ahyDfD9wQrcwHRNba09aSU5rg+FQFf9sT+SkiyE=; b=IpjK7e45aed6ws4LZKCNhripES1l1DrZ0K9Y6iIAKSfYdAUzygHEqckoTSJBLTxkFb frPekOt+Vhnk3w3xFyGKg4mqTVyb3frNDwCO6gRQjpF65uZBEcZTV1hrXGybNsVl93ep /ENbni6yvOHGrOuwx4LAR32LwGPwWokAXMmZhyAawcE31+c4TavpdOM5yQ3RL3opHp1z h1SjE2zYjbYcP7lutiZ/TV3x4avD0yzTzYmCl3FcXSnCe8/12q8FSGr+AnvfIzM1BA6w 160mhyIVMNUJMTIFyZaV4lmMGdJFluzGIouxm+sPcWvQsHbp986FglcapvkUk/+1GBM9 HYTg== MIME-Version: 1.0 X-Received: by 10.236.76.105 with SMTP id a69mr23815500yhe.8.1407032311452; Sat, 02 Aug 2014 19:18:31 -0700 (PDT) Received: by 10.170.42.141 with HTTP; Sat, 2 Aug 2014 19:18:31 -0700 (PDT) In-Reply-To: References: Date: Sat, 2 Aug 2014 22:18:31 -0400 Message-ID: Subject: Re: opengraph not being extracted From: =?UTF-8?Q?St=C3=A9phane_Corlosquet?= To: user@any23.apache.org Content-Type: multipart/alternative; boundary=20cf303ea7121c9d0504ffb03d27 X-Virus-Checked: Checked by ClamAV on apache.org --20cf303ea7121c9d0504ffb03d27 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sat, Aug 2, 2014 at 10:17 PM, St=C3=A9phane Corlosquet wrote: > > > > On Thu, Jul 24, 2014 at 9:19 PM, Lewis John Mcgibbney < > lewis.mcgibbney@gmail.com> wrote: > >> Hi Hadar, >> >> On Thu, Jul 24, 2014 at 3:27 AM, >> wrote: >> >>> >>> I'm trying to use any23 1.0 to extract opengraph data. >>> i'm simply creating the Any23 class and running extract. >>> It works fine on schema.org but it doesnt extract og tags. >>> Anything special needs to be done? >>> >>> >>> OK I found the issue here. Basically Any23 does recognize the og: marku= p >> within the tag's as follows >> >> = >> >> However there is an issue with the way that last.fm actually publish >> thier data on to the web. >> For example, when I run my Any23 master branch code over the webpage, my >> validation reporting notifies me the following >> >> >> >> >> missing-opengraph-namespace-rule >> [HTML: null] >> Missing OpenGraph namespace >> declaration. >> >> bascially that there is no namespace declared to accompany the og: >> markup... >> >> The question for Any23 is whether or not we should acknowledge the >> absence of the namespace declaration and provide one anyone in an effort= to >> continue with extraction. >> >> Do you think this would be valueable? If it is then I can write the >> implementation and post a patch for you to try out. >> > > No, I think this would be a bad idea because RDFa already provides such > functionality. The RDFa Core Initial Context > includes og and therefore > all parsers shoudl recognize it. That means the prefix declaration for og > can be omitted (that's why semargl and other RDFa parsers have no problem > extracting data from that page). The problem doesn't come from the RDFa > parser, but from the HTML parser. I want to make sure you've seen my > comment in Jira which includes more info: > here is the link: https://issues.apache.org/jira/browse/ANY23-227?focusedCommentId=3D14083838= &page=3Dcom.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#co= mment-14083838 > > >> Thanks >> Lewis >> > > > > -- > Steph. > --=20 Steph. --20cf303ea7121c9d0504ffb03d27 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable



On Sat, Aug 2, 2014 at 10:17 PM, St=C3=A9phane Corlosquet <sco= rlosquet@gmail.com> wrote:



On Thu, Jul 24, 2014 at 9:19 PM, Lewis John Mcgibbne= y <lewis.mcgibbney@gmail.com> wrote:
Hi Hadar,

On Thu, Jul 24, 2014 at 3:27 AM, <user-digest-help@any23.apache.org> wrote:

I'm trying to use any23 1.0 = to extract opengraph data.
i'm simply creating the Any23 class and running extract.
It works fine on schem= a.org but it doesnt extract og tags.
Anything special needs to be done?


OK I found the issue here. Basically Any23 does recogniz= e the og: markup within the <meta> tag's as follows
<meta property=3D"fb:app_id=
" content=3D"192959324047861" =
/>

           =20
                       =20
    <meta property=
=3D"og:title" content=3D"Led Zeppelin=
" />
    <meta property=
=3D"og:url" content=3D"http://www.las=
t.fm/music/Led+Zeppelin" />
    <meta property=
=3D"og:image" content=3D"http://users=
erve-ak.last.fm/serve/126/378064.jpg" />
However there is an issue with the way that last.fm actually publish thier data on to the web.
For= example, when I run my Any23 master branch code over the webpage, my valid= ation reporting notifies me the following

<validationReport><errors>
</errors><ruleActiva= tions><ruleActivation><ruleStr>
missing-opengraph-namespa= ce-rule</ruleStr></ruleActivation></ruleActivations><i= ssues><issue><origin>
[HTML: null]</origin><message>
Missing OpenGraph namespace d= eclaration.</message></issue></issues></validationRepo= rt>

bascially that there is no n= amespace declared to accompany the og: markup...

The question for Any23 is whether or n= ot we should acknowledge the absence of the namespace declaration and provi= de one anyone in an effort to continue with extraction.

Do you think this would be valueable? If it is then I can write the impleme= ntation and post a patch for you to try out.

No, I think this would be a bad idea be= cause RDFa already provides such functionality. The RDFa C= ore Initial Context=C2=A0includes og and therefore all parsers shoudl r= ecognize it. That means the prefix declaration for og can be omitted (that&= #39;s why semargl and other RDFa parsers have no problem extracting data from that page). The problem doesn't come from the RDFa parser, but from the HTML parser. I= want to make sure you've seen my comment in Jira which includes more i= nfo:
here is the link:=C2=A0


=C2=A0
=C2=A0
Thanks
Le= wis
<= br>

--
Steph.



--
Steph. --20cf303ea7121c9d0504ffb03d27--