Return-Path: X-Original-To: apmail-incubator-any23-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-any23-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A0019CE3 for ; Fri, 23 Mar 2012 11:16:40 +0000 (UTC) Received: (qmail 87017 invoked by uid 500); 23 Mar 2012 11:16:40 -0000 Delivered-To: apmail-incubator-any23-dev-archive@incubator.apache.org Received: (qmail 86982 invoked by uid 500); 23 Mar 2012 11:16:40 -0000 Mailing-List: contact any23-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: any23-dev@incubator.apache.org Delivered-To: mailing list any23-dev@incubator.apache.org Received: (qmail 86974 invoked by uid 99); 23 Mar 2012 11:16:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Mar 2012 11:16:40 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of danielczyk.szymon@gmail.com designates 74.125.82.175 as permitted sender) Received: from [74.125.82.175] (HELO mail-we0-f175.google.com) (74.125.82.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Mar 2012 11:16:33 +0000 Received: by wera1 with SMTP id a1so2570712wer.6 for ; Fri, 23 Mar 2012 04:16:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=3cFQhQ0JXft9FSezBtVxXNpYi5EuaOhysqzcWUM4ltA=; b=TE4gOZNB895A2wQCr245W9EhkH2FNjxCAZCQ/F6hoPd201fKYCqdl4FX+T2atjnMQF wTN1J0o2afI27zJWk/YcOGF65h0dH81bzDEzPW53wPiWuFQU86Q791d5LpWlY/J/aX3+ bQgQB+MQR/lNtaoVZfjjSpQDzt2rhqCtJq36/CdWC0boLUNVPLCbBRvYK2rdlXDeSbdr QPKIeLNMJictlYrokFEsO5CVpXXrScfpK6e+zg/vdpx+s6Nve0Oxhx3FSRaRs0FHe79+ E5U0JBKMk1iiD5IqmLUeNoLoh0Lr2PFvzgxm80o4uN3d57JSpy7VRLuHgvQ2J5HU41wT 09bA== MIME-Version: 1.0 Received: by 10.180.104.137 with SMTP id ge9mr5516812wib.20.1332501372908; Fri, 23 Mar 2012 04:16:12 -0700 (PDT) Received: by 10.223.115.17 with HTTP; Fri, 23 Mar 2012 04:16:12 -0700 (PDT) In-Reply-To: References: Date: Fri, 23 Mar 2012 11:16:12 +0000 Message-ID: Subject: Re: http://webdatacommons.org/ From: Szymon Danielczyk To: any23-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Paragraph from their website "Our solution is to run (Java) regular expressions against each webpages prior to extraction, which detect the presence of a microformat in a HTML page, and then only run the Any23 extractor when the regular expression find potentional matches." Are we using any technics like that to decide that there is anything to parse in the document ? Maybe we can build in such feature like a method/filter for users that want to parse huge number of docs to detect that the document is worth parsing They have the table with regex they used for each format Any opinions about this Szymon On 23 March 2012 10:38, Davide Palmisano wrote: > Thanks Michele, > > this is a great news. > > Should we have a section on the web site listing > all the products/initiatives that are using Any23? > > On Fri, Mar 23, 2012 at 11:01 AM, Michele Mostarda > wrote: >> Hi Guys, >> >> =A0 just a curiosity: >> >> =A0 =A0Any23 has been recently used to parse the entire corpus =A0of Sem= antic >> Web Data existing on the Web [0]. >> >> The best. >> >> Mic >> >> [0] http://webdatacommons.org/ >> >> -- >> Michele Mostarda >> Senior Software Engineer >> skype: michele.mostarda >> twitter: micmos >> mail: me@michelemostarda.com >> site : http://www.michelemostarda.com > > > > -- > Davide Palmisano > > http://davidepalmisano.com > http://twitter.com/dpalmisano