Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 14787 invoked from network); 6 Feb 2007 15:45:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2007 15:45:46 -0000 Received: (qmail 98055 invoked by uid 500); 6 Feb 2007 15:45:51 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 98036 invoked by uid 500); 6 Feb 2007 15:45:51 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 98025 invoked by uid 99); 6 Feb 2007 15:45:51 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 07:45:51 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of michael.wechner@wyona.com designates 195.226.6.68 as permitted sender) Received: from [195.226.6.68] (HELO mx1.wyona.com) (195.226.6.68) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 07:45:40 -0800 Received: from [195.226.6.66] (helo=[192.168.1.57]) by mx1.wyona.com with asmtp (Exim 3.35 #1 (Debian)) id 1HESVR-0006iI-00 for ; Tue, 06 Feb 2007 16:45:17 +0100 Message-ID: <45C8A279.6000008@wyona.com> Date: Tue, 06 Feb 2007 16:44:57 +0100 From: Michael Wechner User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050920 X-Accept-Language: en, de, en-us, fr-ch MIME-Version: 1.0 To: nutch-dev@lucene.apache.org Subject: Getting a semantic version of an "HTML page" Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Is there any standardized way that nutch is getting a semantic version of a web-page, e.g. the HTML page is as follows blablabal .. and the sematic XML (index-semantic.xml) would be something more useful than the HTML itself ... resp. some RDF or whatever. Any pointers are very welcome. Thanks Michi -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org michael.wechner@wyona.com michi@apache.org +41 44 272 91 61