Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1E089A08 for ; Mon, 19 Sep 2011 10:07:31 +0000 (UTC) Received: (qmail 13268 invoked by uid 500); 19 Sep 2011 10:07:31 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 13231 invoked by uid 500); 19 Sep 2011 10:07:31 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 13220 invoked by uid 99); 19 Sep 2011 10:07:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Sep 2011 10:07:31 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tobias.wunderlich9cd33xy531igd-r.fraunhofer.de@bounce.antispameurope.com designates 83.246.65.52 as permitted sender) Received: from [83.246.65.52] (HELO relay02-haj2.antispameurope.com) (83.246.65.52) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Sep 2011 10:07:25 +0000 Received: by relay02-haj2.antispameurope.com (ASE-Secure-MTA, from userid 1000) id 0D2086F060A; Mon, 19 Sep 2011 12:07:04 +0200 (CEST) Received: from mailgate.igd.fraunhofer.de (mailgate2.igd.fraunhofer.de [192.44.32.14]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by relay02-haj2.antispameurope.com (ASE-Secure-MTA) with ESMTP id DE1CC6F0559 for ; Mon, 19 Sep 2011 12:07:03 +0200 (CEST) Received: from EX2.ad.igd.fraunhofer.de (ex2.igd.fraunhofer.de [146.140.10.206]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mailgate.igd.fraunhofer.de (Postfix) with ESMTPS id 945643F33 for ; Mon, 19 Sep 2011 12:07:03 +0200 (CEST) Received: from EXMBS1.ad.igd.fraunhofer.de ([169.254.1.107]) by EX2.ad.igd.fraunhofer.de ([146.140.10.206]) with mapi id 14.01.0289.001; Mon, 19 Sep 2011 12:07:03 +0200 From: "Wunderlich, Tobias" To: "connectors-user@incubator.apache.org" Subject: AW: Indexing Wikipedia/MediaWiki Thread-Topic: Indexing Wikipedia/MediaWiki Thread-Index: Acx2qhgz1NbgzgEeRhaJEZZ95W9t0P//6TIA///dmpA= Date: Mon, 19 Sep 2011 10:07:02 +0000 Message-ID: References: In-Reply-To: Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.20.99.230] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 (1) How do you form a URL that would take a user to a document? Does it u= se the title, or does it use the page ID? I guess one way would be to just add the title to the main-url, like http:/= /en.wikipedia.org/wiki/. However, I did not find out how to create a= url to the document via pageid yet. (2) If the URL includes the page ID, is there any way to get metadata info= rmation about the document using the page ID directly? It probably wouldn'= t be the query feature that would do this, btw. It is possible to get the metadata of a document using the pages id (instea= d of title) directly: Titel -> http://en.wikipedia.org/w/api.php?action=3Dquery&prop=3Drevisions&= titles=3DAPI&rvprop=3Dtimestamp|user|comment|content PageID -> http://en.wikipedia.org/w/api.php?action=3Dquery&prop=3Drevisions= &pageids=3D27697087&rvprop=3Dtimestamp|user|comment|content Tobias -----Urspr=FCngliche Nachricht----- Von: Karl Wright [mailto:daddywri@gmail.com]=20 Gesendet: Montag, 19. September 2011 11:35 An: connectors-user@incubator.apache.org Betreff: Re: Indexing Wikipedia/MediaWiki The API seems to be built around using Titles as document keys, and yet the= re is a page ID also, which would probably be better at looking up page dat= a. So I have some new questions: (1) How do you form a URL that would take a user to a document? Does it us= e the title, or does it use the page ID? (2) If the URL includes the page ID, is there any way to get metadata infor= mation about the document using the page ID directly? It probably wouldn't= be the query feature that would do this, btw. Thanks, Karl On Mon, Sep 19, 2011 at 5:09 AM, Wunderlich, Tobias <tobias.wunderlich@igd-= r.fraunhofer.de> wrote: > Hey Karl, > > I did some research and the WikiMedia-API looks promising: > > - There needs to be some notion of an overall list of pages: > =A0 =A0 =A0 =A0- http://www.mediawiki.org/wiki/API:Allpages > =A0 =A0 =A0 =A0- Example:=20 > http://en.wikipedia.org/w/api.php?action=3Dquery&list=3Dallpages&apfrom= =3DKr > e&aplimit=3D5 > > - Metadata information (author and pub date) also needs to be separated o= ut in some way: > =A0 =A0 =A0 =A0-=20 > http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example > =A0 =A0 =A0 =A0- Example: =A0 > http://en.wikipedia.org/w/api.php?action=3Dquery&prop=3Drevisions&titles= =3DA > PI|Main%20Page&rvprop=3Dtimestamp|user|comment|content > > What do you think? > > Tobias > > > > -----Urspr=FCngliche Nachricht----- > Von: Karl Wright [mailto:daddywri@gmail.com] > Gesendet: Freitag, 16. September 2011 16:11 > An: Sumana Harihareswara > Cc: Wunderlich, Tobias > Betreff: Re: MediaWiki & Lucene development > > The lucene-search extension may or may not be appropriate for Tobias. > But my interest would extend towards getting wiki content into whatever t= arget a ManifoldCF sets up, not just Solr/Lucene. =A0In order to do this th= e following needs to be addressed: > > - There needs to be some notion of an overall list of pages,=20 > preferably queryable by date and time of last change; > - We'd need access, per page, to authorization information > - Metadata information (author and pub date) also needs to be=20 > separated out in some way > > The plugin that Tobias mentioned seems to do the last item fine, but not = the first two. =A0Do you have a solution for those? > > Thanks, > Karl > > On Fri, Sep 16, 2011 at 9:40 AM, Sumana Harihareswara <sumanah@wikimedia.= org> wrote: >> Hi. =A0I happened to see you both discussing MediaWiki and=20 >> search/indexing in a mailing list recently. >> >> You might be interested in asking your question to the=20 >> MediaWiki/Wikimedia developers' list >> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> and I'd also welcome any assistance in improving our Lucene search=20 >> extension, which is used on Wikipedia: >> >> http://www.mediawiki.org/wiki/Extension:Lucene-search >> >> Thanks! >> >> -- >> Sumana Harihareswara >> Volunteer Development Coordinator >> Wikimedia Foundation >> >