Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6BC517B4F for ; Fri, 16 Sep 2011 11:55:25 +0000 (UTC) Received: (qmail 14549 invoked by uid 500); 16 Sep 2011 11:55:25 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 14504 invoked by uid 500); 16 Sep 2011 11:55:25 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 14494 invoked by uid 99); 16 Sep 2011 11:55:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 11:55:25 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tobias.wunderlich9cd33xy531igd-r.fraunhofer.de@bounce.antispameurope.com designates 83.246.65.52 as permitted sender) Received: from [83.246.65.52] (HELO relay02-haj2.antispameurope.com) (83.246.65.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 11:55:19 +0000 Received: by relay02-haj2.antispameurope.com (ASE-Secure-MTA, from userid 1000) id 7D56B6F05D7; Fri, 16 Sep 2011 13:54:56 +0200 (CEST) Received: from mailgate.igd.fraunhofer.de (mailgate2.igd.fraunhofer.de [192.44.32.14]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by relay02-haj2.antispameurope.com (ASE-Secure-MTA) with ESMTP id A3A3B6F0553 for ; Fri, 16 Sep 2011 13:54:55 +0200 (CEST) Received: from EX1.ad.igd.fraunhofer.de (ex1.igd.fraunhofer.de [146.140.10.201]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mailgate.igd.fraunhofer.de (Postfix) with ESMTPS id D719F4E74 for ; Fri, 16 Sep 2011 13:54:55 +0200 (CEST) Received: from EXMBS1.ad.igd.fraunhofer.de ([169.254.1.107]) by EX1.ad.igd.fraunhofer.de ([146.140.10.201]) with mapi id 14.01.0289.001; Fri, 16 Sep 2011 13:54:55 +0200 From: "Wunderlich, Tobias" To: "connectors-user@incubator.apache.org" Subject: Re: Indexing Wikipedia/MediaWiki Thread-Topic: Indexing Wikipedia/MediaWiki Thread-Index: Acx0Y8f0cR/BEARgQ8+n9t6bqNAZ4w== Date: Fri, 16 Sep 2011 11:54:55 +0000 Message-ID: Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.20.99.230] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Hey Karl, Thanks for your quick reply. Modifying the RSSConnector seems like a valid = approach for crawling sitemaps. Unfortunately the wiki I have to index does not have a sitemap extension at= this moment. Because there is no static link to get a list of available si= tes I need to crawl a seed url with an hop count of at least 2. So I guess = modifying the WebConnector for my personal needs will be my next step?! On another note, the release date of mcf 0.3 was yesterday, but the main pa= ge says that it is still reviewed by the developer community. The svn repos= itory has a rc0 and rc1 version ... are there more to come or is the rc1 go= od to go? Tobias -----Urspr=FCngliche Nachricht----- Von: Karl Wright [mailto:daddywri@gmail.com]=20 Gesendet: Freitag, 16. September 2011 11:32 An: connectors-user@incubator.apache.org Betreff: Re: Indexing Wikipedia/MediaWiki It might be worth exploring sitemaps. http://en.wikipedia.org/wiki/Site_map It may be possible to create a connector, much like the RSS connector, that= you can point at a site map and it would just pick up the pages. In fact, I think it would be straightforward to modify the RSS connector to= understand sitemap format. If you can do a little research to figure out if this might work for you, I= 'd be willing to do some work and try to implement it. Karl On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias wrote: > Hey folks, > > > > I am currently working on a project to create a basic search platform=20 > using Solr and ManifoldCF. One of the content-repositories I need to=20 > index is a wiki (MediaWiki) and that's where I ran into a wall. I=20 > tried using the web-connector, but simply crawling the sites resulted=20 > in a lot of content I don't need (navigation-links, .) and not every=20 > information I wanted was gathered (author, last modified, .). The only=20 > metadata I got was the one included in head/meta, which wasn't relevant. > > > > Is there another way to get the wiki's data and more important is=20 > there a way to get the right data into the right field? I know that=20 > there is a way to export the wiki-sites in xml with wiki-syntax, but I=20 > don't know how that would help me. I could simply use solr's=20 > DataImportHandler to index a complete wiki-dump, but it would be nice=20 > to use the same framework for every repository, especially since manifold= manages all the recrawling. > > > > Does anybody have some experience in this direction, or any idea for a=20 > solution? > > > > Thanks in advance, > > Tobias > >