Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40E967BC4 for ; Fri, 16 Sep 2011 11:56:30 +0000 (UTC) Received: (qmail 20998 invoked by uid 500); 16 Sep 2011 11:56:30 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 20963 invoked by uid 500); 16 Sep 2011 11:56:30 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 20955 invoked by uid 99); 16 Sep 2011 11:56:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 11:56:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.218.47 as permitted sender) Received: from [209.85.218.47] (HELO mail-yi0-f47.google.com) (209.85.218.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 11:56:25 +0000 Received: by yia27 with SMTP id 27so4003779yia.6 for ; Fri, 16 Sep 2011 04:56:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=UDtDRzdZH6u5arWn8R4pG0IgkFSxMdt3+Ax+8kINJz4=; b=Kybm3avRx4FyTkUDCmboBqTtXAqGuKpZ/FHyelj8maYV0GSVFMZOiZZldijXiuPLGd ZH/YbuQ/juLWS8j8K2ci42SATi8bMQQd2TL4HVeEZfThHokqH97NyxYja+poH4q7xcfh HFctdz6e40wV0sWBC6dq4SnBfCT0ZsvOb/MFA= MIME-Version: 1.0 Received: by 10.68.12.194 with SMTP id a2mr458927pbc.163.1316174164489; Fri, 16 Sep 2011 04:56:04 -0700 (PDT) Received: by 10.68.52.105 with HTTP; Fri, 16 Sep 2011 04:56:04 -0700 (PDT) In-Reply-To: References: Date: Fri, 16 Sep 2011 07:56:04 -0400 Message-ID: Subject: Re: Indexing Wikipedia/MediaWiki From: Karl Wright To: connectors-user@incubator.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable This looked easy enough that I just went ahead and implemented it. If you check out trunk, and add site map document URLs to the "Feed URLs" tab for an RSS job, it should locate the documents the sitemap points at. Furthermore it should not chase links within those documents unless the documents are also site map documents or rss feeds in their own right. Karl On Fri, Sep 16, 2011 at 5:31 AM, Karl Wright wrote: > It might be worth exploring sitemaps. > > http://en.wikipedia.org/wiki/Site_map > > It may be possible to create a connector, much like the RSS connector, > that you can point at a site map and it would just pick up the pages. > In fact, I think it would be straightforward to modify the RSS > connector to understand sitemap format. > > If you can do a little research to figure out if this might work for > you, I'd be willing to do some work and try to implement it. > > Karl > > On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias > wrote: >> Hey folks, >> >> >> >> I am currently working on a project to create a basic search platform us= ing >> Solr and ManifoldCF. One of the content-repositories I need to index is = a >> wiki (MediaWiki) and that=92s where I ran into a wall. I tried using the >> web-connector, but simply crawling the sites resulted in a lot of conten= t I >> don=92t need (navigation-links, =85) and not every information I wanted = was >> gathered (author, last modified, =85). The only metadata I got was the o= ne >> included in head/meta, which wasn=92t relevant. >> >> >> >> Is there another way to get the wiki=92s data and more important is ther= e a >> way to get the right data into the right field? I know that there is a w= ay >> to export the wiki-sites in xml with wiki-syntax, but I don=92t know how= that >> would help me. I could simply use solr=92s DataImportHandler to index a >> complete wiki-dump, but it would be nice to use the same framework for e= very >> repository, especially since manifold manages all the recrawling. >> >> >> >> Does anybody have some experience in this direction, or any idea for a >> solution? >> >> >> >> Thanks in advance, >> >> Tobias >> >> >