Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: connectors-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates
 209.85.218.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALUFAGAuEngCs6t8sd6zFzad8M1TMav69xU79EsvYZJDd3FeJA@mail.gmail.com>
References: 
 <A967EAEB85AE4D4CB8C6DF4E02A1E307EC9149@EXMBS1.ad.igd.fraunhofer.de>
	<CALUFAGAuEngCs6t8sd6zFzad8M1TMav69xU79EsvYZJDd3FeJA@mail.gmail.com>
Date: Fri, 16 Sep 2011 07:56:04 -0400
Message-ID: 
 <CALUFAGB+jmq12BEPyM_VBX_yC-4tqLeca1oXJGzPZJ0v7i0qkA@mail.gmail.com>
Subject: Re: Indexing Wikipedia/MediaWiki
From: Karl Wright <daddywri@gmail.com>
To: connectors-user@incubator.apache.org
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

This looked easy enough that I just went ahead and implemented it.

If you check out trunk, and add site map document URLs to the "Feed
URLs" tab for an RSS job, it should locate the documents the sitemap
points at.  Furthermore it should not chase links within those
documents unless the documents are also site map documents or rss
feeds in their own right.

Karl

On Fri, Sep 16, 2011 at 5:31 AM, Karl Wright <daddywri@gmail.com> wrote:
> It might be worth exploring sitemaps.
>
> http://en.wikipedia.org/wiki/Site_map
>
> It may be possible to create a connector, much like the RSS connector,
> that you can point at a site map and it would just pick up the pages.
> In fact, I think it would be straightforward to modify the RSS
> connector to understand sitemap format.
>
> If you can do a little research to figure out if this might work for
> you, I'd be willing to do some work and try to implement it.
>
> Karl
>
> On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias
> <tobias.wunderlich@igd-r.fraunhofer.de> wrote:
>> Hey folks,
>>
>>
>>
>> I am currently working on a project to create a basic search platform us=
ing
>> Solr and ManifoldCF. One of the content-repositories I need to index is =
a
>> wiki (MediaWiki) and that=92s where I ran into a wall. I tried using the
>> web-connector, but simply crawling the sites resulted in a lot of conten=
t I
>> don=92t need (navigation-links, =85) and not every information I wanted =
was
>> gathered (author, last modified, =85). The only metadata I got was the o=
ne
>> included in head/meta, which wasn=92t relevant.
>>
>>
>>
>> Is there another way to get the wiki=92s data and more important is ther=
e a
>> way to get the right data into the right field? I know that there is a w=
ay
>> to export the wiki-sites in xml with wiki-syntax, but I don=92t know how=
 that
>> would help me. I could simply use solr=92s DataImportHandler to index a
>> complete wiki-dump, but it would be nice to use the same framework for e=
very
>> repository, especially since manifold manages all the recrawling.
>>
>>
>>
>> Does anybody have some experience in this direction, or any idea for a
>> solution?
>>
>>
>>
>> Thanks in advance,
>>
>> Tobias
>>
>>
>