Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: connectors-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of
 tobias.wunderlich9cd33xy531igd-r.fraunhofer.de@bounce.antispameurope.com
 designates 83.246.65.52 as permitted sender)
From: "Wunderlich, Tobias" <tobias.wunderlich@igd-r.fraunhofer.de>
To: "connectors-user@incubator.apache.org"
	<connectors-user@incubator.apache.org>
Subject: Re: Indexing Wikipedia/MediaWiki
Thread-Topic: Indexing Wikipedia/MediaWiki
Thread-Index: Acx0Y8f0cR/BEARgQ8+n9t6bqNAZ4w==
Date: Fri, 16 Sep 2011 11:54:55 +0000
Message-ID: 
 <A967EAEB85AE4D4CB8C6DF4E02A1E307EC91A0@EXMBS1.ad.igd.fraunhofer.de>
Accept-Language: de-DE, en-US
Content-Language: de-DE
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hey Karl,

Thanks for your quick reply. Modifying the RSSConnector seems like a valid =
approach for crawling sitemaps.

Unfortunately the wiki I have to index does not have a sitemap extension at=
 this moment. Because there is no static link to get a list of available si=
tes I need to crawl a seed url with an hop count of at least 2. So I guess =
modifying the WebConnector for my personal needs will be my next step?!

On another note, the release date of mcf 0.3 was yesterday, but the main pa=
ge says that it is still reviewed by the developer community. The svn repos=
itory has a rc0 and rc1 version ... are there more to come or is the rc1 go=
od to go?

Tobias

-----Urspr=FCngliche Nachricht-----
Von: Karl Wright [mailto:daddywri@gmail.com]=20
Gesendet: Freitag, 16. September 2011 11:32
An: connectors-user@incubator.apache.org
Betreff: Re: Indexing Wikipedia/MediaWiki

It might be worth exploring sitemaps.

http://en.wikipedia.org/wiki/Site_map

It may be possible to create a connector, much like the RSS connector, that=
 you can point at a site map and it would just pick up the pages.
In fact, I think it would be straightforward to modify the RSS connector to=
 understand sitemap format.

If you can do a little research to figure out if this might work for you, I=
'd be willing to do some work and try to implement it.

Karl

On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias <tobias.wunderlich@igd-=
r.fraunhofer.de> wrote:
> Hey folks,
>
>
>
> I am currently working on a project to create a basic search platform=20
> using Solr and ManifoldCF. One of the content-repositories I need to=20
> index is a wiki (MediaWiki) and that's where I ran into a wall. I=20
> tried using the web-connector, but simply crawling the sites resulted=20
> in a lot of content I don't need (navigation-links, .) and not every=20
> information I wanted was gathered (author, last modified, .). The only=20
> metadata I got was the one included in head/meta, which wasn't relevant.
>
>
>
> Is there another way to get the wiki's data and more important is=20
> there a way to get the right data into the right field? I know that=20
> there is a way to export the wiki-sites in xml with wiki-syntax, but I=20
> don't know how that would help me. I could simply use solr's=20
> DataImportHandler to index a complete wiki-dump, but it would be nice=20
> to use the same framework for every repository, especially since manifold=
 manages all the recrawling.
>
>
>
> Does anybody have some experience in this direction, or any idea for a=20
> solution?
>
>
>
> Thanks in advance,
>
> Tobias
>
>