incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 09:31:45 GMT
It might be worth exploring sitemaps.

http://en.wikipedia.org/wiki/Site_map

It may be possible to create a connector, much like the RSS connector,
that you can point at a site map and it would just pick up the pages.
In fact, I think it would be straightforward to modify the RSS
connector to understand sitemap format.

If you can do a little research to figure out if this might work for
you, I'd be willing to do some work and try to implement it.

Karl

On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias
<tobias.wunderlich@igd-r.fraunhofer.de> wrote:
> Hey folks,
>
>
>
> I am currently working on a project to create a basic search platform using
> Solr and ManifoldCF. One of the content-repositories I need to index is a
> wiki (MediaWiki) and that’s where I ran into a wall. I tried using the
> web-connector, but simply crawling the sites resulted in a lot of content I
> don’t need (navigation-links, …) and not every information I wanted was
> gathered (author, last modified, …). The only metadata I got was the one
> included in head/meta, which wasn’t relevant.
>
>
>
> Is there another way to get the wiki’s data and more important is there a
> way to get the right data into the right field? I know that there is a way
> to export the wiki-sites in xml with wiki-syntax, but I don’t know how that
> would help me. I could simply use solr’s DataImportHandler to index a
> complete wiki-dump, but it would be nice to use the same framework for every
> repository, especially since manifold manages all the recrawling.
>
>
>
> Does anybody have some experience in this direction, or any idea for a
> solution?
>
>
>
> Thanks in advance,
>
> Tobias
>
>

Mime
View raw message