incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wunderlich, Tobias" <tobias.wunderl...@igd-r.fraunhofer.de>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 11:54:55 GMT
Hey Karl,

Thanks for your quick reply. Modifying the RSSConnector seems like a valid approach for crawling
sitemaps.

Unfortunately the wiki I have to index does not have a sitemap extension at this moment. Because
there is no static link to get a list of available sites I need to crawl a seed url with an
hop count of at least 2. So I guess modifying the WebConnector for my personal needs will
be my next step?!

On another note, the release date of mcf 0.3 was yesterday, but the main page says that it
is still reviewed by the developer community. The svn repository has a rc0 and rc1 version
... are there more to come or is the rc1 good to go?

Tobias

-----Urspr√ľngliche Nachricht-----
Von: Karl Wright [mailto:daddywri@gmail.com] 
Gesendet: Freitag, 16. September 2011 11:32
An: connectors-user@incubator.apache.org
Betreff: Re: Indexing Wikipedia/MediaWiki

It might be worth exploring sitemaps.

http://en.wikipedia.org/wiki/Site_map

It may be possible to create a connector, much like the RSS connector, that you can point
at a site map and it would just pick up the pages.
In fact, I think it would be straightforward to modify the RSS connector to understand sitemap
format.

If you can do a little research to figure out if this might work for you, I'd be willing to
do some work and try to implement it.

Karl

On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
> Hey folks,
>
>
>
> I am currently working on a project to create a basic search platform 
> using Solr and ManifoldCF. One of the content-repositories I need to 
> index is a wiki (MediaWiki) and that's where I ran into a wall. I 
> tried using the web-connector, but simply crawling the sites resulted 
> in a lot of content I don't need (navigation-links, .) and not every 
> information I wanted was gathered (author, last modified, .). The only 
> metadata I got was the one included in head/meta, which wasn't relevant.
>
>
>
> Is there another way to get the wiki's data and more important is 
> there a way to get the right data into the right field? I know that 
> there is a way to export the wiki-sites in xml with wiki-syntax, but I 
> don't know how that would help me. I could simply use solr's 
> DataImportHandler to index a complete wiki-dump, but it would be nice 
> to use the same framework for every repository, especially since manifold manages all
the recrawling.
>
>
>
> Does anybody have some experience in this direction, or any idea for a 
> solution?
>
>
>
> Thanks in advance,
>
> Tobias
>
>

Mime
View raw message