incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 11:56:04 GMT
This looked easy enough that I just went ahead and implemented it.

If you check out trunk, and add site map document URLs to the "Feed
URLs" tab for an RSS job, it should locate the documents the sitemap
points at.  Furthermore it should not chase links within those
documents unless the documents are also site map documents or rss
feeds in their own right.

Karl

On Fri, Sep 16, 2011 at 5:31 AM, Karl Wright <daddywri@gmail.com> wrote:
> It might be worth exploring sitemaps.
>
> http://en.wikipedia.org/wiki/Site_map
>
> It may be possible to create a connector, much like the RSS connector,
> that you can point at a site map and it would just pick up the pages.
> In fact, I think it would be straightforward to modify the RSS
> connector to understand sitemap format.
>
> If you can do a little research to figure out if this might work for
> you, I'd be willing to do some work and try to implement it.
>
> Karl
>
> On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias
> <tobias.wunderlich@igd-r.fraunhofer.de> wrote:
>> Hey folks,
>>
>>
>>
>> I am currently working on a project to create a basic search platform using
>> Solr and ManifoldCF. One of the content-repositories I need to index is a
>> wiki (MediaWiki) and that’s where I ran into a wall. I tried using the
>> web-connector, but simply crawling the sites resulted in a lot of content I
>> don’t need (navigation-links, …) and not every information I wanted was
>> gathered (author, last modified, …). The only metadata I got was the one
>> included in head/meta, which wasn’t relevant.
>>
>>
>>
>> Is there another way to get the wiki’s data and more important is there a
>> way to get the right data into the right field? I know that there is a way
>> to export the wiki-sites in xml with wiki-syntax, but I don’t know how that
>> would help me. I could simply use solr’s DataImportHandler to index a
>> complete wiki-dump, but it would be nice to use the same framework for every
>> repository, especially since manifold manages all the recrawling.
>>
>>
>>
>> Does anybody have some experience in this direction, or any idea for a
>> solution?
>>
>>
>>
>> Thanks in advance,
>>
>> Tobias
>>
>>
>

Mime
View raw message