incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 11:59:39 GMT
0.3-incubator RC1 has been successfully voted for release by the
developer community, but because it is an incubator project the
incubator also needs to vote, and that is still pending.  That can
take quite a while, so I would feel comfortable going ahead and taking
the artifact from http://people.apache.org/~kwright and trying it out.

As far as your particular crawling problem is concerned, it would help
if you could provide more information as to what you wind up crawling
that you don't want when you just do the naive web crawl.

Karl


On Fri, Sep 16, 2011 at 7:54 AM, Wunderlich, Tobias
<tobias.wunderlich@igd-r.fraunhofer.de> wrote:
> Hey Karl,
>
> Thanks for your quick reply. Modifying the RSSConnector seems like a valid approach for
crawling sitemaps.
>
> Unfortunately the wiki I have to index does not have a sitemap extension at this moment.
Because there is no static link to get a list of available sites I need to crawl a seed url
with an hop count of at least 2. So I guess modifying the WebConnector for my personal needs
will be my next step?!
>
> On another note, the release date of mcf 0.3 was yesterday, but the main page says that
it is still reviewed by the developer community. The svn repository has a rc0 and rc1 version
... are there more to come or is the rc1 good to go?
>
> Tobias
>
> -----Urspr√ľngliche Nachricht-----
> Von: Karl Wright [mailto:daddywri@gmail.com]
> Gesendet: Freitag, 16. September 2011 11:32
> An: connectors-user@incubator.apache.org
> Betreff: Re: Indexing Wikipedia/MediaWiki
>
> It might be worth exploring sitemaps.
>
> http://en.wikipedia.org/wiki/Site_map
>
> It may be possible to create a connector, much like the RSS connector, that you can point
at a site map and it would just pick up the pages.
> In fact, I think it would be straightforward to modify the RSS connector to understand
sitemap format.
>
> If you can do a little research to figure out if this might work for you, I'd be willing
to do some work and try to implement it.
>
> Karl
>
> On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
>> Hey folks,
>>
>>
>>
>> I am currently working on a project to create a basic search platform
>> using Solr and ManifoldCF. One of the content-repositories I need to
>> index is a wiki (MediaWiki) and that's where I ran into a wall. I
>> tried using the web-connector, but simply crawling the sites resulted
>> in a lot of content I don't need (navigation-links, .) and not every
>> information I wanted was gathered (author, last modified, .). The only
>> metadata I got was the one included in head/meta, which wasn't relevant.
>>
>>
>>
>> Is there another way to get the wiki's data and more important is
>> there a way to get the right data into the right field? I know that
>> there is a way to export the wiki-sites in xml with wiki-syntax, but I
>> don't know how that would help me. I could simply use solr's
>> DataImportHandler to index a complete wiki-dump, but it would be nice
>> to use the same framework for every repository, especially since manifold manages
all the recrawling.
>>
>>
>>
>> Does anybody have some experience in this direction, or any idea for a
>> solution?
>>
>>
>>
>> Thanks in advance,
>>
>> Tobias
>>
>>
>

Mime
View raw message