incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 12:40:50 GMT
I'm going to need to think about this for a bit.  Wikipedia URL
conventions may well be idiosyncratic enough that it makes sense to
come up with a Wiki-specific connector that just knows how to crawl
wikipedias.  Normally I wouldn't go so far, but I've seen a number of
other people struggle with this crawling task, so trying to do it with
the generic tools available seems less and less worthwhile to me.

Karl


On Fri, Sep 16, 2011 at 8:33 AM, Wunderlich, Tobias
<tobias.wunderlich@igd-r.fraunhofer.de> wrote:
> Hey Karl,
>
> The main problem is that I don't get information about the author and the last modified
date since they are not integrated in head/meta. They can be found in the footer (text) though.
>
> The wiki has an extension to export every single site into an xml format, like this:
> http://en.wikipedia.org/wiki/Special:Export/Coffee
>
> That seems to me like a possible way to get the information I need, but to use that extension
I would need to use the export-extension for every site I want to index. Since there is no
direct link to the export-extension on either site, the crawler would need to create the urls
for export sites depending on the original url. I think you integrated a post-document-fetching
filter to the webConnector. Would it be possible to integrate a regex pattern replace filter
for the fetched site-urls to modify them?
>
> Tobias
>
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Karl Wright [mailto:daddywri@gmail.com]
> Gesendet: Freitag, 16. September 2011 14:00
> An: connectors-user@incubator.apache.org
> Betreff: Re: Indexing Wikipedia/MediaWiki
>
> 0.3-incubator RC1 has been successfully voted for release by the developer community,
but because it is an incubator project the incubator also needs to vote, and that is still
pending.  That can take quite a while, so I would feel comfortable going ahead and taking
the artifact from http://people.apache.org/~kwright and trying it out.
>
> As far as your particular crawling problem is concerned, it would help if you could provide
more information as to what you wind up crawling that you don't want when you just do the
naive web crawl.
>
> Karl
>
>
> On Fri, Sep 16, 2011 at 7:54 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
>> Hey Karl,
>>
>> Thanks for your quick reply. Modifying the RSSConnector seems like a valid approach
for crawling sitemaps.
>>
>> Unfortunately the wiki I have to index does not have a sitemap extension at this
moment. Because there is no static link to get a list of available sites I need to crawl a
seed url with an hop count of at least 2. So I guess modifying the WebConnector for my personal
needs will be my next step?!
>>
>> On another note, the release date of mcf 0.3 was yesterday, but the main page says
that it is still reviewed by the developer community. The svn repository has a rc0 and rc1
version ... are there more to come or is the rc1 good to go?
>>
>> Tobias
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Karl Wright [mailto:daddywri@gmail.com]
>> Gesendet: Freitag, 16. September 2011 11:32
>> An: connectors-user@incubator.apache.org
>> Betreff: Re: Indexing Wikipedia/MediaWiki
>>
>> It might be worth exploring sitemaps.
>>
>> http://en.wikipedia.org/wiki/Site_map
>>
>> It may be possible to create a connector, much like the RSS connector, that you can
point at a site map and it would just pick up the pages.
>> In fact, I think it would be straightforward to modify the RSS connector to understand
sitemap format.
>>
>> If you can do a little research to figure out if this might work for you, I'd be
willing to do some work and try to implement it.
>>
>> Karl
>>
>> On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
>>> Hey folks,
>>>
>>>
>>>
>>> I am currently working on a project to create a basic search platform
>>> using Solr and ManifoldCF. One of the content-repositories I need to
>>> index is a wiki (MediaWiki) and that's where I ran into a wall. I
>>> tried using the web-connector, but simply crawling the sites resulted
>>> in a lot of content I don't need (navigation-links, .) and not every
>>> information I wanted was gathered (author, last modified, .). The
>>> only metadata I got was the one included in head/meta, which wasn't relevant.
>>>
>>>
>>>
>>> Is there another way to get the wiki's data and more important is
>>> there a way to get the right data into the right field? I know that
>>> there is a way to export the wiki-sites in xml with wiki-syntax, but
>>> I don't know how that would help me. I could simply use solr's
>>> DataImportHandler to index a complete wiki-dump, but it would be nice
>>> to use the same framework for every repository, especially since manifold manages
all the recrawling.
>>>
>>>
>>>
>>> Does anybody have some experience in this direction, or any idea for
>>> a solution?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Tobias
>>>
>>>
>>
>

Mime
View raw message