incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wunderlich, Tobias" <tobias.wunderl...@igd-r.fraunhofer.de>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 12:33:06 GMT
Hey Karl,

The main problem is that I don't get information about the author and the last modified date
since they are not integrated in head/meta. They can be found in the footer (text) though.


The wiki has an extension to export every single site into an xml format, like this:
http://en.wikipedia.org/wiki/Special:Export/Coffee

That seems to me like a possible way to get the information I need, but to use that extension
I would need to use the export-extension for every site I want to index. Since there is no
direct link to the export-extension on either site, the crawler would need to create the urls
for export sites depending on the original url. I think you integrated a post-document-fetching
filter to the webConnector. Would it be possible to integrate a regex pattern replace filter
for the fetched site-urls to modify them?

Tobias





-----Urspr√ľngliche Nachricht-----
Von: Karl Wright [mailto:daddywri@gmail.com] 
Gesendet: Freitag, 16. September 2011 14:00
An: connectors-user@incubator.apache.org
Betreff: Re: Indexing Wikipedia/MediaWiki

0.3-incubator RC1 has been successfully voted for release by the developer community, but
because it is an incubator project the incubator also needs to vote, and that is still pending.
 That can take quite a while, so I would feel comfortable going ahead and taking the artifact
from http://people.apache.org/~kwright and trying it out.

As far as your particular crawling problem is concerned, it would help if you could provide
more information as to what you wind up crawling that you don't want when you just do the
naive web crawl.

Karl


On Fri, Sep 16, 2011 at 7:54 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
> Hey Karl,
>
> Thanks for your quick reply. Modifying the RSSConnector seems like a valid approach for
crawling sitemaps.
>
> Unfortunately the wiki I have to index does not have a sitemap extension at this moment.
Because there is no static link to get a list of available sites I need to crawl a seed url
with an hop count of at least 2. So I guess modifying the WebConnector for my personal needs
will be my next step?!
>
> On another note, the release date of mcf 0.3 was yesterday, but the main page says that
it is still reviewed by the developer community. The svn repository has a rc0 and rc1 version
... are there more to come or is the rc1 good to go?
>
> Tobias
>
> -----Urspr√ľngliche Nachricht-----
> Von: Karl Wright [mailto:daddywri@gmail.com]
> Gesendet: Freitag, 16. September 2011 11:32
> An: connectors-user@incubator.apache.org
> Betreff: Re: Indexing Wikipedia/MediaWiki
>
> It might be worth exploring sitemaps.
>
> http://en.wikipedia.org/wiki/Site_map
>
> It may be possible to create a connector, much like the RSS connector, that you can point
at a site map and it would just pick up the pages.
> In fact, I think it would be straightforward to modify the RSS connector to understand
sitemap format.
>
> If you can do a little research to figure out if this might work for you, I'd be willing
to do some work and try to implement it.
>
> Karl
>
> On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
>> Hey folks,
>>
>>
>>
>> I am currently working on a project to create a basic search platform 
>> using Solr and ManifoldCF. One of the content-repositories I need to 
>> index is a wiki (MediaWiki) and that's where I ran into a wall. I 
>> tried using the web-connector, but simply crawling the sites resulted 
>> in a lot of content I don't need (navigation-links, .) and not every 
>> information I wanted was gathered (author, last modified, .). The 
>> only metadata I got was the one included in head/meta, which wasn't relevant.
>>
>>
>>
>> Is there another way to get the wiki's data and more important is 
>> there a way to get the right data into the right field? I know that 
>> there is a way to export the wiki-sites in xml with wiki-syntax, but 
>> I don't know how that would help me. I could simply use solr's 
>> DataImportHandler to index a complete wiki-dump, but it would be nice 
>> to use the same framework for every repository, especially since manifold manages
all the recrawling.
>>
>>
>>
>> Does anybody have some experience in this direction, or any idea for 
>> a solution?
>>
>>
>>
>> Thanks in advance,
>>
>> Tobias
>>
>>
>

Mime
View raw message