incubator-connectors-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wunderlich, Tobias" <tobias.wunderl...@igd-r.fraunhofer.de>
Subject AW: Indexing Wikipedia/MediaWiki
Date Mon, 19 Sep 2011 12:21:17 GMT
I don't know if it is possible to link to a wiki document with just the pageid, but it is possible
to to get the url for the referring pageid via api:
http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url

A new connector for crawling Wikis sounds great. Could you create a new ticket? I'm not registered
at jira yet ...

Tobias
 

-----Ursprüngliche Nachricht-----
Von: Karl Wright [mailto:daddywri@gmail.com] 
Gesendet: Montag, 19. September 2011 12:39
An: connectors-user@incubator.apache.org
Betreff: Re: Indexing Wikipedia/MediaWiki

The only thing that concerns me about using a document's title as its document identifier
in ManifoldCF is the possibility of it being renamed.  For that reason the Page ID is preferable.
 But it doesn't sound like bad things would happen either way.

I'd like to suggest creating a JIRA ticket to describe a new connector for crawling Wiki's.
 Then I may create a branch in which to work on this.  We're coming into conference season
so it may be some weeks before there's a connector to try, though.

Karl

On Mon, Sep 19, 2011 at 6:07 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
>  (1) How do you form a URL that would take a user to a document?  Does it use the title,
or does it use the page ID?
> I guess one way would be to just add the title to the main-url, like http://en.wikipedia.org/wiki/<title>.
However, I did not find out how to create a url to the document via pageid yet.
>
>
>  (2) If the URL includes the page ID, is there any way to get metadata information about
the document using the page ID directly?  It probably wouldn't be the query feature that
would do this, btw.
>
> It is possible to get the metadata of a document using the pages id (instead of title)
directly:
> Titel -> 
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=A
> PI&rvprop=timestamp|user|comment|content
> PageID -> 
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=
> 27697087&rvprop=timestamp|user|comment|content
>
>
> Tobias
>
>
> -----Ursprüngliche Nachricht-----
> Von: Karl Wright [mailto:daddywri@gmail.com]
> Gesendet: Montag, 19. September 2011 11:35
> An: connectors-user@incubator.apache.org
> Betreff: Re: Indexing Wikipedia/MediaWiki
>
> The API seems to be built around using Titles as document keys, and yet there is a page
ID also, which would probably be better at looking up page data.  So I have some new questions:
>
> (1) How do you form a URL that would take a user to a document?  Does it use the title,
or does it use the page ID?
> (2) If the URL includes the page ID, is there any way to get metadata information about
the document using the page ID directly?  It probably wouldn't be the query feature that
would do this, btw.
>
> Thanks,
> Karl
>
>
> On Mon, Sep 19, 2011 at 5:09 AM, Wunderlich, Tobias <tobias.wunderlich@igd-r.fraunhofer.de>
wrote:
>> Hey Karl,
>>
>> I did some research and the WikiMedia-API looks promising:
>>
>> - There needs to be some notion of an overall list of pages:
>>        - http://www.mediawiki.org/wiki/API:Allpages
>>        - Example:
>> http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=K
>> r
>> e&aplimit=5
>>
>> - Metadata information (author and pub date) also needs to be separated out in some
way:
>>        -
>> http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example
>>        - Example:
>> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=
>> A
>> PI|Main%20Page&rvprop=timestamp|user|comment|content
>>
>> What do you think?
>>
>> Tobias
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Karl Wright [mailto:daddywri@gmail.com]
>> Gesendet: Freitag, 16. September 2011 16:11
>> An: Sumana Harihareswara
>> Cc: Wunderlich, Tobias
>> Betreff: Re: MediaWiki & Lucene development
>>
>> The lucene-search extension may or may not be appropriate for Tobias.
>> But my interest would extend towards getting wiki content into whatever target a
ManifoldCF sets up, not just Solr/Lucene.  In order to do this the following needs to be
addressed:
>>
>> - There needs to be some notion of an overall list of pages, 
>> preferably queryable by date and time of last change;
>> - We'd need access, per page, to authorization information
>> - Metadata information (author and pub date) also needs to be 
>> separated out in some way
>>
>> The plugin that Tobias mentioned seems to do the last item fine, but not the first
two.  Do you have a solution for those?
>>
>> Thanks,
>> Karl
>>
>> On Fri, Sep 16, 2011 at 9:40 AM, Sumana Harihareswara <sumanah@wikimedia.org>
wrote:
>>> Hi.  I happened to see you both discussing MediaWiki and 
>>> search/indexing in a mailing list recently.
>>>
>>> You might be interested in asking your question to the 
>>> MediaWiki/Wikimedia developers' list
>>>
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>> and I'd also welcome any assistance in improving our Lucene search 
>>> extension, which is used on Wikipedia:
>>>
>>> http://www.mediawiki.org/wiki/Extension:Lucene-search
>>>
>>> Thanks!
>>>
>>> --
>>> Sumana Harihareswara
>>> Volunteer Development Coordinator
>>> Wikimedia Foundation
>>>
>>
>

Mime
View raw message