lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Petite Abeille <petite_abei...@mac.com>
Subject Re: Indexing Wikipedia dumps
Date Thu, 13 Dec 2007 17:56:02 GMT

On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:

> Just incidentally -- do you know of something that would parse the  
> wikipedia markup (to plain text, for example)?

If you find out, let us know :)

You may want to check the partial ANTLR grammar for Wikitext:

http://www.mediawiki.org/wiki/User:Stevage/ANTLR
http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html

This also might be of interest:

http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html

"the nice people over at woc.fslab.de have created a standalone wiki- 
markup parser which is ready for use"
http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
There is also Text::MediawikiFormat:
http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
Perhaps you will be better off processing the Wikipedia static HTML  
dump, instead of the XML one:
http://static.wikipedia.org/
Not a piece of cake one way or another :(
Cheers,
PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message