lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Indexing Wikipedia dumps
Date Fri, 14 Dec 2007 08:08:32 GMT

Good pointers, thanks. I asked because I did have a problem like this a few 
months ago -- none of the existing parsers solved it for me (back then).

D.

Petite Abeille wrote:
> 
> On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
> 
>> Just incidentally -- do you know of something that would parse the 
>> wikipedia markup (to plain text, for example)?
> 
> If you find out, let us know :)
> 
> You may want to check the partial ANTLR grammar for Wikitext:
> 
> http://www.mediawiki.org/wiki/User:Stevage/ANTLR
> http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
> 
> This also might be of interest:
> 
> http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
> 
> "the nice people over at woc.fslab.de have created a standalone 
> wiki-markup parser which is ready for use"
> http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
> There is also Text::MediawikiFormat:
> http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm

> 
> Perhaps you will be better off processing the Wikipedia static HTML 
> dump, instead of the XML one:
> http://static.wikipedia.org/
> Not a piece of cake one way or another :(
> Cheers,
> PA.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message