lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Ochoa" <marcelo.oc...@gmail.com>
Subject Re: Indexing Wikipedia dumps
Date Tue, 18 Dec 2007 18:31:20 GMT
Hi All:
  Just to add simple hack, I had posted at my Blog an entry named
"Uploading WikiPedia Dumps to Oracle databases":
http://marceloochoa.blogspot.com/2007_12_01_archive.html
  with instructions to upload WikiPedia Dumps to Oracle XMLDB, it
means transforming an XML file to an object-relational storage.
  Finally, I added instructions to index it with Lucene Domain Index.
  Best regards, Marcelo.

On Dec 14, 2007 5:08 AM, Dawid Weiss <dawid.weiss@cs.put.poznan.pl> wrote:
>
> Good pointers, thanks. I asked because I did have a problem like this a few
> months ago -- none of the existing parsers solved it for me (back then).
>
> D.
>
>
> Petite Abeille wrote:
> >
> > On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
> >
> >> Just incidentally -- do you know of something that would parse the
> >> wikipedia markup (to plain text, for example)?
> >
> > If you find out, let us know :)
> >
> > You may want to check the partial ANTLR grammar for Wikitext:
> >
> > http://www.mediawiki.org/wiki/User:Stevage/ANTLR
> > http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
> >
> > This also might be of interest:
> >
> > http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
> >
> > "the nice people over at woc.fslab.de have created a standalone
> > wiki-markup parser which is ready for use"
> > http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
> > There is also Text::MediawikiFormat:
> > http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
> >
> > Perhaps you will be better off processing the Wikipedia static HTML
> > dump, instead of the XML one:
> > http://static.wikipedia.org/
> > Not a piece of cake one way or another :(
> > Cheers,
> > PA.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message