lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Indexing Wikipedia dumps
Date Fri, 28 Dec 2007 17:19:31 GMT
See https://issues.apache.org/jira/browse/LUCENE-1103


On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote:

> Hi All:
>  Just to add simple hack, I had posted at my Blog an entry named
> "Uploading WikiPedia Dumps to Oracle databases":
> http://marceloochoa.blogspot.com/2007_12_01_archive.html
>  with instructions to upload WikiPedia Dumps to Oracle XMLDB, it
> means transforming an XML file to an object-relational storage.
>  Finally, I added instructions to index it with Lucene Domain Index.
>  Best regards, Marcelo.
>
> On Dec 14, 2007 5:08 AM, Dawid Weiss <dawid.weiss@cs.put.poznan.pl>  
> wrote:
>>
>> Good pointers, thanks. I asked because I did have a problem like  
>> this a few
>> months ago -- none of the existing parsers solved it for me (back  
>> then).
>>
>> D.
>>
>>
>> Petite Abeille wrote:
>>>
>>> On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
>>>
>>>> Just incidentally -- do you know of something that would parse the
>>>> wikipedia markup (to plain text, for example)?
>>>
>>> If you find out, let us know :)
>>>
>>> You may want to check the partial ANTLR grammar for Wikitext:
>>>
>>> http://www.mediawiki.org/wiki/User:Stevage/ANTLR
>>> http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
>>>
>>> This also might be of interest:
>>>
>>> http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
>>>
>>> "the nice people over at woc.fslab.de have created a standalone
>>> wiki-markup parser which is ready for use"
>>> http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
>>> There is also Text::MediawikiFormat:
>>> http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
>>>
>>> Perhaps you will be better off processing the Wikipedia static HTML
>>> dump, instead of the XML one:
>>> http://static.wikipedia.org/
>>> Not a piece of cake one way or another :(
>>> Cheers,
>>> PA.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> -- 
> Marcelo F. Ochoa
> http://marceloochoa.blogspot.com/
> http://marcelo.ochoa.googlepages.com/home
> ______________
> Do you Know DBPrism? Look @ DB Prism's Web Site
> http://www.dbprism.com.ar/index.html
> More info?
> Chapter 17 of the book "Programming the Oracle Database using Java &
> Web Services"
> http://www.amazon.com/gp/product/1555583296/
> Chapter 21 of the book "Professional XML Databases" - Wrox Press
> http://www.amazon.com/gp/product/1861003587/
> Chapter 8 of the book "Oracle & Open Source" - O'Reilly
> http://www.oreilly.com/catalog/oracleopen/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message