lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kangas <kan...@gmail.com>
Subject Re: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 06:19:44 GMT
Otis, if you're willing to use some non-Java code for your task...

1) Wikipedia uses Lucene for their full-text searches, and the module  
is part of Mediawiki. You could use this as follows:
- Install Mediawiki
- Load your Wikipedia dump into MW (and MySQL)
- Build a search index for the Lucene Search extension:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/README.txt?revision=8535&view=markup

2) Alternately, use Mediawiki's native import parser (in PHP) and use  
that to feed Solr, etc. The code is a bit hairy, 'tho.
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/SpecialImport.php?revision=27686&view=markup

--Matt

On Dec 12, 2007, at 12:35 AM, Otis Gospodnetic wrote:

> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that I  
> should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis

--
Matt Kangas / kangas@gmail.com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message