lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: Using Lucene to index Wikipedia
Date Sun, 23 Oct 2011 20:49:13 GMT
Daniel, since no one knowledgeable has answered I'll take a stab - there 
are a number of ant targets you can run, most of which incorporate some 
indexing step(s).  Basically you can run:

ant -Dtask.alg=<alg file>

it looks as if the ant build.xml is set up to run 
conf/micro-standard.alg by default, but tehre are a bunch of other alg 
files in the conf folder, each of which is set up to run some different 

The only "document" I found is the build.xml file.

On 10/20/2011 12:30 PM, Daniel Quach wrote:
> How do I use the Lucene Benchmark to index a wikipedia dump? I want to 
> be able to execute phrase queries on the latest english wikipedia page 
> dump. I'm trying to look for example use cases but I haven't found any.
> I downloaded the latest english dump, named: 
> enwiki-latest-pages-articles.xml.bz2
> Then I ran the command in the terminal:
> java org.apache.lucene.benchmark.utils.ExtractWikipedia -i 
> ~/enwiki-latest-pages-articles.xml.bz2
> which I believe extracted the pages into a directory labeled "enwiki"
> Now is there something else in benchmarks that I need to run in order 
> to index the wiki? The README.enwiki does not really give me a clear 
> set of instructions, in fact I'm not even sure if I was supposed to 
> run the ExtractWikipedia class or not.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message