lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <steven_par...@esseff.org>
Subject RE: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 13:35:36 GMT
Probably want a combination of extractWikipedia.alg and wikipedia.alg?

You want the EnwikiDocMaker from extractWikipedia.alg which reads the
uncompressed xml file but rather than using WriteLineDoc, you want to go
ahead and index as wikipedia.alg does. (Ditch the query part.)

You'll need an acceptable analyzer, which StandardAnalyzer might not be.

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Wednesday, December 12, 2007 2:29 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing Wikipedia dumps


I haven't actually tried it, but I think very likely the current code  
in contrib/benchmark might be able to extract non-English Wikipedia  
dump as well?

Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
if you just change the docs.file to reference your downloaded XML  
file it could just work?

Mike

Otis Gospodnetic wrote:

> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that  
> I should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message