lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 14:49:16 GMT

12 dec 2007 kl. 06.35 skrev Otis Gospodnetic:

> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that I  
> should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.


Here is one more alternative, the way I did it way back.

Get the tarballs containing rendered HTML. Using NekoHTML (or so) find  
the DOM-node that contains the text content. And there you go, plain  
text.


-- 
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message