lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 13:11:32 GMT
Note that the current code doesn't actually do anything with the wiki  
syntax, but I would think as long as the other language is in the same  
format you should be fine.

-Grant

On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote:

>
> I haven't actually tried it, but I think very likely the current  
> code in contrib/benchmark might be able to extract non-English  
> Wikipedia dump as well?
>
> Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
> if you just change the docs.file to reference your downloaded XML  
> file it could just work?
>
> Mike
>
> Otis Gospodnetic wrote:
>
>> Hi,
>>
>> I need to index a Wikipedia dump.  I know there is code in contrib/ 
>> benchmark for indexing *English* Wikipedia for benchmarking  
>> purposes.  However, I'd like to index a non-English dump, and I  
>> actually don't need it for benchmarking, I just want to end up with  
>> a Lucene index.
>>
>> Any suggestions where I should start?  That is, can anything in  
>> contrib/benchmark already do this, or is there anything there that  
>> I should use as a starting point?  As opposed to writing my own  
>> Wikipedia XML dump parser+indexer.
>>
>> Thanks,
>> Otis
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message