lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 10:01:29 GMT
Otis, I've used this to index wikipedia from XML before now:

http://schmidt.devlib.org/software/lucene-wikipedia.html

Cheers
Mark

----- Original Message ----
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 12 December, 2007 8:18:49 AM
Subject: Re: Indexing Wikipedia dumps

Database?  I imagine I can avoid that.... Wiki dump.gz -> gunzip ->
 parse -> index no?

Otis


----- Original Message ----
From: Chris Lu <chris.lu@gmail.com>
To: java-user@lucene.apache.org
Sent: Wednesday, December 12, 2007 1:55:02 AM
Subject: Re: Indexing Wikipedia dumps

For a quick java approach, give yourself 3 minutes and try to use
DBSight to access the database. You can simply use "select * from
mw_searchindex" as a starting point. It'll build the index for you.
However, you may need to pluggin your custom analyzer for media wiki's
format(Or maybe not).

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer (remain anonymous per request) got 2.6 Million Euro
 funding!


On Dec 11, 2007 9:35 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com>
 wrote:
> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in
 contrib/benchmark for indexing *English* Wikipedia for benchmarking
 purposes.
  However, I'd like to index a non-English dump, and I actually don't
 need
 it for benchmarking, I just want to end up with a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in
 contrib/benchmark already do this, or is there anything there that I
 should
 use as a starting point?  As opposed to writing my own Wikipedia XML
 dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message