Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 4080 invoked from network); 12 Dec 2007 10:02:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Dec 2007 10:02:10 -0000 Received: (qmail 58584 invoked by uid 500); 12 Dec 2007 10:01:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 57939 invoked by uid 500); 12 Dec 2007 10:01:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 57928 invoked by uid 99); 12 Dec 2007 10:01:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Dec 2007 02:01:52 -0800 X-ASF-Spam-Status: No, hits=0.3 required=10.0 tests=SPF_PASS,WHOIS_DMNBYPROXY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [217.12.10.218] (HELO web26007.mail.ukl.yahoo.com) (217.12.10.218) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 12 Dec 2007 10:01:28 +0000 Received: (qmail 28238 invoked by uid 60001); 12 Dec 2007 10:01:29 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID; b=BH6e0i5LmQ8NifjVIr6u4aD6PmEyvt0bENZcQYSj8qHgc1VWAF6nmBpjaDlx5kC2vz0zaPby6MtbDSGhYQBRa+0kZ56/4smgSNBiKlPZjmeOXqyJOXrGS4fey/BmXbX3TmCn0DHoqlJqCAiHnI3UdIsRrxNsEE6MCyk0YzpbmYI=; X-YMail-OSG: lIUo1rwVM1kZP._l296hK6jqukCWVmTOHYSixvlsZmmwz7p31AM_a8_x6QUSbPkNy4qfQceHXX0LPvFarwWNJJW.Qb5HL2vonzKeoIcjNYAjXX9zMHYxJhDwa4M- Received: from [193.36.230.96] by web26007.mail.ukl.yahoo.com via HTTP; Wed, 12 Dec 2007 10:01:29 GMT X-Mailer: YahooMailRC/818.31 YahooMailWebService/0.7.158.1 Date: Wed, 12 Dec 2007 10:01:29 +0000 (GMT) From: mark harwood Subject: Re: Indexing Wikipedia dumps To: java-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-ID: <84352.26112.qm@web26007.mail.ukl.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Otis, I've used this to index wikipedia from XML before now:=0A=0Ahttp://sc= hmidt.devlib.org/software/lucene-wikipedia.html=0A=0ACheers=0AMark=0A=0A---= -- Original Message ----=0AFrom: Otis Gospodnetic =0ATo: java-user@lucene.apache.org=0ASent: Wednesday, 12 December, 2007 = 8:18:49 AM=0ASubject: Re: Indexing Wikipedia dumps=0A=0ADatabase? I imagin= e I can avoid that.... Wiki dump.gz -> gunzip ->=0A parse -> index no?=0A= =0AOtis=0A=0A=0A----- Original Message ----=0AFrom: Chris Lu =0ATo: java-user@lucene.apache.org=0ASent: Wednesday, December 12, 20= 07 1:55:02 AM=0ASubject: Re: Indexing Wikipedia dumps=0A=0AFor a quick java= approach, give yourself 3 minutes and try to use=0ADBSight to access the d= atabase. You can simply use "select * from=0Amw_searchindex" as a starting = point. It'll build the index for you.=0AHowever, you may need to pluggin yo= ur custom analyzer for media wiki's=0Aformat(Or maybe not).=0A=0A-- =0AChri= s Lu=0A-------------------------=0AInstant Scalable Full-Text Search On Any= Database/Application=0Asite: http://www.dbsight.net=0Ademo: http://search.= dbsight.com=0ALucene Database Search in 3 minutes:=0Ahttp://wiki.dbsight.co= m/index.php?title=3DCreate_Lucene_Database_Search_in_3_minutes=0ADBSight cu= stomer (remain anonymous per request) got 2.6 Million Euro=0A funding!=0A= =0A=0AOn Dec 11, 2007 9:35 PM, Otis Gospodnetic =0A wrote:=0A> Hi,=0A>=0A> I need to index a Wikipedia dump. I know there= is code in=0A contrib/benchmark for indexing *English* Wikipedia for bench= marking=0A purposes.=0A However, I'd like to index a non-English dump, and= I actually don't=0A need=0A it for benchmarking, I just want to end up wit= h a Lucene index.=0A>=0A> Any suggestions where I should start? That is, c= an anything in=0A contrib/benchmark already do this, or is there anything t= here that I=0A should=0A use as a starting point? As opposed to writing my= own Wikipedia XML=0A dump parser+indexer.=0A>=0A> Thanks,=0A> Otis=0A>=0A>= =0A>=0A> ------------------------------------------------------------------= ---=0A> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org=0A>= For additional commands, e-mail: java-user-help@lucene.apache.org=0A>=0A>= =0A=0A---------------------------------------------------------------------= =0ATo unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org=0AFor ad= ditional commands, e-mail: java-user-help@lucene.apache.org=0A=0A=0A=0A=0A= =0A---------------------------------------------------------------------=0A= To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org=0AFor addit= ional commands, e-mail: java-user-help@lucene.apache.org=0A=0A=0A=0A=0A=0A= =0A ___________________________________________________________=0AYaho= o! Answers - Got a question? Someone out there knows the answer. Try it=0An= ow.=0Ahttp://uk.answers.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org