Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 55555 invoked from network); 12 Dec 2007 17:28:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Dec 2007 17:28:06 -0000 Received: (qmail 73468 invoked by uid 500); 12 Dec 2007 17:27:48 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 73263 invoked by uid 500); 12 Dec 2007 17:27:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 73252 invoked by uid 99); 12 Dec 2007 17:27:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Dec 2007 09:27:47 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of goodell@gmail.com designates 64.233.170.191 as permitted sender) Received: from [64.233.170.191] (HELO rn-out-0102.google.com) (64.233.170.191) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Dec 2007 17:27:27 +0000 Received: by rn-out-0102.google.com with SMTP id a43so81755rne.5 for ; Wed, 12 Dec 2007 09:27:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=LTW72IpENAziisi2TQxRZlljsuxMh403nJ5W+W/ph44=; b=VrATefrOpAstdDnPGrszm37/kbYeiHSmU7CEeUntPP5kyoLg7oLTSYu3K17cmS5aIB+7iOtBHRGgqn8yXsOJznImDUvkvXPV9jnbBT3H1bKNpV0eM/KoYzt/+BTok30EcvU/s5oeP5RNFB7rRP2wg+qt37UAVr/94m+Jt7LddAY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=TABtQ4DlxBu4/vI/ThP/zhzvsjsAA6ZJxXM+m2MDJJcwHZkQvsT+SJav17Xn1GcasbCm99HO2/7NKf2x0JxE4+Lcamqasnaae1AG20sKYuWRLhTSN21d0CUpuSQy1uOXsalEEfMDPYn7Lf8Miwvw/rpV0O8s+F/kKVuVs8JS72s= Received: by 10.150.156.9 with SMTP id d9mr307881ybe.116.1197480446970; Wed, 12 Dec 2007 09:27:26 -0800 (PST) Received: by 10.150.228.11 with HTTP; Wed, 12 Dec 2007 09:27:26 -0800 (PST) Message-ID: <55b2c6b90712120927v5d7dbb8bxc67e24c9319eba64@mail.gmail.com> Date: Wed, 12 Dec 2007 09:27:26 -0800 From: "Andy Goodell" Sender: goodell@gmail.com To: java-user@lucene.apache.org Subject: Re: Indexing Wikipedia dumps In-Reply-To: <547045.76613.qm@web50306.mail.re2.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <547045.76613.qm@web50306.mail.re2.yahoo.com> X-Google-Sender-Auth: 124a537baa54f3ff X-Virus-Checked: Checked by ClamAV on apache.org My firm uses a parser based on javax.xml.stream.XMLStreamReader to break (english and nonenglish) wikipedia xml dumps into lucene-style "documents and fields." We use wikipedia to test our language-specific code, so we've probably indexed 20 wikipedia dumps. - andy g On Dec 11, 2007 9:35 PM, Otis Gospodnetic wrote: > Hi, > > I need to index a Wikipedia dump. I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index. > > Any suggestions where I should start? That is, can anything in contrib/benchmark already do this, or is there anything there that I should use as a starting point? As opposed to writing my own Wikipedia XML dump parser+indexer. > > Thanks, > Otis > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org