Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 77293 invoked from network); 28 Mar 2007 18:01:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Mar 2007 18:01:05 -0000 Received: (qmail 65139 invoked by uid 500); 28 Mar 2007 18:01:09 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 65096 invoked by uid 500); 28 Mar 2007 18:01:09 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 65067 invoked by uid 99); 28 Mar 2007 18:01:08 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2007 11:01:08 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of DORONC@il.ibm.com designates 195.212.29.150 as permitted sender) Received: from [195.212.29.150] (HELO mtagate1.de.ibm.com) (195.212.29.150) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2007 11:01:00 -0700 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate1.de.ibm.com (8.13.8/8.13.8) with ESMTP id l2SI0dSK080268 for ; Wed, 28 Mar 2007 18:00:39 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l2SI0cbN2265180 for ; Wed, 28 Mar 2007 20:00:38 +0200 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l2SI0chj032496 for ; Wed, 28 Mar 2007 20:00:38 +0200 Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l2SI0bUW032480 for ; Wed, 28 Mar 2007 20:00:37 +0200 In-Reply-To: <3F3C7FB7-865B-4CD7-AA5F-CA22657E0CE1@gmail.com> Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff To: java-dev@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006 Message-ID: From: Doron Cohen Date: Wed, 28 Mar 2007 09:54:17 -0800 X-MIMETrack: Serialize by Router on D12MC102/12/M/IBM(Release 7.0.2HF71 | November 3, 2006) at 28/03/2007 20:00:37 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org Grant Ingersoll wrote on 28/03/2007 10:44:08: > > On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote: > > > Question (for Doron and anyone else): the file is xml and it's big, > > so DOM isn't going to work. I could still use something SAX based > > but since the format is so tightly controlled, I'm thinking regular > > expressions would be sufficient and have less dependences. Anyone > > have opinions on this? > > > Personally, I think SAX is the way to go, as you'll get handling of > escape sequences, etc. out of the box. And seems like it is easier > to read/maintain???? TrecDocMaker is relying on the strict structure of the input data - the read() method there is "eating" the input stream until reaching points of interest, and optionally collects (lines of) text, depending on the format here you may be able to use a variation of this. If input here is not that strictly defined, SAX would be better. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org