lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Wed, 28 Mar 2007 17:54:17 GMT
Grant Ingersoll <> wrote on 28/03/2007 10:44:08:

> On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
> > Question (for Doron and anyone else): the file is xml and it's big,
> > so DOM isn't going to work. I could still use something SAX based
> > but since the format is so tightly controlled, I'm thinking regular
> > expressions would be sufficient and have less dependences. Anyone
> > have opinions on this?
> Personally, I think SAX is the way to go, as you'll get handling of
> escape sequences, etc. out of the box.  And seems like it is easier
> to read/maintain????

TrecDocMaker is relying on the strict structure of the input data - the
read() method there is "eating" the input stream until reaching points of
interest, and optionally collects (lines of) text, depending on the format
here you may be able to use a variation of this. If input here is not that
strictly defined, SAX would be better.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message