lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <>
Subject Re: indexing TREC
Date Mon, 09 May 2005 02:39:30 GMT
Quoting Erik Hatcher <>:

> On May 7, 2005, at 10:24 PM, wrote:
> > Hi -
> >
> > Is there any TREC parser (indexing many documents that are in the  
> > same file)for Lucene avaiable?
> Not to my knowledge.  I've been working on indexing some TREC data  
> myself, with some (not written by me) code to do simplistic grabbing  
> of a couple of fields from the SGML.
> If there is something handy to parse these files, I'd love to give it  
> a spin though.

We at NIST have code to do this, integrated with Lucene.  We have a simple
SGML/XML parser and a simple reader which walks the documents... that's most of
what you need.  You could probably hack it together with NekoHTML in a day or
two, actually.  It can index TREC newswire corpora in a couple hours on a pretty
stock Dell workstation, and the TREC "Terabyte" collection (actually 480GB, 25
million web pages) in a weekend on a handful of workstations indexing in parallel.

It's not quite ready for prime-time release yet, though.  I hope to have some
time this summer to clean it up enough that we're not embarrassed to give it out
;-).  If your need is desperate, drop me an email and I'll see what I can do.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message