lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: indexing TREC
Date Mon, 09 May 2005 02:39:30 GMT
Quoting Erik Hatcher <erik@ehatchersolutions.com>:

> 
> On May 7, 2005, at 10:24 PM, xx28@drexel.edu wrote:
> 
> > Hi -
> >
> > Is there any TREC parser (indexing many documents that are in the  
> > same file)for Lucene avaiable?
> 
> Not to my knowledge.  I've been working on indexing some TREC data  
> myself, with some (not written by me) code to do simplistic grabbing  
> of a couple of fields from the SGML.
> 
> If there is something handy to parse these files, I'd love to give it  
> a spin though.

We at NIST have code to do this, integrated with Lucene.  We have a simple
SGML/XML parser and a simple reader which walks the documents... that's most of
what you need.  You could probably hack it together with NekoHTML in a day or
two, actually.  It can index TREC newswire corpora in a couple hours on a pretty
stock Dell workstation, and the TREC "Terabyte" collection (actually 480GB, 25
million web pages) in a weekend on a handful of workstations indexing in parallel.

It's not quite ready for prime-time release yet, though.  I hope to have some
time this summer to clean it up enough that we're not embarrassed to give it out
;-).  If your need is desperate, drop me an email and I'll see what I can do.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message