lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dr. Hany Azzam" <h...@eecs.qmul.ac.uk>
Subject Re: Indexing TREC GOV2 data in Lucene
Date Thu, 12 Apr 2012 09:07:57 GMT
Hi,

I am not sure if there's something in the contrib for GOV2 but it really
depends on what you want to parse. If you are just interested in full-text
search then it should be similar to parsing a regular document while being
conscious of the trec-specific delimiters. It's something like <DOC>.
However, if you are interested in performing structured search and
maintaining indexes over different fields such as titles, etc. then this
will require some customisation. Note that if you want to store the anchor
text separately and perform some sort of link resolution and page ranking
then again you will need to customize your parsing.

h.

> Hi All ,
>
> I am working on a project on Static Index pruning and I am using the TREC
> GOV2 database . I have seen that the Trec data can be parsed and the
> necessary java files are present in the contrib package , but has any user
> used Lucene to index the GOV2 dataset or is there source code available
> for
> the same ?
>
> Regards
> Jake Dsouza
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message