lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Roberts <...@andy-roberts.net>
Subject Obtaining the contexts of hits
Date Wed, 09 Mar 2005 20:04:08 GMT
Hi,

I've been using Lucene for a few months now, although not in a typical 
"building a search engine" kind of way*. Basically, I have some large 
documents. I would like a system whereby I search for a term, and then I 
receive a hit for each match, with its context, e.g., ten words either side 
of the match.

I've been looking through the API, and there's a SpanNearQuery (or something 
similar) which looked at first glance to be what I want, but after second 
thoughts, it appears it's searching for a set of words within a given 
proximity. 

I know Lucene stores TermPositions: I know how to get the position of a match. 
Is there a way of doing a reverse lookup, i.e., given a position, return the 
term. Because if that were the case, I could easily build a context up by 
finding pos, then looping to get all terms +- a specified window.

Either way, I can't see a way forward. Simply being able to find which 
documents the terms are in isn't very helpful, because let's say I find the 
docs that match, I then have to open up each one, tokenise and re-search for 
my term,etc, etc. All this info is there, somewhere in the index, I just want 
to get to it so that I can benefit from the many speed benefits of Lucene.

I don't expect sample code or anything, just a pointer to the right direction. 
I do own the LIA book, but haven't read it all yet - so if there's anything 
in there which could be relevant, please let me know :)

Much help appreciated,
Andrew Roberts

* for those who are interested, I'm a computer scientist doing research which 
is basically a cross between computational linguistics and machine learning. 
I work with large text corpora, gather information about how words behave 
relative to each and try to infer word class and grammatical structure from 
it. I'm using Lucene to try and speed up text processing overheads, by having 
my corpora indexed.

-- 
Computer Vision and Language Reseearch Group
School of Computing
Univeristy of Leeds
Leeds, UK
LS2 9JT
http://www.comp.leeds.ac.uk/andyr

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message