lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Savenko <>
Subject How to get hit offsets?
Date Mon, 12 Sep 2011 11:08:18 GMT
Hello, everyone!

Could anyone please explain how to get offsets for hits? I.e. I have a big text file and want
to find some string in it. As a result of this operation, I need an array of offsets (in characters)
from the beginning of the file for each occurrence of the string. 

As an example, suppose, the file is "The quick brown fox jumps over the lazy dog" and the
search string is "quick brown". I expect the result of search to be 4.

I spent a while trying to achieve this, but failed. I tried to create a document with a single
field ("content") and use TermPositionVector to get term offsets. It works when query consists
of a single term. I just get all occurrences of this term in the "content" field, and that's
it. But what about more complex queries? I think I could do it by iterating query terms, getting
their offsets, then doing some magic to sort them and link particular occurrences of different
terms together, etc. But this looks like a lot of work for such a simple task. I feel like
there should be a better way.

I understand, that, may be, for some more complex queries, it isn't clear how to define what
"offset" is. But I don't really need sophisticated queries. I just need simple substring search.
May be, Lucene is not supposed to be used that way. But I also need to manage a number of
big files and be able to search in multiple files at once and produce results quickly - things
Lucene does well (as far as I know). 

Best regards,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message