lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anshum <ansh...@gmail.com>
Subject Re: asking about index verification tools
Date Tue, 16 Nov 2010 06:22:45 GMT
Hi,

One way to do that would be to iterate the terms and then reconstruct the
document or just check for the terms one after the other. Though Luke also
reconstructs the document and you could use the reconstruction logic to do
the same and compare, it is not guaranteed that the reconstruction would be
proper (from the inverted index that is).
I am assuming you'd only want to verify the index on some sampled set and
not the entire corpus, else it'd just take a lot of time.
You may also take the document corpus again, tokenize the terms and then
search for each of those in the index. This would also give you a fair idea
of the index state.

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Tue, Nov 16, 2010 at 11:36 AM, Yakob <jacobian@opensuse-id.org> wrote:

> hello all,
> I would like to ask about lucene index. I mean I created a simple
> program that created lucene indexes and stored it in a folder. also I
> had use a diagnostic tools name Luke to be able to lurk inside lucene
> index and find out its content. and I know that lucene is a standard
> framework when it come to building a search engine. but I just wanted
> to be sure that lucene indexes every term that existed in a file.
>
> I mean is there a way for me or some tools out there to verify that
> the index contains in lucene indexes is dependable? and not a single
> term went missing there?
>
> I know that this is subjective question but I just wanted to hear your
> two cents.
> thanks though. :-)
>
> tl;dr: how can we know that the index in lucene is correct?
>
> --
> http://jacobian.web.id
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message