lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Not entire document being indexed?
Date Fri, 25 Feb 2005 23:19:38 GMT
amigo@max3d.com wrote:
> Anyone else has any ideas why wouldn't the whole documents be indexed as 
> described below?
> 
> Or perhaps someone can enlighten me on how to use Luke to find out if 
> the whole document was indexed or not.
> I have not used Luke in such capacity before so not sure what to do or 
> look for?

Well, you could try to use the "Reconstruct & Edit" function - this will 
give you an idea what tokens ended up in the index, and which was the 
last one. In Luke 0.6, if the field is stored then you will see two tabs 
- one is for stored content, the other displays tokenized content where 
tokens are separated by commas. If the field was un-stored, then the 
only tab you will get will be the reconstructed content. In any case, 
just scroll down and check what are the last tokens.

You could also look for presence of some special terms that occur only 
at the end of that document, and check if they are present in the index.

There are really only few reasons why this might be happening:

* your extractor has a bug, or
* the max token limit is wrongly set, or
* the indexing process doesn't close the IndexWriter properly.


-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message