lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harald Kirsch <kir...@ebi.ac.uk>
Subject multiple documents per file, seek and character encoding
Date Fri, 02 Jul 2004 09:32:52 GMT
Hello,

when indexing files which contain several thenthousand individual
documents, I want to keep for each document the name of the file where
it comes from and its byte position. In a query, I want to seek to the
byte position to then read the document. I cannot store all the
documents in the index. The whole corpus is about 50GB.

Question: For indexing I read through the file and add a Document to
lucene every time I find it. Is there an easy way to keep track of
byte positions while reading characters from the file? Or do I have to
run the CharsetDecoder myself on top of reading bytes?

  Harald.

-- 
------------------------------------------------------------------------
Harald Kirsch | kirsch@ebi.ac.uk | +44 (0) 1223/49-2593

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message