lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eyal <>
Subject RE: Adding line count to a document
Date Wed, 01 Mar 2006 22:54:11 GMT
I think my questions wasn't clear..

Let's say I'm doing something like that (c# code, but that's not the

TextReader reader=new StreamReader("C:\FileToIndex.txt"); 
Int lineCount=CountLines("C:\FileToIndex.txt"); //This ones reads the entire
file and count the number of lines

Document doc=new Document();

In the above example, I'm reading the entire file twice. This could be a
100Mb file. 

Now, Let's say I have a class LineCountingTextReader that counts the lines
as the file is being read. If I do the following
Then only after I call IndexWriter.AddDocument I will actually have the line
count (since only then the file will be read entirely). 
I don't want to read the entire file into memory and use it for both line
counting and analyzing since it may be a very big file. So I'm wondering
what other are doing? 
This is also a problem when you need to get several pieces of information
from 1 file to different fields (i.e. analyze an html file and also get the
links from it and add them to other a different field).

Thanks in advance,

> -----Original Message-----
> From: Eyal Post [] 
> Sent: Wednesday, March 01, 2006 8:24 AM
> To:
> Subject: Adding line count to a document
> I'd like to add a line count field to my indexed document. 
> The obvious way is to read my file twice, once to tokenize it 
> and add it's content to a field in the document and once to 
> count the number of lines in it and add it to another field. 
> Any idea how can I optimize this and read the file once? 
> Regards,
> Eyal 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message