lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <>
Subject Indexing speed
Date Sun, 30 Jan 2005 21:21:45 GMT
This relates to a previous post of mine regarding Context of 'lines' of 
text (log4j events in my case):

I'm going through the process of writing quick and dirty 
test-case/test-bed classes to validate whether my ideas are going to 
work or not. 

For my first test, I thought I would write a quick indexer that indexed 
a traditional log file by lines, with each line being a Document, so 
that I could then search for matching lines and then do a context 
search.   Yes this is exactly what 'grep' does and does very well, but I 
thought if one was doing a lot of analysis of a log file (typical when 
mentally analysing log files) it might be best to index it once, and 
then search quickly many times.

Turns out that even using JUST a RamDirectory (which suprised me),  
writing a Document for every line of text isn't as fast as I was hoping, 
it is taking significantly longer than I hoped.  I played around with 
the mergeFactor settings etc, but nothing really made much difference to 
the indexing speed, other than NOT adding the Document to the index....  
I have tried this out on my Mac laptop, as well as a test Linux server 
with no noticeable difference.  (Both scenarios have the reading log 
file, and new index on the same physical drive, which I know is not the 
_best_ setup, but still).

This could well be my own stupidness, so here's what I'm doing.

Statistics on the Log File

The log file is 28meg, consisting of 409566 lines, of the form:

[2004-12-21 00:00:00,935 INFO 
][ommand.ProcessFaxCmd][http-80-Processor9][][]  Finished 
processing [mail box=stagingfax][MsgCount=0]
[2004-12-21 00:00:00,986 INFO 
][ommand.ProcessFaxCmd][http-80-Processor9][][]  Finished 
processing [mail box=aconexnz9000][MsgCount=0]
[2004-12-21 00:00:01,126 INFO ][             
monitor][http-80-Processor9][][] Controller duration: 212ms 
url=/Fax, fowardDuration=-1, total=212
[2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][] 
Could not connect to mail server! 
javax.mail.AuthenticationFailedException: Login failed: authentication 
        at com.sun.mail.imap.IMAPStore.protocolConnect(
        at javax.mail.Service.connect(
        at javax.mail.Service.connect(

Source code for test-bed:

public class TestBed1 {

    public static void main(String[] args) throws Exception {
        if(args.length <1) throw new IllegalArgumentException("not 
enough args");
        String filename = args[0];
        File file = new File(filename);
        Analyzer a = new SimpleAnalyzer();
        String indexLoc = "/tmp/testbed1/";
        //IndexWriter writer = new IndexWriter(indexLoc, a, true);
        RAMDirectory ramDir = new RAMDirectory();
        IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
        long length = file.length();
        BufferedReader fileReader = new BufferedReader(new 
        String line = "";
        double processed = 0;
        NumberFormat nf = NumberFormat.getPercentInstance();
        String percent = "";
        String lastPercent = " ";
        long lines =0;
        while ((line = fileReader.readLine())!=null) {
            Document doc = new Document();
            doc.add(Field.UnStored("Line", line) );
            processed +=line.length();
            percent = nf.format(processed/length);
            if (!percent.equals(lastPercent)){
                lastPercent = percent;
                System.out.println(percent + "(lines=" + lines + ")");


I did other simple tests by testing exactly how long it takes Java to 
just read the lines of the file, and that is mega quick in comparison.  
It's actually the "ramWriter.addDocument(doc)" line which seems to have 
the biggest amount of work to do, and probably for good reason.  I had 
originally tried to use Field.Text(...) to keep the line with the index 
for Context later on, but even Unstored doesn't really make that much 
difference from a stopwatch time point of view (creates a bigger index 
of course).

I might setup a profiler and work through where it's taking the the 
time, but you guys probably already know the answer.

I'm going to need much higher throughput for my utility to be useful. 
Maybe that's just not achievable.



Paul Smith

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message