lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject Re: Indexing speed
Date Mon, 31 Jan 2005 05:30:16 GMT
Thanks Otis, tried Field.Keyword but that didn't seem to make any 
appreciatable difference.

I'll have a hunt around with a profiler and see what I can find.  I 
guess my use case is unusual, I need to create a LOT of very small 
documents.

cheers,

Paul

Otis Gospodnetic wrote:

>I believe most of the time is being spent in the Analyzer.  It should
>be easy to empirically test this claim by using Field.Keyword instead
>of Field.Text (Field.Keyword fields are not analyzed).  If that turns
>out to be correct, then you could play with writing a custom and
>optimal Analyzer.
>
>Otis
>
>--- Paul Smith <psmith@aconex.com> wrote:
>
>  
>
>>This relates to a previous post of mine regarding Context of 'lines'
>>of 
>>text (log4j events in my case):
>>
>>
>>    
>>
>http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
>  
>
>>I'm going through the process of writing quick and dirty 
>>test-case/test-bed classes to validate whether my ideas are going to 
>>work or not. 
>>
>>For my first test, I thought I would write a quick indexer that
>>indexed 
>>a traditional log file by lines, with each line being a Document, so 
>>that I could then search for matching lines and then do a context 
>>search.   Yes this is exactly what 'grep' does and does very well,
>>but I 
>>thought if one was doing a lot of analysis of a log file (typical
>>when 
>>mentally analysing log files) it might be best to index it once, and 
>>then search quickly many times.
>>
>>Turns out that even using JUST a RamDirectory (which suprised me),  
>>writing a Document for every line of text isn't as fast as I was
>>hoping, 
>>it is taking significantly longer than I hoped.  I played around with
>>
>>the mergeFactor settings etc, but nothing really made much difference
>>to 
>>the indexing speed, other than NOT adding the Document to the
>>index....  
>>I have tried this out on my Mac laptop, as well as a test Linux
>>server 
>>with no noticeable difference.  (Both scenarios have the reading log 
>>file, and new index on the same physical drive, which I know is not
>>the 
>>_best_ setup, but still).
>>
>>This could well be my own stupidness, so here's what I'm doing.
>>
>>Statistics on the Log File
>>=================
>>
>>The log file is 28meg, consisting of 409566 lines, of the form:
>>
>>[2004-12-21 00:00:00,935 INFO 
>>][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
>>Finished 
>>processing [mail box=stagingfax][MsgCount=0]
>>[2004-12-21 00:00:00,986 INFO 
>>][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
>>Finished 
>>processing [mail box=aconexnz9000][MsgCount=0]
>>[2004-12-21 00:00:01,126 INFO ][             
>>monitor][http-80-Processor9][192.168.0.220][] Controller duration:
>>212ms 
>>url=/Fax, fowardDuration=-1, total=212
>>[2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
>>
>>Could not connect to mail server! 
>>[host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
>>javax.mail.AuthenticationFailedException: Login failed:
>>authentication 
>>failure
>>        at
>>com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
>>        at javax.mail.Service.connect(Service.java:233)
>>        at javax.mail.Service.connect(Service.java:134)
>>        at 
>>
>>    
>>
>com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
>  
>
>>        at 
>>
>>    
>>
>com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
>  
>
>>==================
>>Source code for test-bed:
>>==================
>>
>>public class TestBed1 {
>>
>>    public static void main(String[] args) throws Exception {
>>       
>>        if(args.length <1) throw new IllegalArgumentException("not 
>>enough args");
>>        String filename = args[0];
>>       
>>        File file = new File(filename);
>>        Analyzer a = new SimpleAnalyzer();
>>       
>>        String indexLoc = "/tmp/testbed1/";
>>       
>>        //IndexWriter writer = new IndexWriter(indexLoc, a, true);
>>       
>>        RAMDirectory ramDir = new RAMDirectory();
>>        IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
>>       
>>        long length = file.length();
>>       
>>        BufferedReader fileReader = new BufferedReader(new 
>>FileReader(file));
>>       
>>        String line = "";
>>        double processed = 0;
>>        NumberFormat nf = NumberFormat.getPercentInstance();
>>        nf.setMaximumFractionDigits(0);
>>       
>>        String percent = "";
>>        String lastPercent = " ";
>>        long lines =0;
>>        while ((line = fileReader.readLine())!=null) {
>>            Document doc = new Document();
>>            doc.add(Field.UnStored("Line", line) );
>>            ramWriter.addDocument(doc);
>>            processed +=line.length();
>>            lines++;
>>            percent = nf.format(processed/length);
>>            if (!percent.equals(lastPercent)){
>>                lastPercent = percent;
>>                System.out.println(percent + "(lines=" + lines +
>>")");
>>            }
>>        }
>>        //writer.optimize();
>>        //writer.close();
>>       
>>       
>>    }
>>}
>>
>>=======
>>
>>I did other simple tests by testing exactly how long it takes Java to
>>
>>just read the lines of the file, and that is mega quick in
>>comparison.  
>>It's actually the "ramWriter.addDocument(doc)" line which seems to
>>have 
>>the biggest amount of work to do, and probably for good reason.  I
>>had 
>>originally tried to use Field.Text(...) to keep the line with the
>>index 
>>for Context later on, but even Unstored doesn't really make that much
>>
>>difference from a stopwatch time point of view (creates a bigger
>>index 
>>of course).
>>
>>I might setup a profiler and work through where it's taking the the 
>>time, but you guys probably already know the answer.
>>
>>I'm going to need much higher throughput for my utility to be useful.
>>
>>Maybe that's just not achievable.
>>
>>Thoughts?
>>
>>cheers,
>>
>>Paul Smith
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>  
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message