lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Hu" <kai...@dusee.cn>
Subject 答复: Indexing correctly?
Date Mon, 13 Aug 2007 07:01:49 GMT
Hi, John
	I think you cost too much time in I/O,and if you use RAMDirectory first will better.see http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

kai

-----邮件原件-----
发件人: Erick Erickson [mailto:erickerickson@gmail.com] 
发送时间: 2007年8月13日 星期一 1:57
收件人: java-user@lucene.apache.org
主题: Re: Indexing correctly?

Where are your source files and index? If they're somewhere
out there on the network, you may be having some slowdown
because of network latency (the part about "/mount/....." leads
me to ask this one).

If this is the case, you might get an improvement if all the files are
local...

Best
Erick

On 8/11/07, John Paul Sondag <jsondag2@uiuc.edu> wrote:
>
> It takes roughly 6 hours for me to index a Gig of data.  The benchmarks
> take
> quite a bit less if I'm reading it correctly.  I'll try out the
> StringBuffer/Builder and let you know.  Thanks for the quick response and
> if
> you have any more suggestions please let me know.
>
> --JP
>
> On 8/11/07, karl wettin <karl.wettin@gmail.com> wrote:
> >
> > How much slower than anticipated is it?
> >
> > I would start by using a StringBuffer/Builder rather than appending
> > (immutable) strings to each other.
> >
> >
> > 11 aug 2007 kl. 19.05 skrev John Paul Sondag:
> >
> > > Hi,
> > >
> > > I was hoping that maybe you guys could see if I'm somehow indexing
> > > inefficiently.  I'm putting relevant parts of my code below.  I've
> > > looked at
> > > the "benchmarks" page on Lucene and my indexing time is taking a
> > > substantial
> > > amount of time more than what I see posted.  I'm not sure when I
> > > should call
> > > flush() ( I saw that I should be doing that on the
> > > ImproveIndexingSpeed
> > > page).  I'd really appreciate any advice.
> > >
> > > Here's my code:
> > >
> > >       File directory = new File( "/mounts/falcon5/disks/0/tcheng3/
> > > Dataset");
> > >       File[] theFiles = directory.listFiles();
> > >
> > >           //go through each file inside the directory and index it
> > >           for(int curFile = 0; curFile < theFiles.length; curFile++)
> > >           {
> > >               File fin=theFiles[curFile];
> > >
> > >               //open up the file
> > >               FileInputStream inf = new FileInputStream(fin);
> > >               InputStreamReader isr = new InputStreamReader(inf,
> > > "US-ASCII");
> > >               BufferedReader in = new BufferedReader(isr);
> > >               String text="";
> > >               String docid="";
> > >
> > >             while (true) {
> > >
> > >             //read in the file one line at a time, and act accordingly
> > >                 String line = in.readLine();
> > >                 if (line == null) { break;}
> > >
> > >                  if (line.startsWith("<DOC>") ) {
> > >                     //get docID
> > >                     line = in.readLine();
> > >                     String tempStr = line.substring(8,line.length());
> > >                     int pos = tempStr.indexOf(' ');
> > >                     docid = tempStr.substring(0,pos);
> > >                     }else if (line.startsWith("</DOC>")) {
> > >
> > >                     Document doc = new Document();
> > >
> > >                       doc.add(new Field("contents",text,
> > > Field.Store.NO,
> > > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
> > >                     doc.add(new Field("DocID",docid, Field.Store.YES,
> > > Field.Index.NO));
> > >                       writer.addDocument(doc);
> > >                     text="";
> > >                 } else {
> > >                     text = text + "\n" + line;
> > >                 }
> > >             }
> > >
> > >         }
> > >
> > >
> > >           int numIndexed = writer.docCount();
> > >
> > >           writer.optimize();
> > >           writer.close();
> > >
> > >
> > > Thanks,
> > >
> > > --JP
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message