lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Why is lucene so slow indexing in nfs file system ?
Date Wed, 09 Jan 2008 14:26:04 GMT
<<< would like to find out why my application has this big
delay to index>>>

Well, then you have to measure <G>. Tthe first thing I'd do
is pinpoint where the time was being spent. Until you have
that answered, you simply cannot take any meaningful action.

1> don't do any of the indexing. No new Documents, don't
add any fields, etc. This will just time the PDF parsing.
(I'd run this for set number of documents rather than the
whole 10G). This'll tell you whether the issue is indexing or
PDFBox.

2> Perhaps try the above with local files rather than files
on the nfs mount.

3> Put back some of the indexing and measure each
step. For instance, create the new documents but don't
add them to the index.

4>Then go ahead and add them to the index.

The numbers you get for these measurements will tell
you a lot. At that point, perhaps folks will have more useful
suggestions.

The reason I'm being so unhelpful is that without lots more
detail, there's really nothing we can help with since there
are so many variables that it's just impossible to say
which one is the problem. For instance, is it a single
10G document and you're swapping like crazy? Are you
CPU bound or IO bound? Have you tried profiling your
process at all to find the choke points?

Best
Erick


On Jan 9, 2008 8:50 AM, Ariel <isaacrc82@gmail.com> wrote:

> Hi:
> I have seen the post in
> http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.htmland
> I am implementing a similar application in a distributed enviroment, a
> cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
> so I am using nfs file system too to access the home directory where the
> documents to be indexed reside and I would like to know how much time an
> application spends to index a big amount of documents like 10 Gb ?
> I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> every
> nodes, LAN: 1Gbits/s.
>
> The problem I have is that my application spends a lot of time to index
> all
> the documents, the delay to index 10 gb of pdf documents is about 2 days
> (to
> convert pdf to text I am using pdfbox) that is of course a lot of time,
> others applications based in lucene, for instance ibm omnifind only takes
> 5
> hours to index the same amount of pdfs documents. I would like to find out
> why my application has this big delay to index, any help is welcome.
> Dou you know others distributed architecture application that uses lucene
> to
> index big amounts of documents ? How long time it takes to index ?
> I hope yo can help me
> Greetings
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message