lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Overall large size in Solr across collections
Date Wed, 20 Apr 2016 13:43:36 GMT
On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote:
> Thanks for the information Shawn.
>
> I believe it could be due to the types of file that is being indexed.
> Currently, I'm indexing the EML files which are in HTML format, and they
> are more rich in content (with in line images and full text), while
> previously the EML files are in Plain Text format, with the images as
> attachments.
>
> Will this be the cause of the slow indexing speed which I'm facing now? It
> is more than 3 times slower than what I had previously.

I assume that you are using the Extracting Request Handler for this.  I
know almost nothing about Tika, but I would imagine that extracting data
from rich text documents is not a fast process, and that plain text
documents would be a lot faster.  I could be wrong -- I've never used
the ERH myself.

If you want a setup like this to go faster, you probably need to make
your indexing process multi-threaded.  Ideally, such an application
would be written in Java and would incorporate Tika into the client-side
code.  Tika can be very unstable, so running it inside Solr (the
Extracting Request Handler) can make Solr itself unstable.

Thanks,
Shawn


Mime
View raw message