lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Slow indexing speed when collection size is large
Date Mon, 08 May 2017 02:03:36 GMT
Hi Shawn,

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?
A) Yes, they are happening on the same Solr server, but currently, only the
indexing from a DB is running.

Is Solr in a virtual machine?
A) No

Is the 384GB at the hypervisor level, or the virtual machine level?
A) The hypervisor level. The virtual machine for the Sybase is allocated
64GB of memory.

Is the 22GB heap the total heap memory, or is that per Solr instance?
A) Per Solr instance.

It's only the Sybase database that is running on a virtual machine under
Hyper-V. Solr is running on the main server.
The main server is running on Windows 2012, while the virtual machine is
running on SUSE Linux 9. Both Solr instances are running on SSD drive,
while the virtual machine is running on normal hard disk.

What is the best suggestion for the 5TB of indexes The searching speed is
quite fast currently, even during indexing. It is the indexing speed that
is slow.

Regards,
Edwin



On 7 May 2017 at 21:14, Shawn Heisey <apache@elyograg.org> wrote:

> On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> > For my rich documentation handling, I'm using Extracting Request
> Handler, and it requires OCR.
> >
> > However, currently, for the slow indexing speed which I'm experiencing,
> the indexing is done directly from the Sybase database. I will fetch about
> 1000 records at a time from Sybase, and stored in into a CacheRowSet for it
> to be indexed. The query to the Sybase database is quite fast, and most of
> the time is spend on processes in the CacheRowSet.
> <snip>
> > A) 384 GB
> <snip>
> > A) 22 GB
> <snip>
> > A) 5 TB
> <snip>
> > A) A virtual machine with Sybase database is running on the server
>
> The discussion about the drawbacks of the Extracting Request Handler has
> already taken place.  Tika should be running on separate hardware, not
> embedded in Solr.  Having high-impact Tika processing run on the Solr
> server is going to slow everything down.
>
> Are the two types of indexing (ERH with OCR, and indexing from a DB)
> happening on the same Solr server?
>
> As soon as you mention virtual machines, my mental picture of the setup
> becomes much less clear.  You'll need to fully describe the OS and
> hardware setup, at both the hypervisor and virtual machine level.  Then
> I will know what questions to ask for more detailed information.
>
> Is Solr in a virtual machine?
> Is the 384GB at the hypervisor level, or the virtual machine level?
> Is the 22GB heap the total heap memory, or is that per Solr instance?
>
> If the 5TB is Solr index data, then there's no way you're going to get
> fast performance.  Putting enough memory in one machine to effectively
> cache that much data is impractically expensive, and most server
> hardware doesn't have enough memory slots even if you do have the
> money.  384GB wouldn't be enough for 5TB of index, and that's not even
> taking into account the memory needed by your software, including Solr
> and Sybase.
>
> Thanks,
> Shawn
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message