lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tang, Rebecca" <Rebecca.T...@ucsf.edu>
Subject solr server early EOF errors
Date Fri, 13 Jun 2014 18:06:27 GMT
Hi there,

I've been working with this issue for a while and I really don’t know what the root cause
is.  Any insight would be great!

I have 14 million records in a mysql DB.  I grab 100,000 records from the DB at a time and
then use ConcurrentUpdateSolrServer (with queue size = 50 and thread count = 4 and using the
internally managed solr client) to write the documents to the solr index.

If I build metadata only (I.e. Only from DB to Solr), then the index build takes 4 hrs with
no errors.

But if I build metadata + ocr text (ocr text is stored on the file system and can be very
large), then the index build takes 15 – 16 hrs and often times I get a few early EOF errors
on the Solr server.
>From Solr.log:
INFO  - 2014-06-13 06:28:27.113; org.apache.solr.update.processor.LogUpdateProcessor; [ltdl3testperf]
webapp=/solr path=/update params={wt=javabin&version=2} {add=[trpy0136 (1470801743195406336),
nfhc0136 (1470801743199600640), sfhc0136 (1470801743205892096), kghc0136 (1470801743218475008),
zfhc0136 (1470801743220572160), jghc0136 (1470801743237349376), rghc0136 (1470801743268806656),
ffhc0136 (1470801743270903808), pghc0136 (1470801743285583872), sghc0136 (1470801743286632448),
... (14165 adds)]} 0 260102
ERROR - 2014-06-13 06:28:27.114; org.apache.solr.common.SolrException; java.lang.RuntimeException:
[was class org.eclipse.jetty.io.EofException] early EOF
        at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
…

We tried increasing the solr server from 4 to 6 cpus.  We moved the solr server to a faster
disk.  I reduced the queue size for the for ConcurrentUpdateSolrServer from 100 to 50.  But
we cannot consistently get a full index going without any the EOF errors.

In my past three builds (I build them overnight):

  1.  The first one succeeded
  2.  The second one had one early EOF error and dropped 3 records out of 14 million
  3.  The third one had many early EOFs and dropped around 200,000 records

One cluster of the errors occurred at around 6:28am.  I looked at the cpu and file I/O stats
around that time, and didn't see anything out of the ordinary.

> sar
06:00:01 AM     all     42.13      0.00      1.54      2.13      0.00     54.20
06:10:01 AM     all     43.30      0.00      1.68      2.77      0.00     52.24
06:20:01 AM     all     47.73      0.00      1.83      2.43      0.00     48.01
06:30:01 AM     all     47.71      0.00      1.76      3.15      0.00     47.38
06:40:01 AM     all     47.01      0.00      1.68      2.55      0.00     48.76

> sar –d
06:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm
    %util
06:20:01 AM    dev8-0      1.84      2.35    370.95    203.01      0.05     27.60      9.58
     1.76
06:20:01 AM   dev8-16     83.05    464.90  44384.81    540.05     13.25    160.17      2.53
    21.03
06:20:01 AM   dev8-32      0.00      0.00      0.00      0.00      0.00      0.00      0.00
     0.00
06:20:01 AM  dev253-0      1.41      1.71     10.90      8.95      0.01     10.16      3.03
     0.43
06:20:01 AM  dev253-1     45.09      0.64    360.06      8.00      2.46     54.66      0.30
     1.37
06:20:01 AM  dev253-2   5513.98    464.90  44092.00      8.08   1623.60    295.54      0.04
    21.04
06:30:01 AM    dev8-0      2.52    100.62     83.64     72.99      0.03     10.42      6.59
     1.66
06:30:01 AM   dev8-16     52.56   1502.75  18736.64    385.06      5.67    107.95      2.17
    11.42
06:30:01 AM   dev8-32     42.55      0.01  38923.71    914.83     15.33    360.27      3.84
    16.35
06:30:01 AM  dev253-0      3.03     98.24     13.55     36.93      0.03      9.44      2.99
     0.90
06:30:01 AM  dev253-1      9.06      2.38     70.09      8.00      0.26     29.19      0.84
     0.77
06:30:01 AM  dev253-2   7216.35   1502.76  57660.35      8.20   2599.49    360.22      0.04
    26.58


Does anyone have any suggestions of where I can dig for the root cause?


Thanks!
Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library<legacy.library.ucsf.edu/>
E: rebecca.tang@ucsf.edu

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message