manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frédéric Olier <FOl...@wooxo.fr>
Subject RE: [Solr] Error on documents makes ManifoldCF
Date Thu, 29 Oct 2015 15:56:56 GMT
Hi Karl,

I managed to get round my 'out of memory issue' with Solr by tweaking the Solr configuration.

Now, I have documents that can take ages to be indexed by Solr.
I set a reasonable value for the socket timeout of the Solr connector (1200 sec).

Still I get timeouts even then.
If a timeout occurs, the MCF crawling stops.
If I restart it, the file that timed out gets indexed again... and so on.

What is your recommendation in such situation ?

Many thanks,


-----Message d'origine-----
De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 22 octobre 2015 18:23
À : dev
Objet : Re: [Solr] Error on documents makes ManifoldCF

Hi Fred,

When a java process runs out of memory in one thread, *all* threads are likely impacted. 
That's why if you are seeing memory issues you really just have to fix them; you can't just
ignore the exception and hope for the best.

Karl


On Thu, Oct 22, 2015 at 12:20 PM, Frédéric Olier <FOlier@wooxo.fr> wrote:

> Hi Karl,
>
> Indeed, I have this in my logs:
>
> MCF:
>
> Exception tossed: Repeated service interruptions - failure processing
> document: Read timed out
>
>
> Solr
>
> Error for /datafari-solr/FileShare/update/extract
> java.lang.OutOfMemoryError: Java heap space
>
>
> The file is not that big (7M).
>
> Although ignoring the file might not be the 'nicest' solution, is that 
> possible ?
>
> I'll investigate on Solr / Tika side to see if I can deactivate the 
> recursive parsing of archive files.
>
> Thanks anyway,
> Fred.
>
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : jeudi 22 octobre 
> 2015 18:16 À : dev Objet : Re: [Solr] Error on documents makes 
> ManifoldCF
>
> Hi Fred,
>
> I suspect that you are getting an out-of-memory or out-of-disk error 
> on the Solr side.  That's really bad and you don't just want to make 
> ManifoldCF ignore it.
>
> What you can do is limit the maximum size file sent to Solr.  That's a 
> far better fix.
>
> Karl
>
>
> On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <FOlier@wooxo.fr> wrote:
>
> > Hi,
> >
> > I managed to progress on my issues.
> >
> > The document (docx) is now skipped as expected when it fails.
> >
> > However, I have now another issue.
> > I have a tar.gz file containing itself 100+ tar.gz files.
> >
> > ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
> > I looked at the Solr configuration and due to the hardware used I 
> > won't be able to tweak more the JVM and so on.
> >
> > Therefore I'd like to know whether ManifoldCF can be configured to 
> > skipped files for which it gets such an error instead of aborting ?
> >
> > Fred.​
> >
> >
> > -----Message d'origine-----
> > De : Frédéric Olier [mailto:FOlier@wooxo.fr] Envoyé : mercredi 21 
> > octobre 2015 17:51 À : dev@manifoldcf.apache.org Objet : RE: [Solr] 
> > Error on documents makes ManifoldCF
> >
> > Hi Karl,
> >
> > Many thanks.
> >
> > I found the configuration to use:
> > Here
> >
> > http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and
> > -s
> > olr-for-files-search/
> >
> > Search for "ignoreTikaException"
> >
> > I'll test it and see if it fixes my issue.
> >
> > Fred​
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> > octobre
> > 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes 
> > ManifoldCF
> >
> > Standard google searching finds it.
> >
> > See:
> >
> >
> > http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox
> > /% 3C55127866020000250008FD2A@slesmail.veritablelp.com%3E
> >
> > Karl
> >
> >
> > On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FOlier@wooxo.fr>
> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your reply.
> > >
> > > I looked here :
> > > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> > >
> > > But there is no 'search' option...
> > >
> > > Any idea where I can search what I'm looking for more efficiently ?
> > >
> > > Thanks​
> > >
> > >
> > > -----Message d'origine-----
> > > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> > > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents 
> > > makes ManifoldCF
> > >
> > > Hi Frédéric,
> > >
> > > There's a flag in the Solr configuration you can set that will 
> > > cause exceptions from Solr Cell (Tika) to cause the document to be 
> > > skipped rather than causing ManifoldCF to retry the document.  I 
> > > don't remember what it is but others have noted it and you can 
> > > search the mail
> > archive to find it.
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FOlier@wooxo.fr>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > We integrated Solr to ManifoldCF.
> > > >
> > > > We configured Solr to use the OCR engine.
> > > >
> > > >
> > > >
> > > > When we crawl documents MCF reads the docs fine and submit them 
> > > > to
> > Solr.
> > > >
> > > >
> > > >
> > > > It happens on large files (PDF, images) that the OCR takes too 
> > > > long which leads to MCF request to fail.
> > > >
> > > >
> > > >
> > > > The annoying thing is that MCF does not ignore the file.
> > > >
> > > > On the next crawling, the file keeps failing.
> > > >
> > > >
> > > >
> > > > How could I tell manifold to skip the file that fails ?
> > > >
> > > >
> > > >
> > > > Thanks for your reply.
> > > >
> > > >
> > > >
> > > > [image: TOP 250 des éditeurs]
> > > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f
> > > > 28
> > > > 30
> > > > 87
> > > > b34/undefined>
> > > >
> > > > [image: Logo]
> > > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-873
> > > > 0e
> > > > ac
> > > > 1b
> > > > 836/undefined>
> > > >
> > > > *Suivez-nous !*
> > > >
> > > > [image: Linkedin]
> > > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8
> > > > 73
> > > > 8a
> > > > fa
> > > > 52f/undefined>
> > > >
> > > > [image: Viadeo]
> > > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec
> > > > 6f
> > > > 46
> > > > 3f
> > > > e83/undefined>
> > > >
> > > > [image: Twitter]
> > > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb
> > > > 9d
> > > > 3b
> > > > 26
> > > > d01/undefined>
> > > >
> > > > [image: Googleplus]
> > > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365
> > > > a1
> > > > 99
> > > > 76
> > > > f79/undefined>
> > > >
> > > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > > >
> > > > * 33 442 016 891 33 662 635 031*
> > > >
> > > > *WOOXO*
> > > > Tél : 0811 140 160
> > > > Fax0811 481 507
> > > > Immeuble Le Forum - Bât A - 3ème étage
> > > > 515 av. de la Tramontane
> > > > ZAC Athélia IV
> > > > 13600 LA CIOTAT
> > > > FRANCE
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>
Mime
View raw message