Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of jacobsmv@gmail.com designates
 209.85.214.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAN4YXvfZo4ASeRDeYKYAXjrGTWNUrdkaCjnfsvxM15759Mmjmw@mail.gmail.com>
References: 
 <CA+Y5QKQ9L-70LJTny4HXHtUnRv9ZV1DQWwvKv8aRp9=YTDgUvg@mail.gmail.com>
	<CAN4YXvfZo4ASeRDeYKYAXjrGTWNUrdkaCjnfsvxM15759Mmjmw@mail.gmail.com>
Date: Tue, 30 Aug 2011 20:22:08 +0200
Message-ID: 
 <CA+Y5QKQEMofp1ViNFEPZ1kRyx0Ruc9XQekbDBG9aj+6S4KeBrw@mail.gmail.com>
Subject: Re: Stream still in memory after tika exception? Possible memoryleak?
From: Marc Jacobs <jacobsmv@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0015174780feebe0b604abbd1631

--0015174780feebe0b604abbd1631
Content-Type: text/plain; charset=ISO-8859-1

Hi Erick,

I am using Solr 3.3.0, but with 1.4.1 the same problems.
The connector is a homemade program in the C# programming language and is
posting via http remote streaming (i.e.
http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1
)
I'm using Tika to extract the content (comes with the Solr Cell).

A possible problem is that the filestream needs to be closed, after
extracting, by the client application, but it seems that there is going
something wrong while getting a Tika-exception: the stream never leaves the
memory. At least that is my assumption.

What is the common way to extract content from officefiles (pdf, doc, rtf,
xls etc) and index them? To write a content extractor / validator yourself?
Or is it possible to do this with the Solr Cell without getting a huge
memory consumption? Please let me know. Thanks in advance.

Marc

2011/8/30 Erick Erickson <erickerickson@gmail.com>

> What version of Solr are you using, and how are you indexing?
> DIH? SolrJ?
>
> I'm guessing you're using Tika, but how?
>
> Best
> Erick
>
> On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <jacobsmv@gmail.com> wrote:
> > Hi all,
> >
> > Currently I'm testing Solr's indexing performance, but unfortunately I'm
> > running into memory problems.
> > It looks like Solr is not closing the filestream after an exception, but
> I'm
> > not really sure.
> >
> > The current system I'm using has 150GB of memory and while I'm indexing
> the
> > memoryconsumption is growing and growing (eventually more then 50GB).
> > In the attached graph I indexed about 70k of office-documents
> (pdf,doc,xls
> > etc) and between 1 and 2 percent throws an exception.
> > The commits are after 64MB, 60 seconds or after a job (there are 6 evenly
> > divided jobs).
> >
> > After indexing the memoryconsumption isn't dropping. Even after an
> optimize
> > command it's still there.
> > What am I doing wrong? I can't imagine I'm the only one with this
> problem.
> > Thanks in advance!
> >
> > Kind regards,
> >
> > Marc
> >
>

--0015174780feebe0b604abbd1631--