Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 451BD823F for ; Tue, 30 Aug 2011 18:22:40 +0000 (UTC) Received: (qmail 75612 invoked by uid 500); 30 Aug 2011 18:22:37 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 75410 invoked by uid 500); 30 Aug 2011 18:22:36 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 75400 invoked by uid 99); 30 Aug 2011 18:22:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2011 18:22:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jacobsmv@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bw0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2011 18:22:29 +0000 Received: by bkat2 with SMTP id t2so7922128bka.35 for ; Tue, 30 Aug 2011 11:22:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=EDzMwtrS1BHCTKUsxuEh/5dTRUd46CYpUpotl0astf8=; b=xeEwwyDoPcuMLlIwbQSJYbRRcEfIHAbbCiI/rLodreKLy0QPa8EwFk13oFMYGldzlv fCS/nGzRtIEum0RfQPOJGJ+n8nbOQhQTYQVCIpsqW+Gqk/FZ6WarQB7lsNTtu4W647HY GNf6n99w8F6GxJGghAiJV+bv23tdshfSJEgTw= MIME-Version: 1.0 Received: by 10.204.144.90 with SMTP id y26mr339105bku.85.1314728528568; Tue, 30 Aug 2011 11:22:08 -0700 (PDT) Received: by 10.204.35.151 with HTTP; Tue, 30 Aug 2011 11:22:08 -0700 (PDT) In-Reply-To: References: Date: Tue, 30 Aug 2011 20:22:08 +0200 Message-ID: Subject: Re: Stream still in memory after tika exception? Possible memoryleak? From: Marc Jacobs To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0015174780feebe0b604abbd1631 --0015174780feebe0b604abbd1631 Content-Type: text/plain; charset=ISO-8859-1 Hi Erick, I am using Solr 3.3.0, but with 1.4.1 the same problems. The connector is a homemade program in the C# programming language and is posting via http remote streaming (i.e. http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1 ) I'm using Tika to extract the content (comes with the Solr Cell). A possible problem is that the filestream needs to be closed, after extracting, by the client application, but it seems that there is going something wrong while getting a Tika-exception: the stream never leaves the memory. At least that is my assumption. What is the common way to extract content from officefiles (pdf, doc, rtf, xls etc) and index them? To write a content extractor / validator yourself? Or is it possible to do this with the Solr Cell without getting a huge memory consumption? Please let me know. Thanks in advance. Marc 2011/8/30 Erick Erickson > What version of Solr are you using, and how are you indexing? > DIH? SolrJ? > > I'm guessing you're using Tika, but how? > > Best > Erick > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs wrote: > > Hi all, > > > > Currently I'm testing Solr's indexing performance, but unfortunately I'm > > running into memory problems. > > It looks like Solr is not closing the filestream after an exception, but > I'm > > not really sure. > > > > The current system I'm using has 150GB of memory and while I'm indexing > the > > memoryconsumption is growing and growing (eventually more then 50GB). > > In the attached graph I indexed about 70k of office-documents > (pdf,doc,xls > > etc) and between 1 and 2 percent throws an exception. > > The commits are after 64MB, 60 seconds or after a job (there are 6 evenly > > divided jobs). > > > > After indexing the memoryconsumption isn't dropping. Even after an > optimize > > command it's still there. > > What am I doing wrong? I can't imagine I'm the only one with this > problem. > > Thanks in advance! > > > > Kind regards, > > > > Marc > > > --0015174780feebe0b604abbd1631--