Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 9727 invoked from network); 22 May 2009 18:05:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 May 2009 18:05:06 -0000 Received: (qmail 33110 invoked by uid 500); 22 May 2009 18:05:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 33031 invoked by uid 500); 22 May 2009 18:05:16 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 33021 invoked by uid 99); 22 May 2009 18:05:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2009 18:05:16 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of prasanna.pradhan@gmail.com designates 74.125.92.27 as permitted sender) Received: from [74.125.92.27] (HELO qw-out-2122.google.com) (74.125.92.27) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2009 18:05:08 +0000 Received: by qw-out-2122.google.com with SMTP id 5so1359959qwd.53 for ; Fri, 22 May 2009 11:04:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=tPQv1MTJDLu494R/wBX/No83zJTaOS2sHMG9oaoRe5M=; b=VIJ+sqDVqRpBL4Y1b5Yxe8bxeQ6fgtNtzP+k2YeoFF9TyOPPuHM98wsC4SdXHh/TYV X3Jfuaoj73I0/MJAkAjS3YkCFUWf/ArJF1BYN4HVSXorsn8B2giw9BkjGWTkLwzlRCnc O1lOZ59UjwagVHQu1wmxXQ74b6B0W+PFeqYIc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=itHhyORq5DQSKDe6KT5lrx4hyaKo+/Qvf2NmxzjslRBKBx046NQNiuZ3Spt9QVZ4tX TzB1gHS4X3r0d4MMkJv/edSQfx2kyrxbXifQOwL4oWumaACjxuMoov+poqmZd7GdrdZ6 vL0nmo02QwWvw+jPvmWE0FEIzcDFpx7xy2zKw= MIME-Version: 1.0 Received: by 10.229.84.6 with SMTP id h6mr1549644qcl.19.1243015485835; Fri, 22 May 2009 11:04:45 -0700 (PDT) In-Reply-To: References: Date: Fri, 22 May 2009 23:34:45 +0530 Message-ID: <2e3c5ba70905221104u50b747e6rc56764d1a15839f4@mail.gmail.com> Subject: Re: Parsing large xml files From: prasanna pradhan To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00163646c2e27bca79046a841817 X-Virus-Checked: Checked by ClamAV on apache.org --00163646c2e27bca79046a841817 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit We had similar a problem where we had to parse 1 GB XML files.Better transform to array like json and write a custom search API using lucene. On Thu, May 21, 2009 at 8:12 PM, Sudarsan, Sithu D. < Sithu.Sudarsan@fda.hhs.gov> wrote: > > Hi, > > While trying to parse xml documents of about 50MB size, we run into > OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB > (that is the max), does not help. Is there any API that could be used to > handle such large single xml files? > > If Lucene is not the right place, please let me know alternate places to > look for, > > Thanks in advance, > Sithu D Sudarsan > sithu.sudarsan@fda.hhs.gov > sdsudarsan@ualr.edu > > > > -- Thanks, Prasanna --00163646c2e27bca79046a841817--