lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From crack...@comcast.net
Subject Re: Parsing large xml files
Date Fri, 22 May 2009 18:33:42 GMT
once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX... 
----- Original Message ----- 
From: "Sithu D. Sudarsan" <Sithu.Sudarsan@fda.hhs.gov> 
To: java-user@lucene.apache.org 
Sent: Friday, May 22, 2009 6:39:33 AM GMT -08:00 US/Canada Pacific 
Subject: RE: Parsing large xml files 

Thanks everyone for your useful suggestions/links. 

Lucene uses DOM and we tried with SAX. 

XML Pull & vtd-xml as well as Piccolo seem good. 

However, for now, we've broken the file into smaller chunks and then 
parsing it. 

When we get some time, we'ld like to refactor with the suggested ones. 

Erick: We do use Eclipse. But running from CLI gives the same error! May 
be there is a way to address the memory issues, but the current idea of 
breaking into smaller chunks have worked for now... 


Sincerely, 
Sithu D Sudarsan 

-----Original Message----- 
From: Michael Wechner [mailto:michael.wechner@wyona.com] 
Sent: Friday, May 22, 2009 4:48 AM 
To: java-user@lucene.apache.org 
Subject: Re: Parsing large xml files 

crackeur@comcast.net schrieb: 
> http://vtd-xml.sf.net 
> 
> 
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Sithu.Sudarsan@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
> Subject: Parsing large xml files 
> 
> 
> Hi, 
> 
> While trying to parse xml documents of about 50MB size, we run into 
> OutOfMemoryError due to java heap space. Increasing JVM to use close 
2GB 
> (that is the max), does not help. Is there any API that could be used 
to 
> handle such large single xml files? 
>   

I am not familiar with that particular code of Lucene, but is it 
possible that Lucene is using DOM for this parsing? 
If so, one could try to replace it by SAX, and hence get rid of the 
OutOfMemory issue. 

Cheers 

Michael 
> If Lucene is not the right place, please let me know alternate places 
to 
> look for, 
> 
> Thanks in advance, 
> Sithu D Sudarsan 
> sithu.sudarsan@fda.hhs.gov 
> sdsudarsan@ualr.edu 
> 
> 
> 
> 
>   


--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 


--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message