lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: Parsing large xml files
Date Fri, 22 May 2009 18:41:51 GMT
crackeur@comcast.net schrieb:
> once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX...

>   

maybe you want to consider to contribute a vtd-xml based parsing 
implementation to Lucene ;-)

Thanks

Michael
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Sithu.Sudarsan@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Friday, May 22, 2009 6:39:33 AM GMT -08:00 US/Canada Pacific 
> Subject: RE: Parsing large xml files 
>
> Thanks everyone for your useful suggestions/links. 
>
> Lucene uses DOM and we tried with SAX. 
>
> XML Pull & vtd-xml as well as Piccolo seem good. 
>
> However, for now, we've broken the file into smaller chunks and then 
> parsing it. 
>
> When we get some time, we'ld like to refactor with the suggested ones. 
>
> Erick: We do use Eclipse. But running from CLI gives the same error! May 
> be there is a way to address the memory issues, but the current idea of 
> breaking into smaller chunks have worked for now... 
>
>
> Sincerely, 
> Sithu D Sudarsan 
>
> -----Original Message----- 
> From: Michael Wechner [mailto:michael.wechner@wyona.com] 
> Sent: Friday, May 22, 2009 4:48 AM 
> To: java-user@lucene.apache.org 
> Subject: Re: Parsing large xml files 
>
> crackeur@comcast.net schrieb: 
>   
>> http://vtd-xml.sf.net 
>>
>>
>> ----- Original Message ----- 
>> From: "Sithu D. Sudarsan" <Sithu.Sudarsan@fda.hhs.gov> 
>> To: java-user@lucene.apache.org 
>> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
>> Subject: Parsing large xml files 
>>
>>
>> Hi, 
>>
>> While trying to parse xml documents of about 50MB size, we run into 
>> OutOfMemoryError due to java heap space. Increasing JVM to use close 
>>     
> 2GB 
>   
>> (that is the max), does not help. Is there any API that could be used 
>>     
> to 
>   
>> handle such large single xml files? 
>>   
>>     
>
> I am not familiar with that particular code of Lucene, but is it 
> possible that Lucene is using DOM for this parsing? 
> If so, one could try to replace it by SAX, and hence get rid of the 
> OutOfMemory issue. 
>
> Cheers 
>
> Michael 
>   
>> If Lucene is not the right place, please let me know alternate places 
>>     
> to 
>   
>> look for, 
>>
>> Thanks in advance, 
>> Sithu D Sudarsan 
>> sithu.sudarsan@fda.hhs.gov 
>> sdsudarsan@ualr.edu 
>>
>>
>>
>>
>>   
>>     
>
>
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
> For additional commands, e-mail: java-user-help@lucene.apache.org 
>
>
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
> For additional commands, e-mail: java-user-help@lucene.apache.org 
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message