camel-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claus Ibsen <claus.ib...@gmail.com>
Subject Re: handling large files
Date Wed, 24 Mar 2010 05:26:35 GMT
On Tue, Mar 23, 2010 at 8:24 PM, Justinson <justinson@googlemail.com> wrote:
>
> Unfortunately I'm getting an OutOfMemoryError using XPath splitting the way
> you shown. I'm parsing a file with about 500000 xml messages.
>

You could pre process the big file and split it into X files.
Maybe by using the java.util.Scanner to identify "good places" to
split the big file.

Or you could try using SAX based XML parsing when splitting to reduce
the memory overhead.
Just use a Bean for that. Something like this:

public Iterator splitBigFile(java.io.File file) {
  // SAX parsing the big file and return an iterator or something that
can walk the XML messages you like
}

And use the bean with the Camel Split EIP


> How can we use Apache Digester instead?
>
>
> Claus Ibsen-2 wrote:
>>
>> Hi
>>
>> This is as far I got with the xpath expression for splitting
>> http://svn.apache.org/viewvc?rev=825156&view=rev
>>
>>
>>
>> On Wed, Oct 14, 2009 at 4:40 PM, Claus Ibsen <claus.ibsen@gmail.com>
>> wrote:
>>> On Wed, Oct 14, 2009 at 4:21 PM, Claus Ibsen <claus.ibsen@gmail.com>
>>> wrote:
>>>> Hi
>>>>
>>>> On Wed, Oct 14, 2009 at 4:16 PM, mcarson <mcarson@amsa.com> wrote:
>>>>>
>>>>> It looks like the scanner might provide me with the capabilities I was
>>>>> looking for regarding reading in a file in delimited chunks.  I'm
>>>>> assuming I
>>>>> would implement this as a bean... can the bean component be used as a
>>>>> "from"
>>>>> in a camel route?  I'm new to Camel, and I have never seen that done.
>>>>>  Is
>>>>> there an example bean (that is a consumer of some sort) that I could
>>>>> use to
>>>>> model my code after?
>>>>>
>>>>
>>>> Since you use xpath then I took at dive into looking how to split big
>>>> files.
>>>> Using InputSource seems to do the trick as it allow xpath to use SAX
>>>> events which fits with streaming.
>>>>
>>>> I will work a bit to get it supported nice out of the box. And provide
>>>> details how to do it in 2.0.
>>>>
>>>
>>> Ah yeah the xpath will still at least hold all the result into memory.
>>>
>>> As you can only get a result of these types listed here:
>>> http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPathConstants.html
>>>
>>> And none of them is stream based.
>>>
>>> So even with SAX to parse the big xml file the xpath expression
>>> evaluation will result into all data being loaded into memory, or at
>>> least the NodeList which contains all the splitted entries.
>>>
>>> So maybe that Scanner is better if you can do some custom clipping. I
>>> believe its regexp based so you may be able to find a good regexp that
>>> can split on </person> or something.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Claus Ibsen-2 wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> How do you want to split the file?
>>>>>> Is there a special character that denotes a new "record"
>>>>>>
>>>>>> Using java.util.Scanner is great as it can do streaming. And also
what
>>>>>> Camel can do if you for example want to split by new line etc.
>>>>>>
>>>>>> --
>>>>>> Claus Ibsen
>>>>>> Apache Camel Committer
>>>>>>
>>>>>> Open Source Integration: http://fusesource.com
>>>>>> Blog: http://davsclaus.blogspot.com/
>>>>>> Twitter: http://twitter.com/davsclaus
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/handling-large-files-tp25826380p25891924.html
>>>>> Sent from the Camel - Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Claus Ibsen
>>>> Apache Camel Committer
>>>>
>>>> Open Source Integration: http://fusesource.com
>>>> Blog: http://davsclaus.blogspot.com/
>>>> Twitter: http://twitter.com/davsclaus
>>>>
>>>
>>>
>>>
>>> --
>>> Claus Ibsen
>>> Apache Camel Committer
>>>
>>> Open Source Integration: http://fusesource.com
>>> Blog: http://davsclaus.blogspot.com/
>>> Twitter: http://twitter.com/davsclaus
>>>
>>
>>
>>
>> --
>> Claus Ibsen
>> Apache Camel Committer
>>
>> Open Source Integration: http://fusesource.com
>> Blog: http://davsclaus.blogspot.com/
>> Twitter: http://twitter.com/davsclaus
>>
>>
>
> --
> View this message in context: http://old.nabble.com/handling-large-files-tp25826380p28005868.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>
>



-- 
Claus Ibsen
Apache Camel Committer

Author of Camel in Action: http://www.manning.com/ibsen/
Open Source Integration: http://fusesource.com
Blog: http://davsclaus.blogspot.com/
Twitter: http://twitter.com/davsclaus

Mime
View raw message