Mailing-List: contact camel-user-help@activemq.apache.org; run by ezmlm
Precedence: bulk
Reply-To: camel-user@activemq.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [SPAM]  RE: Splitter for big files
Date: Wed, 3 Sep 2008 16:04:47 +0200
Message-ID: <4C1FB9C00D24A140906239533638C4D20536FE49@EXVS04.exserver.dk>
In-Reply-To: <19289425.post@talk.nabble.com>
Thread-Topic: [SPAM]  RE: Splitter for big files
Thread-Index: AckNyortH/QQs4vBRP66h6/qw3R5TwAAruRQ
From: "Claus Ibsen" <ci@silverbullet.dk>
To: <camel-user@activemq.apache.org>

Hi

With or without these improvements the transaction issue is still the =
same.

The patches just improve the memory usage to not load the entire file =
into memory before splitting.

The transactional issue should be handled by external Transaction =
managers such as Spring, JTA in a J2EE container or others. Notice this =
usually only works with JMS and JDBC.

So if you for instance want to read a big file, split it into lines, =
processes each line and store each line in a database. Then you could =
put the exchanges on a JMS queue before it's stored in the database to =
ensure a safe point. Then the JMS can redo until the database is =
updated.

from(file).split().to(jms);
from(jms).process().to(jdbc);


Med venlig hilsen
=20
Claus Ibsen
......................................
Silverbullet
Skovsg=E5rdsv=E6nget 21
8362 H=F8rning
Tlf. +45 2962 7576
Web: www.silverbullet.dk

-----Original Message-----
From: cmoulliard [mailto:cmoulliard@gmail.com]=20
Sent: 3. september 2008 15:41
To: camel-user@activemq.apache.org
Subject: [SPAM] RE: Splitter for big files


If we implement what the different stakeholders propose, can we =
guarantee
that in case a problem occurs during the parsing of the file, a rollback =
of
the messages created (by the batch or the tokenisation) will be done ?

Kind regards,

=20

Claus Ibsen wrote:
>=20
> Hi
>=20
> I have created 2 tickets to track this:
> CAMEL-875, CAMEL-876
>=20
> Med venlig hilsen
> =20
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsg=E5rdsv=E6nget 21
> 8362 H=F8rning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
>=20
> -----Original Message-----
> From: Claus Ibsen [mailto:ci@silverbullet.dk]=20
> Sent: 2. september 2008 21:44
> To: camel-user@activemq.apache.org
> Subject: RE: Splitter for big files
>=20
> Ah of course well spotted. The tokenize is the memory hog. Good idea =
with
> the java.util.Scanner.
>=20
> So combined with the batch stuff we should be able to operate on =
really
> big files without consuming too much memory ;)
>=20
>=20
> Med venlig hilsen
> =20
> Claus Ibsen
> ......................................
> Silverbullet
> Skovsg=E5rdsv=E6nget 21
> 8362 H=F8rning
> Tlf. +45 2962 7576
> Web: www.silverbullet.dk
> -----Original Message-----
> From: Gert Vanthienen [mailto:gert.vanthienen@skynet.be]=20
> Sent: 2. september 2008 21:28
> To: camel-user@activemq.apache.org
> Subject: Re: Splitter for big files
>=20
> L.S.,
>=20
> Just added my pair of eyes ;).  One part of the problem is indeed the=20
> list of exchanges that is returned by the expression, but I think =
you're=20
> also reading the entire file into memory a first time for tokenizing=20
> it.  ExpressionBuilder.tokenizeExpression() converts the type to =
string=20
> and then uses a StringTokenizer on that.  I think we could add support =

> there for tokenizing File, InputStreams and Readers directly using a=20
> Scanner.
>=20
> Regards,
>=20
> Gert
>=20
> Claus Ibsen wrote:
>> Hi
>>
>> Looking into the source code of the splitter it looks like it creates =
the
>> list of splitted exchanges before they are being processed. That is =
why
>> it then will consume memory for big files.
>>
>> Maybe somekind of batch size option is needed so you can set for =
instance
>> number, say 20 as batch size.
>>
>>    .splitter(body(InputStream.class).tokenize("\r\n").batchSize(20))
>>
>> Could you create a JIRA ticket for this improvement?
>> Btw how big is the files you use?=20
>>
>> The file component uses a File as the object.=20
>> So when you split using the input stream then Camel should use the =
type
>> converter from File -> InputStream, that doesn't read the entire =
content
>> into memory. This happends in the splitter where it creates the =
entire
>> list of new exchanges to fire.
>>
>> At least that is what I can read from the source code after a long =
days
>> work, so please read the code as 4 eyes is better that 2 ;)
>>
>>
>>
>> Med venlig hilsen
>> =20
>> Claus Ibsen
>> ......................................
>> Silverbullet
>> Skovsg=E5rdsv=E6nget 21
>> 8362 H=F8rning
>> Tlf. +45 2962 7576
>> Web: www.silverbullet.dk
>>
>> -----Original Message-----
>> From: Bart Frackiewicz [mailto:bart@open-medium.com]=20
>> Sent: 2. september 2008 17:40
>> To: camel-user@activemq.apache.org
>> Subject: Splitter for big files
>>
>> Hi,
>>
>> i am using this route for a couple of CSV file routes:
>>
>>    from("file:/tmp/input/?delete=3Dtrue")
>>    .splitter(body(InputStream.class).tokenize("\r\n"))
>>    .beanRef("myBean", "process")
>>    .to("file:/tmp/output/?append=3Dtrue")
>>
>> This works fine for small CSV files, but for big files i noticed
>> that camel uses a lot of memory, it seems that camel is reading
>> the file into memory. What is the configuration to use a stream
>> in the splitter?
>>
>> I recognized the same behaviour in the xpath splitter:
>>
>>    from("file:/tmp/input/?delete=3Dtrue")
>>    .splitter(ns.xpath("//member"))
>>    ...
>>
>> BTW, i found a posting from march, where James suggest following
>> implementation for an own splitter:
>>
>> -- quote --
>>
>>    from("file:///c:/temp?noop=3Dtrue)").
>>      splitter().method("myBean", "split").
>>      to("activemq:someQueue")
>>
>> Then register "myBean" with a split method...
>>
>> class SomeBean {
>>    public Iterator split(File file) {
>>       /// figure out how to split this file into rows...
>>    }
>> }
>> -- quote --
>>
>> But this won't work for me (Camel 1.4).
>>
>> Bart
>>
>>  =20
>=20
>=20
>=20


-----
Enterprise Architect

Xpectis
12, route d'Esch
L-1470 Luxembourg

Phone +352 25 10 70 470
Mobile +352 621 45 36 22

e-mail : cmoulliard@xpectis.com
web site :  www.xpectis.com www.xpectis.com=20
My Blog :  http://cmoulliard.blogspot.com/ =
http://cmoulliard.blogspot.com/ =20
--=20
View this message in context: =
http://www.nabble.com/Splitter-for-big-files-tp19272583s22882p19289425.ht=
ml
Sent from the Camel - Users mailing list archive at Nabble.com.