camel-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claus Ibsen <claus.ib...@gmail.com>
Subject Re: correct way to provide regex in TokenizerExpression?
Date Sat, 31 Oct 2015 17:24:49 GMT
So you want to split a file line by line and disregard what kind of
line terminators the file is using.

Camel uses the java.util.Scanner with tokenizer with that provided
token to split it. So if you can get that working then it should be
supported.

As it may be a bit difficult to do this maybe we need a DSL syntax to
offer an expression that can split this nicely, and you can chose line
terminators as: platform, windows, unix, both. Or something and you
can set it to both in your use case.





On Thu, Oct 29, 2015 at 11:38 PM, furchess123 <constv@hotmail.com> wrote:
> Ok, here's the workaround I have implemented to go past the above issue...
>
> Some MyConstants.java file:
>
>     public static final String SYSTEM_AGNOSTIC_NEWLINE_REGEX = "\r|\r\n|\n";
>
> Splitter route configuration in a RouteBuilder implementation:
>
>        TokenizerExpression tokenizerExpression = new TokenizerExpression();
>
> tokenizerExpression.setToken(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX);  //
> tokenize by line separators
>         tokenizerExpression.setGroup(readerConfig.getLinesPerChunk());//
> group so many lines into one exchange
>         tokenizerExpression.setRegex(true);  // indicate that it is a
> regular expression, not simple string match
>
>         from(FILE_SPLITTER_ENDPOINT).routeId("fileSplitterRoute").
>             split(tokenizerExpression).
>                 streaming().          // enable streaming vs. reading all
> into memory
>                 parallelProcessing(readerConfig.isParallelProcessing()). //
> on/off concurrent processing of multiple chunks
>                 stopOnException().    // stop processing file if system
> exception occurs (handled by onException clause)
>                 *bean(new TokenizerCharRemover())*. // cleans junk chars
> inserted by Camel's tokenizer due to bug(?)
>                 unmarshal().csv().    // unmarshal each chunk to Java (list
> of String lists) using Camel's CSV component
>                 bean(csvHandler).     // hand each unmarshalled list of
> lines/fields to bean that parses and validates line content
>                 bean(importProcessor).// process codes for import (depending
> on operational mode and errors in exchange)
>                 to(AGGREGATE_ERRORS_ENDPOINT).      // delegate to nested
> route to update error report
>             end();
>
>
> TokenizerCharRemover.java:
>
> public class TokenizerCharRemover
> {
>     /**
>      * Pre-compiled regex pattern to match the instances of character
> sequences of the regular expression inserted by
>      * Camel's splitter's tokenizer between the file lines in the body of
> the exchange.  The input string that specifies
>      * the pattern is treated as a sequence of literal characters thanks to
> the {@link Pattern#LITERAL} flag.
>      */
>     private static final Pattern REPLACE_JUNK_PATTERN =
>         Pattern.compile(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX,
> Pattern.LITERAL);
>
>
>     /**
>      * Replaces every instance of the {@link
> FileContext#SYSTEM_AGNOSTIC_NEWLINE_REGEX} character sequence in the
>      * exchange body with a simple '\n' line separator.
>      */
>     @SuppressWarnings("MethodMayBeStatic")
>     @Handler
>     public void cleanupLineSeparators(Exchange exchange)
>     {
>         String newBody =
> REPLACE_JUNK_PATTERN.matcher(exchange.getIn().getBody(String.class))
>             .replaceAll(Matcher.quoteReplacement("\n"));
>         exchange.getIn().setBody(newBody);
>     }
>
> }
>
> If there is a better solution, or if I have missed some obvious simple way
> to use the tokenizer that does not replace the matching line separators with
> the regex character sequence itself, please let me know! I'd very much
> appreciate that.
>
>
>
> --
> View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773221.html
> Sent from the Camel - Users mailing list archive at Nabble.com.



-- 
Claus Ibsen
-----------------
http://davsclaus.com @davsclaus
Camel in Action 2nd edition:
https://www.manning.com/books/camel-in-action-second-edition

Mime
View raw message