hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Podkowinski <spo...@gmail.com>
Subject Re: Best practices on spliltting an input line?
Date Thu, 12 Feb 2009 09:47:49 GMT
I'm currently using OpenCSV which can be found at
http://opencsv.sourceforge.net/  but haven't done any performance
tests on it yet. In my case simply splitting strings would not work
anyways, since I need to handle quotes and separators within quoted
values, e.g. "a","a,b","c".

On Tue, Feb 10, 2009 at 9:18 PM, Andy Sautins
<andy.sautins@returnpath.net> wrote:
>   I have question.  I've dabbled with different ways of tokenizing an
> input file line for processing.  I've noticed in my somewhat limited
> tests that there seem to be some pretty reasonable performance
> differences between different tokenizing methods.  For example, roughly
> it seems to split a line on tokens ( tab delimited in my case ) that
> Scanner is the slowest, followed by String.spit and StringTokenizer
> being the fastest.  StringTokenizer, for my application, has the
> unfortunate characteristic of not returning blank tokens ( i.e., parsing
> "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
> The WordCount example uses StringTokenizer which makes sense to me,
> except I'm currently getting hung up on not returning blank tokens.  I
> did run across the com.Ostermiller.util StringTokenizer replacement that
> handles null/blank tokens
> (http://ostermiller.org/utils/StringTokenizer.html ) which seems
> possible to use, but it sure seems like someone else has solved this
> problem already better than I have.
>   So, my question is, is there a "best practice" for splitting an input
> line especially when NULL tokens are expected ( i.e., two consecutive
> delimiter characters )?
>   Any thoughts would be appreciated
>   Thanks
>   Andy

View raw message