cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From " Brian Hess (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9048) Delimited File Bulk Loader
Date Thu, 26 Mar 2015 19:15:53 GMT


 Brian Hess commented on CASSANDRA-9048:

I have created a version of this as a Java program via executeAsync().  Some testing has shown
that for bulk writing to Cassandra, if you are starting with delimited files (not SSTables),
that Java's executeAsync() is more efficient/performant than creating SSTables and then calling

This implementation provides for the options above, as well as a way to specify the parallelism
of the asynchronous writing (the number of futures "in flight").  In addition to the Java
implementation, I created a command-line utility a la cassandra-stress called cassandra-loader
to invoke the Java classes with the appropriate CLASSPATH.  As such, I also modified build.xml
and tools/bin/ as appropriate.

The patch is attached for review.

The command-line usage statement is:

{{Usage: -f <filename> -host <ipaddress> -schema <schema> [OPTIONS]
  -delim <delimiter>             Delimiter to use [,]
  -delmInQuotes true             Set to 'true' if delimiter can be inside quoted fields [false]
 -dateFormat <dateFormatString> Date format [default for Locale.ENGLISH]
  -nullString <nullString>       String that signifies NULL [none]
  -skipRows <skipRows>           Number of rows to skip [0]
  -maxRows <maxRows>             Maximum number of rows to read (-1 means all) [-1]
  -maxErrors <maxErrors>         Maximum errors to endure [10]
  -badFile <badFilename>         Filename for where to place badly parsed rows. [none]
  -port <portNumber>             CQL Port Number [9042]
  -numFutures <numFutures>       Number of CQL futures to keep in flight [1000]
  -decimalDelim <decimalDelim>   Decimal delimiter [.] Other option is ','
  -boolStyle <boolStyleString>   Style for booleans [TRUE_FALSE] }}

> Delimited File Bulk Loader
> --------------------------
>                 Key: CASSANDRA-9048
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter:  Brian Hess
>         Attachments: CASSANDRA-9048.patch
> There is a strong need for bulk loading data from delimited files into Cassandra.  Starting
with delimited files means that the data is not currently in the SSTable format, and therefore
cannot immediately leverage Cassandra's bulk loading tool, sstableloader, directly.
> A tool supporting delimited files much closer matches the format of the data more often
than the SSTable format itself, and a tool that loads from delimited files is very useful.
> In order for this bulk loader to be more generally useful to customers, it should handle
a number of options at a minimum:
> - support specifying the input file or to read the data from stdin (so other command-line
programs can pipe into the loader)
> - supply the CQL schema for the input data
> - support all data types other than collections (collections is a stretch goal/need)
> - an option to specify the delimiter
> - an option to specify comma as the decimal delimiter (for international use casese)
> - an option to specify how NULL values are specified in the file (e.g., the empty string
or the string NULL)
> - an option to specify how BOOLEAN values are specified in the file (e.g., TRUE/FALSE
or 0/1)
> - an option to specify the Date and Time format
> - an option to skip some number of rows at the beginning of the file
> - an option to only read in some number of rows from the file
> - an option to indicate how many parse errors to tolerate
> - an option to specify a file that will contain all the lines that did not parse correctly
(up to the maximum number of parse errors)
> - an option to specify the CQL port to connect to (with 9042 as the default).
> Additional options would be useful, but this set of options/features is a start.
> A word on COPY.  COPY comes via CQLSH which requires the client to be the same version
as the server (e.g., 2.0 CQLSH does not work with 2.1 Cassandra, etc).  This tool should be
able to connect to any version of Cassandra (within reason).  For example, it should be able
to handle 2.0.x and 2.1.x.  Moreover, CQLSH's COPY command does not support a number of the
options above.  Lastly, the performance of COPY in 2.0.x is not high enough to be considered
a bulk ingest tool.

This message was sent by Atlassian JIRA

View raw message