From issues-return-56640-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Thu Feb 1 16:46:08 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 1039E180652 for ; Thu, 1 Feb 2018 16:46:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id F3F91160C56; Thu, 1 Feb 2018 15:46:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id ED0B2160C26 for ; Thu, 1 Feb 2018 16:46:06 +0100 (CET) Received: (qmail 50385 invoked by uid 500); 1 Feb 2018 15:46:06 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 50376 invoked by uid 99); 1 Feb 2018 15:46:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Feb 2018 15:46:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B2B88E772D for ; Thu, 1 Feb 2018 15:46:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.51 X-Spam-Level: X-Spam-Status: No, score=-109.51 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, KAM_SHORT=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6j12iPlInvZT for ; Thu, 1 Feb 2018 15:46:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 58F555FB94 for ; Thu, 1 Feb 2018 15:46:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 85762E00FC for ; Thu, 1 Feb 2018 15:46:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3C40A21301 for ; Thu, 1 Feb 2018 15:46:00 +0000 (UTC) Date: Thu, 1 Feb 2018 15:46:00 +0000 (UTC) From: "Kirill Shirokov (JIRA)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (IGNITE-6917) SQL: implement COPY command for efficient data loading MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/IGNITE-6917?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D163= 48592#comment-16348592 ]=20 Kirill Shirokov edited comment on IGNITE-6917 at 2/1/18 3:45 PM: ----------------------------------------------------------------- > 20. SqlParserCopySelfTest - no single test for positive cases, only negat= ive stuff. We need to cover positive cases extensively as well. Basic set of positive tests covered in JdbcThinBulkLoadAbstractSelfTest. Do you think we need to duplicate some of them here? > 21. JdbcBulkLoadContext - probably we don't need this class and it would = be better to keep update counter in BulkLoadContext as the notion of counte= r is used not just in JDBC driver. FIXED (removed). > 22. JdbcThinStatement - > 22.1 let's catch and wrap any exception in sendFile FIXED. Ok. > 22.2 Odd empty line in the beginning of sendFile FIXED. > 23. CsvLineProcessorBlock - we don't handle cases when quoted strings in = CSV line contain commas, do we? This clearly does not seem right and I don'= t see any tests for that either. Shouldn't we handle escaped quotes too? Yes, Vladimir asked to make a very simple implementation for now, so my pla= n was to implement this properly in IGNITE-7537. > 24. All *PipelineBlock classes - odd empty line after class header. FIXED > 25. StrListAppenderBlock - do we need list of lists here? Probably just l= ist of arrays could do? Afterwards each record is processed (via dataConverter) in UpdatePlan.proce= ssRow(), which requires List. Converting it to array would require = us to change UpdatePlan implementation. > 26. BulkLoadContextCursor\{{}} - it's unclear to me why we return context= in fake "row" while all other parameters should be retrieved via getters. = Can we unify this? (E.g. add getters for everything AND return everything i= n fake row - that would be consistent.) > 27. Please add more tests for behavior in case of failure - server or cli= ent disconnect during file processing, file open error on client, etc. Vladimir said that we can postpone such tests. Filed IGNITE-7602. > I think this is it for now. Looking forward to the fixes. Thanks a lot! was (Author: kirill.shirokov): > 20. SqlParserCopySelfTest - no single test for positive cases, only negat= ive stuff. We need to cover positive cases extensively as well. Basic set of positive tests covered in JdbcThinBulkLoadAbstractSelfTest. Do you think we need to duplicate some of them here?=20 > 21. JdbcBulkLoadContext - probably we don't need this class and it would = be better to keep update counter in BulkLoadContext as the notion of counte= r is used not just in JDBC driver. FIXED (removed). > 22. JdbcThinStatement - > 22.1 let's catch and wrap any exception in sendFile FIXED. Ok. > 22.2 Odd empty line in the beginning of sendFile FIXED. > 23. CsvLineProcessorBlock - we don't handle cases when quoted strings in = CSV line contain commas, do we? This clearly does not seem right and I don'= t see any tests for that either. Shouldn't we handle escaped quotes too? Yes, Vladimir asked to make a very simple implementation for now, so my pla= n was to implement this properly in IGNITE-7537. > 24. All *PipelineBlock classes - odd empty line after class header. FIXED > 25. StrListAppenderBlock - do we need list of lists here? Probably just l= ist of arrays could do? Afterwards each record is processed (via dataConverter) in UpdatePlan.proce= ssRow(), which requires List. Converting it to array would require = us to change UpdatePlan implementation. > 26. BulkLoadContextCursor\{{}} - it's unclear to me why we return context= in fake "row" while all other parameters should be retrieved via getters. = Can we unify this? (E.g. add getters for everything AND return everything i= n fake row - that would be consistent.) > 27. Please add more tests for behavior in case of failure - server or cli= ent disconnect during file processing, file open error on client, etc. Vladimir said that we can postpone such tests. Filed IGNITE-6702. > I think this is it for now. Looking forward to the fixes. Thanks a lot! > SQL: implement COPY command for efficient data loading > ------------------------------------------------------ > > Key: IGNITE-6917 > URL: https://issues.apache.org/jira/browse/IGNITE-6917 > Project: Ignite > Issue Type: New Feature > Components: sql > Reporter: Vladimir Ozerov > Assignee: Kirill Shirokov > Priority: Major > Labels: iep-1 > > Inspired by Postgres [1] > Common use case - bulk data load through JDBC/ODBC interface. Currently i= t is only possible to execute single commands one by one. We already can ba= tch them to improve performance, but there is still big room for improvemen= t. > We should think of a completely new command - {{COPY}}. It will accept a = file (or input stream in general case) on the client side, then transfer da= ta to the cluster, and then execute update inside the cluster, e.g. through= streamer. > First of all we need to create quick and dirty prototype to assess potent= ial performance improvement. It speedup is confirmed, we should build base = implementation which will accept only files. But at the same time we should= understand how it will evolve in future: multiple file formats (probably i= ncluding Hadoop formarts, e.g. Parquet), escape characters, input streams, = etc.. > [1] [https://www.postgresql.org/docs/9.6/static/sql-copy.html] > h1. Proposed syntax > Curent implementation: > {noformat} > COPY=20 > FROM "file.name" > INTO . > [(col-name, ...)] > FORMAT -- Only CSV format is supported in the curre= nt release > [BATCH_SIZE ] > {noformat} > We may want to gradually add features to this command in future to have s= omething like this: > {noformat} > COPY > FROM "file.name"[CHARSET ""] > INTO .
[CREATE [IF NOT EXISTS]] > [(col-name [] [NULLABLE] [ESCAPES], ...) [MATCH HEADER]] > FORMAT (csv|tsv|...) > -- CSV format options: > [FIELDSEP=3D'column-separators-regexp'] > [LINESEP=3D'row-separators-regexp'] > [QUOTE=3D'quote-chars'] > [ESCAPE=3D'escape-char'] > [NULL=3D'null-sequence'] > [COMMENT=3D'single-line-comment-start-char'] > [TRIM_LINES] > [IMPORT_EMPTY_LINES] > [CHARSET ""] > [ROWS -] > --or-- > [SKIP ROWS ] [MAX ROWS ] > [COLS -] > --or-- > [SKIP COLS ] [MAX COLS ] > [(MATCH | SKIP) HEADER] > [(REPLACE|IGNORE|ABORT ON [])) DUPLICATE KEYS] > [BATCH SIZE ( ROWS | [K|M|G|T|P])] > [COMPRESS "codec-name" [codec options]] > [LOCK (TABLE|ROWS)] > [NOLOGGING] > [BACKEND (DIRECT | STREAMER)] > {noformat} > h1. Implementation decisions and notes > h2. Parsing > * We support CSV format described in RFC 4180. > * Custom row and column separators, quoting characters are currently hard= coded > * Escape sequences, line comment characters are currently not supported > * We may want to support fixed-length formats (via format descriptors) in= future > * We may want to strip comments from lines (for example, starting with '#= ') > * We may want to allow user to either ignore empty lines or treat them as= a special case of record having all default values > * We may allow user to enable whitespace trimming from beginning and end = of a line > * We may want to allow user to specify error handling strategy: e.g., onl= y one quote character is present or escape sequence is invalid. > h2. File handling > * File character set to be supported in future > * Skipped/imported row number (or first/last line or skip header option),= skipped/imported column number (or first/last column): to be supported in = future > * Line start pattern (as in MySQL): no support planned > * We currently support only client-side import. No server-side file impor= t. > * We may want to support client-side stdin import in future. > * We do not handle importing multiple files from single command > * We don't benefit from any kind of pre-sorting pre-partitioning data on = client side. > * We don't include any any metadata, such as line number from client side= . > h3. Transferring data > * We send file data via batches. In future we will support batch size (sp= ecified with rows per batch or data block size=20 > per batch). > * We may want to implement data compression in future. > * We connect to single node in JDBC driver (no multi-node connections). > h3. Cache/tables/column handling > * We don't create table in the bulk load command > * We may want to have and option for reading header row, which contains c= olumn names to match columns > * In future we may wish to support COLUMNS (col1, _, col2, _, col3) synta= x, where '_' marker means a skipped column (MySQL uses '@dummy' for this) > h3. Data types > * Data types are converted as if they were supplied to INSERT SQL command= . > * We may want type conversion (automatic, custom using sql function, cust= om via Java code, string auto-trimming) in future. > * We will support optional null sequence ("\N") later > * We may want to allow user to specify what to do if the same record exis= ts (e.g., ignore record, replace it, report error with a max. error counter= before failing the command) > * We don't currently support any generated/autoincremented row IDs or any= custom generators. > * We don't support any filtering/conditional expressions > * We don't support any files/recordsets/tables with multiple conversion e= rrors generated during import. > h3. Backend / Transactional / MVCC / other > * We may want an option to select how do we insert the data into cache: u= sing cache.putAll(...), for example, or via data streamer interface (see BA= CKEND option) > * We don't use transactions > * We don't create locks on rows or tables. > * We don't try to minimize any indexing overhead (it's up to the user) > * We may want to minimize WAL impact in future via NOLOGGING option. > h3. Miscellanea > * We don't supply an utility to load data > * We don't currently supply any java loaders (as in PG and MSSQL) that st= ream data (not neccessary from file) > * Security-related questions are out of scope of this JIRA > * We don't have triggers and constraints in Apache Ignite > h1. Implementations from other vendors > h2. PostgreSQL > {noformat} > COPY table_name [ ( column_name [, ...] ) ] > FROM { 'filename' | STDIN } > [ [ WITH ] > [ BINARY ] > [ OIDS ] > [ DELIMITER [ AS ] 'delimiter' ] > [ NULL [ AS ] 'null string' ] > [ CSV [ HEADER ] > [ QUOTE [ AS ] 'quote' ] > [ ESCAPE [ AS ] 'escape' ] > [ FORCE NOT NULL column_name [, ...] ] ] ] > {noformat} > ([https://www.postgresql.org/docs/9.2/static/sql-copy.html]) > h3. Notes > * Server-side file import > * Client-side: only from STDIN > * Protocol implementation: via special command in the protocol > * Special bulk data loaders in implemented as part of JDBC driver packag= e: org.postgresql.copy.CopyManager > ([https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/cop= y/CopyManager.html]) > * Custom loaders available (e.g., [https://github.com/bytefish/PgBulkIns= ert.git]) > h2. MySQL > {noformat} > LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name' > [REPLACE | IGNORE] > INTO TABLE tbl_name > [PARTITION (partition_name [, partition_name] ...)] > [CHARACTER SET charset_name] > [{FIELDS | COLUMNS} > [TERMINATED BY 'string'] > [[OPTIONALLY] ENCLOSED BY 'char'] > [ESCAPED BY 'char'] > ] > [LINES > [STARTING BY 'string'] > [TERMINATED BY 'string'] > ] > [IGNORE number {LINES | ROWS}] > [(col_name_or_user_var > [, col_name_or_user_var] ...)] > [SET col_name=3D{expr | DEFAULT}, > [, col_name=3D{expr | DEFAULT}] ...] > {noformat} > ([https://dev.mysql.com/doc/refman/5.7/en/load-data.html]) > h3. Notes > * Both client- and server-side import > * Protocol implementation via a hack: if result set returned with column= count =3D=3D -1, read file name from server and send it immediately. > h2. Microsoft SQL Server > {noformat} > BULK INSERT > [ database_name . [ schema_name ] . | schema_name . ] [ table_name |= =20 > view_name ] > FROM 'data_file' > [ WITH > ( > [ [ , ] BATCHSIZE =3D batch_size ] > [ [ , ] CHECK_CONSTRAINTS ] > [ [ , ] CODEPAGE =3D { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ] > [ [ , ] DATAFILETYPE =3D > { 'char' | 'native'| 'widechar' | 'widenative' } ] > [ [ , ] DATASOURCE =3D 'data_source_name' ] > [ [ , ] ERRORFILE =3D 'file_name' ] > [ [ , ] ERRORFILE_DATASOURCE =3D 'data_source_name' ] > [ [ , ] FIRSTROW =3D first_row ] > [ [ , ] FIRE_TRIGGERS ] > [ [ , ] FORMATFILE_DATASOURCE =3D 'data_source_name' ] > [ [ , ] KEEPIDENTITY ] > [ [ , ] KEEPNULLS ] > [ [ , ] KILOBYTES_PER_BATCH =3D kilobytes_per_batch ] > [ [ , ] LASTROW =3D last_row ] > [ [ , ] MAXERRORS =3D max_errors ] > [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ] > [ [ , ] ROWS_PER_BATCH =3D rows_per_batch ] > [ [ , ] ROWTERMINATOR =3D 'row_terminator' ] > [ [ , ] TABLOCK ] > -- input file format options > [ [ , ] FORMAT =3D 'CSV' ] > [ [ , ] FIELDQUOTE =3D 'quote_characters'] > [ [ , ] FORMATFILE =3D 'format_file_path' ] > [ [ , ] FIELDTERMINATOR =3D 'field_terminator' ] > [ [ , ] ROWTERMINATOR =3D 'row_terminator' ] > )] > {noformat} > ([https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-trans= act-sql]) > h3. Notes > * Server-side import > * CLI utility to import from client side > * Protocol implementation: Special packet types: column definition and r= ow > * Custom bulk data supplied in JDBC driver package: com.microsoft.sqlser= ver.jdbc.SqlServerBulkCopy. > h2. Oracle > There is no bulk load SQL command. Bulk loading external data can be achi= eved via: > * =C2=A0Oracle External Tables > ([https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sutil/o= racle-external-tables-concepts.html]) > * SQL*Loader > ([https://docs.oracle.com/database/121/SUTIL/GUID-8D037494-07FA-4226-B507= -E1B2ED10C144.htm#SUTIL3311]). > * There is a separate utility for Oracle TimesTen in-memory database: > [https://docs.oracle.com/database/121/TTREF/util.htm#TTREF324] > h2. Vertica DB > {noformat} > COPY [ [db-name.]schema-name.]target-table=20 > ... [ ( { column-as-expression | column } > ...... [ DELIMITER [ AS ] 'char' ]=20 > ...... [ ENCLOSED [ BY ] 'char' ] > ...... [ ENFORCELENGTH ] > ...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ] > ...... [ FILLER datatype] > ...... [ FORMAT 'format' ]=20 > ...... [ NULL [ AS ] 'string' ] > ...... [ TRIM 'byte' ] > ... [, ... ] ) ] > ... [ COLUMN OPTION ( column=20 > ...... [ DELIMITER [ AS ] 'char' ] > ...... [ ENCLOSED [ BY ] 'char' ] > ...... [ ENFORCELENGTH ] > ...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ] > ...... [ FORMAT 'format' ] > ...... [ NULL [ AS ] 'string' ] > ...... [ TRIM 'byte' ] > ... [, ... ] ) ] > [ FROM {=20 > ...STDIN=20 > ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ]=20 > ...| 'pathToData' [ ON nodename | ON ANY NODE ]=20 > ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]=20 > ...| LOCAL {STDIN | 'pathToData'}=20 > ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...] > ...| VERTICA source_database.[source_schema.]source_table[(source_column = [,...]) ] > } ] > ...[ NATIVE=20 > .....| NATIVE VARCHAR=20 > .....| FIXEDWIDTH COLSIZES (integer [,...])=20 > .....| ORC=20 > .....| PARQUET=20 > ...] > ...[ WITH ] > ......[ SOURCE source([arg=3Dvalue [,...] ]) ] > ......[ FILTER filter([arg=3Dvalue [,...] ]) ]=20 > ......[ PARSER parser([arg=3Dvalue [,...] ]) ] > ...[ DELIMITER [ AS ] 'char' ] > ...[ TRAILING NULLCOLS ] > ...[ NULL [ AS ] 'string' ] > ...[ ESCAPE [ AS ] 'char' | NO ESCAPE ] > ...[ ENCLOSED [ BY ] 'char' ] > ...[ RECORD TERMINATOR 'string' ] > ...[ SKIP records ] > ...[ SKIP BYTES integer ] > ...[ TRIM 'byte' ] > ...[ REJECTMAX integer ] > ...[ REJECTED DATA {'path' [ ON nodename ] [, ...] | AS TABLE reject-tab= le} ] > ...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ] > ...[ ENFORCELENGTH ] > ...[ ERROR TOLERANCE ] > ...[ ABORT ON ERROR ] > ...[ [ STORAGE ] load-method ] > ...[ STREAM NAME 'streamName'] > ...[ NO COMMIT ] > {noformat} > ([https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/SQLReference= Manual/Statements/COPY/COPY.htm]) > h2. Various solutions from vendors not mentioned above > * Apache Hive: > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#La= nguageManualDML-Loadingfilesintotables] > * Apache HBase: > [http://dwgeek.com/apache-hbase-bulk-load-csv-examples.html] > * SAP IQ: > [https://help.sap.com/viewer/a898a5b484f21015a377cd1ccb6ee9b5/16.0.11.1= 2/en-US/a6209de484f21015bcb2d858c21ab35e.html] > * VoltDB: > [https://docs.voltdb.com/UsingVoltDB/clicsvloader.php] > * MemSQL: > [https://docs.memsql.com/sql-reference/v6.0/load-data/] > (creating pipelines and connecting them to LOAD DATA statement is also = a notable feature) > * Sybase: > [http://infocenter.sybase.com/help/index.jsp?topic=3D/com.sybase.help.s= qlanywhere.12.0.1/dbreference/input-statement.html] > * IBM DB2: > [https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db= 2.luw.admin.cmd.doc/doc/r0008305.html] > * IBM Informix: > [https://www.ibm.com/support/knowledgecenter/en/SSGU8G_12.1.0/com.ibm.s= qls.doc/ids_sqs_0884.htm] > * Apache Derby (AKA Java DB, Apache DB): > [https://db.apache.org/derby/docs/10.7/tools/ttoolsimporting.html] > * Google Cloud Spanner:=20 > [https://cloud.google.com/spanner/docs/bulk-loading] -- This message was sent by Atlassian JIRA (v7.6.3#76005)