Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3526A200ACA for ; Thu, 9 Jun 2016 16:58:26 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 33CC4160A58; Thu, 9 Jun 2016 14:58:26 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7A37B160A29 for ; Thu, 9 Jun 2016 16:58:25 +0200 (CEST) Received: (qmail 40291 invoked by uid 500); 9 Jun 2016 14:58:24 -0000 Mailing-List: contact dev-help@quickstep.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@quickstep.incubator.apache.org Delivered-To: mailing list dev@quickstep.incubator.apache.org Received: (qmail 40279 invoked by uid 99); 9 Jun 2016 14:58:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jun 2016 14:58:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0C366C0EB4 for ; Thu, 9 Jun 2016 14:58:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -5.446 X-Spam-Level: X-Spam-Status: No, score=-5.446 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id f8U6LmbjMsY8 for ; Thu, 9 Jun 2016 14:58:22 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 02B675F1D5 for ; Thu, 9 Jun 2016 14:58:21 +0000 (UTC) Received: (qmail 38081 invoked by uid 99); 9 Jun 2016 14:58:21 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jun 2016 14:58:21 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 1011FDFC61; Thu, 9 Jun 2016 14:58:21 +0000 (UTC) From: zuyu To: dev@quickstep.incubator.apache.org Reply-To: dev@quickstep.incubator.apache.org References: In-Reply-To: Subject: [GitHub] incubator-quickstep pull request #19: Improve text scan operator Content-Type: text/plain Message-Id: <20160609145821.1011FDFC61@git1-us-west.apache.org> Date: Thu, 9 Jun 2016 14:58:21 +0000 (UTC) archived-at: Thu, 09 Jun 2016 14:58:26 -0000 Github user zuyu commented on a diff in the pull request: https://github.com/apache/incubator-quickstep/pull/19#discussion_r66457297 --- Diff: relational_operators/TextScanOperator.cpp --- @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders( InsertDestination *output_destination = query_context->getInsertDestination(output_destination_index_); - if (parallelize_load_) { - // Parallel implementation: Split work orders are generated for each file - // being bulk-loaded. (More than one file can be loaded, because we support - // glob() semantics in file name.) These work orders read the input file, - // and split them in the blobs that can be parsed independently. - if (blocking_dependencies_met_) { - if (!work_generated_) { - // First, generate text-split work orders. - for (const auto &file : files) { - container->addNormalWorkOrder( - new TextSplitWorkOrder(query_id_, - file, - process_escape_sequences_, - storage_manager, - op_index_, - scheduler_client_id, - bus), - op_index_); - ++num_split_work_orders_; - } - work_generated_ = true; - return false; - } else { - // Check if there are blobs to parse. - while (!text_blob_queue_.empty()) { - const TextBlob blob_work = text_blob_queue_.popOne(); - container->addNormalWorkOrder( - new TextScanWorkOrder(query_id_, - blob_work.blob_id, - blob_work.size, - field_terminator_, - process_escape_sequences_, - output_destination, - storage_manager), - op_index_); - } - // Done if all split work orders are completed, and no blobs are left to - // process. - return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ && - text_blob_queue_.empty(); - } - } - return false; - } else { - // Serial implementation. - if (blocking_dependencies_met_ && !work_generated_) { - for (const auto &file : files) { + // Text segment size set to 256KB. + constexpr std::size_t kTextSegmentSize = 0x40000u; + + if (blocking_dependencies_met_ && !work_generated_) { + for (const std::string &file : files) { + // Use standard C libary to retrieve the file size. + FILE *fp = std::fopen(file.c_str(), "rb"); + std::fseek(fp, 0, SEEK_END); + const std::size_t file_size = std::ftell(fp); + std::fclose(fp); + + std::size_t text_offset = 0; + while (text_offset < file_size) { container->addNormalWorkOrder( new TextScanWorkOrder(query_id_, file, + text_offset, + std::min(kTextSegmentSize, file_size - text_offset), field_terminator_, process_escape_sequences_, output_destination, storage_manager), op_index_); + text_offset += kTextSegmentSize; --- End diff -- This won't become a bug, but I think what we really mean is the following: ``` const size_t text_actual_segment_size = std::min(kTextSegmentSize, file_size - text_offset); text_offset += text_actual_segment_size; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---