Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B8AF0200BB3 for ; Wed, 2 Nov 2016 08:15:25 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B738A160AFB; Wed, 2 Nov 2016 07:15:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D6BCB160AEA for ; Wed, 2 Nov 2016 08:15:24 +0100 (CET) Received: (qmail 98631 invoked by uid 500); 2 Nov 2016 07:15:24 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 98617 invoked by uid 99); 2 Nov 2016 07:15:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Nov 2016 07:15:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3C963C07CC for ; Wed, 2 Nov 2016 07:15:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.898 X-Spam-Level: ** X-Spam-Status: No, score=2.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 0nmQKHE8ILFL for ; Wed, 2 Nov 2016 07:15:21 +0000 (UTC) Received: from mail-oi0-f48.google.com (mail-oi0-f48.google.com [209.85.218.48]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 773AA5F22F for ; Wed, 2 Nov 2016 07:15:20 +0000 (UTC) Received: by mail-oi0-f48.google.com with SMTP id x4so7463528oix.2 for ; Wed, 02 Nov 2016 00:15:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=ZlrAMl3mwTZqT9jSci/G7fZ5uAI/+AmlFdCNEIQlM2k=; b=G8BdXjLSGJXDA7X/KqJcW7Ln5Yhy14SVjqhw9qyxTPYCSjK9Y5nEB6PKmqtc5BdGKJ XaRGbi9BIZCdYD60Q4kXAksS1wFoXuzpZgH3dU6IWkJxlxkfbpHhYAQIJertQPdMdMq2 8uoDxuCULw1vj9V13UAh9LPKDBCQG9G40zw68fOQ23ZXY1U1j8Lnn2LRxhrdWQ2juQNb eAMa10kkcQsnAj6NdORMKJ48/r1OT1I8xVQDZLqC+piY5W4vtZLD/Cmv3Fr4tfEzmt6s Ok18WgKYM38/X4CUvM9zSE1xTLvnB3R9sNeZhh2so6xv1CWPncSkehdKy20fT60sP6qt uWeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=ZlrAMl3mwTZqT9jSci/G7fZ5uAI/+AmlFdCNEIQlM2k=; b=e1KDC4ndMDZCA/G++naj0CJWU07Vb3iG/rFQYJQeaHaYhVBEhN0Thj4hLNaxHwIjNA ekQ5yOKZ5lSlN7CJXFWE7HTyeHqNRNFYI1vAw07i8RfyaUGAKGPySEoHgADxo5qBpSLt gLZvdGNvRlJJagz8JpBzlLpkliu2OfCrMVqxZrcHUoVPrngeY9Is2dZdCNUFFksAjhxD pNSl6c2GpG9u2miME6PCBqm4ZUqyYQ8atZke5X9YCFw5XFU9TtsdIQuwqHpbdCPGFzVg GOzqtsKaLT/LpjuqK9i7bxSipcfh2ZjLKdJHNyD7n5ZaDvaQo4dvcrHJ7igprO03xL1s gb0w== X-Gm-Message-State: ABUngvcqQVvLIiHPAifzc8Yb2c7RAIPFuC/gakNC4myHHdMeP4y8tomlxmcF0xNpCi4lqle/35RCFo5B6ZaE3A== X-Received: by 10.157.60.168 with SMTP id z37mr1447603otc.129.1478070919294; Wed, 02 Nov 2016 00:15:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.204.12 with HTTP; Wed, 2 Nov 2016 00:15:18 -0700 (PDT) In-Reply-To: References: From: AJAY GUPTA Date: Wed, 2 Nov 2016 12:45:18 +0530 Message-ID: Subject: Re: [jira] [Commented] (APEXMALHAR-2303) S3 Line By Line Module To: dev@apex.apache.org Cc: dev@apex.incubator.apache.org Content-Type: multipart/alternative; boundary=94eb2c186b8818f6a705404c34fa archived-at: Wed, 02 Nov 2016 07:15:25 -0000 --94eb2c186b8818f6a705404c34fa Content-Type: text/plain; charset=UTF-8 Hi Apex Dev Community, For Fixed Width S3 record Reader, the input is the block metadata containing the block offset and the block length. The length of the block may not be a factor of the length of the record. (For eg, block length can be 1MB, record length can be 23 bytes) Hence, the first byte in the block may belong to a record starting in the previous block. Similarly, the last record may not have all its bytes in this block and may spill over to next block. Since the record is fixed width, we can make some optimization in the way data is fetched from S3. We can change the start offset and end offset so that we fetch data from S3 such that records are also aligned and do not span multiple blocks. While retriving the block, we will retrive from X upto Y where *X is the startbyte of a record whose first byte in current block* *Y is the endbyte of the last record which exists in the current block* *startOffset = block.startOffset + (recordLength - block.startOffset % recordLength) % recordLength* endOffset = *block.endOffset + (recordLength - block.endOffset % recordLength) % recordLength - 1* This will ensure no multiple get requests to fetch entire record and also ensure no extra bytes are read from S3. Kindly let me know your views.alternative approaches for the same. *Regards* *Ajay* On Tue, Nov 1, 2016 at 2:16 PM, ASF GitHub Bot (JIRA) wrote: > > [ https://issues.apache.org/jira/browse/APEXMALHAR-2303? > page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel& > focusedCommentId=15624810#comment-15624810 ] > > ASF GitHub Bot commented on APEXMALHAR-2303: > -------------------------------------------- > > GitHub user ajaygit158 opened a pull request: > > https://github.com/apache/apex-malhar/pull/478 > > APEXMALHAR-2303 Added S3RecordReaderModule for reading records line by > line > > @chaithu14 @yogidevendra Kindly review > > You can merge this pull request into a Git repository by running: > > $ git pull https://github.com/ajaygit158/apex-malhar APEXMALHAR-2303 > > Alternatively you can review and apply these changes as the patch at: > > https://github.com/apache/apex-malhar/pull/478.patch > > To close this pull request, make a commit to your master/trunk branch > with (at least) the following in the commit message: > > This closes #478 > > ---- > commit b999cbd044b370a271ea8265f2b3e4b7be3935bc > Author: Ajay > Date: 2016-10-27T12:57:28Z > > Added S3 Record Reader module > > commit 426f8f6efc838ca754ad6070c3d0110537b1f222 > Author: Ajay > Date: 2016-10-28T13:42:51Z > > Changes to ensure compilation with jdk 1.7 > > commit a2e7d9892e00784b881c53e2d44cff12ceb6abb1 > Author: Ajay > Date: 2016-11-01T08:42:27Z > > Few corrections in S3RecordReader > > ---- > > > > S3 Line By Line Module > > ---------------------- > > > > Key: APEXMALHAR-2303 > > URL: https://issues.apache.org/ > jira/browse/APEXMALHAR-2303 > > Project: Apache Apex Malhar > > Issue Type: Bug > > Reporter: Ajay Gupta > > Assignee: Ajay Gupta > > Original Estimate: 336h > > Remaining Estimate: 336h > > > > This is a new module which will consist of 2 operators > > 1) File Splitter -- Already existing in Malhar library > > 2) S3RecordReader -- Read a file from S3 and output the records > (delimited or fixed width) > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > --94eb2c186b8818f6a705404c34fa--