Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0D824200B4F for ; Tue, 26 Jul 2016 12:22:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0C1F5160A75; Tue, 26 Jul 2016 10:22:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2BB41160A56 for ; Tue, 26 Jul 2016 12:22:10 +0200 (CEST) Received: (qmail 58956 invoked by uid 500); 26 Jul 2016 10:22:09 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 58944 invoked by uid 99); 26 Jul 2016 10:22:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jul 2016 10:22:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 826491A0214 for ; Tue, 26 Jul 2016 10:22:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.72 X-Spam-Level: X-Spam-Status: No, score=-0.72 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=data-artisans-com.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id rDQTYc_PifSV for ; Tue, 26 Jul 2016 10:22:04 +0000 (UTC) Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 1C0E860E8D for ; Tue, 26 Jul 2016 10:22:04 +0000 (UTC) Received: by mail-wm0-f47.google.com with SMTP id i5so9479482wmg.0 for ; Tue, 26 Jul 2016 03:22:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=data-artisans-com.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=591fRj0hNnJTCqiuij+kZZthO3r6DTgayVa1IAZceEk=; b=yWNlJYkZyTrD6beLaBruINXNzJww81m9OGV+OAbaMQS9UYmBzDPejlgLcKCm4U0gw4 IuKgT6CWAiJn/Z4rkR2RCqHeuNhl/ppCt59ta1eJ4VWdn5BLR6HwW81AuLJfYgrgt+0Q rEEn33yAziP3BzyTPgsTJajsZ8tVPlBM1SJQMt7W/NlHhkbMLz4NKtO1fxiFpeHKp0MH U6WsRoIqS6TQYj5b6a2hV4trigXQ5RNTWSkENo8xqqcSP1Cv5aJZ09xVRpZhGJc+1rSD V4V1l3SZgDeMp5fgUWdJAYjMwAcDss7iJrIO/jctNe7RIH8exuVn5Lw+3MS+kRQIY8C8 ocFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=591fRj0hNnJTCqiuij+kZZthO3r6DTgayVa1IAZceEk=; b=cHdoeuQzw+tXRt8WT/VtVQI0wbKVkhg71Ihb6ZFduT19Y4UNIThhqXF3ZIH0F0Sa9X UaOaoOJfPW1uxjmruhUzAey++R1C95qmMBHHfL0S3Cch2pndWCV7XXEnZteNNL0uIy6a +odhsDvF1TzG+6gF1JsXBAHfcs2nKWrLbKUqFeckWCHppopxAc0TOWGUdfbIY9ZXYnju iLk+jC1KlJkwwm9UciVOCYhFpQrcovdZlj17LN8V6qoF3VOWg27tYyPSrK9lGeXJv9f0 kyY6kFn04xu8kbvhSblGNpqYqerU5/FfK4M5gaSRyq08aiU3/RPpBNBS5WFslWBtNfz1 l5Rg== X-Gm-Message-State: AEkoout2EZxmylk3H8Llq861c1CRJs466SMFhLdckyVRZwzwwWcdp+6t5YYgrv4atTYRzuQj X-Received: by 10.28.207.197 with SMTP id f188mr23500172wmg.69.1469528523192; Tue, 26 Jul 2016 03:22:03 -0700 (PDT) Received: from macklou.fritz.box (ip5b40315a.dynamic.kabel-deutschland.de. [91.64.49.90]) by smtp.gmail.com with ESMTPSA id p23sm909194wme.8.2016.07.26.03.22.01 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 26 Jul 2016 03:22:02 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Question about Apache Flink Use Case From: Kostas Kloudas In-Reply-To: Date: Tue, 26 Jul 2016 12:22:00 +0200 Cc: Suma Cherukuri Content-Transfer-Encoding: quoted-printable Message-Id: References: To: dev@flink.apache.org X-Mailer: Apple Mail (2.3124) archived-at: Tue, 26 Jul 2016 10:22:11 -0000 Hi Suma Cherukuri, =46rom what I understand you have many small files and you want to=20 aggregate them into bigger ones containing the logs of the last 24h. As Max said RollingSinks will allow you to have exactly-once semantics when writing your aggregated results to your FS. As far as reading your input is concerned, Flink recently integrated functionality to periodically monitor a directory, e.g. your log directory, and process only the new files as they appear. This will be part of the 1.1 release which is coming possibly during = this=20 week or the next, but you can always find it on the master branch. The method that you need is: readFile(FileInputFormat inputFormat, String filePath, FileProcessingMode watchType, long interval, FilePathFilter filter, TypeInformation typeInformation) which allows you to specify the FileProcessingMode (which you should set = to FileProcessingMode.PROCESS_CONTINUOUSLY) and the =E2=80=9Cinterval=E2=80=9D= at which=20 Flink is going to monitor the directory (path) for new files.=20 In addition you can find some helper methods in the = StreamExecutionEnvironment=20 class that allow you to avoid specifying some parameters. I believe that with the above two features (RollingSink and = ContinuousMonitoring source) Link can be the tool for your job, as both of them also provide = exactly-once guarantees. I hope this helps. Let us know what you think, Kostas =20 > On Jul 26, 2016, at 11:51 AM, Maximilian Michels = wrote: >=20 > Hi Suma Cherukuri, >=20 > Apache Flink can certainly serve your use case very well. Here's why: >=20 > 1) Apache Flink has a connectors for Kafka and ElasticSearch. It > supports reading and writing to the S3 file system. >=20 > 2) Apache Flink includes a RollingSink which splits up data into files > with a configurable maximum file size. The RollingSink includes a > "Bucketer" which lets you control when and how to create new > directories or files. >=20 > 3) Apache Flink's streaming API and runtime for event processing is > one of the most advanced out there (support for Event Time, Windowing, > exactly-once) >=20 > These are just first pointers. Please don't hesitate to ask more > questions. I think we would need a bit more details about your use > case to understand how exactly you would use Apache Flink. >=20 > Best, > Max >=20 > On Sun, Jul 24, 2016 at 1:58 AM, Suneel Marthi = wrote: >> =46rom the Use Case description, it seems like u r looking to = aggregate files >> based on either a threshold size or threshold time and ship them to = S3. >> Correct? >>=20 >> Flink might be an overkill here and u could look at frameworks like = Apache >> NiFi that have pre-built (and configurable) processors to do just = what u r >> describing here. >>=20 >>=20 >>=20 >> On Fri, Jul 22, 2016 at 3:00 PM, Suma Cherukuri = >> wrote: >>=20 >>> Hi, >>>=20 >>> Good Afternoon! >>>=20 >>> I work as an engineer at Symantec. My team works on Multi-tenant = Event >>> Processing System. Just a high level background, our customers write = data >>> to kafka brokers though agents like logstash and we process the = events and >>> save the log data in Elastic Search and S3. >>>=20 >>> Use Case: We have a use case where in we write batches of events to = S3 >>> when file size limitation of 1MB (specific to our case) or a certain = time >>> threshold is reached. We are planning on merging the number of = files >>> specific to a folder into one single file based on either time limit = such >>> as every 24 hrs. >>>=20 >>> We were considering various options available today and would like = to >>> know if Apache Flink can be used to serve the purpose. >>>=20 >>> Looking forward to hearing from you. >>>=20 >>> Thank you >>> Suma Cherukuri >>>=20 >>>=20