Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EBE90200C14 for ; Tue, 24 Jan 2017 00:33:28 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id EA9A9160B53; Mon, 23 Jan 2017 23:33:28 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3F3CF160B49 for ; Tue, 24 Jan 2017 00:33:28 +0100 (CET) Received: (qmail 62993 invoked by uid 500); 23 Jan 2017 23:33:27 -0000 Mailing-List: contact users-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@apex.apache.org Delivered-To: mailing list users@apex.apache.org Received: (qmail 62984 invoked by uid 99); 23 Jan 2017 23:33:27 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jan 2017 23:33:27 +0000 Received: from mail-ot0-f180.google.com (mail-ot0-f180.google.com [74.125.82.180]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 225F91A060E for ; Mon, 23 Jan 2017 23:33:27 +0000 (UTC) Received: by mail-ot0-f180.google.com with SMTP id 65so114930183otq.2 for ; Mon, 23 Jan 2017 15:33:27 -0800 (PST) X-Gm-Message-State: AIkVDXKgA4qsHa8ZlVFqFeU9UBCNHqF+Lqz4VNsyC5T2Jx996OwtZEh302BC1QWnhNYIEch1RZ0+IFoDUtIYVw== X-Received: by 10.157.36.138 with SMTP id z10mr14241311ota.7.1485214406234; Mon, 23 Jan 2017 15:33:26 -0800 (PST) MIME-Version: 1.0 Received: by 10.182.11.66 with HTTP; Mon, 23 Jan 2017 15:33:25 -0800 (PST) In-Reply-To: References: <104DE8D1-1FF3-4820-8F61-EF267A8D10E1@datatorrent.com> From: Thomas Weise Date: Mon, 23 Jan 2017 15:33:25 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: One-time Initialization of in-memory data using a data file To: users@apex.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Mon, 23 Jan 2017 23:33:29 -0000 Roger, An Apex operator typically holds state that it uses for processing and often that state is mutable. For large state: "Managed state" in Malhar (and its predecessor HDHT) were designed for large state that can be mutated efficiently under a specific write pattern (semi ordered keys). However, there is no benefit of using these for immutable data that is already in HDFS. In such case it would be best to store them (during migration/ingest) in HDFS a file format that allows for fast random reads (block structured files like HFile or TFile or any other indexed structure provide that). Also, depending on how the data, once in memory, would be used, an Apex operator may or may not be the right home. If the goal is to only lookup data without further processing with a synchronous request/response pattern, then an IMDG or similar system may be a more appropriate solution. Here are pointers for managed state: https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/= index.html https://github.com/apache/apex-malhar/blob/master/benchmark/src/main/java/c= om/datatorrent/benchmark/state/ManagedStateBenchmarkApp.java Thanks, Thomas On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta wrote: > Roger, > > Depending on the certain requirements on expected latency, size of data e= tc, > the operator's design will change. > > If latency needs to be lowest possible, meaning completely in-memory and = not > hitting the disk for read I/O, there are two scenarios > 1. If the lookup data size is small --> just load to memory in the setup > call, switch off checkpointing to get rid off checkpoint I/O latency in > between. In case of operator restarts, the data should be reloaded in set= up. > 2. If the lookup data is large --> have many partitions of this operator = to > minimize the footprint of each partition. Still switch off checkpointing = and > reload in setup in case of operator restart. Having many partitions will > ensure that the setup load is fast. The incoming query needs to be > partitioned based on the lookup key. > > You can use the PojoEnricher with FSLoader for above design. > > Code: > https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/c= om/datatorrent/contrib/enrich/POJOEnricher.java > Example: > https://github.com/DataTorrent/examples/tree/master/tutorials/enricher > > In case of large lookup dataset and latency caused by disk read I/O is fi= ne, > then use HDHT or managed state as a backup mechanism for the in-memory da= ta > to decrease the checkpoint footprint. I could not find example for manage= d > state but here are the links for HDHT.. > > Code: > https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com= /datatorrent/contrib/hdht > Example: > https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/te= st/java/com/example/HDHTAppTest.java > > Regards, > Ashwin. > > On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare > wrote: >> >> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileRead= er >> in the malhar-library =E2=80=93 not sure whether it gives you reading th= e whole file >> into memory. >> >> >> >> Also there is a library called Megh at https://github.com/DataTorrent/Me= gh >> where you might find some useful operators like >> com.datatorrent.contrib.hdht.hfile.HFileImpl . >> >> >> >> From: Roger F >> Reply-To: >> Date: Sunday, January 22, 2017 at 9:32 PM >> To: >> Subject: One-time Initialization of in-memory data using a data file >> >> >> >> Hi, >> >> I have a use case where application business data needs migrated from a >> legacy system (such as mainframe) into HDFS and then loaded for use by a= n >> Apex application. >> >> To get this done, an approach that is being considered to perform one-ti= me >> initialization of the data from the HDFS into application memory. This d= ata >> will then be queried for various business logic functions of the >> application. >> >> Once the data is loaded, this operator/module (?) should no longer perfo= rm >> any further function except for acting as a master of this data and then >> supporting operations to query the data (via a key). >> >> Any pointers to how this can be done ? I was looking for an operator or >> any other entity which can load this data at startup (Activation or Setu= p) >> and then allow queries to be submitted to it via an input port. >> >> >> >> -R > > > > > -- > > Regards, > Ashwin.