apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <...@apache.org>
Subject Re: One-time Initialization of in-memory data using a data file
Date Mon, 23 Jan 2017 23:33:25 GMT
Roger,

An Apex operator typically holds state that it uses for processing and
often that state is mutable. For large state: "Managed state" in
Malhar (and its predecessor HDHT) were designed for large state that
can be mutated efficiently under a specific write pattern (semi
ordered keys). However, there is no benefit of using these for
immutable data that is already in HDFS.

In such case it would be best to store them (during migration/ingest)
in HDFS a file format that allows for fast random reads (block
structured files like HFile or TFile or any other indexed structure
provide that).

Also, depending on how the data, once in memory, would be used, an
Apex operator may or may not be the right home. If the goal is to only
lookup data without further processing with a synchronous
request/response pattern, then an IMDG or similar system may be a more
appropriate solution.

Here are pointers for managed state:

https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/index.html
https://github.com/apache/apex-malhar/blob/master/benchmark/src/main/java/com/datatorrent/benchmark/state/ManagedStateBenchmarkApp.java

Thanks,
Thomas


On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta
<ashwinchandrap@gmail.com> wrote:
> Roger,
>
> Depending on the certain requirements on expected latency, size of data etc,
> the operator's design will change.
>
> If latency needs to be lowest possible, meaning completely in-memory and not
> hitting the disk for read I/O, there are two scenarios
> 1. If the lookup data size is small --> just load to memory in the setup
> call, switch off checkpointing to get rid off checkpoint I/O latency in
> between. In case of operator restarts, the data should be reloaded in setup.
> 2. If the lookup data is large --> have many partitions of this operator to
> minimize the footprint of each partition. Still switch off checkpointing and
> reload in setup in case of operator restart. Having many partitions will
> ensure that the setup load is fast. The incoming query needs to be
> partitioned based on the lookup key.
>
> You can use the PojoEnricher with FSLoader for above design.
>
> Code:
> https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java
> Example:
> https://github.com/DataTorrent/examples/tree/master/tutorials/enricher
>
> In case of large lookup dataset and latency caused by disk read I/O is fine,
> then use HDHT or managed state as a backup mechanism for the in-memory data
> to decrease the checkpoint footprint. I could not find example for managed
> state but here are the links for HDHT..
>
> Code:
> https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com/datatorrent/contrib/hdht
> Example:
> https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/test/java/com/example/HDHTAppTest.java
>
> Regards,
> Ashwin.
>
> On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare <sanjay@datatorrent.com>
> wrote:
>>
>> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileReader
>> in the malhar-library – not sure whether it gives you reading the whole file
>> into memory.
>>
>>
>>
>> Also there is a library called Megh at https://github.com/DataTorrent/Megh
>> where you might find some useful operators like
>> com.datatorrent.contrib.hdht.hfile.HFileImpl .
>>
>>
>>
>> From: Roger F <rf301623@gmail.com>
>> Reply-To: <users@apex.apache.org>
>> Date: Sunday, January 22, 2017 at 9:32 PM
>> To: <users@apex.apache.org>
>> Subject: One-time Initialization of in-memory data using a data file
>>
>>
>>
>> Hi,
>>
>> I have a use case where application business data needs migrated from a
>> legacy system (such as mainframe) into HDFS and then loaded for use by an
>> Apex application.
>>
>> To get this done, an approach that is being considered to perform one-time
>> initialization of the data from the HDFS into application memory. This data
>> will then be queried for various business logic functions of the
>> application.
>>
>> Once the data is loaded, this operator/module (?) should no longer perform
>> any further function except for acting as a master of this data and then
>> supporting operations to query the data (via a key).
>>
>> Any pointers to how this can be done ? I was looking for an operator or
>> any other entity which can load this data at startup (Activation or Setup)
>> and then allow queries to be submitted to it via an input port.
>>
>>
>>
>> -R
>
>
>
>
> --
>
> Regards,
> Ashwin.

Mime
View raw message