nifi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NIFI-988) PutDistributedMapCache processor
Date Wed, 23 Sep 2015 16:28:04 GMT


ASF GitHub Bot commented on NIFI-988:

Github user joemeszaros commented on the pull request:
    I have several tracking event files, containing user interactions, e.g. user.x liked item.y
in the following format:
    |UserId  | Action | ItemId |
    | ------------- | ------------- | ------------- |
    | user.x | like  | item.y |
    | user.xx | like  | item.z |
    I need to enrich these event files e.g. with the title of the associated item from a separate
item file, containing the item metadata:
    |ItemId  | Title |
    | ------------- | ------------- |
    | item.y | Title for item.y  |
    | item.z | Title for item.z  |
    and the enriched event file should like this:
    |UserId  | Action | ItemId | Title
    | ------------- | ------------- | ------------- | ------------- |
    | user.x | like  | item.y | Title for item.y|
    | user.xx | like  | item.z | Title for item.z|
    My idea was to cache the item file in a distributed cache, because it is a typical controller
service functionality, and use the same cache to extend the event files one-by-one, when looking
for a title, based on the ItemId. In that case I need to read the item file only once. I created
a workflow, which grabs the item file, creates a flow file for each item (each line), where
the ItemId is added as a custom flow file attribute and puts those flow files into the distributed
cache, using the PutDistributedMapCache processor. The cache key is the custom ItemId attribute,
and the metadata is the cache value. During the event file enrichment I use this item catalogue
cache to look for an ItemId and get e.g. the title. 
    (My workflow is not so simple, because I use JSON conversion, and additional processors
as well)
    The DetectDuplicate was not an appropriate processor for me, because (as it names suggests)
it is used for duplicate detection and caches a custom flow file attribute, not the flow file
    I hope I was able to highlight my rationality behind this new processor  :-)

> PutDistributedMapCache processor
> --------------------------------
>                 Key: NIFI-988
>                 URL:
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Joe Mészáros
>            Priority: Minor
>              Labels: cache, distributed, feature, new, put
> There is a standard controller service, called DistributedMapCacheServer, which provides
a distributed cache, and an associated DistributedMapCacheClientService to interact with the
cache. But there is not any standard processor, which puts data into the cache, and helps
the user to leverage the distributed cache capabilities.
> The purpose of PutDistributedMapCache is very similar to the egress processors: it gets
the content of a FlowFile and puts it to a distributed map cache, using a cache key computed
from FlowFile attributes. If the cache already contains the entry and the cache update strategy
is 'keep original' the entry is not replaced.

This message was sent by Atlassian JIRA

View raw message