nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Thomsen <mikerthom...@gmail.com>
Subject Re: Bulk inserting into HBase with NiFi
Date Wed, 07 Jun 2017 19:56:45 GMT
Yeah, it's really getting hammered by the small files. I took a look at the
new record APIs and that looked really promising. In fact, I'm taking a
shot at creating a variant of PutHBaseJSON that uses the record API. Look
fairly straight forward so far. My strategy is roughly like this:

GetFile -> SplitText -> ExecuteScript -> RouteOnAttribute ->
PutHBaseJSONRecord

ExecuteScript generates a larger flowfile that contains a structure like
this now:

[
  { "key": "XYZ", "value": "ABC" }
]


My intention is to have a JsonPathReader take that bigger flowfile which is
a JSON array and iterate over it as a bunch of records to turn into Puts
with the new HBase processor. I'm borrowing some code for wiring in the
reader from the QueryRecord processor.

So my only question now is, what is the best way to serialize the Record
objects to JSON? The PutHBaseJson processor already has a Jackson setup
internally. Any suggestions on doing this in a way that doesn't tie me at
the hip to a particular reader implementation?

Thanks,

Mike


On Wed, Jun 7, 2017 at 6:12 PM, Bryan Bende <bbende@gmail.com> wrote:

> Mike,
>
> Just following up on this...
>
> I created this JIRA to track the idea of record-based HBase processors:
> https://issues.apache.org/jira/browse/NIFI-4034
>
> Also wanted to mention that with the existing processors, the main way
> to scale up would be to increase the concurrent tasks on PutHBaseJson
> and also to increase the Batch Size property which defaults to 25. The
> Batch Size controls the maximum number of flow files that a concurrent
> task will attempt to pull from the queue and send to HBase in one put
> operation.
>
> Even with those tweaks your flow may still be getting hammered with
> lots of small flow files, but thought I would mention to see if it
> helps at all.
>
> -Bryan
>
>
> On Tue, Jun 6, 2017 at 7:40 PM, Bryan Bende <bbende@gmail.com> wrote:
> > Mike,
> >
> > With the recent record-oriented processors that have come out recently, a
> > good solution would be to implement a PutHBaseRecord processor that would
> > have a Record Reader configured. This way the processor could read in a
> > large CSV without having to convert to individual JSON documents.
> >
> > One thing to consider is how many records/puts to send in a single call
> to
> > HBase. Assuming multi-GB csv files you'll want to send portions at a
> time to
> > avoid having the whole content in memory (some kind of record batch size
> > property), but then you also have to deal with what happens when things
> fail
> > half way through. If the puts are idempotent then it may be fine to route
> > the whole to failure and try again even if some data was already
> inserted.
> >
> > Feel free to create a JIRA for hbase record processors, or I can do it
> > later.
> >
> > Hope that helps.
> >
> > -Bryan
> >
> >
> > On Tue, Jun 6, 2017 at 7:21 PM Mike Thomsen <mikerthomsen@gmail.com>
> wrote:
> >>
> >> We have a very large body of CSV files (well over 1TB) that need to be
> >> imported into HBase. For a single 20GB segment, we are looking at
> having to
> >> push easily 100M flowfiles into HBase and most of the JSON files
> generated
> >> are rather small (like 20-250 bytes).
> >>
> >> It's going very slowly, and I assume that is because we're taxing the
> disk
> >> very heavily because of the content and provenance repositories coming
> into
> >> play. So I'm wondering if anyone has a suggestion on a good NiFiesque
> way of
> >> solving this. Right now, I'm considering two options:
> >>
> >> 1. Looking for a way to inject the HBase controller service into an
> >> ExecuteScript processor so I can handle the data in large chunks
> (splitting
> >> text and generating a List<Put> inside the processor myself and doing
> one
> >> huge Put)
> >>
> >> 2. Creating a library that lets me generate HFiles from within an
> >> ExecuteScript processor.
> >>
> >> What I really need is something fast within NiFi that would let me
> >> generate huge blocks of updates for HBase and push them out. Any ideas?
> >>
> >> Thanks,
> >>
> >> Mike
> >
> > --
> > Sent from Gmail Mobile
>

Mime
View raw message