nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@hexastax.com>
Subject Re: Re: Filtering large CSV files
Date Tue, 05 Apr 2016 19:21:17 GMT
Hi Uwe,

Yes, that is what I was thinking of using for the CSV processor.  Will you
be committing your version?

- Dmitry

On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <uwe.geercken@web.de> wrote:

> Dimitry,
>
> I was working on a processor for CSV files and one remark came up that we
> might want to use the opencsv library for parsing the file.
>
> Here is the link: http://opencsv.sourceforge.net/
>
> Greetings,
>
> Uwe
>
> > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > Von: "Dmitry Goldenberg" <dgoldenberg@hexastax.com>
> > An: dev@nifi.apache.org
> > Betreff: Re: Filtering large CSV files
> >
> > Hi Eric,
> >
> > Thinking about exactly these use-cases, I filed the following JIRA
> ticket:
> > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks
> for a
> > SplitCSV processor, and actually for a GetCSV ingress which would address
> > the issue of reading out of a large CSV treating it as a "data source".
> I
> > was thinking of actually implementing both and committing them.
> >
> > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking
> for a
> > way to filter the CSV columns.  I believe this is best achieved as the
> CSV
> > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> > separate step.
> >
> > I'm not sure that SplitText is the best way to process CSV data to begin
> > with, because with a CSV, there's a chance that a given cell may spill
> over
> > into multiple lines. Such would be the case of embedded newlines within a
> > single, quoted cell. I don't think SplitText addresses that and that
> would
> > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > semantics, the other reason being efficiency of reading.
> >
> > As far as the limit on the capturing groups, that seems arbitrary. I
> think
> > that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> > columns by their number (index) that should go a long way; perhaps a
> regex
> > is also a good option.  I know it may seem that filtering should be a
> > separate step in a given dataflow but from the point of view of
> efficiency,
> > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> > records are being read and processed.
> >
> > - Dmitry
> >
> >
> >
> >
> > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <eric.falk@uni.lu> wrote:
> >
> > > Dear all,
> > >
> > > I would require to filter large csv files in a data flow. By filtering
> I
> > > mean: scale down the file in terms of columns, and looking for a
> particular
> > > value to match a parameter. I looked into the example, of csv to JSON.
> I do
> > > have a couple of questions:
> > >
> > > -First I use a SplitText control get each line of the file. It makes
> > > things slow, as it seems to generate a flow file for each line. Do I
> have
> > > to proceed this way, or is there an alternative? My csv files are
> really
> > > large and can have millions of lines.
> > >
> > > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > > technique, before using a processor to check for a match, on
> ${csv.146} for
> > > instance. Now I have a problem: my csv has 233 fields, so I am getting
> the
> > > message: “ReGex is required to have between 1 and 40 capturing groups
> but
> > > has 233”. Again, is there another way to proceed, am I missing
> something?
> > >
> > > Best regards,
> > > Eric
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message