flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Bertoni <michele1.bert...@mail.polimi.it>
Subject Re: open multiple file from list of uri
Date Wed, 15 Jul 2015 22:03:12 GMT
uhm, it doesn’t seem to work: it calls the configure() method that checks if filePath is
null and throws an exception
Actually i set that field only during the createInputSplits that is some steps later

Il giorno 15/lug/2015, alle ore 13:16, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
ha scritto:

If you want to work without the placeholder, simply do: "env.createInput(new myDelimitedInputFormat(parser)(paths))

The "createInputSplits()" method looks good.


On Tue, Jul 14, 2015 at 11:42 PM, Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>
Ok thank you, now I solved it!

The problem was in the env.readFile(myInputFormat, path)

now that path is actually a list of paths what should I pass it?

I solved in this way

env.readFile(new myDelimitedInputFormat(parser)(paths), paths.head)

where that paths.head gives to the read file a url that is just a “placeholder” and seems
to be never used, and the custom input format takes care of creating the split out of the
list of dir

I tried and it works
is it correct way to do that? :)

fyi the create input split is implemented in this way

override def createInputSplits(minNumSplits : Int) = {
    files.flatMap((f) => {

where paths is a parameter of the input format constructor (as much as the custom parser as
shown above)

do you think it is useful if a open a stack overflow post of it (maybe with the custom parser


Il giorno 14/lug/2015, alle ore 18:50, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
ha scritto:

For the approach that I outlined, you need to subclass of the file input format.

In that subclass, you store the list of URIs (in a new variable), and override the "createInputSplits()"


On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will
try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>
ha scritto:

later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful

Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()"
and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it
with the option 2?

Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to
create a list of all files and fils splits that will be read.


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open
them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files
in the directory


View raw message