flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: open multiple file from list of uri
Date Wed, 15 Jul 2015 11:16:35 GMT
If you want to work without the placeholder, simply do: "env.createInput(new
myDelimitedInputFormat(parser)(paths))

The "createInputSplits()" method looks good.

Greetings,
Stephan


On Tue, Jul 14, 2015 at 11:42 PM, Michele Bertoni <
michele1.bertoni@mail.polimi.it> wrote:

>  Ok thank you, now I solved it!
>
>
>  The problem was in the env.readFile(myInputFormat, path)
>
>  now that path is actually a list of paths what should I pass it?
>
>
>
>  I solved in this way
>
>  env.readFile(new myDelimitedInputFormat(parser)(paths), paths.head)
>
>  where that paths.head gives to the read file a url that is just a
> “placeholder” and seems to be never used, and the custom input format takes
> care of creating the split out of the list of dir
>
>  I tried and it works
> is it correct way to do that? :)
>
>
>
>  fyi the create input split is implemented in this way
>
>  override def createInputSplits(minNumSplits : Int) = {
>     files.flatMap((f) => {
>       super.setFilePath(f)
>       super.createInputSplits(minNumSplits)
>     }).toArray
>   }
>
>  where paths is a parameter of the input format constructor (as much as
> the custom parser as shown above)
>
>  do you think it is useful if a open a stack overflow post of it (maybe
> with the custom parser too)?
>
>
>
>
>  cheers
> michele
>
>
>  Il giorno 14/lug/2015, alle ore 18:50, Stephan Ewen <sewen@apache.org>
> ha scritto:
>
>  For the approach that I outlined, you need to subclass of the file input
> format.
>
>  In that subclass, you store the list of URIs (in a new variable), and
> override the "createInputSplits()" method.
>
>  Stephan
>
> On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <
> michele1.bertoni@mail.polimi.it> wrote:
>
>> Hi Stephan, I started working on this today, but I am having a problem
>>
>>  Can you be a little more detailed in the procedure?
>> actually I don’t understand how to give to the input format the list of
>> URI since it will try putting it in a Path variable
>>
>>  createinputsplit does not receive the path but takes a path from that
>> variable
>>
>>
>>  Thanks,
>> Michele
>>
>>
>>  Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <
>> michele1.bertoni@mail.polimi.it> ha scritto:
>>
>>  Right!
>> later I will do the question and quoting your answer with the solution :)
>>
>>  Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <sewen@apache.org>
>> ha scritto:
>>
>>  Seems like a good idea to collect these questions.
>>
>>  Stackoverflow is also a good place for "useful tricks"...
>>
>> On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <
>> michele1.bertoni@mail.polimi.it> wrote:
>>
>>> Got it!
>>> i will try thanks! :)
>>>
>>>  What about writing a section of it in the programming guide?
>>> I found a couple of topic about the readers in the mailing list, it
>>> seems it may be helpful
>>>
>>>
>>>
>>>  Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <sewen@apache.org>
>>> ha scritto:
>>>
>>>  Sure, just override the "createInputSplits()" method. Call for each of
>>> your file paths "super.createInputSplits()" and then combine the results
>>> into one array that you return.
>>>
>>>  That should do it...
>>>
>>> On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <
>>> michele1.bertoni@mail.polimi.it> wrote:
>>>
>>>> Hi Stephan, thanks for answering,
>>>> right now I am using an extension of the DelimitedInputFormat, is there
>>>> a way to merge it with the option 2?
>>>>
>>>>
>>>>
>>>>  Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <sewen@apache.org>
>>>> ha scritto:
>>>>
>>>>  There are two ways you can realize that:
>>>>
>>>>  1) Create multiple sources and union them. This is easy, but probably
>>>> a bit less efficient.
>>>>
>>>>  2) Override the FileInputFormat's createInputSplits method to take a
>>>> union of the paths to create a list of all files and fils splits that will
>>>> be read.
>>>>
>>>>  Stephan
>>>>
>>>>
>>>> On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <
>>>> michele1.bertoni@mail.polimi.it> wrote:
>>>>
>>>>> Hi everybody,
>>>>> is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…)
>>>>> and open them as different files?
>>>>> I know i may open the entire directory, but i want to be able to
>>>>> select a subset of files in the directory
>>>>>
>>>>> thanks
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Mime
View raw message