apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mukkamula, Suryavamshivardhan (CWM-NR)" <suryavamshivardhan.mukkam...@rbc.com>
Subject RE: Multiple directories
Date Wed, 15 Jun 2016 20:55:27 GMT
Hi Ram/Team,

I could create an operator which reads multiple directories and parses the each file with
respect to an individual configuration file and generates output file to different directories.

However I have some questions regarding the design.


è We have 120 directories to scan on HDFS, if we use parallel partitioning with operator
memory around 250MB , it might be around 30GB of RAM for the processing of this operator,
are these figures going to create any problem in production ?

è Should I use a scheduler for running the batch job (or) define next scan time and make
the DT job running continuously ? if I run DT job continuously I assume memory will be continuously
utilized by the DT Job it is not available to other resources on the cluster, please clarify.

Regards,
Surya Vamshi

From: Munagala Ramanath [mailto:ram@datatorrent.com]
Sent: 2016, June, 05 10:24 PM
To: users@apex.apache.org
Subject: Re: Multiple directories

Some sample code to monitor multiple directories is now available at:
https://github.com/DataTorrent/examples/tree/master/tutorials/fileIO-multiDir

It shows how to use a custom implementation of definePartitions() to create
multiple partitions of the file input operator and group them
into "slices" where each slice monitors a single directory.

Ram

On Wed, May 25, 2016 at 9:55 AM, Munagala Ramanath <ram@datatorrent.com<mailto:ram@datatorrent.com>>
wrote:
I'm hoping to have a sample sometime next week.

Ram

On Wed, May 25, 2016 at 9:30 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <suryavamshivardhan.mukkamula@rbc.com<mailto:suryavamshivardhan.mukkamula@rbc.com>>
wrote:
Thank you so much ram, for your advice , Option (a) would be ideal for my requirement.

Do you have sample usage for partitioning with individual configuration set ups different
partitions?

Regards,
Surya Vamshi

From: Munagala Ramanath [mailto:ram@datatorrent.com<mailto:ram@datatorrent.com>]
Sent: 2016, May, 25 12:11 PM
To: users@apex.apache.org<mailto:users@apex.apache.org>
Subject: Re: Multiple directories

You have 2 options: (a) AbstractFileInputOperator (b) FileSplitter/BlockReader

For (a), each partition (i.e. replica or the operator) can scan only a single directory, so
if you have 100
directories, you can simply start with 100 partitions; since each partition is scanning its
own directory
you don't need to worry about which files the lines came from. This approach however needs
a custom
definePartition() implementation in your subclass to assign the appropriate directory and
XML parsing
config file to each partition; it also needs adequate cluster resources to be able to spin
up the required
number of partitions.

For (b), there is some documentation in the Operators section at http://docs.datatorrent.com/
including
sample code. There operators support scanning multiple directories out of the box but have
more
elaborate configuration options. Check this out and see if it works in your use case.

Ram

On Wed, May 25, 2016 at 8:17 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <suryavamshivardhan.mukkamula@rbc.com<mailto:suryavamshivardhan.mukkamula@rbc.com>>
wrote:
Hello Ram/Team,

My requirement is to read input feeds from different locations on HDFS and parse those files
by reading XML configuration files (each input feed has configuration file which defines the
fields inside the input feeds).

My approach : I would like to define a mapping file which contains individual feed identifier,
feed location , configuration file location. I would like to read this mapping file at initial
load within setup() method and define my DirectoryScan.acceptFiles. Here my challenge is when
I read the files , I should parse the lines by reading the individual configuration files.
How do I know the line is from particular file , if I know this I can read the corresponding
configuration file before parsing the line.

Please let me know how do I handle this.

Regards,
Surya Vamshi

From: Munagala Ramanath [mailto:ram@datatorrent.com<mailto:ram@datatorrent.com>]
Sent: 2016, May, 24 5:49 PM
To: Mukkamula, Suryavamshivardhan (CWM-NR)
Subject: Multiple directories

One way of addressing the issue is to use some sort of external tool (like a script) to
copy all the input files to a common directory (making sure that the file names are
unique to prevent one file from overwriting another) before the Apex application starts.

The Apex application then starts and processes files from this directory.

If you set the partition count of the file input operator to N, it will create N partitions
and
the files will be automatically distributed among the partitions. The partitions will work
in parallel.

Ram

_______________________________________________________________________

This [email] may be privileged and/or confidential, and the sender does not waive any related
rights and obligations. Any distribution, use or copying of this [email] or the information
it contains by other than an intended recipient is unauthorized. If you received this [email]
in error, please advise the sender (by return [email] or otherwise) immediately. You have
consented to receive the attached electronically at the above-noted address; please retain
a copy of this confirmation for future reference.


_______________________________________________________________________

This [email] may be privileged and/or confidential, and the sender does not waive any related
rights and obligations. Any distribution, use or copying of this [email] or the information
it contains by other than an intended recipient is unauthorized. If you received this [email]
in error, please advise the sender (by return [email] or otherwise) immediately. You have
consented to receive the attached electronically at the above-noted address; please retain
a copy of this confirmation for future reference.


_______________________________________________________________________

This [email] may be privileged and/or confidential, and the sender does not waive any related
rights and obligations. Any distribution, use or copying of this [email] or the information
it contains by other than an intended recipient is unauthorized. If you received this [email]
in error, please advise the sender (by return [email] or otherwise) immediately. You have
consented to receive the attached electronically at the above-noted address; please retain
a copy of this confirmation for future reference.
Mime
View raw message