apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mukkamula, Suryavamshivardhan (CWM-NR)" <suryavamshivardhan.mukkam...@rbc.com>
Subject RE: Reading Multiple Direcotries in sequence
Date Thu, 23 Jun 2016 14:35:33 GMT
Hi Ram,

In my case , I have 120 directories that I have to read per one batch job per a day. With
your guidance I have successfully implemented the parallel partition approach with a single
logical operator and it is working the way it is expected. I am creating the partition depending
on the number sources from the properties file.

If I give 250MB per each operator , I need around 12 containers of each 4GB RAM(Each container
can handle 10 parallel operators) which comes around 50GB of RAM to process the batch.

My concerns are, Please provide your suggestions,


è Is this memory utilization on the cluster ok ?

è If not I can sequentially run two applications with (60 directories per application) but
I have to schedule the batch in different times, may be by using oozie or spring batch. What
do you suggest?

è As per your comments below , How do I make my partition wait for trigger from kafka (or)
entry in Database, is it inside the definepartition? Do you have any sample code for the same.
What I am currently doing to generate the partition is source property in the properties file
for each directory. I am processing the each file differently and generating the output file
in different directories.

Regards,
Surya Vamshi

From: Munagala Ramanath [mailto:ram@datatorrent.com]
Sent: 2016, June, 23 9:54 AM
To: users@apex.apache.org
Subject: Re: Reading Multiple Direcotries in sequence

No, I don't have an example but several approaches are possible depending on the
exact requirements, e.g.:
1. How large is the number of directories ?
2. Is the desired sequence a total order or a partial order (i.e. DAG, https://en.wikipedia.org/wiki/Partially_ordered_set)
?

If the number of directories is small you can use one operator per directory and link them
with ports in the
desired sequence. Each operator sends a control tuple to the next when it wants the next one
to start.
Each operator waits for this trigger and emits tuples in the idle time handler, for example:

public class DownStreamReceiver extends AbstractFileInputOperator implements Operator.IdleTimeHandler{
  @Override
  public void handleIdleTime()
  {
        if(upstreamDoneReading){ // this is set to true only after receiving the trigger from
1st reader
         emitTuples();
        }
  }
}

If the number is large, you can explore the earlier partitioned approach but have each partition
look for a trigger
from an external source like a Kafka queue or an entry in a DB to start processing.

Ram

On Thu, Jun 23, 2016 at 6:11 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <suryavamshivardhan.mukkamula@rbc.com<mailto:suryavamshivardhan.mukkamula@rbc.com>>
wrote:
Hi Ram,

Do you have a sample DT application code for reading multiple directories in sequence ?

Or through some light on how would I achieve that with AbstractFileInputOperator.

Regards,
Surya Vamshi


_______________________________________________________________________

If you received this email in error, please advise the sender (by return email or otherwise)
immediately. You have consented to receive the attached electronically at the above-noted
email address; please retain a copy of this confirmation for future reference.

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par
retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s)
ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver
une copie de cette confirmation pour les fins de reference future.

_______________________________________________________________________
If you received this email in error, please advise the sender (by return email or otherwise)
immediately. You have consented to receive the attached electronically at the above-noted
email address; please retain a copy of this confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par
retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s)
ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver
une copie de cette confirmation pour les fins de reference future.
Mime
View raw message