avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deepak Kumar V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1418) AvroMultipleOutputs should support sync-able writers
Date Fri, 20 Dec 2013 18:24:10 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854386#comment-13854386

Deepak Kumar V commented on AVRO-1418:

Hi Doug,
Your correct, I did not think about email triggered through edits. Do you want me to remove
the solution from description ?

I was under impression we are not suppose to modify existing AvroMultipleOutputs. (And now
i think, why not). May be since i was developing the solution for my use case i was good with
extension rather changing the base class.

We could add sync() directly to AvroMultipleOutputs that would avoid multiple subclasses (3
each of AvroKey* and AvroKeyValue* and a Markable interface). 

> AvroMultipleOutputs should support sync-able writers
> ----------------------------------------------------
>                 Key: AVRO-1418
>                 URL: https://issues.apache.org/jira/browse/AVRO-1418
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Deepak Kumar V
> DataFileWriter supports APIs like sync() (that allows to emit synchronization markers)
so that DataFileReader could later use sync() or seek() to move to a particular synchronization
> AvroMultipleOutputs does not support or provide a way to invoke sync on its individual
writers. One could extend its behavior, however its design is closed for extension. (All states
are private and getRecordWriter() are private). Hence AvroMultipleOutputs must first be modified
so as to support extension and additional classes must be provided to support a synch able
> Solution
> ======
> I) MarkableAvroMultipleOutputs : Allows users to set synchronization points before/after
writing Key-Value pairs with AvroMultipleOutputs.write()
> A public api to invoke sync on a named output.
> Ex: public void sync(String namedOutput, String baseOutputPath) throws IOException, InterruptedException
> To achieve above AvroMultipleOutputs should be modified so as to allow support for additional
behavior. The following must be marked as protected instead of private
> 1) private static void checkBaseOutputPath(String outputPath) {}  from private.
> 2) private static void checkNamedOutputName(JobContext job, String namedOutput, boolean
alreadyDefined) {} from private.
> 3) private TaskInputOutputContext<?, ?, ?, ?> context;
> 4) private Set<String> namedOutputs;
> 5) private synchronized RecordWriter getRecordWriter(TaskAttemptContext taskContext,
String baseFileName)
> II) AvroKeyValueRecordWriter that is used by AvroMultipleOutputs as writers for individual
writers is again closed for extension. It must allow to invoke sync() on writer.
> To achieve that the following private members must be marked protected.
> 1) private final DataFileWriter<GenericRecord> mAvroFileWriter;
> A MarkableAvroKeyValueRecordWriter must be provided that exposes a public API to invoke
sync on its writer.
> public void sync() throws IOException {}
> III) A MarkableAvroKeyValueOutputFormat that extends AvroKeyValueOutputFormat and uses
> Include similar support for AvroKeyOutputFormat & AvroKeyRecordWriter.

This message was sent by Atlassian JIRA

View raw message