avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Is it possible to append to an already existing avro file
Date Wed, 06 Feb 2013 18:03:33 GMT

On Feb 5, 2013, at 7:30pm, Michael Malak wrote:

> I don't believe a Hadoop FileSystem is a Java OutputStream?

The Hadoop FileSystem.append() method returns an FSDataOutputStream, which is a sub-class
of the Java OutputStream.

-- Ken

> 
> --- On Tue, 2/5/13, Doug Cutting <cutting@apache.org> wrote:
> 
>> From: Doug Cutting <cutting@apache.org>
>> Subject: Re: Is it possible to append to an already existing avro file
>> To: user@avro.apache.org
>> Date: Tuesday, February 5, 2013, 5:27 PM
>> It will work on an OutputStream that
>> supports append.
>> 
>> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(org.apache.avro.file.SeekableInput,
>> java.io.OutputStream)
>> 
>> So it depends on how well HDFS implements
>> FileSystem#append(), not on
>> any changes in Avro.
>> 
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path)
>> 
>> I have no recent personal experience with append in
>> HDFS.  Does anyone
>> else here?
>> 
>> Doug
>> 
>> On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak <michaelmalak@yahoo.com>
>> wrote:
>>> My understanding is that will append to a file on the
>> local filesystem, but not to a file on HDFS.
>>> 
>>> --- On Tue, 2/5/13, Doug Cutting <cutting@apache.org>
>> wrote:
>>> 
>>>> From: Doug Cutting <cutting@apache.org>
>>>> Subject: Re: Is it possible to append to an already
>> existing avro file
>>>> To: user@avro.apache.org
>>>> Date: Tuesday, February 5, 2013, 5:08 PM
>>>> The Jira is:
>>>> 
>>>> https://issues.apache.org/jira/browse/AVRO-1035
>>>> 
>>>> It is possible to append to an existing Avro file:
>>>> 
>>>> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File)
>>>> 
>>>> Should we close that issue as "fixed"?
>>>> 
>>>> Doug
>>>> 
>>>> On Fri, Feb 1, 2013 at 11:32 AM, Michael Malak
>> <michaelmalak@yahoo.com>
>>>> wrote:
>>>>> Was a JIRA ticket ever created regarding
>> appending to
>>>> an existing Avro file on HDFS?
>>>>> 
>>>>> What is the status of such a capability, a
>> year out
>>>> from when the issue below was raised?
>>>>> 
>>>>> On Wed, 22 Feb 2012 10:57:48 +0100,
>> "Vyacheslav
>>>> Zholudev" <vyacheslav.zholudev@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Thanks for your reply, I suspected this.
>>>>>> 
>>>>>> I will create a JIRA ticket.
>>>>>> 
>>>>>> Vyacheslav
>>>>>> 
>>>>>> On Feb 21, 2012, at 6:02 PM, Scott Carey
>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On 2/21/12 7:29 AM, "Vyacheslav
>> Zholudev"
>>>> <vyacheslav.zholudev@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Yep, I saw that method as well as
>> the
>>>> stackoverflow post. However, I'm
>>>>>>>> interested how to append to a file
>> on the
>>>> arbitrary file system, not
>>>>>>>> only on the local one.
>>>>>>>> 
>>>>>>>> I want to get an OutputStream
>> based on the
>>>> Path and the FileSystem
>>>>>>>> implementation and then pass it
>> for
>>>> appending to avro methods.
>>>>>>>> 
>>>>>>>> Is that possible?
>>>>>>> 
>>>>>>> It is not possible without modifying
>>>> DataFileWriter. Please open a JIRA
>>>>>>> ticket.
>>>>>>> 
>>>>>>> It could not simply append to an
>> OutputStream,
>>>> since it must either:
>>>>>>> * Seek to the start to validate the
>> schemas
>>>> match and find the sync
>>>>>>> marker, or
>>>>>>> * Trust that the schemas match and
>> find the
>>>> sync marker from the last
>>>>>>> block
>>>>>>> 
>>>>>>> DataFileWriter cannot refer to Hadoop
>> classes
>>>> such as FileSystem, but we
>>>>>>> could add something to the mapred
>> module that
>>>> takes a Path and
>>>>>>> FileSystem and returns something that
>>>> implemements an interface that
>>>>>>> DataFileWriter can append to. 
>> This would
>>>> be something that is both a
>>>>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html
>>>>>>> and an OutputStream, or has both an
>> InputStream
>>>> from the start of the
>>>>>>> existing file and an OutputStream at
>> the end.
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Vyacheslav
>>>>>>>> 
>>>>>>>> On Feb 21, 2012, at 5:29 AM, Harsh
>> J
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Use the appendTo feature of
>> the
>>>> DataFileWriter. See
>>>>>>>>> 
>>>>>>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File)
>>>>>>>>> 
>>>>>>>>> For a quick setup example,
>> read also:
>>>>>>>>> 
>>>>>>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 21, 2012 at 3:15
>> AM,
>>>> Vyacheslav Zholudev
>>>>>>>>> <vyacheslav.zholudev@gmail.com>
>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> is it possible to append
>> to an
>>>> already existing avro file when it was
>>>>>>>>>> written and closed
>> before?
>>>>>>>>>> 
>>>>>>>>>> If I use
>>>>>>>>>> outputStream =
>>>> fs.append(avroFilePath);
>>>>>>>>>> 
>>>>>>>>>> then later on I get:
>>>> java.io.IOException: Invalid sync!
>>>>>>>>>> 
>>>>>>>>>> Probably because the
>> schema is
>>>> written twice and some other issues.
>>>>>>>>>> 
>>>>>>>>>> If I use outputStream =
>>>> fs.create(avroFilePath); then the avro file
>>>>>>>>>> gets
>>>>>>>>>> overwritten.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Vyacheslav
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Harsh J
>>>>>>>>> Customer Ops. Engineer
>>>>>>>>> Cloudera | http://tiny.cloudera.com/about
>>>>> 
>>>> 
>>>> On Fri, Feb 1, 2013 at 11:32 AM, Michael Malak
>> <michaelmalak@yahoo.com>
>>>> wrote:
>>>>> Was a JIRA ticket ever created regarding
>> appending to
>>>> an existing Avro file on HDFS?
>>>>> 
>>>>> What is the status of such a capability, a
>> year out
>>>> from when the issue below was raised?
>>>>> 
>>>>> On Wed, 22 Feb 2012 10:57:48 +0100,
>> "Vyacheslav
>>>> Zholudev" <vyacheslav.zholudev@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Thanks for your reply, I suspected this.
>>>>>> 
>>>>>> I will create a JIRA ticket.
>>>>>> 
>>>>>> Vyacheslav
>>>>>> 
>>>>>> On Feb 21, 2012, at 6:02 PM, Scott Carey
>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On 2/21/12 7:29 AM, "Vyacheslav
>> Zholudev"
>>>> <vyacheslav.zholudev@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Yep, I saw that method as well as
>> the
>>>> stackoverflow post. However, I'm
>>>>>>>> interested how to append to a file
>> on the
>>>> arbitrary file system, not
>>>>>>>> only on the local one.
>>>>>>>> 
>>>>>>>> I want to get an OutputStream
>> based on the
>>>> Path and the FileSystem
>>>>>>>> implementation and then pass it
>> for
>>>> appending to avro methods.
>>>>>>>> 
>>>>>>>> Is that possible?
>>>>>>> 
>>>>>>> It is not possible without modifying
>>>> DataFileWriter. Please open a JIRA
>>>>>>> ticket.
>>>>>>> 
>>>>>>> It could not simply append to an
>> OutputStream,
>>>> since it must either:
>>>>>>> * Seek to the start to validate the
>> schemas
>>>> match and find the sync
>>>>>>> marker, or
>>>>>>> * Trust that the schemas match and
>> find the
>>>> sync marker from the last
>>>>>>> block
>>>>>>> 
>>>>>>> DataFileWriter cannot refer to Hadoop
>> classes
>>>> such as FileSystem, but we
>>>>>>> could add something to the mapred
>> module that
>>>> takes a Path and
>>>>>>> FileSystem and returns something that
>>>> implemements an interface that
>>>>>>> DataFileWriter can append to. 
>> This would
>>>> be something that is both a
>>>>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html
>>>>>>> and an OutputStream, or has both an
>> InputStream
>>>> from the start of the
>>>>>>> existing file and an OutputStream at
>> the end.
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Vyacheslav
>>>>>>>> 
>>>>>>>> On Feb 21, 2012, at 5:29 AM, Harsh
>> J
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Use the appendTo feature of
>> the
>>>> DataFileWriter. See
>>>>>>>>> 
>>>>>>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File)
>>>>>>>>> 
>>>>>>>>> For a quick setup example,
>> read also:
>>>>>>>>> 
>>>>>>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 21, 2012 at 3:15
>> AM,
>>>> Vyacheslav Zholudev
>>>>>>>>> <vyacheslav.zholudev@gmail.com>
>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> is it possible to append
>> to an
>>>> already existing avro file when it was
>>>>>>>>>> written and closed
>> before?
>>>>>>>>>> 
>>>>>>>>>> If I use
>>>>>>>>>> outputStream =
>>>> fs.append(avroFilePath);
>>>>>>>>>> 
>>>>>>>>>> then later on I get:
>>>> java.io.IOException: Invalid sync!
>>>>>>>>>> 
>>>>>>>>>> Probably because the
>> schema is
>>>> written twice and some other issues.
>>>>>>>>>> 
>>>>>>>>>> If I use outputStream =
>>>> fs.create(avroFilePath); then the avro file
>>>>>>>>>> gets
>>>>>>>>>> overwritten.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Vyacheslav
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Harsh J
>>>>>>>>> Customer Ops. Engineer
>>>>>>>>> Cloudera | http://tiny.cloudera.com/about
>>>>> 
>>>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Mime
View raw message