Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B94F910AF2 for ; Wed, 2 Apr 2014 21:03:04 +0000 (UTC) Received: (qmail 47364 invoked by uid 500); 2 Apr 2014 21:03:03 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 47296 invoked by uid 500); 2 Apr 2014 21:03:02 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 47288 invoked by uid 99); 2 Apr 2014 21:03:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Apr 2014 21:03:02 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of prvs=162cff910=mkleppmann@linkedin.com designates 69.28.149.80 as permitted sender) Received: from [69.28.149.80] (HELO esv4-mav04.corp.linkedin.com) (69.28.149.80) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Apr 2014 21:02:57 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linkedin.com; i=@linkedin.com; q=dns/txt; s=proddkim1024; t=1396472577; x=1428008577; h=from:to:subject:date:message-id:references:in-reply-to: content-id:content-transfer-encoding:mime-version; bh=NsmirNj+dsZWoD5f1BX4kEX7c3YazSLL7LdbfBUEflQ=; b=CgTCb9P0rAGM55V+yA2MGyEF4XJQKKLj0F1LeA2JymiUeVT2J2v7r317 WHgddq+frOEEa9BFxnsQLx21cpphZ6N4MAWkzeZXJpoFc3XSWOfcVCLlR NvqmeN6A5yv9YQFizP3Wy1GVRghDM8iSdHrYEYNBvVufJdkW2SxDfUxwz c=; X-IronPort-AV: E=Sophos;i="4.97,782,1389772800"; d="scan'208";a="109157945" Received: from ESV4-MBX02.linkedin.biz ([fe80::20f1:6264:6880:7fc7]) by esv4-cas02.linkedin.biz ([172.18.46.142]) with mapi id 14.03.0174.001; Wed, 2 Apr 2014 14:01:10 -0700 From: Martin Kleppmann To: "" Subject: Re: Dynamic Schema Thread-Topic: Dynamic Schema Thread-Index: AQHPTezJoeSh+fIOr0+9Oh8m+JKAJpr/RYwA Date: Wed, 2 Apr 2014 21:01:08 +0000 Message-ID: <350707B7-D311-4E46-A13B-4A104A729728@linkedin.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.18.46.254] Content-Type: text/plain; charset="iso-8859-1" Content-ID: <69084E2DA44E5F478A2BA7EF2B7FE3CC@linkedin.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi Amit, The Avro data file format requires the writer to know the schema from the s= tart, because all records in the file are then written with the same schema= . So there probably isn't an alternative to what you're doing -- to buffer = as much as you can in memory, write it out to file when the memory buffer i= s full, and then start a new file. You can't change the schema of a data file once it has been written, but yo= u can run a background process which merges several data files together, an= d writes the result to a new file. You can make the merged file's schema th= e union of all the input file schemas, or you can write some application-sp= ecific code which combines the schemas into one, and evolve all the records= into that merged schema. This can be done by streaming through the files -= - you don't need to keep all the data in memory. Martin On 1 Apr 2014, at 21:55, amit nanda wrote: > I have very dynamic data that i want to write to an avro file. The soluti= on i have is to store all that data in the memory and then calculate the sc= hema, and then start the writing.=20 >=20 > This causes the files to be smaller in size, because of the memory limita= tions. >=20 > What i am looking for is that i will start data as and when it is collect= ed, but how should i compute the schema in this case? Can i change the sche= ma for an avro file? >=20 > Thanks > Amit