Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C7797CBB for ; Fri, 5 Aug 2011 20:46:13 +0000 (UTC) Received: (qmail 88276 invoked by uid 500); 5 Aug 2011 20:46:12 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 88143 invoked by uid 500); 5 Aug 2011 20:46:12 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 87962 invoked by uid 99); 5 Aug 2011 20:46:11 -0000 Received: from reviews.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 20:46:11 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 983C81C00A7; Fri, 5 Aug 2011 20:46:16 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4571631146331239564==" MIME-Version: 1.0 Subject: Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db From: "Sohan Jain" To: "Ning Zhang" , "Paul Yang" Date: Fri, 05 Aug 2011 20:46:16 -0000 Message-ID: <20110805204616.21250.27405@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org X-ReviewRequest-URL: https://reviews.apache.org/r/1183/ Cc: "Sohan Jain" ,"hive" In-Reply-To: <20110725064604.22636.18719@reviews.apache.org> References: <20110725064604.22636.18719@reviews.apache.org> --===============4571631146331239564== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable > On 2011-07-25 06:46:04, Ning Zhang wrote: > > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 76 > > > > > > is the CHARSET (latin1) the same as SDS? This will require the user= 's comments to be in latin1 which prevents UTF chars. Yes, this charset matches the same ones from the official hive schema for 0= .7.0. > On 2011-07-25 06:46:04, Ning Zhang wrote: > > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 206 > > > > > > can you also add migration script for derby? we support derby as a = default metastore RDBMS as well. Ok, will do. I will add it in the next-next diff here. > On 2011-07-25 06:46:04, Ning Zhang wrote: > > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.j= ava, line 1752 > > > > > > here do you check if the 'alter table' command changes the schema (= columns definition)? If it just set a table property, then you don't need t= o create a new ColumnDescriptor right? > > = > > Also if a table's schema got changed, a new CD will be created, but= the old partition will still have the old CDs. When we query the old parti= tion, do we use the old partitons's CD or the table's CD? = > > = > > Also in the above case, when you run 'desc table partition ', do you return the old partition's CD or the table's CD? Good point; I should check whether the table columns have changed; I do thi= s already when altering partitions. I added that in the next diff. If a table's schema changes, it does not update existing partition CDs. If= we ever grab the partition object after the schema change, it will refer t= o its old CD, not the table's CD. However, when querying tables on the CLI= , we almost always use the table's set of columns. E.g., if did: > create table test (a string) partitioned by (p1 string, p2 string); > alter table test add partition(p1=3D1, p2=3D1); > # populate the p1=3D1, p2=3D1 partition with some data now > alter table test add columns (b string) > select * from test where p1 =3D 1 and p2 =3D 1, it'd use the table's latest schema; i.e., return the column 'a's values and= the column 'b' as all NULL. - Sohan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/#review1176 ----------------------------------------------------------- On 2011-07-22 05:30:29, Sohan Jain wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/1183/ > ----------------------------------------------------------- > = > (Updated 2011-07-22 05:30:29) > = > = > Review request for hive, Ning Zhang and Paul Yang. > = > = > Summary > ------- > = > This patch tries to make minimal changes to the API while keeping migrati= on short and somewhat easy to revert. > = > The new schema can be described as follows: > - CDS is a table corresponding to Column Descriptor objects. Currently, = it only stores a CD_ID. > - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns= . A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign k= ey to the CD_ID to which it belongs. > - SDS was modified to reference a Column Descriptor. So SDS now has a for= eign key to a CD_ID which describes its columns. > = > During migration, we create Column Descriptors for tables in a straightfo= rward manner: their columns are now just wrapped inside a column descriptor= . The SDS of partitions use their parent table's column descriptor, since = currently a partition and its table share the same list of columns. > = > When altering or adding a partition, give it it's parent table's column d= escriptor IF the columns they describe are the same. Otherwise, create a n= ew column descriptor for its columns. > = > When adding or altering a table, create a new column descriptor every tim= e. > = > Whenever you drop a storage descriptor (e.g, when dropping tables or part= itions), check to see if the related column descriptor has any other refere= nces in the table. That is, check to see if any other storage descriptors = point to that column descriptor. If none do, then delete that column descr= iptor. This check is in place so we don't have unreferenced column descrip= tors and columns hanging around after schema evolution for tables. > = > = > This addresses bug HIVE-2246. > https://issues.apache.org/jira/browse/HIVE-2246 > = > = > Diffs > ----- > = > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREAT= ION = > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.j= ava 1148945 = > trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColum= nDescriptor.java PRE-CREATION = > trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStora= geDescriptor.java 1148945 = > trunk/metastore/src/model/package.jdo 1148945 = > = > Diff: https://reviews.apache.org/r/1183/diff > = > = > Testing > ------- > = > Passes facebook's regression testing and all existing test cases. In one= instance, before migration, the overhead involved with storage descriptors= and columns was ~11 GB. After migration, the overhead was ~1.5 GB. > = > = > Thanks, > = > Sohan > = > --===============4571631146331239564==--