Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Content-Type: multipart/alternative;
	boundary="===============4571631146331239564=="
MIME-Version: 1.0
Subject: Re: Review Request: HIVE-2246: Dedupe tables' column schemas from
 partitions
 in the metastore db
From: "Sohan Jain" <sohanjain@fb.com>
To: "Ning Zhang" <nzhang@fb.com>, "Paul Yang" <pyang@fb.com>
Date: Fri, 05 Aug 2011 20:46:16 -0000
Message-ID: <20110805204616.21250.27405@reviews.apache.org>
Cc: "Sohan Jain" <sohanjain@fb.com>,"hive" <dev@hive.apache.org>
In-Reply-To: <20110725064604.22636.18719@reviews.apache.org>
References: <20110725064604.22636.18719@reviews.apache.org>

--===============4571631146331239564==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable


> On 2011-07-25 06:46:04, Ning Zhang wrote:
> > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 76
> > <https://reviews.apache.org/r/1183/diff/2/?file=3D26824#file26824line76>
> >
> >     is the CHARSET (latin1) the same as SDS? This will require the user=
's comments to be in latin1 which prevents UTF chars.

Yes, this charset matches the same ones from the official hive schema for 0=
.7.0.


> On 2011-07-25 06:46:04, Ning Zhang wrote:
> > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 206
> > <https://reviews.apache.org/r/1183/diff/2/?file=3D26824#file26824line20=
6>
> >
> >     can you also add migration script for derby? we support derby as a =
default metastore RDBMS as well.

Ok, will do.  I will add it in the next-next diff here.


> On 2011-07-25 06:46:04, Ning Zhang wrote:
> > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.j=
ava, line 1752
> > <https://reviews.apache.org/r/1183/diff/2/?file=3D26825#file26825line17=
52>
> >
> >     here do you check if the 'alter table' command changes the schema (=
columns definition)? If it just set a table property, then you don't need t=
o create a new ColumnDescriptor right?
> >     =

> >     Also if a table's schema got changed, a new CD will be created, but=
 the old partition will still have the old CDs. When we query the old parti=
tion, do we use the old partitons's CD or the table's CD? =

> >     =

> >     Also in the above case, when you run 'desc table partition <old_par=
tition>', do you return the old partition's CD or the table's CD?

Good point; I should check whether the table columns have changed; I do thi=
s already when altering partitions.  I added that in the next diff.

If a table's schema changes, it does not update existing partition CDs.  If=
 we ever grab the partition object after the schema change, it will refer t=
o its old CD, not the table's CD.  However, when querying tables on the CLI=
, we almost always use the table's set of columns.  E.g., if did:
> create table test (a string) partitioned by (p1 string, p2 string);
> alter table test add partition(p1=3D1, p2=3D1);
> # populate the p1=3D1, p2=3D1 partition with some data now
> alter table test add columns (b string)
> select * from test where p1 =3D 1 and p2 =3D 1,

it'd use the table's latest schema; i.e., return the column 'a's values and=
 the column 'b' as all NULL.


- Sohan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1176
-----------------------------------------------------------


On 2011-07-22 05:30:29, Sohan Jain wrote:
> =

> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/1183/
> -----------------------------------------------------------
> =

> (Updated 2011-07-22 05:30:29)
> =

> =

> Review request for hive, Ning Zhang and Paul Yang.
> =

> =

> Summary
> -------
> =

> This patch tries to make minimal changes to the API while keeping migrati=
on short and somewhat easy to revert.
> =

> The new schema can be described as follows:
> - CDS is a table corresponding to Column Descriptor objects.  Currently, =
it only stores a CD_ID.
> - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns=
.  A Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign k=
ey to the CD_ID to which it belongs.
> - SDS was modified to reference a Column Descriptor. So SDS now has a for=
eign key to a CD_ID which describes its columns.
> =

> During migration, we create Column Descriptors for tables in a straightfo=
rward manner: their columns are now just wrapped inside a column descriptor=
.  The SDS of partitions use their parent table's column descriptor, since =
currently a partition and its table share the same list of columns.
> =

> When altering or adding a partition, give it it's parent table's column d=
escriptor IF the columns they describe are the same.  Otherwise, create a n=
ew column descriptor for its columns.
> =

> When adding or altering a table, create a new column descriptor every tim=
e.
> =

> Whenever you drop a storage descriptor (e.g, when dropping tables or part=
itions), check to see if the related column descriptor has any other refere=
nces in the table.  That is, check to see if any other storage descriptors =
point to that column descriptor.  If none do, then delete that column descr=
iptor.  This check is in place so we don't have unreferenced column descrip=
tors and columns hanging around after schema evolution for tables.
> =

> =

> This addresses bug HIVE-2246.
>     https://issues.apache.org/jira/browse/HIVE-2246
> =

> =

> Diffs
> -----
> =

>   trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREAT=
ION =

>   trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.j=
ava 1148945 =

>   trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColum=
nDescriptor.java PRE-CREATION =

>   trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStora=
geDescriptor.java 1148945 =

>   trunk/metastore/src/model/package.jdo 1148945 =

> =

> Diff: https://reviews.apache.org/r/1183/diff
> =

> =

> Testing
> -------
> =

> Passes facebook's regression testing and all existing test cases.  In one=
 instance, before migration, the overhead involved with storage descriptors=
 and columns was ~11 GB.  After migration, the overhead was ~1.5 GB.
> =

> =

> Thanks,
> =

> Sohan
> =

>


--===============4571631146331239564==--