Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 83684200C4A for ; Sun, 19 Mar 2017 03:39:26 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 820F4160B8D; Sun, 19 Mar 2017 02:39:26 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A78B7160B7F for ; Sun, 19 Mar 2017 03:39:25 +0100 (CET) Received: (qmail 627 invoked by uid 500); 19 Mar 2017 02:39:24 -0000 Mailing-List: contact dev-help@atlas.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@atlas.incubator.apache.org Delivered-To: mailing list dev@atlas.incubator.apache.org Received: (qmail 616 invoked by uid 99); 19 Mar 2017 02:39:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Mar 2017 02:39:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 18C4F186138; Sun, 19 Mar 2017 02:39:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.999 X-Spam-Level: ** X-Spam-Status: No, score=2.999 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id UYix-gn_Vzr4; Sun, 19 Mar 2017 02:39:20 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 86C665F238; Sun, 19 Mar 2017 02:39:20 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 03707E0054; Sun, 19 Mar 2017 02:39:15 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id C8A30C405DF; Sun, 19 Mar 2017 02:39:14 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============1262161468783281761==" MIME-Version: 1.0 Subject: Re: Review Request 57649: Export API: ZIP File Size Optimization From: Madhan Neethiraj To: Madhan Neethiraj Cc: Ashutosh Mestry , atlas Date: Sun, 19 Mar 2017 02:39:14 -0000 Message-ID: <20170319023914.41091.52740@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Madhan Neethiraj X-ReviewGroup: atlas X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/57649/ X-Sender: Madhan Neethiraj X-ReviewBoard-ShipIt: 1 References: <20170317050930.61363.22028@reviews-vm2.apache.org> In-Reply-To: <20170317050930.61363.22028@reviews-vm2.apache.org> X-ReviewBoard-ShipIt-Only: 1 Reply-To: Madhan Neethiraj X-ReviewRequest-Repository: atlas archived-at: Sun, 19 Mar 2017 02:39:26 -0000 --===============1262161468783281761== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/57649/#review169370 ----------------------------------------------------------- Ship it! Ship It! - Madhan Neethiraj On March 17, 2017, 5:09 a.m., Ashutosh Mestry wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/57649/ > ----------------------------------------------------------- > > (Updated March 17, 2017, 5:09 a.m.) > > > Review request for atlas and Madhan Neethiraj. > > > Bugs: ATLAS-1665 > https://issues.apache.org/jira/browse/ATLAS-1665 > > > Repository: atlas > > > Description > ------- > > **Background** > ============== > Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* file per entitiy. This makes ZIP file creation inefficient. The ZIP files are 75% larger in size than what could be possible with fewer *.json* file entries. > > **Solution** > ============ > The implementation uses the new v2 API *AtlasEntityWithExtInfo* representation instead of *AtlasEntity*. This format combines an entity with related entities as one. E.g. *hive_table* will contain all the *hive_columns* that it is made up of. (See example section below.) > > This results in significant reduction of generated *JSON* files. This impacts reduction in generated *ZIP* file. > > **Implementation Details** > ========================== > *Export API* > - Modified *Gremlin* used to fetch connected entities to return *guid* with *boolean* to indicate if the entity is process or not. > - _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* instead of *AtlasEntity*. Modified book keeping to save *process* (lineage) entities after all non-process entities are saved. > - _ZipSink_ Minor modification to serialize *AtlasEntityWithExtInfo*. > > *Import API* > - _ZipSource_ Modified to source *AtlasEntityWithExtInfo*. > - _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*. > - _AtlasEntityStreamForImport.getGuid_ Modified to source requested entities first from stored *AtlasEntityWithExtInfo* object. Request from stream only if not found. > - _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes to stream. > > > **Functional Areas Impacted** > ============================= > *Export* > - Full > - Connected > - HDFS path-based import. > > *Import* > - Regular flow. > > **Examples** > ============ > Case *hive_db*: Within the GraphDB the database has inward edges from objects that refer to it. Tables in this case. So *AtlasEntityWithExtInfo* for database will not have any referred entities. > > Case of *hive_table*: Within the GraphDB the table has outward edges pointing to the columns it is made up of. It also has edges pointing to database and storage descriptor. Hence, the *AtlasEntityWithExtInfo* for table will have all full representation of all the columns and reference to database and storage descriptor. > > **Metrics** > =========== > > Date | File Size | No. of Entities | Export |Import | > | | | Duration |Duration | > -----|-----------|-----------------|----------|---------| > 3/02 | 180 MB | 202930 | 22 mins| 1:38 hrs| > 3/08 | 7 KB | 3 | 5 secs| 7 sec| > --------------------------------------------------------| > Improvement | > --------------------------------------------------------| > 3/14 | 38 MB | 202930 | 20 mins| 1:10 hrs| > 3/14 | 5 KB | 3 | 5 secs| 7 sec| > > > **Summary** > =========== > With these changes the file size reduction is: ~65%. > > > Diffs > ----- > > intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityGraphDiscoveryV1.java 6c88510 > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java cce3fca > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java 5d9a7d4 > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java 8cb36ac > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityStream.java 4c43921 > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/InMemoryMapEntityStream.java 241f6d0 > repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java 4743b73 > webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java e123ff7 > webapp/src/main/java/org/apache/atlas/web/resources/ZipSink.java 37d9eb5 > webapp/src/main/java/org/apache/atlas/web/resources/ZipSource.java a69f7fa > > > Diff: https://reviews.apache.org/r/57649/diff/6/ > > > Testing > ------- > > Test data: > - QuickStart_v1: 3 databases. > - A *hive_db* with 922 tables. > - Stocks *hive_db* with 1 database, table, process and 5 columns. > - A *hive_db* with 522K entities. > > The changes impact all the flows in the Export & Import APIs. > Unit testing: Manual. > Integration testing: Manual. > Accuracy testing: Manual. Verified using Export -> Import -> Export -> file compare. > > > File Attachments > ---------------- > > Patch on 2.6-maint > https://reviews.apache.org/media/uploaded/files/2017/03/17/5fc7a466-9bac-4282-a9fd-659d0528b443__export-size-optimized.2.6-maint.2.patch > > > Thanks, > > Ashutosh Mestry > > --===============1262161468783281761==--