atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Mestry <>
Subject Review Request 57649: Export API: ZIP File Size Optimization
Date Wed, 15 Mar 2017 23:04:52 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for atlas and Madhan Neethiraj.

Bugs: ATLAS-1503

Repository: atlas


Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* file per entitiy.
This makes ZIP file creation inefficient. The ZIP files are 75% larger in size than what could
be possible with fewer *.json* file entries.

The implementation uses the new v2 API *AtlasEntityWithExtInfo* representation instead of
*AtlasEntity*. This format combines an entity with related entities as one. E.g. *hive_table*
will contain all the *hive_columns* that it is made up of. (See example section below.)

This results in significant reduction of generated *JSON* files. This impacts reduction in
generated *ZIP* file.

**Implementation Details**
*Export API*
- Modified *Gremlin* used to fetch connected entities to return *guid* with *boolean* to indicate
if the entity is process or not.
- _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* instead of *AtlasEntity*.
Modified book keeping to save *process* (lineage) entities after all non-process entities
are saved.
- _ZipSink_ Minor modification to serialize  *AtlasEntityWithExtInfo*.

*Import API*
- _ZipSource_ Modified to source *AtlasEntityWithExtInfo*.
- _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*.
- _AtlasEntityStreamForImport.getGuid_ Modified  to source requested entities first from stored
*AtlasEntityWithExtInfo* object. Request from stream only if not found.
- _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes to stream.

**Functional Areas Impacted**
- Full
- Connected
- HDFS path-based import.

- Regular flow.

Case *hive_db*: Within the GraphDB the database has inward edges from objects that refer to
it. Tables in this case. So *AtlasEntityWithExtInfo* for database will not have any referred

Case of *hive_table*: Within the GraphDB the table has outward edges pointing to the columns
it is made up of. It also has edges pointing to database and storage descriptor. Hence, the
*AtlasEntityWithExtInfo* for table will have all full representation of all the columns and
reference to database and storage descriptor.


Date | File Size | No. of Entities | Export   |
     |           |                 | Duration |
3/02 |    180 MB |          202930 |   22 mins|
3/08 |      7 KB |               3 |    5 secs|
               Improvement                    |           
3/14 |     38 MB |          202930 |   19 mins|
3/14 |      5 KB |               3 |    5 secs|

With these changes the file size reduction is: ~65%.


  intg/src/main/java/org/apache/atlas/model/impexp/ e6a967e 
  intg/src/main/java/org/apache/atlas/model/instance/ 4e3895d 
  repository/src/main/java/org/apache/atlas/util/ 4743b73 
  webapp/src/main/java/org/apache/atlas/web/resources/ 31a4cf9 
  webapp/src/main/java/org/apache/atlas/web/resources/ c1891e0 
  webapp/src/main/java/org/apache/atlas/web/resources/ 2e4cb01 
  webapp/src/main/java/org/apache/atlas/web/resources/ a69f7fa 



Test data:
- QuickStart_v1: 3 databases.
- A *hive_db* with 922 tables.
- Stocks *hive_db* with 1 database, table, process and 5 columns.
- A *hive_db* with 522K entities.

The changes impact all the flows in the Export & Import APIs.
Unit testing: Manual.
Integration testing: Manual.
Accuracy testing: Manual. Verified using Export -> Import -> Export -> file compare.


Ashutosh Mestry

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message