bahir-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mayya-sharipova <...@git.apache.org>
Subject [GitHub] bahir pull request #45: [WIP] [BAHIR-110] Implement _changes API for non-str...
Date Tue, 04 Jul 2017 19:41:12 GMT
Github user mayya-sharipova commented on a diff in the pull request:

    https://github.com/apache/bahir/pull/45#discussion_r125523606
  
    --- Diff: sql-cloudant/README.md ---
    @@ -52,39 +51,61 @@ Here each subsequent configuration overrides the previous one. Thus,
configurati
     
     
     ### Configuration in application.conf
    -Default values are defined in [here](cloudant-spark-sql/src/main/resources/application.conf).
    +Default values are defined in [here](src/main/resources/application.conf).
     
     ### Configuration on SparkConf
     
     Name | Default | Meaning
     --- |:---:| ---
    +cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when loading or saving
data from Cloudant to DataFrames or SQL temporary tables. Select between "_all_docs" or "_changes"
endpoint.
     cloudant.protocol|https|protocol to use to transfer data: http or https
    -cloudant.host||cloudant host url
    -cloudant.username||cloudant userid
    -cloudant.password||cloudant password
    +cloudant.host| |cloudant host url
    +cloudant.username| |cloudant userid
    +cloudant.password| |cloudant password
     cloudant.useQuery|false|By default, _all_docs endpoint is used if configuration 'view'
and 'index' (see below) are not set. When useQuery is enabled, _find endpoint will be used
in place of _all_docs when query condition is not on primary key field (_id), so that query
predicates may be driven into datastore. 
     cloudant.queryLimit|25|The maximum number of results returned when querying the _find
endpoint.
     jsonstore.rdd.partitions|10|the number of partitions intent used to drive JsonStoreRDD
loading query result in parallel. The actual number is calculated based on total rows returned
and satisfying maxInPartition and minInPartition
     jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means unlimited
     jsonstore.rdd.minInPartition|10|the min rows in a partition.
     jsonstore.rdd.requestTimeout|900000| the request timeout in milliseconds
     bulkSize|200| the bulk save size
    -schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means we are using
only first document for schema discovery; -1 means all documents; 0 will be treated as 1;
any number N means min(N, total) docs 
    -createDBOnSave|"false"| whether to create a new database during save operation. If false,
a database should already exist. If true, a new database will be created. If true, and a database
with a provided name already exists, an error will be raised. 
    +schemaSampleSize|-1| the sample size for RDD schema discovery. 1 means we are using only
first document for schema discovery; -1 means all documents; 0 will be treated as 1; any number
N means min(N, total) docs 
    +createDBOnSave|false| whether to create a new database during save operation. If false,
a database should already exist. If true, a new database will be created. If true, and a database
with a provided name already exists, an error will be raised. 
    +
    +The `cloudant.apiReceiver` option allows for _changes or _all_docs API endpoint to be
called while loading Cloudant data into Spark DataFrames or SQL Tables,
    +or saving data from DataFrames or SQL Tables to a Cloudant database.  
    +
    +**Note:** When using `_changes` API, please consider: 
    +1. Results are partially ordered and may not be be presented in order in 
    +which documents were updated.
    +2. In case of shards' unavailability, you may see duplicate results (changes that have
been seen already)
    +3. Can use `selector` option to retrieve all revisions for docs
    +4. Only supports single threaded
    +
    +When using `_all_docs` API:
    +1. Supports parallel reads (using offset and range)
    +
    +Performance of `_changes` API is still better in most cases (even with single threaded
support). 
    +During several performance tests using 50 to 200 MB Cloudant databases, load time from
Cloudant to Spark using `_changes` 
    +feed was faster to complete every time compared to `_all_docs`.
    + 
    --- End diff --
    
    with `_changes`, you can specify `selector` option using CQ format, thus significantly
limit the number of documents to be loaded into Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message