Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F37E118642 for ; Wed, 17 Feb 2016 18:36:20 +0000 (UTC) Received: (qmail 30113 invoked by uid 500); 17 Feb 2016 18:36:10 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 30032 invoked by uid 500); 17 Feb 2016 18:36:10 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 29573 invoked by uid 99); 17 Feb 2016 18:36:10 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Feb 2016 18:36:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A8086C27BC for ; Wed, 17 Feb 2016 18:36:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.448 X-Spam-Level: * X-Spam-Status: No, score=1.448 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id qfxi1Z5yUGBa for ; Wed, 17 Feb 2016 18:36:07 +0000 (UTC) Received: from mail-vk0-f43.google.com (mail-vk0-f43.google.com [209.85.213.43]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id DDDF261EC9 for ; Wed, 17 Feb 2016 07:59:51 +0000 (UTC) Received: by mail-vk0-f43.google.com with SMTP id e6so7477715vkh.2 for ; Tue, 16 Feb 2016 23:59:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=NROPqZPn4ARV2TRCnjP2OrWjkZMS1dVQABMn3c6qJ14=; b=cHvbHy0r4PU7kLxUIytli86C6hq/K68AIObuAY/5vQ0p2No6ftFwE+MPCWvIYVvDfu fNfx/uX8BneDkJk6Q1qBuv5yHFpJziLYsRp/wKJn8F/Fx89tl1Xq8hv2zJ7ud/1kk6Wz UaLQxvZSFHMCwYCrXOhdCfOw6O+5I9X4gIy6cBDhNZklR59i43uDiPCvAGAvwUNRkv3T nCl04hI6ADLjm0H2h+piMcatPQlPw4nJFSn6kntpWKTOra77DWrHsxdBTFcF3vgxBZrS F7b61c+OPt7fcNsCJPoELCYIoTXkFwVEgkPP4ruxlD2155N5f6yomk4WFm6PA9znmbBt LyeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=NROPqZPn4ARV2TRCnjP2OrWjkZMS1dVQABMn3c6qJ14=; b=XwsxxybhG/P5y4jkDYuhZLDkGeWR6CxGvHs/XBcpU8vknFsjaPVpo4LV7JFA8KCrAH asal/JePFLUP6ImhYbtlM7sKnV4/SOpKWmkSH4p6lVVjsLHobWpBLnYnt8uhevZJUYZM pfQjEFFWmY5e4/NJhEnEbjwRE366We+ZJJ/jEvNXNCoO3tPs4FPLF1OWpyS8NZn00Rx8 3JpSTQOZV32qKE+ogEvzBmfKAzyx+0Kbn3tnE011G1GI9vWeTqoPWNWr5oyvedx2l6cb 6lrSVoP2Bzf5u75MOmIo5iLvQDAsdEjt10xPtKVtX2I6Veno8QYk1gckV3crR/tJ6Vem l+GQ== X-Gm-Message-State: AG10YORHZ5iri64uVM66HDeXHMO6uk1sPmuU06Wl+QQ0Z+fSu11wnuB7Ayru/I1fEy7HG1DjHj2fBTE41TlAxw== MIME-Version: 1.0 X-Received: by 10.31.52.65 with SMTP id b62mr105572vka.61.1455695991208; Tue, 16 Feb 2016 23:59:51 -0800 (PST) Received: by 10.31.15.133 with HTTP; Tue, 16 Feb 2016 23:59:51 -0800 (PST) In-Reply-To: References: Date: Wed, 17 Feb 2016 13:29:51 +0530 Message-ID: Subject: Re: external data set support From: Sandeep Joshi To: dev@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1143fdde75039a052bf2a202 --001a1143fdde75039a052bf2a202 Content-Type: text/plain; charset=UTF-8 Comments in text.. On Sun, Feb 14, 2016 at 1:14 PM, abdullah alamoudi wrote: > Hi Sandeep, > Here are the answers as per my understanding of the questions: > > 1) Schema catalog : One would have implement IMetadataProvider, > IDataSource, IDataSourceIndex and other related classes. Is there any > functionality missing from the current schema implementation for external > data sets ? > Schema information for external data already exists and we use the > AqlMetadataProvider for both external and internal datasets. > > One of the papers says that one should add comparators and hash functions > for any new data types introduced by the external data set. Which > interface does one have to implement for that ? > I am not sure which paper you're referring to but for adding new data types > (regardless for use with internal or external. there is really no > distinction) here is what needs to be done: > 1. For complex types, one can simply define a type using the create type > statement. > 2. For completely new types, one needs to implement at least {IAType, > IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is > enough but that is a starting point. > > 2) Query optimization : There is no cost-based optimizer yet within > Algebricks, therefore there is no API to support retrieval and use of table > statistics from an external data source. > > Is something planned in this regard ? > Cost based optimizer for internal datasets is being worked on (@Ildar might > add here). As for external data, unfortunately right now, we don't even > employ some easy rule based optimizations. For example, we can utilize RC > files structure to push project into data source operator but we don't do > that yet. Another optimization that can be done is lazy deserialization of > records but again we don't do that. There are plans to do all of these but > we have man power shortage. You are welcome to give them a shot and we can > assist. > I will get back on that... > > > 3) Data fetch and update : The VLDB'14 paper states that external data sets > are read-only, static and without indices, but the current codebase has > support for IExternalIndex and IIndexibleExternalDataSource, so presumably > I can fetch records from an external data source (base table scan as well > as index). > Yes, we can access external data through indexes. probably by the time the > VLDB'14 paper was published, we didn't have this feature yet. You can check > http://dl.acm.org/citation.cfm?id=2806428 which is about external data > access and indexing. > > Could you please add this paper to the Publications page ? https://asterixdb.ics.uci.edu/publications.html I was going by that information when I asked questions > Can I write to an external data source ? > Right now, this is not supported because we can't provide the same > transactional guarantees we can with internal datasets. This point probably > needs to be discussed with Mike before doing anything about it. I believe > we offer some other thing that can be utilized which is righting query > results into files but I am not sure. > > > 4) Hyracks runtime : For data retrieval, is it sufficient to implement the > interfaces within asterix.external.api or does one also have to add some > Hyracks operators which are constructed via contributeRuntimeOperator ? > > For data retrieval, one only needs to implement IExternalDataSourceFactory > along with IRecordReader or IInputStreamProvider (depending on > whether the source produces a stream or a set of records). > > For data parsing, one only needs to implements IDataParserFactory along > with IRecordDataParser or IStreamDataParser (depending on whether the > parsed data source produces a stream or a set of records). > > Let me know if I can provide more information. > Cheers, > Abdullah. > > P.S, > Thanks for doing your work before asking. This is a great sign :) > > Amoudi, Abdullah. > > On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi > wrote: > > > Can someone describe the level of support for External data sets and the > > future roadmap ? > > > > Let me divide the question into four broad issues: > > > > 1) Schema catalog : One would have implement IMetadataProvider, > > IDataSource, IDataSourceIndex and other related classes. Is there any > > functionality missing from the current schema implementation for external > > data sets ? > > > > One of the papers says that one should add comparators and hash functions > > for any new data types introduced by the external data set. Which > > interface does one have to implement for that ? > > > > 2) Query optimization : There is no cost-based optimizer yet within > > Algebricks, therefore there is no API to support retrieval and use of > table > > statistics from an external data source. > > > > Is something planned in this regard ? > > > > 3) Data fetch and update : The VLDB'14 paper states that external data > sets > > are read-only, static and without indices, but the current codebase has > > support for IExternalIndex and IIndexibleExternalDataSource, so > presumably > > I can fetch records from an external data source (base table scan as well > > as index). > > > > Can I write to an external data source ? > > > > 4) Hyracks runtime : For data retrieval, is it sufficient to implement > the > > interfaces within asterix.external.api or does one also have to add some > > Hyracks operators which are constructed via contributeRuntimeOperator ? > > > > -Sandeep > > > --001a1143fdde75039a052bf2a202--