Return-Path: X-Original-To: apmail-drill-commits-archive@www.apache.org Delivered-To: apmail-drill-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F1BA517371 for ; Thu, 15 Jan 2015 05:11:48 +0000 (UTC) Received: (qmail 90294 invoked by uid 500); 15 Jan 2015 05:11:50 -0000 Delivered-To: apmail-drill-commits-archive@drill.apache.org Received: (qmail 90261 invoked by uid 500); 15 Jan 2015 05:11:50 -0000 Mailing-List: contact commits-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: commits@drill.apache.org Delivered-To: mailing list commits@drill.apache.org Received: (qmail 90245 invoked by uid 99); 15 Jan 2015 05:11:50 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 05:11:50 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id 58B0DAC01D7; Thu, 15 Jan 2015 05:11:50 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1651949 [2/13] - in /drill/site/trunk/content/drill: ./ blog/2014/11/19/sql-on-mongodb/ blog/2014/12/02/drill-top-level-project/ blog/2014/12/09/running-sql-queries-on-amazon-s3/ blog/2014/12/11/apache-drill-qa-panelist-spotlight/ blog/201... Date: Thu, 15 Jan 2015 05:11:48 -0000 To: commits@drill.apache.org From: tshiran@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150115051150.58B0DAC01D7@hades.apache.org> Added: drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html (added) +++ drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,468 @@ + + + + + + + + +Analyzing Yelp JSON Data with Apache Drill - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Analyzing Yelp JSON Data with Apache Drill

+ +
+ +

Apache Drill is one of the +fastest growing open source projects, with the community making rapid progress +with monthly releases. The key difference is Drill’s agility and flexibility. +Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve low +latency performance at scale, Drill allows users to analyze the data without +any ETL or up-front schema definitions. The data could be in any file format +such as text, JSON, or Parquet. Data could have simple types such as string, +integer, dates, or more complex multi-structured data, such as nested maps and +arrays. Data can exist in any file system, local or distributed, such as HDFS, +MapR FS, or S3. Drill, has a “no schema” approach, which enables you to get +value from your data in just a few minutes.

+ +

Let’s quickly walk through the steps required to install Drill and run it +against the Yelp data set. The publicly available data set used for this +example is downloadable from Yelp +(business reviews) and is in JSON format.

+ +

Installing and Starting Drill

+ +

Step 1: Download Apache Drill onto your local machine

+ +

http://incubator.apache.org/drill/download/

+ +

You can also deploy Drill in clustered mode if you +want to scale your environment.

+ +

Step 2 : Open the Drill tar file

+ +

tar -xvf apache-drill-0.6.0-incubating.tar

+ +

Step 3: Launch sqlline, a JDBC application that ships with Drill

+ +

bin/sqlline -u jdbc:drill:zk=local

+ +

That’s it! You are now ready explore the data.

+ +

Let’s try out some SQL examples to understand how Drill makes the raw data +analysis extremely easy.

+ +

Note: You need to substitute your local path to the Yelp data set in the FROM clause of each query you run.

+ +

Querying Data with Drill

+ +

1. View the contents of the Yelp business data

+ +

0: jdbc:drill:zk=local> !set maxwidth 10000

+ +

0: jdbc:drill:zk=local> select * from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +limit 1;

+
+-------------+--------------+------------+------------+------------+------------+--------------+------------+------------+------------+------------+------------+------------+------------+---------------+
+| business_id | full_address |   hours    |     open    | categories |            city    | review_count |        name   | longitude  |   state  |   stars          |  latitude  | attributes |          type    | neighborhoods |
++-------------+--------------+------------+------------+------------+------------+--------------+------------+------------+------------+------------+------------+------------+------------+---------------+
+| vcNAWiLM4dR7D2nwwJ7nCA | 4840 E Indian School Rd
+Ste 101
+Phoenix, AZ 85018 | {"Tuesday":{"close":"17:00","open":"08:00"},"Friday":{"close":"17:00","open":"08:00"},"Monday":{"close":"17:00","open":"08:00"},"Wednesday":{"close":"17:00","open":"08:00"},"Thursday":{"close":"17:00","open":"08:00"},"Sunday":{},"Saturday":{}} | true              | ["Doctors","Health & Medical"] | Phoenix  | 7                   | Eric Goldberg, MD | -111.983758 | AZ       | 3.5                | 33.499313  | {"By Appointment Only":true,"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | business   | []                  
 |
++-------------+--------------+------------+------------+------------+------------+--------------+------------+------------+------------+------------+------------+------------+------------+---------------+
+
+

**Note: **You can directly query self-describing files such as JSON, Parquet, and text. There is no need to create metadata definitions in the Hive metastore.

+ +

2. Explore the business data set further

+ +

Total reviews in the data set

+ +

0: jdbc:drill:zk=local> select sum(review_count) as totalreviews from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +;

+
+--------------+
+| totalreviews |
++--------------+
+| 1236445      |
++--------------+
+
+

Top states and cities in total number of reviews

+ +

0: jdbc:drill:zk=local> select state, city, count(*) totalreviews from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +group by state, city order by count(*) desc limit 10;

+
+------------+------------+--------------+
+|   state    |    city    | totalreviews |
++------------+------------+--------------+
+| NV         | Las Vegas  | 12021        |
+| AZ         | Phoenix    | 7499         |
+| AZ         | Scottsdale | 3605         |
+| EDH        | Edinburgh  | 2804         |
+| AZ         | Mesa       | 2041         |
+| AZ         | Tempe      | 2025         |
+| NV         | Henderson  | 1914         |
+| AZ         | Chandler   | 1637         |
+| WI         | Madison    | 1630         |
+| AZ         | Glendale   | 1196         |
++------------+------------+--------------+
+
+

Average number of reviews per business star rating

+ +

0: jdbc:drill:zk=local> select stars,trunc(avg(review_count)) reviewsavg from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +group by stars order by stars desc;

+
+------------+------------+
+|   stars    | reviewsavg |
++------------+------------+
+| 5.0        | 8.0        |
+| 4.5        | 28.0       |
+| 4.0        | 48.0       |
+| 3.5        | 35.0       |
+| 3.0        | 26.0       |
+| 2.5        | 16.0       |
+| 2.0        | 11.0       |
+| 1.5        | 9.0        |
+| 1.0        | 4.0        |
++------------+------------+
+
+

Top businesses with high review counts (> 1000)

+ +

0: jdbc:drill:zk=local> select name, state, city, `review_count` from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +where review_count > 1000 order by `review_count` desc limit 10;

+
+------------+------------+------------+----------------------------+
+|    name                |   state     |    city     | review_count |
++------------+------------+------------+----------------------------+
+| Mon Ami Gabi           | NV          | Las Vegas  | 4084          |
+| Earl of Sandwich       | NV          | Las Vegas  | 3655          |
+| Wicked Spoon           | NV          | Las Vegas  | 3408          |
+| The Buffet             | NV          | Las Vegas  | 2791          |
+| Serendipity 3          | NV          | Las Vegas  | 2682          |
+| Bouchon                | NV          | Las Vegas  | 2419          |
+| The Buffet at Bellagio | NV          | Las Vegas  | 2404          |
+| Bacchanal Buffet       | NV          | Las Vegas  | 2369          |
+| The Cosmopolitan of Las Vegas | NV   | Las Vegas  | 2253          |
+| Aria Hotel & Casino    | NV          | Las Vegas  | 2224          |
++------------+------------+------------+----------------------------+
+
+

Saturday open and close times for a few businesses

+ +

0: jdbc:drill:zk=local> select b.name, b.hours.Saturday.`open`, +b.hours.Saturday.`close` +from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +b limit 10;

+
+------------+------------+----------------------------+
+|    name                    |   EXPR$1   |   EXPR$2   |
++------------+------------+----------------------------+
+| Eric Goldberg, MD          | 08:00      | 17:00      |
+| Pine Cone Restaurant       | null       | null       |
+| Deforest Family Restaurant | 06:00      | 22:00      |
+| Culver's                   | 10:30      | 22:00      |
+| Chang Jiang Chinese Kitchen| 11:00      | 22:00      |
+| Charter Communications     | null       | null       |
+| Air Quality Systems        | null       | null       |
+| McFarland Public Library   | 09:00      | 20:00      |
+| Green Lantern Restaurant   | 06:00      | 02:00      |
+| Spartan Animal Hospital    | 07:30      | 18:00      |
++------------+------------+----------------------------+
+
+

** **Note how Drill can traverse and refer through multiple levels of nesting.

+ +

3. Get the amenities of each business in the data set

+ +

Note that the attributes column in the Yelp business data set has a different +element for every row, representing that businesses can have separate +amenities. Drill makes it easy to quickly access data sets with changing +schemas.

+ +

First, change Drill to work in all text mode (so we can take a look at all of +the data).

+
0: jdbc:drill:zk=local> alter system set `store.json.all_text_mode` = true;
++------------+-----------------------------------+
+|     ok     |  summary                          |
++------------+-----------------------------------+
+| true       | store.json.all_text_mode updated. |
++------------+-----------------------------------+
+
+

Then, query the attribute’s data.

+
0: jdbc:drill:zk=local> select attributes from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` limit 10;
++----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| attributes                                                                                                                                                                       |
++----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| {"By Appointment Only":"true","Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"true","dinner":"false","breakfast":"false","brunch":"false"},"Caters":"false","Noise Level":"averag |
+| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"false","dinner":"false","breakfast":"false","brunch":"true"},"Caters":"false","Noise Level":"quiet" |
+| {"Take-out":"true","Good For":{},"Takes Reservations":"false","Delivery":"false","Ambience":{},"Parking":{"garage":"false","street":"false","validated":"false","lot":"true","val |
+| {"Take-out":"true","Good For":{},"Ambience":{},"Parking":{},"Has TV":"false","Outdoor Seating":"false","Attire":"casual","Music":{},"Hair Types Specialized In":{},"Payment Types |
+| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Good For":{},"Ambience":{},"Parking":{},"Wi-Fi":"free","Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"false","dinner":"true","breakfast":"false","brunch":"false"},"Noise Level":"average","Takes Reserva |
+| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
++------------+
+
+

Turn off the all text mode so we can continue to perform arithmetic operations +on data.

+
0: jdbc:drill:zk=local> alter system set `store.json.all_text_mode` = false;
++------------+------------+
+|     ok             |  summary   |
++------------+------------+
+| true              | store.json.all_text_mode updated. |
+
+

4. Explore the restaurant businesses in the data set

+ +

Number of restaurants in the data set** **

+
0: jdbc:drill:zk=local> select count(*) as TotalRestaurants from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants');
++------------------+
+| TotalRestaurants |
++------------------+
+| 14303            |
++------------------+
+
+

Top restaurants in number of reviews

+
0: jdbc:drill:zk=local> select name,state,city,`review_count` from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by `review_count` desc limit 10
+. . . . . . . . . . . > ;
++------------+------------+------------+--------------+
+|    name         |   state    |    city     | review_count |
++------------+------------+------------+--------------+
+| Mon Ami Gabi | NV               | Las Vegas  | 4084         |
+| Earl of Sandwich | NV         | Las Vegas  | 3655         |
+| Wicked Spoon | NV             | Las Vegas  | 3408         |
+| The Buffet | NV       | Las Vegas  | 2791         |
+| Serendipity 3 | NV              | Las Vegas  | 2682         |
+| Bouchon       | NV         | Las Vegas  | 2419           |
+| The Buffet at Bellagio | NV             | Las Vegas  | 2404         |
+| Bacchanal Buffet | NV        | Las Vegas  | 2369         |
+| Hash House A Go Go | NV                | Las Vegas  | 2201         |
+| Mesa Grill | NV         | Las Vegas  | 2004         |
++------------+------------+------------+--------------+
+
+

Top restaurants in number of listed categories

+
0: jdbc:drill:zk=local> select name,repeated_count(categories) as categorycount, categories from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by repeated_count(categories) desc limit 10;
++------------+---------------+------------+
+|    name         | categorycount | categories |
++------------+---------------+------------+
+| Binion's Hotel & Casino | 10           | ["Arts & Entertainment","Restaurants","Bars","Casinos","Event Planning & Services","Lounges","Nightlife","Hotels & Travel","American (N |
+| Stage Deli | 10        | ["Arts & Entertainment","Food","Hotels","Desserts","Delis","Casinos","Sandwiches","Hotels & Travel","Restaurants","Event Planning & Services"] |
+| Jillian's  | 9               | ["Arts & Entertainment","American (Traditional)","Music Venues","Bars","Dance Clubs","Nightlife","Bowling","Active Life","Restaurants"] |
+| Hotel Chocolat | 9               | ["Coffee & Tea","Food","Cafes","Chocolatiers & Shops","Specialty Food","Event Planning & Services","Hotels & Travel","Hotels","Restaurants"] |
+| Hotel du Vin & Bistro Edinburgh | 9           | ["Modern European","Bars","French","Wine Bars","Event Planning & Services","Nightlife","Hotels & Travel","Hotels","Restaurants" |
+| Elixir             | 9             | ["Arts & Entertainment","American (Traditional)","Music Venues","Bars","Cocktail Bars","Nightlife","American (New)","Local Flavor","Restaurants"] |
+| Tocasierra Spa and Fitness | 8                  | ["Beauty & Spas","Gyms","Medical Spas","Health & Medical","Fitness & Instruction","Active Life","Day Spas","Restaurants"] |
+| Costa Del Sol At Sunset Station | 8            | ["Steakhouses","Mexican","Seafood","Event Planning & Services","Hotels & Travel","Italian","Restaurants","Hotels"] |
+| Scottsdale Silverado Golf Club | 8              | ["Fashion","Shopping","Sporting Goods","Active Life","Golf","American (New)","Sports Wear","Restaurants"] |
+| House of Blues | 8               | ["Arts & Entertainment","Music Venues","Restaurants","Hotels","Event Planning & Services","Hotels & Travel","American (New)","Nightlife"] |
++------------+---------------+------------+
+
+

Top first categories in number of review counts

+
0: jdbc:drill:zk=local> select categories[0], count(categories[0]) as categorycount from dfs.`/users/nrentachintala/Downloads/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json` group by categories[0] 
+order by count(categories[0]) desc limit 10;
++------------+---------------+
+|   EXPR$0   | categorycount |
++------------+---------------+
+| Food       | 4294          |
+| Shopping   | 1885          |
+| Active Life | 1676          |
+| Bars       | 1366          |
+| Local Services | 1351          |
+| Mexican    | 1284          |
+| Hotels & Travel | 1283          |
+| Fast Food  | 963           |
+| Arts & Entertainment | 906           |
+| Hair Salons | 901           |
++------------+---------------+
+
+

5. Explore the Yelp reviews dataset and combine with the businesses.** **

+ +

Take a look at the contents of the Yelp reviews dataset.** **

+
0: jdbc:drill:zk=local> select * from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_review.json` limit 1;
++------------+------------+------------+------------+------------+------------+------------+-------------+
+|   votes          |  user_id   | review_id  |   stars    |            date    |    text           |          type    | business_id |
++------------+------------+------------+------------+------------+------------+------------+-------------+
+| {"funny":0,"useful":2,"cool":1} | Xqd0DzHaiyRqVH3WRG7hzg | 15SdjuK7DmYqUAj6rjGowg | 5            | 2007-05-17 | dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank. | review | vcNAWiLM4dR7D2nwwJ7nCA |
++------------+------------+------------+------------+------------+------------+------------+-------------+
+
+

Top businesses with cool rated reviews

+ +

Note that we are combining the Yelp business data set that has the overall +review_count to the Yelp review data, which holds additional details on each +of the reviews themselves.

+
0: jdbc:drill:zk=local> Select b.name from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` b where b.business_id in (SELECT r.business_id FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_review.json` r
+GROUP BY r.business_id having sum(r.votes.cool) > 2000 order by sum(r.votes.cool)  desc);
++------------+
+|    name         |
++------------+
+| Earl of Sandwich |
+| XS Nightclub |
+| The Cosmopolitan of Las Vegas |
+| Wicked Spoon |
++------------+
+
+

Create a view with the combined business and reviews data sets

+ +

Note that Drill views are lightweight, and can just be created in the local +file system. Drill in standalone mode comes with a dfs.tmp workspace, which we +can use to create views (or you can can define your own workspaces on a local +or distributed file system). If you want to persist the data physically +instead of in a logical view, you can use CREATE TABLE AS SELECT syntax.

+
0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as Select b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, r.`date` from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` b , dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_review.json` r where r.business_id=b.business_id
++------------+------------+
+|     ok             |  summary   |
++------------+------------+
+| true              | View 'businessreviews' created successfully in 'dfs.tmp' schema |
++------------+------------+
+
+

Let’s get the total number of records from the view.

+
0: jdbc:drill:zk=local> select count(*) as Total from dfs.tmp.businessreviews;
++------------+
+|   Total   |
++------------+
+| 1125458       |
++------------+
+
+

In addition to these queries, you can get many more deeper insights using +Drill’s SQL functionality. If you are not comfortable with writing queries manually, you +can use a BI/Analytics tools such as Tableau/MicroStrategy to query raw +files/Hive/HBase data or Drill-created views directly using Drill ODBC/JDBC +drivers.

+ +

The goal of Apache Drill is to provide the freedom and flexibility in +exploring data in ways we have never seen before with SQL technologies. The +community is working on more exciting features around nested data and +supporting data with changing schemas in upcoming releases.

+ +

As an example, a new FLATTEN function is in development (an upcoming feature +in 0.7). This function can be used to dynamically rationalize semi-structured +data so you can apply even deeper SQL functionality. Here is a sample query:

+ +

Get a flattened list of categories for each business

+
0: jdbc:drill:zk=local> select name, flatten(categories) as category from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json`  limit 20;
++------------+------------+
+|    name         |   category   |
++------------+------------+
+| Eric Goldberg, MD | Doctors          |
+| Eric Goldberg, MD | Health & Medical |
+| Pine Cone Restaurant | Restaurants |
+| Deforest Family Restaurant | American (Traditional) |
+| Deforest Family Restaurant | Restaurants |
+| Culver's   | Food       |
+| Culver's   | Ice Cream & Frozen Yogurt |
+| Culver's   | Fast Food  |
+| Culver's   | Restaurants |
+| Chang Jiang Chinese Kitchen | Chinese    |
+| Chang Jiang Chinese Kitchen | Restaurants |
+| Charter Communications | Television Stations |
+| Charter Communications | Mass Media |
+| Air Quality Systems | Home Services |
+| Air Quality Systems | Heating & Air Conditioning/HVAC |
+| McFarland Public Library | Libraries  |
+| McFarland Public Library | Public Services & Government |
+| Green Lantern Restaurant | American (Traditional) |
+| Green Lantern Restaurant | Restaurants |
+| Spartan Animal Hospital | Veterinarians |
++------------+------------+
+
+

Top categories used in business reviews

+
0: jdbc:drill:zk=local> select celltbl.catl, count(celltbl.catl) categorycnt from (select flatten(categories) catl from dfs.`/users/nrentachintala/Downloads/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json` )  celltbl group by celltbl.catl order by count(celltbl.catl) desc limit 10 ;
++------------+-------------+
+|    catl    | categorycnt |
++------------+-------------+
+| Restaurants | 14303       |
+| Shopping   | 6428        |
+| Food       | 5209        |
+| Beauty & Spas | 3421        |
+| Nightlife  | 2870        |
+| Bars       | 2378        |
+| Health & Medical | 2351        |
+| Automotive | 2241        |
+| Home Services | 1957        |
+| Fashion    | 1897        |
++------------+-------------+
+
+

Stay tuned for more features and upcoming activities in the Drill community.

+ +

To learn more about Drill, please refer to the following resources:

+ + +
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/aol-search/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/aol-search/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/aol-search/index.html (added) +++ drill/site/trunk/content/drill/docs/aol-search/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,138 @@ + + + + + + + + +AOL Search - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

AOL Search

+ +
+ +

Quick Stats

+ +

The AOL Search dataset is +a collection of real query log data that is based on real users.

+ +

The Data Source

+ +

The dataset consists of 20M Web queries from 650k users over a period of three +months, 440MB in total and available for +download. The format used in +the dataset is:

+
AnonID, Query, QueryTime, ItemRank, ClickURL
+
+

... with:

+ +
    +
  • AnonID, an anonymous user ID number.
  • +
  • Query, the query issued by the user, case shifted with most punctuation removed.
  • +
  • QueryTime, the time at which the query was submitted for search.
  • +
  • ItemRank, if the user clicked on a search result, the rank of the item on which they clicked is listed.
  • +
  • ClickURL, if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.
  • +
+ +

Each line in the data represents one of two types of events

+ +
    +
  • A query that was NOT followed by the user clicking on a result item.
  • +
  • A click through on an item in the result list returned from a query.
  • +
+ +

In the first case (query only) there is data in only the first three columns, +in the second case (click through), there is data in all five columns. For +click through events, the query that preceded the click through is included. +Note that if a user clicked on more than one result in the list returned from +a single query, there will be TWO lines in the data to represent the two +events.

+ +

The Queries

+ +

Interesting queries, for example

+ +
    +
  • Users querying for topic X
  • +
  • Users that click on the first (second, third) ranked item
  • +
  • TOP 10 domains searched
  • +
  • TOP 10 domains clicked at
  • +
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,127 @@ + + + + + + + + +Apache Drill 0.4.0 Release Notes - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Apache Drill 0.4.0 Release Notes

+ +
+ +

The 0.4.0 release is a developer preview release, designed to help enthusiasts +start to work with and experiment with Drill. It is the first Drill release +that provides distributed query execution.

+ +

This release is built upon more than 800 +JIRAs. +It is a pre-beta release on the way towards Drill. As a developer snapshot, +the release contains a large number of outstanding bugs that will make some +use cases challenging. Feel free to consult outstanding issues targeted for +the 0.5.0 +release +to see whether your use case is affected.

+ +

To read more about this release and new features introduced, please view the +0.4.0 announcement blog +entry.

+ +

The release is available as both binary +and source tarballs. In both cases, +these are compiled against Apache Hadoop. Drill has also been tested against +MapR, Cloudera and Hortonworks Hadoop distributions and there are associated +build profiles or JIRAs that can help you run against your preferred +distribution.

+ +

Some Key Notes & Limitations

+ +
    +
  • The current release supports in memory and beyond memory execution. However, users must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
  • +
  • In many cases,merge join operations return incorrect results.
  • +
  • Use of a local filter in a join “on” clause when using left, right or full outer joins may result in incorrect results.
  • +
  • Because of known memory leaks and memory overrun issues you may need more memory and you may need to restart the system in some cases.
  • +
  • Some types of complex expressions, especially those involving empty arrays may fail or return incorrect results.
  • +
  • While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior (such as Sort). Others operations (such as streaming aggregate) may have partial support that leads to unexpected results.
  • +
  • Protobuf, UDF, query plan interfaces and all interfaces are subject to change in incompatible ways.
  • +
  • Multiplication of some types of DECIMAL(28+,*) will return incorrect result.
  • +
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,112 @@ + + + + + + + + +Apache Drill 0.5.0 Release Notes - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Apache Drill 0.5.0 Release Notes

+ +
+ +

Apache Drill 0.5.0, the first beta release for Drill, is designed to help +enthusiasts start working and experimenting with Drill. It also continues the +Drill monthly release cycle as we drive towards general availability.

+ +

The 0.5.0 release is primarily a bug fix release, with more than 100 JIRAs closed, but there are some notable features. For information +about the features, see the Apache Drill Blog for the 0.5.0 +release.

+ +

This release is available as binary and +source tarballs that are compiled +against Apache Hadoop. Drill has been tested against MapR, Cloudera, and +Hortonworks Hadoop distributions. There are associated build profiles and +JIRAs that can help you run Drill against your preferred distribution.

+ +

Apache Drill 0.5.0 Key Notes and Limitations

+ +
    +
  • The current release supports in memory and beyond memory execution. However, you must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
  • +
  • While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior, such as Sort. Others operations, such as streaming aggregate, may have partial support that leads to unexpected results.
  • +
  • There are known issues with joining text files without using an intervening view. See DRILL-1401 for more information.
  • +
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,118 @@ + + + + + + + + +Apache Drill 0.6.0 Release Notes (Apache Drill Alpha) - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Apache Drill 0.6.0 Release Notes (Apache Drill Alpha)

+ +
+ +

Apache Drill 0.6.0, the second beta release for Drill, is designed to help +enthusiasts start working and experimenting with Drill. It also continues the +Drill monthly release cycle as we drive towards general availability.

+ +

This release is available as binary and +source tarballs that are compiled +against Apache Hadoop. Drill has been tested against MapR, Cloudera, and +Hortonworks Hadoop distributions. There are associated build profiles and +JIRAs that can help you run Drill against your preferred distribution.

+ +

Apache Drill 0.6.0 Key Features

+ +

This release is primarily a bug fix release, with more than 30 JIRAs closed, but there are some notable features:

+ +
    +
  • Direct ANSI SQL access to MongoDB, using the latest MongoDB Plugin for Apache Drill
  • +
  • Filesystem query performance improvements with partition pruning
  • +
  • Ability to use the file system as a persistent store for query profiles and diagnostic information
  • +
  • Window function support (alpha)
  • +
+ +

Apache Drill 0.6.0 Key Notes and Limitations

+ +
    +
  • The current release supports in-memory and beyond-memory execution. However, you must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
  • +
  • While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior, such as Sort. Other operations, such as streaming aggregate, may have partial support that leads to unexpected results.
  • +
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,132 @@ + + + + + + + + +Apache Drill 0.7.0 Release Notes (Apache Drill Alpha) - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Apache Drill 0.7.0 Release Notes (Apache Drill Alpha)

+ +
+ +

Apache Drill 0.7.0, the third beta release for Drill, is designed to help +enthusiasts start working and experimenting with Drill. It also continues the +Drill monthly release cycle as we drive towards general availability.

+ +

This release is available as +binary and +source tarballs that are compiled against Apache Hadoop. +Drill has been tested against MapR, Cloudera, and Hortonworks Hadoop +distributions. There are associated build profiles and JIRAs that can help you +run Drill against your preferred distribution

+ +

Apache Drill 0.7.0 Key Features

+ +
    +
  • No more dependency on UDP/Multicast - Making it possible for Drill to work well in the following scenarios:

    + +
      +
    • UDP multicast not enabled (as in EC2)
    • +
    • Cluster spans multiple subnets
    • +
    • Cluster has multihome configuration
    • +
  • +
  • New functions to natively work with nested data - KVGen and Flatten

  • +
  • Support for Hive 0.13 (Hive 0.12 with Drill is not supported any more)

  • +
  • Improved performance when querying Hive tables and File system through partition pruning

  • +
  • Improved performance for HBase with LIKE operator pushdown

  • +
  • Improved memory management

  • +
  • Drill web UI monitoring and query profile improvements

  • +
  • Ability to parse files without explicit extensions using default storage format specification

  • +
  • Fixes for dealing with complex/nested data objects in Parquet/JSON

  • +
  • Fast schema return - Improved experience working with BI/query tools by returning metadata quickly

  • +
  • Several hang related fixes

  • +
  • Parquet writer fixes for handling large datasets

  • +
  • Stability improvements in ODBC and JDBC drivers

  • +
+ +

Apache Drill 0.7.0 Key Notes and Limitations

+ +
    +
  • The current release supports in-memory and beyond-memory execution. However, you must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
  • +
  • While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior, such as Sort. Other operations, such as streaming aggregate, may have partial support that leads to unexpected results.
  • +
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,256 @@ + + + + + + + + +Apache Drill Contribution Guidelines - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Apache Drill Contribution Guidelines

+ +
+ +
    +
  • Fixing JIRAs
  • +
  • SQL functions
  • +
  • Support for new file format readers/writers
  • +
  • Support for new data sources
  • +
  • New query language parsers
  • +
  • Application interfaces + +
      +
    • BI Tool testing
    • +
  • +
  • General CLI improvements
  • +
  • Eco system integrations + +
      +
    • MapReduce
    • +
    • Hive views
    • +
    • YARN
    • +
    • Spark
    • +
    • Hue
    • +
    • Phoenix
    • +
  • +
+ +

Fixing JIRAs

+ +

This is a good place to begin if you are new to Drill. Feel free to pick +issues from the Drill JIRA list. When you pick an issue, assign it to +yourself, inform the team, and start fixing it.

+ +

For any questions, seek help from the team by sending email to drill- +dev@incubator.apache.org.

+ +

https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira +.jira-projects-plugin:summary-panel

+ +

SQL functions

+ +

One of the next simple places to start is to implement a DrillFunc.
DrillFuncs +is way that Drill express all scalar functions (UDF or system).
 First you can +put together a JIRA for one of the DrillFunc's we don't yet have but should +(referencing the capabilities of something like Postgres
or SQL Server or your +own use case). Then try to implement one.

+ +

One example DrillFunc:
+https://github.com/apache/incubator- +drill/blob/103072a619741d5e228fdb181501ec2f82e111a3/sandbox/prototype/exec +/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ComparisonFunction +s.java** **

+ +

Additional ideas on functions that can be added to Drill SQL support

+ +
    +
  • Madlib integration
  • +
  • Machine learning functions
  • +
  • Approximate aggregate functions (such as what is available in BlinkDB)
  • +
+ +

Support for new file format readers/writers

+ +

Currently Drill supports text, JSON and Parquet file formats natively when +interacting with file system. More readers/writers can be introduced by +implementing custom storage plugins. Example formats include below.

+ +
    +
  • AVRO
  • +
  • Sequence
  • +
  • RC
  • +
  • ORC
  • +
  • Protobuf
  • +
  • XML
  • +
  • Thrift
  • +
  • ....
  • +
+ +

Support for new data sources

+ +

Implement custom storage plugins for the following non-Hadoop data sources:

+ +
    +
  • NoSQL databases (such as Mongo, Cassandra, Couch etc)
  • +
  • Search engines (such as Solr, Lucidworks, Elastic Search etc)
  • +
  • SQL databases (MySQL< PostGres etc)
  • +
  • Generic JDBC/ODBC data sources
  • +
  • HTTP URL
  • +
  • ----
  • +
+ +

New query language parsers

+ +

Drill exposes strongly typed JSON APIs for logical and physical plans (plan +syntax at https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQ +rGUse7zp8eA72VKtLOHwfXy6c7I/edit#heading=h.n9gdb1ek71hf ). Drill provides a +SQL language parser today, but any language parser that can generate +logical/physical plans can use Drill's power on the backend as the distributed +low latency query execution engine along with its support for self-describing +data and complex/multi-structured data.

+ +
    +
  • Pig parser : Use Pig as the language to query data from Drill. Great for existing Pig users.
  • +
  • Hive parser : Use HiveQL as the language to query data from Drill. Great for existing Hive users.
  • +
+ +

Application interfaces

+ +

Drill currently provides JDBC/ODBC drivers for the applications to interact +along with a basic version of REST API and a C++ API. The following list +provides a few possible application interface opportunities:

+ + + +

BI Tool testing

+ +

Drill provides JDBC/ODBC drivers to connect to BI tools. We need to make sure +Drill works with all major BI tools. Doing a quick sanity testing with your +favorite BI tool is a good place to learn Drill and also uncover issues in +being able to do so.

+ +

General CLI improvements

+ +

Currently Drill uses SQLLine as the CLI. The goal of this effort is to improve +the CLI experience by adding functionality such as execute statements from a +file, output results to a file, display version information, and so on.

+ +

Eco system integrations

+ +

MapReduce

+ +

Allow using result set from Drill queries as input to the Hadoop/MapReduce +jobs.

+ +

Hive views

+ +

Query data from existing Hive views using Drill queries. Drill needs to parse +the HiveQL and translate them appropriately (into Drill's SQL or +logical/physical plans) to execute the requests.

+ +

YARN

+ +

https://issues.apache.org/jira/browse/DRILL-1170

+ +

Spark

+ +

Provide ability to invoke Drill queries as part of Apache Spark programs. This +gives ability for Spark developers/users to leverage Drill richness of the +query layer , for data source access and as low latency execution engine.

+ +

Hue

+ +

Hue is a GUI for users to interact with various Hadoop eco system components +(such as Hive, Oozie, Pig, HBase, Impala ...). The goal of this project is to +expose Drill as an application inside Hue so users can explore Drill metadata +and do SQL queries.

+ +

Phoenix

+ +

Phoenix provides a low latency query layer on HBase for operational +applications. The goal of this effort is to explore opportunities for +integrating Phoenix with Drill.

+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,92 @@ + + + + + + + + +Apache Drill Documentation - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Apache Drill Documentation

+ +
+ +

The Drill documentation covers how to install, configure, and use Apache Drill.

+
+ + + + + + + +