Reply-To: commits@drill.apache.org Delivered-To: mailing list commits@drill.apache.org Received: (qmail 90245 invoked by uid 99); 15 Jan 2015 05:11:50 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 05:11:50 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id 58B0DAC01D7; Thu, 15 Jan 2015 05:11:50 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1651949 [2/13] - in /drill/site/trunk/content/drill: ./ blog/2014/11/19/sql-on-mongodb/ blog/2014/12/02/drill-top-level-project/ blog/2014/12/09/running-sql-queries-on-amazon-s3/ blog/2014/12/11/apache-drill-qa-panelist-spotlight/ blog/201... Date: Thu, 15 Jan 2015 05:11:48 -0000 To: commits@drill.apache.org From: tshiran@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150115051150.58B0DAC01D7@hades.apache.org> Added: drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html (added) +++ drill/site/trunk/content/drill/docs/analyzing-yelp-json-data-with-apache-drill/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,468 @@ + + + + + + + + +Analyzing Yelp JSON Data with Apache Drill - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Analyzing Yelp JSON Data with Apache Drill

+ +

Apache Drill is one of the +fastest growing open source projects, with the community making rapid progress +with monthly releases. The key difference is Drillâs agility and flexibility. +Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve low +latency performance at scale, Drill allows users to analyze the data without +any ETL or up-front schema definitions. The data could be in any file format +such as text, JSON, or Parquet. Data could have simple types such as string, +integer, dates, or more complex multi-structured data, such as nested maps and +arrays. Data can exist in any file system, local or distributed, such as HDFS, +MapR FS, or S3. Drill, has a âno schemaâ approach, which enables you to get +value from your data in just a few minutes.

+ +

Letâs quickly walk through the steps required to install Drill and run it +against the Yelp data set. The publicly available data set used for this +example is downloadable from Yelp +(business reviews) and is in JSON format.

+ +

Installing and Starting Drill

+ +

Step 1: Download Apache Drill onto your local machine

+ +

http://incubator.apache.org/drill/download/

+ +

You can also deploy Drill in clustered mode if you +want to scale your environment.

+ +

Step 2 : Open the Drill tar file

+ +

tar -xvf apache-drill-0.6.0-incubating.tar

+ +

Step 3: Launch sqlline, a JDBC application that ships with Drill

+ +

bin/sqlline -u jdbc:drill:zk=local

+ +

Thatâs it! You are now ready explore the data.

+ +

Letâs try out some SQL examples to understand how Drill makes the raw data +analysis extremely easy.

+ +

Note: You need to substitute your local path to the Yelp data set in the FROM clause of each query you run.

+ +

Querying Data with Drill

+ +

1. View the contents of the Yelp business data

+ +

0: jdbc:drill:zk=local> !set maxwidth 10000

+ +

0: jdbc:drill:zk=local> select * from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +limit 1;

+-------------+--------------+------------+------------+------------+------------+--------------+------------+------------+------------+------------+------------+------------+------------+---------------+
+| business_id | full_address |   hours    |     open    | categories |            city    | review_count |        name   | longitude  |   state  |   stars          |  latitude  | attributes |          type    | neighborhoods |
++-------------+--------------+------------+------------+------------+------------+--------------+------------+------------+------------+------------+------------+------------+------------+---------------+
+| vcNAWiLM4dR7D2nwwJ7nCA | 4840 E Indian School Rd
+Ste 101
+Phoenix, AZ 85018 | {"Tuesday":{"close":"17:00","open":"08:00"},"Friday":{"close":"17:00","open":"08:00"},"Monday":{"close":"17:00","open":"08:00"},"Wednesday":{"close":"17:00","open":"08:00"},"Thursday":{"close":"17:00","open":"08:00"},"Sunday":{},"Saturday":{}} | true              | ["Doctors","Health & Medical"] | Phoenix  | 7                   | Eric Goldberg, MD | -111.983758 | AZ       | 3.5                | 33.499313  | {"By Appointment Only":true,"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | business   | []                  
 |
++-------------+--------------+------------+------------+------------+------------+--------------+------------+------------+------------+------------+------------+------------+------------+---------------+
+

**Note: **You can directly query self-describing files such as JSON, Parquet, and text. There is no need to create metadata definitions in the Hive metastore.

+ +

2. Explore the business data set further

+ +

Total reviews in the data set

+ +

0: jdbc:drill:zk=local> select sum(review_count) as totalreviews from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +;

+--------------+
+| totalreviews |
++--------------+
+| 1236445      |
++--------------+
+

Top states and cities in total number of reviews

+ +

0: jdbc:drill:zk=local> select state, city, count(*) totalreviews from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +group by state, city order by count(*) desc limit 10;

+------------+------------+--------------+
+|   state    |    city    | totalreviews |
++------------+------------+--------------+
+| NV         | Las Vegas  | 12021        |
+| AZ         | Phoenix    | 7499         |
+| AZ         | Scottsdale | 3605         |
+| EDH        | Edinburgh  | 2804         |
+| AZ         | Mesa       | 2041         |
+| AZ         | Tempe      | 2025         |
+| NV         | Henderson  | 1914         |
+| AZ         | Chandler   | 1637         |
+| WI         | Madison    | 1630         |
+| AZ         | Glendale   | 1196         |
++------------+------------+--------------+
+

Average number of reviews per business star rating

+ +

0: jdbc:drill:zk=local> select stars,trunc(avg(review_count)) reviewsavg from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +group by stars order by stars desc;

+------------+------------+
+|   stars    | reviewsavg |
++------------+------------+
+| 5.0        | 8.0        |
+| 4.5        | 28.0       |
+| 4.0        | 48.0       |
+| 3.5        | 35.0       |
+| 3.0        | 26.0       |
+| 2.5        | 16.0       |
+| 2.0        | 11.0       |
+| 1.5        | 9.0        |
+| 1.0        | 4.0        |
++------------+------------+
+

Top businesses with high review counts (> 1000)

+ +

0: jdbc:drill:zk=local> select name, state, city, `review_count` from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +where review_count > 1000 order by `review_count` desc limit 10;

+------------+------------+------------+----------------------------+
+|    name                |   state     |    city     | review_count |
++------------+------------+------------+----------------------------+
+| Mon Ami Gabi           | NV          | Las Vegas  | 4084          |
+| Earl of Sandwich       | NV          | Las Vegas  | 3655          |
+| Wicked Spoon           | NV          | Las Vegas  | 3408          |
+| The Buffet             | NV          | Las Vegas  | 2791          |
+| Serendipity 3          | NV          | Las Vegas  | 2682          |
+| Bouchon                | NV          | Las Vegas  | 2419          |
+| The Buffet at Bellagio | NV          | Las Vegas  | 2404          |
+| Bacchanal Buffet       | NV          | Las Vegas  | 2369          |
+| The Cosmopolitan of Las Vegas | NV   | Las Vegas  | 2253          |
+| Aria Hotel & Casino    | NV          | Las Vegas  | 2224          |
++------------+------------+------------+----------------------------+
+

Saturday open and close times for a few businesses

+ +

0: jdbc:drill:zk=local> select b.name, b.hours.Saturday.`open`, +b.hours.Saturday.`close` +from +dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` +b limit 10;

+------------+------------+----------------------------+
+|    name                    |   EXPR$1   |   EXPR$2   |
++------------+------------+----------------------------+
+| Eric Goldberg, MD          | 08:00      | 17:00      |
+| Pine Cone Restaurant       | null       | null       |
+| Deforest Family Restaurant | 06:00      | 22:00      |
+| Culver's                   | 10:30      | 22:00      |
+| Chang Jiang Chinese Kitchen| 11:00      | 22:00      |
+| Charter Communications     | null       | null       |
+| Air Quality Systems        | null       | null       |
+| McFarland Public Library   | 09:00      | 20:00      |
+| Green Lantern Restaurant   | 06:00      | 02:00      |
+| Spartan Animal Hospital    | 07:30      | 18:00      |
++------------+------------+----------------------------+
+

** **Note how Drill can traverse and refer through multiple levels of nesting.

+ +

3. Get the amenities of each business in the data set

+ +

Note that the attributes column in the Yelp business data set has a different +element for every row, representing that businesses can have separate +amenities. Drill makes it easy to quickly access data sets with changing +schemas.

+ +

First, change Drill to work in all text mode (so we can take a look at all of +the data).

0: jdbc:drill:zk=local> alter system set `store.json.all_text_mode` = true;
++------------+-----------------------------------+
+|     ok     |  summary                          |
++------------+-----------------------------------+
+| true       | store.json.all_text_mode updated. |
++------------+-----------------------------------+
+

Then, query the attributeâs data.

0: jdbc:drill:zk=local> select attributes from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` limit 10;
++----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| attributes                                                                                                                                                                       |
++----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| {"By Appointment Only":"true","Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"true","dinner":"false","breakfast":"false","brunch":"false"},"Caters":"false","Noise Level":"averag |
+| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"false","dinner":"false","breakfast":"false","brunch":"true"},"Caters":"false","Noise Level":"quiet" |
+| {"Take-out":"true","Good For":{},"Takes Reservations":"false","Delivery":"false","Ambience":{},"Parking":{"garage":"false","street":"false","validated":"false","lot":"true","val |
+| {"Take-out":"true","Good For":{},"Ambience":{},"Parking":{},"Has TV":"false","Outdoor Seating":"false","Attire":"casual","Music":{},"Hair Types Specialized In":{},"Payment Types |
+| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Good For":{},"Ambience":{},"Parking":{},"Wi-Fi":"free","Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"false","dinner":"true","breakfast":"false","brunch":"false"},"Noise Level":"average","Takes Reserva |
+| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
++------------+
+

Turn off the all text mode so we can continue to perform arithmetic operations +on data.

0: jdbc:drill:zk=local> alter system set `store.json.all_text_mode` = false;
++------------+------------+
+|     ok             |  summary   |
++------------+------------+
+| true              | store.json.all_text_mode updated. |
+

4. Explore the restaurant businesses in the data set

+ +

Number of restaurants in the data set

0: jdbc:drill:zk=local> select count(*) as TotalRestaurants from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants');
++------------------+
+| TotalRestaurants |
++------------------+
+| 14303            |
++------------------+
+

Top restaurants in number of reviews

0: jdbc:drill:zk=local> select name,state,city,`review_count` from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by `review_count` desc limit 10
+. . . . . . . . . . . > ;
++------------+------------+------------+--------------+
+|    name         |   state    |    city     | review_count |
++------------+------------+------------+--------------+
+| Mon Ami Gabi | NV               | Las Vegas  | 4084         |
+| Earl of Sandwich | NV         | Las Vegas  | 3655         |
+| Wicked Spoon | NV             | Las Vegas  | 3408         |
+| The Buffet | NV       | Las Vegas  | 2791         |
+| Serendipity 3 | NV              | Las Vegas  | 2682         |
+| Bouchon       | NV         | Las Vegas  | 2419           |
+| The Buffet at Bellagio | NV             | Las Vegas  | 2404         |
+| Bacchanal Buffet | NV        | Las Vegas  | 2369         |
+| Hash House A Go Go | NV                | Las Vegas  | 2201         |
+| Mesa Grill | NV         | Las Vegas  | 2004         |
++------------+------------+------------+--------------+
+

Top restaurants in number of listed categories

0: jdbc:drill:zk=local> select name,repeated_count(categories) as categorycount, categories from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by repeated_count(categories) desc limit 10;
++------------+---------------+------------+
+|    name         | categorycount | categories |
++------------+---------------+------------+
+| Binion's Hotel & Casino | 10           | ["Arts & Entertainment","Restaurants","Bars","Casinos","Event Planning & Services","Lounges","Nightlife","Hotels & Travel","American (N |
+| Stage Deli | 10        | ["Arts & Entertainment","Food","Hotels","Desserts","Delis","Casinos","Sandwiches","Hotels & Travel","Restaurants","Event Planning & Services"] |
+| Jillian's  | 9               | ["Arts & Entertainment","American (Traditional)","Music Venues","Bars","Dance Clubs","Nightlife","Bowling","Active Life","Restaurants"] |
+| Hotel Chocolat | 9               | ["Coffee & Tea","Food","Cafes","Chocolatiers & Shops","Specialty Food","Event Planning & Services","Hotels & Travel","Hotels","Restaurants"] |
+| Hotel du Vin & Bistro Edinburgh | 9           | ["Modern European","Bars","French","Wine Bars","Event Planning & Services","Nightlife","Hotels & Travel","Hotels","Restaurants" |
+| Elixir             | 9             | ["Arts & Entertainment","American (Traditional)","Music Venues","Bars","Cocktail Bars","Nightlife","American (New)","Local Flavor","Restaurants"] |
+| Tocasierra Spa and Fitness | 8                  | ["Beauty & Spas","Gyms","Medical Spas","Health & Medical","Fitness & Instruction","Active Life","Day Spas","Restaurants"] |
+| Costa Del Sol At Sunset Station | 8            | ["Steakhouses","Mexican","Seafood","Event Planning & Services","Hotels & Travel","Italian","Restaurants","Hotels"] |
+| Scottsdale Silverado Golf Club | 8              | ["Fashion","Shopping","Sporting Goods","Active Life","Golf","American (New)","Sports Wear","Restaurants"] |
+| House of Blues | 8               | ["Arts & Entertainment","Music Venues","Restaurants","Hotels","Event Planning & Services","Hotels & Travel","American (New)","Nightlife"] |
++------------+---------------+------------+
+

Top first categories in number of review counts

0: jdbc:drill:zk=local> select categories[0], count(categories[0]) as categorycount from dfs.`/users/nrentachintala/Downloads/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json` group by categories[0] 
+order by count(categories[0]) desc limit 10;
++------------+---------------+
+|   EXPR$0   | categorycount |
++------------+---------------+
+| Food       | 4294          |
+| Shopping   | 1885          |
+| Active Life | 1676          |
+| Bars       | 1366          |
+| Local Services | 1351          |
+| Mexican    | 1284          |
+| Hotels & Travel | 1283          |
+| Fast Food  | 963           |
+| Arts & Entertainment | 906           |
+| Hair Salons | 901           |
++------------+---------------+
+

5. Explore the Yelp reviews dataset and combine with the businesses.** **

+ +

Take a look at the contents of the Yelp reviews dataset.

0: jdbc:drill:zk=local> select * from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_review.json` limit 1;
++------------+------------+------------+------------+------------+------------+------------+-------------+
+|   votes          |  user_id   | review_id  |   stars    |            date    |    text           |          type    | business_id |
++------------+------------+------------+------------+------------+------------+------------+-------------+
+| {"funny":0,"useful":2,"cool":1} | Xqd0DzHaiyRqVH3WRG7hzg | 15SdjuK7DmYqUAj6rjGowg | 5            | 2007-05-17 | dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank. | review | vcNAWiLM4dR7D2nwwJ7nCA |
++------------+------------+------------+------------+------------+------------+------------+-------------+
+

Top businesses with cool rated reviews

+ +

Note that we are combining the Yelp business data set that has the overall +review_count to the Yelp review data, which holds additional details on each +of the reviews themselves.

0: jdbc:drill:zk=local> Select b.name from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` b where b.business_id in (SELECT r.business_id FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_review.json` r
+GROUP BY r.business_id having sum(r.votes.cool) > 2000 order by sum(r.votes.cool)  desc);
++------------+
+|    name         |
++------------+
+| Earl of Sandwich |
+| XS Nightclub |
+| The Cosmopolitan of Las Vegas |
+| Wicked Spoon |
++------------+
+

Create a view with the combined business and reviews data sets

+ +

Note that Drill views are lightweight, and can just be created in the local +file system. Drill in standalone mode comes with a dfs.tmp workspace, which we +can use to create views (or you can can define your own workspaces on a local +or distributed file system). If you want to persist the data physically +instead of in a logical view, you can use CREATE TABLE AS SELECT syntax.

0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as Select b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, r.`date` from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json` b , dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_review.json` r where r.business_id=b.business_id
++------------+------------+
+|     ok             |  summary   |
++------------+------------+
+| true              | View 'businessreviews' created successfully in 'dfs.tmp' schema |
++------------+------------+
+

Letâs get the total number of records from the view.

0: jdbc:drill:zk=local> select count(*) as Total from dfs.tmp.businessreviews;
++------------+
+|   Total   |
++------------+
+| 1125458       |
++------------+
+

In addition to these queries, you can get many more deeper insights using +Drillâs SQL functionality. If you are not comfortable with writing queries manually, you +can use a BI/Analytics tools such as Tableau/MicroStrategy to query raw +files/Hive/HBase data or Drill-created views directly using Drill ODBC/JDBC +drivers.

+ +

The goal of Apache Drill is to provide the freedom and flexibility in +exploring data in ways we have never seen before with SQL technologies. The +community is working on more exciting features around nested data and +supporting data with changing schemas in upcoming releases.

+ +

As an example, a new FLATTEN function is in development (an upcoming feature +in 0.7). This function can be used to dynamically rationalize semi-structured +data so you can apply even deeper SQL functionality. Here is a sample query:

+ +

Get a flattened list of categories for each business

0: jdbc:drill:zk=local> select name, flatten(categories) as category from dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_business.json`  limit 20;
++------------+------------+
+|    name         |   category   |
++------------+------------+
+| Eric Goldberg, MD | Doctors          |
+| Eric Goldberg, MD | Health & Medical |
+| Pine Cone Restaurant | Restaurants |
+| Deforest Family Restaurant | American (Traditional) |
+| Deforest Family Restaurant | Restaurants |
+| Culver's   | Food       |
+| Culver's   | Ice Cream & Frozen Yogurt |
+| Culver's   | Fast Food  |
+| Culver's   | Restaurants |
+| Chang Jiang Chinese Kitchen | Chinese    |
+| Chang Jiang Chinese Kitchen | Restaurants |
+| Charter Communications | Television Stations |
+| Charter Communications | Mass Media |
+| Air Quality Systems | Home Services |
+| Air Quality Systems | Heating & Air Conditioning/HVAC |
+| McFarland Public Library | Libraries  |
+| McFarland Public Library | Public Services & Government |
+| Green Lantern Restaurant | American (Traditional) |
+| Green Lantern Restaurant | Restaurants |
+| Spartan Animal Hospital | Veterinarians |
++------------+------------+
+

Top categories used in business reviews

0: jdbc:drill:zk=local> select celltbl.catl, count(celltbl.catl) categorycnt from (select flatten(categories) catl from dfs.`/users/nrentachintala/Downloads/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json` )  celltbl group by celltbl.catl order by count(celltbl.catl) desc limit 10 ;
++------------+-------------+
+|    catl    | categorycnt |
++------------+-------------+
+| Restaurants | 14303       |
+| Shopping   | 6428        |
+| Food       | 5209        |
+| Beauty & Spas | 3421        |
+| Nightlife  | 2870        |
+| Bars       | 2378        |
+| Health & Medical | 2351        |
+| Automotive | 2241        |
+| Home Services | 1957        |
+| Fashion    | 1897        |
++------------+-------------+
+

Stay tuned for more features and upcoming activities in the Drill community.

+ +

To learn more about Drill, please refer to the following resources:

+ +

Download Drill here:http://incubator.apache.org/drill/download/
10 reasons we think Drill is cool:http://incubator.apache.org/drill/why-drill/
A simple 10-minute tutorial:https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes
A more comprehensive tutorial:https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/aol-search/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/aol-search/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/aol-search/index.html (added) +++ drill/site/trunk/content/drill/docs/aol-search/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,138 @@ + + + + + + + + +AOL Search - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

AOL Search

+ +

Quick Stats

+ +

The AOL Search dataset is +a collection of real query log data that is based on real users.

+ +

The Data Source

+ +

The dataset consists of 20M Web queries from 650k users over a period of three +months, 440MB in total and available for +download. The format used in +the dataset is:

AnonID, Query, QueryTime, ItemRank, ClickURL
+

... with:

+ +

AnonID, an anonymous user ID number.
Query, the query issued by the user, case shifted with most punctuation removed.
QueryTime, the time at which the query was submitted for search.
ItemRank, if the user clicked on a search result, the rank of the item on which they clicked is listed.
ClickURL, if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.

+ +

Each line in the data represents one of two types of events

+ +

A query that was NOT followed by the user clicking on a result item.
A click through on an item in the result list returned from a query.

+ +

In the first case (query only) there is data in only the first three columns, +in the second case (click through), there is data in all five columns. For +click through events, the query that preceded the click through is included. +Note that if a user clicked on more than one result in the list returned from +a single query, there will be TWO lines in the data to represent the two +events.

+ +

The Queries

+ +

Interesting queries, for example

+ +

Users querying for topic X
Users that click on the first (second, third) ranked item
TOP 10 domains searched
TOP 10 domains clicked at

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-4-0-release-notes/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,127 @@ + + + + + + + + +Apache Drill 0.4.0 Release Notes - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Apache Drill 0.4.0 Release Notes

+ +

The 0.4.0 release is a developer preview release, designed to help enthusiasts +start to work with and experiment with Drill. It is the first Drill release +that provides distributed query execution.

+ +

This release is built upon more than 800 +JIRAs. +It is a pre-beta release on the way towards Drill. As a developer snapshot, +the release contains a large number of outstanding bugs that will make some +use cases challenging. Feel free to consult outstanding issues targeted for +the 0.5.0 +release +to see whether your use case is affected.

+ +

To read more about this release and new features introduced, please view the +0.4.0 announcement blog +entry.

+ +

The release is available as both binary +and source tarballs. In both cases, +these are compiled against Apache Hadoop. Drill has also been tested against +MapR, Cloudera and Hortonworks Hadoop distributions and there are associated +build profiles or JIRAs that can help you run against your preferred +distribution.

+ +

Some Key Notes & Limitations

+ +

The current release supports in memory and beyond memory execution. However, users must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
In many cases,merge join operations return incorrect results.
Use of a local filter in a join âonâ clause when using left, right or full outer joins may result in incorrect results.
Because of known memory leaks and memory overrun issues you may need more memory and you may need to restart the system in some cases.
Some types of complex expressions, especially those involving empty arrays may fail or return incorrect results.
While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior (such as Sort). Others operations (such as streaming aggregate) may have partial support that leads to unexpected results.
Protobuf, UDF, query plan interfaces and all interfaces are subject to change in incompatible ways.
Multiplication of some types of DECIMAL(28+,*) will return incorrect result.

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-5-0-release-notes/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,112 @@ + + + + + + + + +Apache Drill 0.5.0 Release Notes - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Apache Drill 0.5.0 Release Notes

+ +

Apache Drill 0.5.0, the first beta release for Drill, is designed to help +enthusiasts start working and experimenting with Drill. It also continues the +Drill monthly release cycle as we drive towards general availability.

+ +

The 0.5.0 release is primarily a bug fix release, with more than 100 JIRAs closed, but there are some notable features. For information +about the features, see the Apache Drill Blog for the 0.5.0 +release.

+ +

This release is available as binary and +source tarballs that are compiled +against Apache Hadoop. Drill has been tested against MapR, Cloudera, and +Hortonworks Hadoop distributions. There are associated build profiles and +JIRAs that can help you run Drill against your preferred distribution.

+ +

Apache Drill 0.5.0 Key Notes and Limitations

+ +

The current release supports in memory and beyond memory execution. However, you must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior, such as Sort. Others operations, such as streaming aggregate, may have partial support that leads to unexpected results.
There are known issues with joining text files without using an intervening view. See DRILL-1401 for more information.

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-6-0-release-notes-apache-drill-alpha/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,118 @@ + + + + + + + + +Apache Drill 0.6.0 Release Notes (Apache Drill Alpha) - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Apache Drill 0.6.0 Release Notes (Apache Drill Alpha)

+ +

Apache Drill 0.6.0, the second beta release for Drill, is designed to help +enthusiasts start working and experimenting with Drill. It also continues the +Drill monthly release cycle as we drive towards general availability.

+ +

Apache Drill 0.6.0 Key Features

+ +

This release is primarily a bug fix release, with more than 30 JIRAs closed, but there are some notable features:

+ +

Direct ANSI SQL access to MongoDB, using the latest MongoDB Plugin for Apache Drill
Filesystem query performance improvements with partition pruning
Ability to use the file system as a persistent store for query profiles and diagnostic information
Window function support (alpha)

+ +

Apache Drill 0.6.0 Key Notes and Limitations

+ +

The current release supports in-memory and beyond-memory execution. However, you must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior, such as Sort. Other operations, such as streaming aggregate, may have partial support that leads to unexpected results.

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-0-7-0-release-notes-apache-drill-alpha/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,132 @@ + + + + + + + + +Apache Drill 0.7.0 Release Notes (Apache Drill Alpha) - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Apache Drill 0.7.0 Release Notes (Apache Drill Alpha)

+ +

Apache Drill 0.7.0, the third beta release for Drill, is designed to help +enthusiasts start working and experimenting with Drill. It also continues the +Drill monthly release cycle as we drive towards general availability.

+ +

This release is available as +binary and +source tarballs that are compiled against Apache Hadoop. +Drill has been tested against MapR, Cloudera, and Hortonworks Hadoop +distributions. There are associated build profiles and JIRAs that can help you +run Drill against your preferred distribution

+ +

Apache Drill 0.7.0 Key Features

+ +

No more dependency on UDP/Multicast - Making it possible for Drill to work well in the following scenarios:
+ +
- UDP multicast not enabled (as in EC2)
- Cluster spans multiple subnets
- Cluster has multihome configuration
New functions to natively work with nested data - KVGen and Flatten
Support for Hive 0.13 (Hive 0.12 with Drill is not supported any more)
Improved performance when querying Hive tables and File system through partition pruning
Improved performance for HBase with LIKE operator pushdown
Improved memory management
Drill web UI monitoring and query profile improvements
Ability to parse files without explicit extensions using default storage format specification
Fixes for dealing with complex/nested data objects in Parquet/JSON
Fast schema return - Improved experience working with BI/query tools by returning metadata quickly
Several hang related fixes
Parquet writer fixes for handling large datasets
Stability improvements in ODBC and JDBC drivers

+ +

Apache Drill 0.7.0 Key Notes and Limitations

+ +

The current release supports in-memory and beyond-memory execution. However, you must disable memory-intensive hash aggregate and hash join operations to leverage this functionality.
While the Drill execution engine supports dynamic schema changes during the course of a query, some operators have yet to implement support for this behavior, such as Sort. Other operations, such as streaming aggregate, may have partial support that leads to unexpected results.

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-contribution-guidelines/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,256 @@ + + + + + + + + +Apache Drill Contribution Guidelines - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Apache Drill Contribution Guidelines

+ +

Fixing JIRAs
SQL functions
Support for new file format readers/writers
Support for new data sources
New query language parsers
Application interfaces + +
- BI Tool testing
General CLI improvements
Eco system integrations + +
- MapReduce
- Hive views
- YARN
- Spark
- Hue
- Phoenix

+ +

Fixing JIRAs

+ +

This is a good place to begin if you are new to Drill. Feel free to pick +issues from the Drill JIRA list. When you pick an issue, assign it to +yourself, inform the team, and start fixing it.

+ +

For any questions, seek help from the team by sending email to drill- +dev@incubator.apache.org.

+ +

https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira +.jira-projects-plugin:summary-panel

+ +

SQL functions

+ +

One of the next simple places to start is to implement a DrillFunc.â¨DrillFuncs +is way that Drill express all scalar functions (UDF or system).â¨ First you can +put together a JIRA for one of the DrillFunc's we don't yet have but should +(referencing the capabilities of something like Postgresâ¨or SQL Server or your +own use case). Then try to implement one.

+ +

One example DrillFunc:
+https://github.com/apache/incubator- +drill/blob/103072a619741d5e228fdb181501ec2f82e111a3/sandbox/prototype/exec +/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ComparisonFunction +s.java** **

+ +

Additional ideas on functions that can be added to Drill SQL support

+ +

Madlib integration
Machine learning functions
Approximate aggregate functions (such as what is available in BlinkDB)

+ +

Support for new file format readers/writers

+ +

Currently Drill supports text, JSON and Parquet file formats natively when +interacting with file system. More readers/writers can be introduced by +implementing custom storage plugins. Example formats include below.

+ +

AVRO
Sequence
RC
ORC
Protobuf
XML
Thrift
....

+ +

Support for new data sources

+ +

Implement custom storage plugins for the following non-Hadoop data sources:

+ +

NoSQL databases (such as Mongo, Cassandra, Couch etc)
Search engines (such as Solr, Lucidworks, Elastic Search etc)
SQL databases (MySQL< PostGres etc)
Generic JDBC/ODBC data sources
HTTP URL
----

+ +

New query language parsers

+ +

Drill exposes strongly typed JSON APIs for logical and physical plans (plan +syntax at https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQ +rGUse7zp8eA72VKtLOHwfXy6c7I/edit#heading=h.n9gdb1ek71hf ). Drill provides a +SQL language parser today, but any language parser that can generate +logical/physical plans can use Drill's power on the backend as the distributed +low latency query execution engine along with its support for self-describing +data and complex/multi-structured data.

+ +

Pig parser : Use Pig as the language to query data from Drill. Great for existing Pig users.
Hive parser : Use HiveQL as the language to query data from Drill. Great for existing Hive users.

+ +

Application interfaces

+ +

Drill currently provides JDBC/ODBC drivers for the applications to interact +along with a basic version of REST API and a C++ API. The following list +provides a few possible application interface opportunities:

+ +

Enhancements to REST APIs (https://issues.apache.org/jira/browse/DRILL-77)
Expose Drill tables/views as REST APIs
Language drivers for Drill (python etc)
Thrift support
....

+ +

BI Tool testing

+ +

Drill provides JDBC/ODBC drivers to connect to BI tools. We need to make sure +Drill works with all major BI tools. Doing a quick sanity testing with your +favorite BI tool is a good place to learn Drill and also uncover issues in +being able to do so.

+ +

General CLI improvements

+ +

Currently Drill uses SQLLine as the CLI. The goal of this effort is to improve +the CLI experience by adding functionality such as execute statements from a +file, output results to a file, display version information, and so on.

+ +

Eco system integrations

+ +

MapReduce

+ +

Allow using result set from Drill queries as input to the Hadoop/MapReduce +jobs.

+ +

Hive views

+ +

Query data from existing Hive views using Drill queries. Drill needs to parse +the HiveQL and translate them appropriately (into Drill's SQL or +logical/physical plans) to execute the requests.

+ +

YARN

+ +

https://issues.apache.org/jira/browse/DRILL-1170

+ +

Spark

+ +

Provide ability to invoke Drill queries as part of Apache Spark programs. This +gives ability for Spark developers/users to leverage Drill richness of the +query layer , for data source access and as low latency execution engine.

+ +

Hue

+ +

Hue is a GUI for users to interact with various Hadoop eco system components +(such as Hive, Oozie, Pig, HBase, Impala ...). The goal of this project is to +expose Drill as an application inside Hue so users can explore Drill metadata +and do SQL queries.

+ +

Phoenix

+ +

Phoenix provides a low latency query layer on HBase for operational +applications. The goal of this effort is to explore opportunities for +integrating Phoenix with Drill.

+ + + + + + + + Added: drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html (added) +++ drill/site/trunk/content/drill/docs/apache-drill-documentation/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,92 @@ + + + + + + + + +Apache Drill Documentation - Apache Drill + + + + + + + + + + + + + + + + + + +

+ +

+ + + +

Apache Drill Documentation

+ +

The Drill documentation covers how to install, configure, and use Apache Drill.

+ + + + + + + +

Analyzing Yelp JSON Data with Apache Drill

Installing and Starting Drill

Step 1: Download Apache Drill onto your local machine

Step 2 : Open the Drill tar file

Step 3: Launch sqlline, a JDBC application that ships with Drill

Querying Data with Drill

1. View the contents of the Yelp business data

2. Explore the business data set further

Total reviews in the data set

Top states and cities in total number of reviews

Average number of reviews per business star rating

Top businesses with high review counts (> 1000)

Saturday open and close times for a few businesses

3. Get the amenities of each business in the data set

Number of restaurants in the data set** **

Top restaurants in number of reviews

Top first categories in number of review counts

Take a look at the contents of the Yelp reviews dataset.** **

Top businesses with cool rated reviews

Get a flattened list of categories for each business

AOL Search

Quick Stats

The Data Source

The Queries

Apache Drill 0.4.0 Release Notes

Apache Drill 0.5.0 Release Notes

Apache Drill 0.6.0 Release Notes (Apache Drill Alpha)

Apache Drill 0.7.0 Release Notes (Apache Drill Alpha)

Apache Drill Contribution Guidelines

Fixing JIRAs

SQL functions

Support for new file format readers/writers

Support for new data sources

New query language parsers

Application interfaces

BI Tool testing

General CLI improvements

Eco system integrations

MapReduce

Hive views

YARN

Spark

Hue

Phoenix

Apache Drill Documentation

Number of restaurants in the data set

Take a look at the contents of the Yelp reviews dataset.