drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <tnac...@gmail.com>
Subject Re: Using Apache Drill - Sample Data and Sample Queries - Examples
Date Thu, 14 Aug 2014 19:57:40 GMT
Hi Jim,

The vimeo link gave me 404 though.

Tim

On Thu, Aug 14, 2014 at 12:31 PM, Jim Scott <jscott@maprtech.com> wrote:
> I recently gave a presentation on how to use Apache Drill with some
> demonstrations.
>
> The video for the presentation is available here:
> http://vimeo.com/chug/using-apache-drill
> The slides are available here:
> http://www.slideshare.net/ChicagoHUG/using-apache-drill-chug-august-2014-jim-scott
>
> *The data used in this presentation can be found and downloaded (1.3GB CSV
> file and a 100KB CSV file) *
> *here:*
> https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
> https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy
>
> Once you start your drill session you can connect to it via:
> http://localhost:8047/
>
>
> *Queries (pay attention to back ticks) *
> To see a full stack trace when errors occur:
> alter session set `exec.errors.verbose`=true;
>
>
> *Looking through the information schema:*SELECT * from
> INFORMATION_SCHEMA.SCHEMATA;
> SELECT * from INFORMATION_SCHEMA.`TABLES`;
> SELECT * from INFORMATION_SCHEMA.`COLUMNS`;
> SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE
> TABLE_SCHEMA = 'sys' and TABLE_NAME = 'options';
>
>
> *How many of each type of incident:*select count(*) as incidents,
> columns[5] as category from
> dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
> columns[5] order by incidents desc limit 10;
>
>
> *How many of each type of incident and subtype:*select count(*) as
> incidents, columns[5] as type, columns[6] as subtype from
> dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
> columns[6], columns[5] order by incidents desc limit 10;
>
>
> *CREATE A TABLE in PARQUET format:*
>
> *Please note that I created a workspace "db" at my path of /opt/drill/db
> and I gave it a default storage format of parquet. *
>
> create table dfs.db.`crime_data` as select columns[0] as REFID, columns[1]
> as CaseNumber, columns[2] as DateTime, columns[3] as Block, columns[4] as
> IUCR, columns[5] as PrimaryType, columns[6] as Description, columns[7] as
> LocationDescription, columns[8] as Arrest, columns[9] as Domestic,
> columns[10] as Beat, columns[11] as District, columns[12] as Ward,
> columns[13] as CommunityArea, columns[14] as FBICode, columns[15] as
> XCoordinate, columns[16] as YCoordinate, columns[17] as YEAR_DATE,
> columns[18] as UpdatedOn, columns[19] as Latitude, columns[20] as
> Longitude, columns[21] as Location from
> dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv`;
> create table dfs.db.`sex_offenders` as select columns[0] as LAST,columns[1]
> as FIRST,columns[2] as BLOCK,columns[3] as GENDER,columns[4] as
> RACE,columns[5] as BIRTHDATE,columns[6] as AGE,columns[7] as
> HEIGHT,columns[8] as WEIGHT,columns[9] as VICTIM_MINOR from
> dfs.`/home/jscott/Downloads/chicago/Sex_Offenders.csv`;
>
>
> *Run the previous query on the newly created table:*select count(*) as
> incidents, primarytype, description from dfs.db.`crime_data` group by
> description, primarytype order by incidents desc limit 10;
>
>
> *How many crimes occurred on a block with a registered sex offender?*SELECT
> count(*) as incidents FROM dfs.db.`crime_data` crimes JOIN
> dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block ORDER BY
> incidents DESC;
>
>
> *What types of incidents occurred?*SELECT count(*) as incidents,
> crimes.primarytype as type, crimes.description as subtype FROM
> dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
> crimes.block = offenders.block GROUP BY crimes.description,
> crimes.primarytype ORDER BY incidents DESC LIMIT 20;
>
>
> *What all is in the BATTERY category?*SELECT count(*) as incidents,
> crimes.description as subtype FROM dfs.db.`crime_data` crimes JOIN
> dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE
> crimes.primarytype = 'BATTERY' GROUP BY crimes.description ORDER BY
> incidents DESC;
>
>
> *How many total BATTERY incidents?*SELECT count(*) as incidents FROM
> dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
> crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY';
>
>
> *How many "BATTERY" incidents occurred on those blocks by year (has it
> gotten better or worse over time)?*SELECT count(*) as incidents,
> extract(year from to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) as
> `year` FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders`
> offenders ON crimes.block = offenders.block WHERE crimes.primarytype =
> 'BATTERY' and crimes.datetime not like '%Date%' GROUP BY extract(year from
> to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) ORDER BY `year` ASC;
>
>
> Do you have queries against this data set that you have written that you
> would like to share with others? Reply to the thread...
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
>
>  <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>

Mime
View raw message