Query the COVID-19 data lake #

One of Starburst Galaxy’s many uses is as an analytics engine on your data lakehouse or data lake. Once data is landed in S3, GCP, or Azure, you can easily capitalize on the “separation of storage and compute” principle and use Starburst Galaxy as the engine behind your data lake analytics.

Tutorial architecture #

For this tutorial, analyze two different datasets from the public COVID-19 data lake on AWS.

  • The first dataset is the Global Coronavirus (COVID-19) Data provided by Enigma. This dataset tracks COVID case counts.
  • The second dataset shares information on US Hospital Beds, provided by Rearc. With this hospital information, you are able to see the capacity and occupancy of hospital beds for each state.

This guide walks through:

  • Connecting Starburst Galaxy to AWS
  • Running schema discovery
  • Querying the COVID-19 data lake

Login to Starburst Galaxy #

  1. Navigate to the Starburst Galaxy login page.
  2. If you don’t have an account, create a new one and verify the account with your email.
  3. Switch to an account admin level role. If this is the first time you have logged in, or you have never made any additional roles, you are automatically assigned to the accountadmin role.

Role selector

Connect Starburst Galaxy to AWS #

Create a catalog in Starburst Galaxy #

Catalogs contain the configuration and connection information needed to access a data source. To gain this access, configure a catalog and use it in a cluster.

This tutorial utilizes the public COVID-19 data lake that lives on AWS, which simplifies the setup process significantly. Instead of creating an S3 catalog, where you are required to have access to an AWS account, the data for this tutorial is accessible without charge and without AWS credentials by connecting to the AWS COVID-19 data lake dataset catalog.

  1. In the navigation menu, click Catalogs, then Configure a Catalog.
  2. Scroll down to Select a dataset.
    • Select AWS COVID-19 data lake.
    • Enter a relevant catalog name such as aws_covid.
    • Add a relevant description such as COVID data from AWS public dataset.
    • Add the US East (Ohio) region to the catalog as this is the only region where the dataset lives.
    • Select Connect catalog.

Image displaying the actions to create a covid-19 catalog

Select Save access controls on the Set permissions page to add the default access controls.

Create a cluster in Starburst Galaxy #

A Starburst Galaxy cluster provides the resources necessary to run queries against your catalogs. Use the Starburst Galaxy Cluster explorer to access the catalog data exposed by running clusters.

  1. On the Add to cluster page, select + Create a new cluster.

Image displaying the actions to create a covid-19 cluster

  • Enter cluster name: aws-covid-cluster.
  • Cluster size: Free.
  • Cluster type: Standard.
  • Catalogs: aws_covid.
  • Cloud provider region: US East (Ohio) aka us-east-2.
    1. Select Create cluster.
    2. Select Add to cluster.
    3. Select Query my data.

After configuring the catalog and cluster, you are directed to the query editor to continue the tutorial. Notice that the location drop-downs in the top right corner have already selected the proper cluster and catalog for querying.

Image displaying the example query navigation after connecting to   the covid cluster and catalog

Run schema discovery #

Schema discovery is the Starburst Galaxy feature that analyzes a root object in an object storage location, and returns the schema structure of any tables found. This tutorial uses schema discovery to analyze an individual table in the COVID-19 data lake and generate the SQL command which upon execution creates the corresponding table.

Navigate to your newly created catalog within the cluster explorer. Select the vertical ellipsis menu for the aws_covid catalog and select Run discovery.

Image displaying the run discovery preview

Enter the URI to be discovered. Use the Global Coronavirus (COVID-19) Data provided by Enigma.

s3://covid19-lake/enigma-jhu/json

Click Run Discovery.

Image displaying the URI input to schema discovery

Click Save SQL to save the results in a new tab of the query editor. You now have the schema discovery results which provide the SQL to create the first table, without spending any time investigating yourself.

Configure role-based access control #

The Starburst Galaxy access control system uses roles to provide users with privileges for clusters, catalogs, schemas, tables and other types of entities, such as object storage locations.

Configure the accountadmin role to access the COVID-19 data lake location. (But if you have already gone through these steps at the direction of a different Starburst Galaxy tutorial, there is no need to repeat them.)

  1. In the navigation menu, select Roles and privileges.
  2. Click on the accountadmin role name.
  3. Select the Privileges tab.
  4. Select Add privilege.
  5. Add privilege to accountadmin role.
    • Choose location as the modify privileges destination.
    • Add the storage location: s3://covid19-lake/*.
    • Select Create SQL.
    • Select Add privilege.

Image displaying the URI input to schema discovery

You now have access to all the data sources available in the COVID-19 data lake.

Create each table #

Navigate to the Query editor. Validate the location drop-downs in the top right hand corner match the cluster and catalog previously created.

Image displaying the example query navigation after connecting to   the covid cluster and catalog

Create a schema to contain your tables.

CREATE SCHEMA covid_tutorial;

Now, in the location drop-downs, select the covid_tutorial schema so that you can easily run the tutorial queries.

Create the enigma_jhu table #

The enigma_jhu dataset provides the Global Coronavirus Data and is sourced from John Hopkins and provided by Enigma. This data tracks confirmed COVID-19 cases in provinces, states, and countries across the world, while also providing a county level breakdown in the United States.

Edit your saved query to match the following SQL command to create the table.

CREATE TABLE enigma_jhu (
   fips VARCHAR,
   admin2 VARCHAR,
   province_state VARCHAR,
   country_region VARCHAR,
   last_update VARCHAR,
   latitude DOUBLE,
   longitude DOUBLE,
   confirmed INTEGER,
   deaths INTEGER,
   recovered INTEGER,
   active INTEGER,
   combined_key VARCHAR
)
WITH (
   format = 'json',
   EXTERNAL_LOCATION = 's3://covid19-lake/enigma-jhu/json/')
;

Run a select all command to view your results. You can also use the table’s vertical ellipsis menu to generate the command for you in the query editor.

SELECT * FROM enigma_jhu LIMIT 10;

Take notice that the admin2 column is actually the county. However, it has been improperly named. You can also see that the case information is aggregated for each previously updated timestamp. We account for this as you query.

Create the hospital_beds table #

The second dataset is the USA Hospital Beds data sourced by Definitive Healthcare and provided by Rearc. This data provides intelligence on the number of licensed beds, staffed beds, and ICU beds for the hospitals in the United States.

Run the following SQL command to create the table.

CREATE TABLE hospital_beds (
   objectid INTEGER,
   hospital_name VARCHAR,
   hospital_type VARCHAR,
   hq_address VARCHAR,
   hq_address1 VARCHAR,
   hq_city VARCHAR,
   hq_state VARCHAR,
   hq_zip_code VARCHAR,
   county_name VARCHAR,
   state_name VARCHAR,
   state_fips VARCHAR,
   cnty_fips VARCHAR,
   fips VARCHAR,
   num_licensed_beds INTEGER,
   num_staffed_beds INTEGER,
   num_icu_beds INTEGER,
   adult_icu_beds INTEGER,
   pedi_icu_beds INTEGER,
   bed_utilization DOUBLE,
   avg_ventilator_usage VARCHAR,
   potential_increase_in_bed_capac INTEGER,
   latitude DOUBLE,
   longtitude DOUBLE
)
WITH (
   format = 'json',
   EXTERNAL_LOCATION = 's3://covid19-lake/rearc-usa-hospital-beds/json/')
;

Run a select all command to view your results. You can also use the table’s vertical ellipsis menu to generate the command for you in the query editor.

SELECT * FROM hospital_beds LIMIT 10;

Query each table #

Now that the each table has been validated, query those tables for analytical insights.

Query the enigma_jhu table #

Now that the table is created, run the query that will filter out rows that are not within the US and do not have a value for any province or state. Since the case information is aggregated for each previously updated timestamp, only select the most recent timestamp on May 30, 2020.

SELECT
    fips,
    admin2,
    province_state,
    country_region,
    confirmed
FROM
    enigma_jhu
WHERE
    country_region = 'US'
    AND province_state not like ''
    AND last_update = '2020-05-30T02:32:48';

Run a query to aggregate the total confirmed cases for each state. Keep in mind that this dataset includes US territories as well as states.

SELECT
    province_state,
    country_region,
    SUM(confirmed) AS total_confirmed
FROM
    enigma_jhu
WHERE
    country_region = 'US'
    AND province_state not like ''
    AND last_update = '2020-05-30T02:32:48'
GROUP BY
    province_state,
    country_region;

Query the hospital_beds table #

In the original table, there are additional columns that you do not need for your analysis. Therefore, remove them from your query.

SELECT
   hospital_name,
   county_name,
   state_name,
   fips,
   num_licensed_beds,
   num_staffed_beds,
   num_icu_beds,
   potential_increase_in_bed_capac
FROM
    hospital_beds;

Run a query to aggregate the total hospital capacity for each state. Keep in mind that this dataset includes Puerto Rico and the District of Columbia.

SELECT
   state_name,
   SUM(num_licensed_beds) AS total_licensed_beds,
   SUM(num_staffed_beds) AS total_staffed_beds,
   SUM(num_icu_beds) AS total_icu_beds
FROM
    hospital_beds
WHERE
    state_name != ''
GROUP BY
    state_name;

Query the tables together #

Evaluate each state’s case count versus the number of licensed beds during May 30, 2020. You can see for each state if the state is running close to hospital capacity, or if they are managing with the current number of infections.

SELECT
 cases.fips,
 admin2 AS county,
 province_state,
 confirmed,
 growth_count,
 SUM(num_licensed_beds) AS num_licensed_beds,
 SUM(num_staffed_beds) AS num_staffed_beds,
 SUM(num_icu_beds) AS num_icu_beds
FROM
 hospital_beds beds,
 ( SELECT
     fips,
     admin2,
     province_state,
     confirmed,
     last_value(confirmed) OVER (PARTITION BY fips ORDER BY last_update) - first_value(confirmed) OVER (PARTITION BY fips ORDER BY last_update) AS growth_count,
     first_value(last_update) OVER (PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
     last_update
   FROM
     enigma_jhu cases
   WHERE
     from_iso8601_timestamp(last_update) > TIMESTAMP '2020-05-01 01:00' AND country_region = 'US') cases
WHERE
 beds.fips = cases.fips AND last_update = most_recent
GROUP BY cases.fips, confirmed, growth_count, admin2, province_state
ORDER BY growth_count DESC;

Next Steps #

Now that you have explored the data lake analytics capabilities of Starburst Galaxy, continue exploring with our Query multiple data sources tutorial or connect your own data.