Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Learn complex grouping with SQL #

    Welcome to the Learn complex grouping with SQL tutorial, where you explore useful ways to craft your SQL statements using the TPC-H dataset. You can perform multi-column analysis within each SQL statement using complex grouping operations.

    Requirements #

    To complete the tutorial using Starburst Galaxy, you must create a free trial or activate your user account.

    Once you login to Starburst Galaxy, the sample cluster and tpch catalog are usually pre-configured and ready for querying.

    The sample cluster and tpch catalog are pre-configured and ready for querying.

    • If you do not see the tpch catalog automatically available, create a TPC-H dataset catalog named tpch.
    • If you do not see the sample cluster automatically available, create a cluster named sample and add the tpch catalog to it.

    In the query editor, navigate to the Cluster explorer.

    1. In the navigation menu, expand the sample cluster to view its catalogs.
    2. Select the sample catalog.

        Query editor sample cluster explorer

    In the location drop-down menus, select the cluster sample and the catalog tpch in order to run the queries without having to specify the full table path location in each query.

      Query editor use tpch catalog button

    The namespace for a table is typically specified as catalog_name.schema_name.table_name. For the tpch dataset, this configuration is as follows:

    tpch.tiny.<table_name>
    

    If you choose to name your catalog differently, either adjust the queries accordingly or select the appropriately catalog in the location drop-down menus.

    If you choose to name your cluster differently, select the appropriately named cluster in the location drop-down menus.

    Complex grouping operations #

    Starburst Galaxy supports complex grouping operations using the ROLLUP, CUBE, and GROUPING SETS syntax. This syntax allows you to perform analysis that requires aggregation on multiple sets of columns in a single query.

    ROLLUP #

    The ROLLUP operator generates all possible subtotals for a given set of columns.

    Query the customers table and view the data available for analysis.

    SELECT
      *
    FROM
      tpch.tiny.customer;
    

    Group the customers by market segment to get the total count in each segment.

    SELECT
      mktsegment,
      count() AS customers
    FROM
      tpch.tiny.customer
    GROUP BY
      mktsegment
    ORDER BY
      mktsegment;
    

    Instead of using the GROUP BY clause to organize the market segments, use the ROLLUP operator. Also, reduce the market segments to focus only on three distinct categories: AUTOMOBILE, BUILDING, and MACHINERY.

    SELECT
      mktsegment,
      count() AS customers
    FROM
      tpch.tiny.customer
    WHERE
      mktsegment IN ('AUTOMOBILE', 'BUILDING', 'MACHINERY')
    GROUP BY
      ROLLUP(mktsegment)
    ORDER BY
      mktsegment;
    

    Add another column to the ROLLUP operator and include the nationkey grouping to generate multiple grouping sets.

    SELECT
      mktsegment,
      nationkey,
      count() AS orders
    FROM
      tpch.tiny.customer
    WHERE
      mktsegment IN ('AUTOMOBILE', 'BUILDING', 'MACHINERY')
      AND nationkey BETWEEN 1 AND 2
    GROUP BY
      ROLLUP(mktsegment, nationkey)
    ORDER BY
      mktsegment,
      nationkey;
    

    CUBE #

    The CUBE operator generates all the possible grouping sets for a given set of columns.

    Switch to the CUBE operator to obtain the order count breakdown of each nation key for the market segments not explicitly specified.

    SELECT
      mktsegment,
      nationkey,
      count() AS orders
    FROM
      tpch.tiny.customer
    WHERE
      mktsegment IN ('AUTOMOBILE', 'BUILDING', 'MACHINERY')
      AND nationkey BETWEEN 1 AND 2
    GROUP BY
      CUBE(mktsegment, nationkey)
    ORDER BY
      mktsegment,
      nationkey;
    

    GROUPING SETS #

    The result of the CUBE query can also be achieved using GROUPING SETS. Grouping sets allow you to specify multiple lists of columns to group on, so the manually specifying all the combination of columns can equate to the same output as the CUBE operator.

    SELECT
      mktsegment,
      nationkey,
      count() AS orders
    FROM
      tpch.tiny.customer
    WHERE
      mktsegment IN ('AUTOMOBILE', 'BUILDING', 'MACHINERY')
      AND nationkey BETWEEN 1 AND 2
    GROUP BY
      GROUPING SETS (
         (mktsegment),
         (nationkey),
         (mktsegment, nationkey),
         ()
      )
    ORDER BY
      mktsegment,
      nationkey;
    

    Next steps #

    Check out our other tutorials, or dive right into the SQL documentation and experiment with your own data.