Migrating your analytics to Starburst #

In many cases, migrating your analytics to Starburst is as easy as swapping out your client, such as a JDBC driver, and changing the sources in your FROM clauses. But because not every data source implements the SQL standard the same, there could be slight differences between your data sources’ native SQL and SQL in Starburst products. The information on this page gets you started moving your tools and workflows to Starburst.

Getting and installing Starburst drivers #

Starburst has JDBC and ODBC drivers to connect your clients to your data sources. You can down load them and learn about connecting clients in our Clients document section.

Migrate queries from ANSI-standard SQL implementations #

For SQL implementations that follow the ANSI standard closely, it is likely that only minor changes to your queries are necessary. Rather than changing these production queries in production, we suggest using a SQL client to test your existing queries before making changes to production.

Our full SQL reference manuals (Galaxy open_in_new / SEP open_in_new) are available to help you resolve any small implementation differences as you migrate your queries to Starburst.

Reduce ETL #

With Starburst, there are many opportunities to both reduce your reliance tech debt from increasingly complex ETL pipelines, and your use of intermediate storage for data centralization. These are complex problems and it can be hard to know where to begin. The following sections offer some strategies that make approaching the problem easier.

Reduce or remove duplicative or similar processing #

If you have multiple pipelines that ultimately produce similar data with the same grain, start by combining those queries. For example, if you calculate budget for ads at an ad_id level, and someone else is calculating invoices for the same ad_id grain, combine those queries. With this one column change, you can significantly reduce compute and storage demands by no longer performing operations twice.

Optimize around end usage #

As you are doing your due diligence for this migration, whether you choose to leave some pipelines in your current framework or not, take the opportunity to review your pipeline design against their purpose:

  • For end use cases purely for downstream ETL, optimize for writes.
  • For end usage cases of dashboards and reports, optimize for reads.

For an in-depth introduction on optimizing queries, watch our training video on optimizing query performance.

Reduce or remove the need for temp tables and intermediate disk storage #

Data products that are derived from disparate sources often need to land in an intermediate schema where they can be combined locally. With Starburst, you can simplify this type of processing and remove the need for temp tables and schemas by taking advantage of Starburst’s powerful query federation abilities. This has the positive side effect of removing the need for managing temp table cleanup jobs.

Reduce or remove little-used queries and data products #

If you are not already, start measuring the usage of dashboards and reports. Have any been abandoned? Has the usage shrunk enough on any of them so that there is no longer a justifiable ROI on their maintenance? If so, work with your stakeholders on a sunset plan, and remove the surfaces and pipelines.

Migrating queries from Hive #

In cases where a SQL implementation deviates significantly from the ANSI standard, such as with Hive’s HiveQL, there some things you’ll want to keep in mind. Starburst uses ANSI SQL syntax and semantics, whereas Hive uses a language similar to SQL called HiveQL which is loosely modeled after MySQL (which itself has many differences from ANSI SQL).

Our documentation covers the syntactic and semantic differences between HiveQL and other non-ANSI standard implementations such as:

  • Array syntax and handling
  • Syntax for strings and identifiers
  • CAST considerations
  • Differences in datediff()
  • Complex expressions and subqueries
  • INSERT and OVERWRITE operations

Use subscript for accessing a dynamic index of an array instead of a udf #

The subscript operator in SQL supports full expressions, unlike Hive (which only supports constants). Therefore you can write queries like:

SELECT my_array[CARDINALITY(my_array)] as last_element
FROM ...

Avoid out of bounds access of arrays #

Accessing out of bounds elements of an array will result in an exception. You can avoid this with an IF as follows:

SELECT IF(CARDINALITY(my_array) >= 3, my_array[3], NULL)
FROM ...

Use ANSI SQL syntax for arrays #

Arrays are indexed starting from 1, not from 0:

SELECT my_array[1] AS first_element
FROM ...

Construct arrays with ANSI syntax:

SELECT ARRAY[1, 2, 3] AS my_array

Use ANSI SQL syntax for identifiers and strings #

Strings are delimited with single quotes and identifiers are quoted with double quotes, not backquotes:

SELECT name AS "User Name"
FROM "7day_active"
WHERE name = 'foo'

Quote identifiers that start with numbers #

Identifiers that start with numbers are not legal in ANSI SQL and must be quoted using double quotes:

SELECT *
FROM "7day_active"

Use the standard string concatenation operator #

Use the ANSI SQL string concatenation operator:

SELECT a || b || c
FROM ...

Use standard types for CAST targets #

The following standard types are supported for CAST targets:

SELECT
  CAST(x AS varchar),
  CAST(x AS bigint),
  CAST(x AS double),
  CAST(x AS boolean)
FROM ...

In particular, use VARCHAR instead of STRING.

Use CAST when dividing integers #

Starburst follows the standard behavior of performing integer division when dividing two integers. For example, dividing 7 by 2 results in 3, not 3.5. To perform floating point division on two integers, cast one of them to a double:

SELECT CAST(5 AS DOUBLE) / 2

Use WITH for complex expressions or queries #

When you want to reuse a complex output expression as a filter, use either an inline subquery or factor it out using the WITH clause:

WITH a AS (
  SELECT substr(name, 1, 3) x
  FROM ...
)
SELECT *
FROM a
WHERE x = 'foo'

Use UNNEST to expand arrays and maps #

Starburst supports {ref}UNNEST for expanding arrays and maps. Use UNNEST instead of LATERAL VIEW explode().

Hive query:

SELECT student, score
FROM tests
LATERAL VIEW explode(scores) t AS score;

Starburst query:

SELECT student, score
FROM tests
CROSS JOIN UNNEST(scores) AS t (score);

Use ANSI SQL syntax for date and time INTERVAL expressions #

Starburst supports the ANSI SQL style INTERVAL expressions that differs from the implementation used in Hive.

  • The INTERVAL keyword is required and is not optional.
  • Date and time units must be singular. For example day and not days.
  • Values must be quoted.

Hive query:

SELECT cast('2000-08-19' as date) + 14 days;

Equivalent Starburst query:

SELECT cast('2000-08-19' as date) + INTERVAL '14' day;

Caution using datediff() #

The Hive datediff() function returns the difference between the two dates in days and is declared as:

datediff(string enddate, string startdate)  -> integer

The equivalent Starburst date_diff() function uses a reverse order for the two date parameters and requires a unit. This must be taken into account when migrating:

Hive query:

datediff(enddate, startdate)

Starburst query:

date_diff('day', startdate, enddate)

Overwriting data on insert #

By default, INSERT queries are not allowed to overwrite existing data. You can use the catalog session property insert_existing_partitions_behavior to allow overwrites. Prepend the name of the catalog using the Hive connector, for example hdfs, and set the property in the session before you run the insert query:

SET SESSION hdfs.insert_existing_partitions_behavior = 'OVERWRITE';
INSERT INTO hdfs.schema.table ...

The resulting behavior is equivalent to using INSERT OVERWRITE in Hive.

Insert overwrite operation is not supported by Starburst when the table is stored on encrypted HDFS, when the table is unpartitioned or table is transactional.

Next steps #

Once you have cleaned up your pipelines, start looking at where you can delight your customers with more aggressive data landing times. Upstream data ETL and availability continues to set a lower bound on your landing times, but there is a positive downstream effect from reducing pipeline complexity and the need to wait for slow compute and storage operations.