Migrating your analytics to Starburst #

In some cases, migrating your analytics to Starburst is as easy as swapping out your client, such as a JDBC driver, and changing the sources in your FROM clauses. But because not every data source implements the SQL standard the same, there could be slight differences between your data sources’ native SQL and Starburst’s SQL. The information on this page will get you started moving your tools and workflows to Starburst.

Getting and installing Starburst clients #

Starburst has JDBC and ODBC drivers to connect your favorite tools to your favorite data. We have an entire document section that covers downloading, installing and connecting clients.

Migrating queries from ANSI-standard SQL implementations #

For SQL implementations that follow the ANSI standard closely, only minor query edits are likely. Rather than changing these production queries on the fly though, we suggest using your favorite SQL client or our own CLI to test your existing queries before making changes to production.

Our full Starburst SQL reference manual is available to help you resolve any small implementation differences as you migrate your queries to Starburst.

Migrating queries from Hive #

In other cases where a SQL implementation deviates significantly from the ANSI standard, such as with Hive’s HiveQL, there some things you’ll want to keep in mind. Our documentation covers the syntactic and semantic differences between HiveQL and other non-ANSI standard implementations such as:

  • Array syntax and handling
  • Syntax for strings and identifiers
  • CAST considerations
  • Differences in datediff
  • Complex expressions and subqueries
  • INSERT and OVERWRITE operations

Read the Hive migration page in our reference documentation for detailed information on these topics.

Reducing ETL #

With Starburst, there are many opportunities to both reduce your reliance on increasingly complex ETL pipelines and the intermediate storage to centralize data that you must transform and clean up tech debt while you do so. These are complex problems and it can be hard to know where to begin. The following sections offer some strategies that make approaching the problem easier.

Reduce or remove duplicative or similar processing #

If you have multiple pipelines that ultimately produce the similar data with the same grain, start there by combining those queries. For example, if you calculate budget for ads at an ad_id level, and someone else is calculating invoices for the same ad_id grain, combine those queries. You save the double compute and storage demands so that you don’t scan the data twice, then move it around twice, and finally write it twice with just one column being different.

Optimize around end usage #

As you are doing your due diligence for this migration, whether you choose to leave some pipelines in your current framework or not, take the opportunity to review your pipeline design against their purpose:

  • For end usage purely for downstream ETL, optimize for writes
  • For end usage for dashboards and reports, optimize for reads

For in-depth information on optimizing queries, dive into our training video on optimizing query performance, and reference the query optimizer section of our reference documentation.

Reduce or remove the need for temp tables and intermediate disk storage #

Data products that are derived from disparate sources often need to land in an intermediate schema where they can be combined locally. With Starburst, you can simplify this type of processing and remove the need for temp tables and schemas by taking advantage of Starburst’s powerful query federation abilities. This has the additional, positive side effect of obviating the need for managing temp table cleanup jobs, too.

Reduce or remove little-used queries and data products #

If you are not already, start measuring the usage for dashboards and reports. Have any been abandoned? Has the usage shrunk enough on any of them so that there is no longer a justifiable ROI on maintaining them? If so, work with your stakeholders on a sunset plan, and remove the surfaces and pipelines.

Next steps #

Once you have cleaned up your pipelines, start looking at where you can delight your customers with more aggressive data landing times. Upstream data ETL and availability will continue to set a lower bound on your landing times, but there is a positive, domino effect from reducing pipeline complexity and the need to wait for slow compute and storage operations.