Introduction to UDFs#
A user-defined function (UDF) is a custom function authored by a user of Starburst Galaxy as part of a SQL query or client application. UDFs are scalar functions that return a single output value, similar to built-in functions.
UDF declaration#
Declare the UDF with the SQL FUNCTION keyword and the supported statements for SQL user-defined functions.
A UDF can be declared as an inline UDF to be used in the current query, or declared as a catalog UDF to be used in any future query in any catalog.
Inline UDFs#
An inline user-defined function (inline UDF) declares and uses the UDF within a
query processing context. The UDF is declared in a WITH
block before the
query:
WITH
FUNCTION doubleup(x integer)
RETURNS integer
RETURN x * 2
SELECT doubleup(21);
-- 42
Inline UDF names must follow SQL identifier naming conventions, and cannot contain period characters.
The UDF declaration is only valid within the context of the query. A separate later invocation of the UDF is not possible. If this is desired, use a catalog UDF.
Multiple inline UDF declarations are comma-separated, and can include UDFs calling each other, as long as a called UDF is declared before the first invocation.
WITH
FUNCTION doubleup(x integer)
RETURNS integer
RETURN x * 2,
FUNCTION doubleupplusone(x integer)
RETURNS integer
RETURN doubleup(x) + 1
SELECT doubleupplusone(21);
-- 43
Note that inline UDFs can mask and override the meaning of a built-in function:
WITH
FUNCTION abs(x integer)
RETURNS integer
RETURN x * 2
SELECT abs(-10); -- -20, not 10!
Catalog UDFs#
You can store a UDF in the context of a catalog. Starburst Galaxy stores all
UDFs in the the global catalog named galaxy
, which is automatically attached
to each cluster in your Galaxy account.
In this scenario, the following commands can be used:
CREATE FUNCTION to create and store a UDF.
DROP FUNCTION to remove a UDF.
SHOW FUNCTIONS to display a list of UDFs in a catalog.
Because catalog UDFs in Galaxy all reside in the same global catalog, they can
be referenced by function name alone. If you do need to reference the full path
of a function such as square
, that full path is galaxy.functions.square
.
When using the SHOW FUNCTIONS statement in Galaxy, specify the location of all UDFs as follows:
show functions from galaxy.functions;
SQL environment configuration for UDFs#
Because catalog UDFs are stored in the global catalog named galaxy
, there are
no environment considerations for creating or dropping catalog UDFs.
SQL UDFs in Galaxy can be defined while the current location is a particular
catalog.schema.table
location (such as catA.schemaB.tableC
) and then
immediately invoked when the SQL editor’s context changes to a different
location (such as catX.schemaY.tableZ
).
Recommendations#
Processing UDFs can potentially be resource intensive on the cluster in terms of memory and processing. Take the following considerations into account when writing and running UDFs:
Some checks for the runtime behavior of queries, and therefore UDF processing, are in place. For example, if a query takes longer to process than a hardcoded threshold, processing is automatically terminated.
Avoid creation of arrays in a looping construct. Each iteration creates a separate new array with all items and copies the data for each modification, leaving the prior array in memory for automated clean up later. Use a lambda expression instead of the loop.
Avoid concatenating strings in a looping construct. Each iteration creates a separate new string and copying the old string for each modification, leaving the prior string in memory for automated clean up later. Use a lambda expression instead of the loop.
Most UDFs should declare the
RETURNS NULL ON NULL INPUT
characteristics unless the code has some special handling for null values. You must declare this explicitly sinceCALLED ON NULL INPUT
is the default characteristic.