BigQuery connector#

The BigQuery connector lets you query data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read data from tables. See the BigQuery Storage API for more information.

SEP includes additional enterprise features that are built on top of the existing Trino connector functionality. For more information on key feature differences between Trino and SEP, see the connectors feature matrix.

Requirements#

To connect to BigQuery, you need:

To enable the BigQuery Storage Read API.
Network access from your SEP coordinator and workers to the Google Cloud API service endpoint. This endpoint uses HTTPS, or port 443.
To configure BigQuery so the SEP coordinator and workers have permissions in BigQuery.
A valid Starburst Enterprise license.
To set up authentication. Authentication options vary, depending on whether or not you are using Dataproc/Google Compute Engine (GCE). On Dataproc/GCE authentication is based on the machine’s role. If using Outside Dataproc/GCE you have the following options to choose from:
- Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described in the Google Cloud authentication getting started guide.
- Set bigquery.credentials-key in the catalog properties file. It should contain the contents of the JSON file, encoded using base64.
- Set bigquery.credentials-file in the catalog properties file. It should point to the location of the JSON file.

Configuration#

To configure the BigQuery connector, create a catalog properties file that specifies the BigQuery connector by setting the connector.name to bigquery.

For example, to access a database as example, create the file etc/catalog/example.properties. Replace the connection properties as appropriate for your setup:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project ID>

BigQuery configuration properties#
Property name	Description
`bigquery.project-prefix-for-schema.enabled`	Allow access to other projects in BigQuery by including the project ID with the schema name in queries: SELECT * FROM catalog."project-id.dataset".table The default is `false`.
`bigquery.project-cache-ttl`	Frequency of checks for project names as duration. The default is `30m`.

Multiple Google Cloud projects#

The BigQuery connector can only access a single Google Cloud project. If you have data in multiple Google Cloud projects, you must create several catalogs that each point to a different Google Cloud project.

For example, if you have two Google Cloud projects, one called sales and one called analytics, create two properties files in etc/catalog named sales.properties and analytics.properties. They both need to include the connector.name=bigqueryproperty but with different project-id.

Billing and data projects#

The BigQuery connector determines the project ID to use based on the configuration settings. This behavior provides users with flexibility in selecting both the project to query and the project to bill for BigQuery operations. The following table explains how project IDs are resolved in different scenarios:

Billing and data project ID resolution#
Configured properties	Billing project	Data project
Only `bigquery.credentials-key`	The project ID from the credentials key is used for billing.	The project ID from the credentials key is used for querying data.
`bigquery.credentials-key` and `bigquery.project-id`	The project ID from the credentials key is used for billing.	`bigquery.project-id` is used for querying data.
`bigquery.credentials-key` and `bigquery.parent-project-id`	`bigquery.parent-project-id` is used for billing.	The project ID from the credentials key is used for querying data.
`bigquery.credentials-key` and `bigquery.parent-project-id` and `bigquery.project-id`	`bigquery.parent-project-id` is used for billing.	`bigquery.project-id` is used for querying data.

Reading from views#

The connector supports reading from BigQuery views.

By default, reading from BigQuery views is disabled.

To enable reading from views, set the bigquery.views-enabled configuration property to true.

BigQuery views are not materialized by default. In addition, materialized views are created in the same project and dataset. This means that the connector needs to materialize them before it can read them which affects read performance. The materialization process can also incur additional costs to your BigQuery bill.

The optional configuration properties are bigquery.view-materialization-project and bigquery.view-materialization-dataset. The service account must have write permission to the project and dataset in order to materialize the view.

The connector creates a temporary table with the _pbc_ prefix for reading views from the Storage API and when using the query table function for improved performance. The temporary table expires based on the duration set by the bigquery.views-cache-ttl configuration property. Additionally, temporary tables get deleted after the duration set by the bigquery.view-expire-duration configuration property.

The value set in bigquery.view-expire-duration must be greater than the value of bigquery.views-cache-ttl with enough buffer time to ensure that in-progress queries using the materialized view can complete successfully before the view expires.

Arrow serialization support#

This is a feature which introduces support for using Apache Arrow as the serialization format when reading from BigQuery. Add the following required, additional JVM argument to the JVM config:

--add-opens=java.base/java.nio=ALL-UNNAMED
--sun-misc-unsafe-memory-access=allow

Configuration properties#

BigQuery configuration properties#
Property name	Description	Default
`bigquery.project-id`	The project ID of the Google Cloud account used to store the data, see also Billing and data projects	Taken from the service account or from `bigquery.parent-project-id`, if set
`bigquery.parent-project-id`	The project ID of the Google Cloud account used to store the data, see also Billing and data projects	Taken from the service account or from `bigquery.parent-project-id`, if set
`bigquery.views-enabled`	Enables the connector to read from views and not only tables. Read this section before enabling this feature.	`false`
`bigquery.view-expire-duration`	Expire duration for the materialized view.	`24h`
`bigquery.view-materialization-project`	The project where the materialized view is going to be created. See also, Reading from views.	The view’s project
`bigquery.view-materialization-dataset`	The dataset where the materialized view is going to be created. See also, Reading from views.	The view’s project
`bigquery.skip-view-materialization`	Use REST API to access views instead of Storage API. BigQuery `BIGNUMERIC` and `TIMESTAMP` types are unsupported.	`false`
`bigquery.view-materialization-with-filter`	Use filter conditions when materializing views.	`false`
`bigquery.views-cache-ttl`	Duration for which the materialization of a view will be cached and reused. Set to `0ms` to disable the cache.	`15m`
`bigquery.metadata.cache-ttl`	Duration for which metadata retrieved from BigQuery is cached and reused. Set to `0ms` to disable the cache.	`0ms`
`bigquery.max-read-rows-retries`	The number of retries in case of retryable server issues.	`3`
`bigquery.credentials-key`	The base64 encoded credentials key.	None. See the requirements section
`bigquery.credentials-file`	The path to the JSON credentials file.	None. See the requirements section
`bigquery.case-insensitive-name-matching`	Match dataset and table names case-insensitively.	`false`
`bigquery.case-insensitive-name-matching.cache-ttl`	Duration for which case insensitive schema and table names are cached. Set to `0ms` to disable the cache.	`0ms`
`bigquery.query-results-cache.enabled`	Enable query results cache.	`false`
`bigquery.arrow-serialization.enabled`	Enable using Apache Arrow serialization when reading data from BigQuery. Read this section before using this feature.	`true`
`bigquery.max-parallelism`	The max number of partitions to split the data into. Reduce this number if the default parallelism (number of workers x 3) is too high.
`bigquery.channel-pool.initial-size`	The initial size of the connection pool, also known as a channel pool, used for gRPC communication.	`1`
`bigquery.channel-pool.min-size`	The minimum number of connections in the connection pool, also known as a channel pool, used for gRPC communication.	`1`
`bigquery.channel-pool.max-size`	The maximum number of connections in the connection pool, also known as a channel pool, used for gRPC communication.	`1`
`bigquery.channel-pool.min-rpc-per-channel`	Threshold to start scaling down the channel pool. When the average of outstanding RPCs in a single minute drop below this threshold, channels are removed from the pool.	`0`
`bigquery.channel-pool.max-rpc-per-channel`	Threshold to start scaling up the channel pool. When the average of outstanding RPCs in a single minute surpass this threshold, channels are added to the pool.	`2147483647`
`bigquery.rpc-retries`	The maximum number of retry attempts to perform for the RPC calls. If this value is set to `0`, the value from `bigquery.rpc-timeout` is used. Retry is deactivated when both `bigquery.rpc-retries` and `bigquery.rpc-timeout` are `0`. If this value is positive, and the number of attempts exceeds `bigquery.rpc-retries` limit, retries stop even if the total retry time is still lower than `bigquery.rpc-timeout`.	`0`
`bigquery.rpc-timeout`	Timeout duration on when the retries for the RPC call should be given up completely. The higher the timeout, the more retries can be attempted. If this value is `0s`, then `bigquery.rpc-retries` is used to determine retries. Retry is deactivated when `bigquery.rpc-retries` and `bigquery.rpc-timeout` are both `0`. If this value is positive, and the retry duration has reached the timeout value, retries stop even if the number of attempts is lower than the `bigquery.rpc-retries` value.	`0s`
`bigquery.rpc-retry-delay`	The delay duration before the first retry attempt for RPC calls.	`0s`
`bigquery.rpc-retry-delay-multiplier`	Controls the change in delay before the next retry. The retry delay of the previous call is multiplied by the `bigquery.rpc-retry-delay-multiplier` to calculate the retry delay for the next RPC call.	`1.0`
`bigquery.rpc-proxy.enabled`	Use a proxy for communication with BigQuery.	`false`
`bigquery.rpc-proxy.uri`	Proxy URI to use, if connecting through a proxy.
`bigquery.rpc-proxy.username`	Proxy user name to use if connecting through a proxy.
`bigquery.rpc-proxy.password`	Proxy password to use if connecting through a proxy.
`bigquery.rpc-proxy.keystore-path`	Keystore containing client certificates to present to proxy, if connecting through a proxy. Only required if proxy uses mutual TLS.
`bigquery.rpc-proxy.keystore-password`	Password of the keystore specified by `bigquery.rpc-proxy.keystore-path`.
`bigquery.rpc-proxy.truststore-path`	Truststore containing certificates of the proxy server, if connecting through a proxy.
`bigquery.rpc-proxy.truststore-password`	Password of the truststore specified by `bigquery.rpc-proxy.truststore-path`.

Fault-tolerant execution support#

The connector supports Fault-tolerant execution of query processing. Read and write operations are both supported with any retry policy.

Type mapping#

Because Trino and BigQuery each support types that the other does not, this connector modifies some types when reading or writing data. Data types may not map the same way in both directions between Trino and the data source. Refer to the following sections for type mapping in each direction.

BigQuery type to Trino type mapping#

The connector maps BigQuery types to the corresponding Trino types following this table:

BigQuery type to Trino type mapping#
BigQuery type	Trino type	Notes
`BOOLEAN`	`BOOLEAN`
`INT64`	`BIGINT`	`INT`, `SMALLINT`, `INTEGER`, `BIGINT`, `TINYINT`, and `BYTEINT` are aliases for `INT64` in BigQuery.
`FLOAT64`	`DOUBLE`
`NUMERIC`	`DECIMAL(P,S)`	The default precision and scale of `NUMERIC` is `(38, 9)`.
`BIGNUMERIC`	`DECIMAL(P,S)`	Precision greater than 38 is not supported. The default precision and scale of `BIGNUMERIC` is `(77, 38)`.
`DATE`	`DATE`
`DATETIME`	`TIMESTAMP(6)`
`STRING`	`VARCHAR`
`BYTES`	`VARBINARY`
`TIME`	`TIME(6)`
`TIMESTAMP`	`TIMESTAMP(6) WITH TIME ZONE`	Time zone is UTC.
`GEOGRAPHY`	`VARCHAR`	In Well-known text (WKT) format.
`JSON`	`JSON`
`ARRAY`	`ARRAY`
`RECORD`	`ROW`

No other types are supported.

Trino type to BigQuery type mapping#

The connector maps Trino types to the corresponding BigQuery types following this table:

Trino type to BigQuery type mapping#
Trino type	BigQuery type	Notes
`BOOLEAN`	`BOOLEAN`
`VARBINARY`	`BYTES`
`DATE`	`DATE`
`DOUBLE`	`FLOAT`
`BIGINT`	`INT64`	`INT`, `SMALLINT`, `INTEGER`, `BIGINT`, `TINYINT`, and `BYTEINT` are aliases for `INT64` in BigQuery.
`DECIMAL(P,S)`	`NUMERIC`	The default precision and scale of `NUMERIC` is `(38, 9)`.
`VARCHAR`	`STRING`
`TIMESTAMP(6)`	`DATETIME`

No other types are supported.

System tables#

For each Trino table that maps to a BigQuery view, there exists a system table which exposes the corresponding BigQuery view definition. For example, if you have a view named example_view, run the following query to see the SQL that defines the view:

 SELECT * example_view$view_definition

Special columns#

In addition to the defined columns, the BigQuery connector exposes partition information in a number of hidden columns:

$partition_date: Equivalent to _PARTITIONDATE pseudo-column in BigQuery
$partition_time: Equivalent to _PARTITIONTIME pseudo-column in BigQuery

You can use these columns in your SQL statements like any other column. They can be selected directly or used in conditional statements.

Example

Inspect the partition date and time for each record:

SELECT *, "$partition_date", "$partition_time"
FROM example.web.page_views;

Retrieve all records stored in the partition _PARTITIONDATE = '2022-04-07':

SELECT *
FROM example.web.page_views
WHERE "$partition_date" = date '2022-04-07';

Note

Two special partitions __NULL__ and __UNPARTITIONED__ are not supported.

SQL support#

The connector provides read and write access to data and metadata in the BigQuery. In addition to the globally available and read operation statements, the connector supports the following features:

DELETE limitation#

If a WHERE clause is specified, the DELETE operation only works if the predicate in the clause can be fully pushed down to the data source.

ALTER TABLE EXECUTE#

This connector supports the following commands for use with ALTER TABLE EXECUTE:

collect_statistics#

The collect_statistics command is used with Managed statistics to collect statistics for a table and its columns.

The following statement collects statistics for the example_table table and all of its columns:

ALTER TABLE example_table EXECUTE collect_statistics;

Collecting statistics for all columns in a table may be unnecessarily performance-intensive, especially for wide tables. To only collect statistics for a subset of columns, you can include the columns parameter with an array of column names. For example:

ALTER TABLE example_table
    EXECUTE collect_statistics(columns => ARRAY['customer','line_item']);

Wildcard table#

The connector provides support to query multiple tables using a concise wildcard table notation.

SELECT *
FROM example.web."page_views_*";

Procedures#

`system.execute('query')`#

The execute procedure allows you to execute a query in the underlying data source directly. The query must use supported syntax of the connected data source. Use the procedure to access features which are not available in Trino or to execute queries that return no result set and therefore can not be used with the query or raw_query pass-through table function. Typical use cases are statements that create or alter objects, and require native feature such as constraints, default values, automatic identifier creation, or indexes. Queries can also invoke statements that insert, update, or delete data, and do not return any data as a result.

The query text is not parsed by Trino, only passed through, and therefore only subject to any security or access control of the underlying data source.

The following example sets the current database to the example_schema of the example catalog. Then it calls the procedure in that schema to drop the default value from your_column on your_table table using the standard SQL syntax in the parameter value assigned for query:

USE example.example_schema;
CALL system.execute(query => 'ALTER TABLE your_table ALTER COLUMN your_column DROP DEFAULT');

Verify that the specific database supports this syntax, and adapt as necessary based on the documentation for the specific connected database and database version.

Table functions#

The connector provides specific table functions to access BigQuery.

`query(varchar) -> table`#

The query function lets you query the underlying BigQuery database directly. It requires syntax native to BigQuery because the full query is pushed down and processed by BigQuery. This can be useful for accessing native features which are not implemented in SEP or for improving query performance in situations where running a query natively may be faster.

The native query passed to the underlying data source is required to return a table as a result set. Only the data source performs validation or security checks for these queries using its own configuration. Trino does not perform these tasks. Only use passthrough queries to read data.

For example, query the example catalog and group and concatenate all employee IDs by manager ID:

SELECT
  *
FROM
  TABLE(
    example.system.query(
      query => 'SELECT
        manager_id, STRING_AGG(employee_id)
      FROM
        company.employees
      GROUP BY
        manager_id'
    )
  );

Note

The query engine does not preserve the order of the results of this function. If the passed query contains an ORDER BY clause, the function result may not be ordered as expected.

Performance#

The connector includes a number of performance features, detailed in the following sections.

Table statistics#

BigQuery does not expose table statistics. Therefore, the connector cannot use cost based optimizations to improve query processing performance based on the actual data in the data source.

Managed statistics#

The connector supports Managed statistics which lets SEP collect and store table and column statistics that can then be used for performance optimizations in query planning.

Statistics must be collected manually using the built-in collect_statistics command, see ALTER TABLE EXECUTE for details and examples.

BigQuery Storage API#

The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary. It has a number of advantages over using the previous export-based read flow and should improve read performance:

Direct Streaming: It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Column Filtering: The API allows column filtering to only read the data you are interested in. Backed by a columnar datastore, it can efficiently stream data without reading all columns.
Dynamic Sharding: The API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on how dynamic sharding is similarly used in Google Cloud Dataflow.

For information about BigQuery pricing, see the BigQuery pricing documentation.

Pushdown#

The connector supports pushdown for a number of operations:

Dereference pushdown
Limit pushdown for access to tables and other objects when using the REST API to reduce CPU consumption in BigQuery and performance overall. Pushdown is not supported by the Storage API, used for the more common Trino-managed tables, and therefore not used for access with it.
Predicate pushdown on columns of the following SEP types:
- BOOLEAN
- BIGINT
- DOUBLE
- DECIMAL(P,S)
- VARCHAR
- VARBINARY
- DATE
- TIME(6)
- TIMESTAMP(6)
- TIMESTAMP(6) WITH TIME ZONE
Read the Type mapping section to see the complete type conversion.
Projection pushdown

Starburst Cached Views#

The connector supports table scan redirection to improve performance and reduce load on the data source.

Security#

The connector includes a number of security-related features, detailed in the following sections.

OAuth 2.0 token pass-through#

The BigQuery connector supports OAuth 2.0 token pass-through when SEP is configured to use Google Cloud Platform as an OAuth 2.0 identity provider.

To enable OAuth 2.0 passthrough, set the following catalog configuration property:

bigquery.authentication.type=TOKEN_PASS_THROUGH