--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.13:1.2.1
BIGLAKE_ICEBERG_CATALOG_JAR
: the Cloud Storage URI of the Iceberg custom catalog plugin to install. Depending on your environment, select one of the following: Iceberg 1.5.1
: gs://spark-lib/biglake/biglake-catalog-iceberg1.5.1-0.1.2-with-dependencies.jarIceberg 1.5.0
: gs://spark-lib/biglake/biglake-catalog-iceberg1.5.0-0.1.1-with-dependencies.jarIceberg 1.2.0
: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.1-with-dependencies.jarIceberg 0.14.0
: gs://spark-lib/biglake/biglake-catalog-iceberg0.14.0-0.1.1-with-dependencies.jarSPARK_CATALOG
: the catalog identifier for Spark. It is linked to a BigLake Metastore catalog.PROJECT_ID
: the Google Cloud project ID of the BigLake Metastore catalog that the Spark catalog links with.LOCATION
: the Google Cloud location of the BigLake Metastore catalog that the Spark catalog links with.BLMS_CATALOG
: the BigLake Metastore catalog ID that the Spark catalog links with. The catalog does not need to exist, and it can be created in Spark.GCS_DATA_WAREHOUSE_FOLDER
: the Cloud Storage folder where Spark creates all files. It starts with gs://
.HMS_DB
: (optional) the HMS database containing the table to copy from.HMS_TABLE
: (optional) the HMS table to copy from.HMS_URI
: (optional) the HMS Thrift endpoint.Alternatively, you can submit a Dataproc job to a cluster. The following sample installs the appropriate Iceberg Custom Catalog.
To connect with a Dataproc cluster, submit a job with the following specifications:
CONFS="spark.sql.catalog.SPARK_CATALOG=org.apache.iceberg.spark.SparkCatalog,"CONFS+="spark.sql.catalog.SPARK_CATALOG.catalog-impl=org.apache.iceberg.gcp.biglake.BigLakeCatalog,"CONFS+="spark.sql.catalog.SPARK_CATALOG.gcp_project=PROJECT_ID,"CONFS+="spark.sql.catalog.SPARK_CATALOG.gcp_location=LOCATION,"CONFS+="spark.sql.catalog.SPARK_CATALOG.blms_catalog=BLMS_CATALOG,"CONFS+="spark.sql.catalog.SPARK_CATALOG.warehouse=GCS_DATA_WAREHOUSE_FOLDER,"CONFS+="spark.jars.packages=ICEBERG_SPARK_PACKAGE"gclouddataprocjobssubmitspark-sql--cluster=DATAPROC_CLUSTER \--project=DATAPROC_PROJECT_ID \--region=DATAPROC_LOCATION \--jars=BIGLAKE_ICEBERG_CATALOG_JAR \--properties="${CONFS}" \--file=QUERY_FILE_PATH
Replace the following:
DATAPROC_CLUSTER
: the Dataproc cluster to submit the job to.DATAPROC_PROJECT_ID
: the project ID of the Dataproc cluster. This ID can be different from PROJECT_ID
.DATAPROC_LOCATION
: the location of the Dataproc cluster. This location can be different from LOCATION
.QUERY_FILE_PATH
: the path to the file containing queries to run.Similarly, you can submit a batch workload to Dataproc Serverless. To do so, follow the batch workload instructions with the following additional flags:
--properties="${CONFS}"
--jars=BIGLAKE_ICEBERG_CATALOG_JAR
You can use BigQuery stored procedures to run Dataproc Serverless jobs. The process is similar to running Dataproc Serverless jobs directly in Dataproc.
The following sections describe how to create resources in the metastore.
Catalog names have constraints; for more information, see Limitations. To create a catalog, select one of the following options:
Use the projects.locations.catalogs.create
method and specify the name of a catalog.
CREATENAMESPACESPARK_CATALOG;
This creates a BigLake database named 'my_database' of type 'HIVE' in the catalog specified by the 'google_biglake_catalog.default.id' variable. For more information, see the Terraform BigLake documentation.
resource"google_biglake_catalog""default"{name="my_catalog"location="US"}
Database names have constraints; for more information, see Limitations. To ensure that your database resource is compatible with data engines, we recommend creating databases using data engines instead of manually crafting the resource body. To create a database, select one of the following options:
Use the projects.locations.catalogs.databases.create
method and specify the name of a database.
CREATENAMESPACESPARK_CATALOG.BLMS_DB;
Replace the following:
BLMS_DB
: the BigLake Metastore database ID to createThis creates a BigLake database named 'my_database' of type 'HIVE' in the catalog specified by the 'google_biglake_catalog.default.id' variable. For more information, see the Terraform BigLake documentation.
resource"google_biglake_database""default"{name="my_database"catalog=google_biglake_catalog.default.idtype="HIVE"hive_options{location_uri="gs://${google_storage_bucket.default.name}/${google_storage_bucket_object.metadata_directory.name}"parameters={"owner"="Alex"}}}
Table names have constraints. For more information, see Table naming. To create a table, select one of the following options:
Use the projects.locations.catalogs.databases.tables.create
method and specify the name of a table.
CREATETABLESPARK_CATALOG.BLMS_DB.BLMS_TABLE(idbigint,datastring)USINGiceberg;
Replace the following:
BLMS_TABLE
: the BigLake Metastore table ID to createThis registers a BigLake Metastore table with the name "my_table" and type "Hive" in the database specified by the "google_biglake_database.default.id" variable. Note that the table must exist prior to registration in the catalog, which can be accomplished by initializing the table from an engine such as Apache Spark. Refer to the Terraform Provider Documentation for more information: BigLake Table.
resource"google_biglake_table""default"{name="my-table"database=google_biglake_database.default.idtype="HIVE"hive_options{table_type="MANAGED_TABLE"storage_descriptor{location_uri="gs://${google_storage_bucket.default.name}/${google_storage_bucket_object.data_directory.name}"input_format="org.apache.hadoop.mapred.SequenceFileInputFormat"output_format="org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat"}parameters={"spark.sql.create.version"="3.1.3""spark.sql.sources.schema.numParts"="1""transient_lastDdlTime"="1680894197""spark.sql.partitionProvider"="catalog""owner"="Alex""spark.sql.sources.schema.part.0"=jsonencode({"type":"struct","fields":[{"name":"id", "type" : "integer","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"integer","nullable":true,"metadata":{}}]})"spark.sql.sources.provider"="iceberg""provider"="iceberg"}}}
This GitHub example provides a runnable E2E example that creates a "BigLake" Metastore Catalog, Database, and Table. For more information on how to use this example, refer to Basic Terraform Commands.
To create an Iceberg table and copy a Hive Metastore table over to BigLake Metastore, use the following Spark SQL statement:
CREATETABLESPARK_CATALOG.BLMS_DB.BLMS_TABLE(idbigint,datastring)USINGicebergTBLPROPERTIES(hms_table='HMS_DB.HMS_TABLE');
BigLake Metastore is the recommended metastore for querying BigLake external tables for Iceberg. When creating an Iceberg table in Spark, you can optionally create a linked BigLake Iceberg table at the same time.
To create an Iceberg table in Spark and automatically create a BigLake Iceberg table at the same time, use the following Spark SQL statement:
CREATETABLESPARK_CATALOG.BLMS_DB.BLMS_TABLE(idbigint,datastring)USINGicebergTBLPROPERTIES(bq_table='BQ_TABLE_PATH',bq_connection='BQ_RESOURCE_CONNECTION');
Replace the following:
BQ_TABLE_PATH
: the path of the BigLake Iceberg table to create. Follow the BigQuery table path syntax. It uses the same project as the BigLake Metastore catalog if the project is unspecified.BQ_RESOURCE_CONNECTION
(optional): the format is project.location.connection-id
. If specified, BigQuery queries use the Cloud Resource connection credentials to access BigLake Metastore. If not specified, BigQuery creates a regular external table instead of a BigLake table.
To manually create BigLake Iceberg tables links with specified BigLake Metastore table URIs (blms://…
), use the following BigQuery SQL statement:
CREATEEXTERNALTABLE'BQ_TABLE_PATH'WITHCONNECTION`BQ_RESOURCE_CONNECTION`OPTIONS(format='ICEBERG',uris=['blms://projects/PROJECT_ID/locations/LOCATION/catalogs/BLMS_CATALOG/databases/BLMS_DB/tables/BLMS_TABLE'])
The following sections describe how to view resources in BigLake Metastore.
To see all databases in a catalog, use the projects.locations.catalogs.list
method and specify the name of a catalog.
To see information about a catalog, use the projects.locations.catalogs.get
method and specify the name of a catalog.
To view a database, do the following:
To see all tables in a database, use the projects.locations.catalogs.databases.list
method and specify the name of a database.
To see information about a database, use the projects.locations.catalogs.databases.get
method and specify the name of a database.
To see all databases in a catalog, use the following statement:
SHOW{DATABASES|NAMESPACES}INSPARK_CATALOG;
To see information about a defined database, use the following statement:
DESCRIBE{DATABASE|NAMESPACE}[EXTENDED]SPARK_CATALOG.BLMS_DB;
To view all tables in a database or view a defined table, do the following:
To see all tables in a database, use the projects.locations.catalogs.databases.tables.list
method and specify the name of a database.
To see information about a table, use the projects.locations.catalogs.databases.tables.get
method and specify the name of a table.
To see all tables in a database, use the following statement:
SHOWTABLESINSPARK_CATALOG.BLMS_DB;
To see information about a defined table, use the following statement:
DESCRIBETABLE[EXTENDED]SPARK_CATALOG.BLMS_DB.BLMS_TABLE;
The following sections describe how to modify resources in the metastore.
To avoid avoid conflicts when multiple jobs try to update the same table at the same time, BigLake Metastore uses optimistic locking. To use optimistic locking, you first need to get the current version of the table (called an etag) by using the GetTable
method. Then you can make changes to the table and use the UpdateTable
method, passing in the previously fetched etag. If another job updates the table after you fetch the etag, the UpdateTable
method fails. This measure ensures that only one job can update the table at a time, preventing conflicts.
To update a table, select one of the following options:
Use the projects.locations.catalogs.databases.tables.patch
method and specify the name of a table.
For table update options in SQL, see ALTER TABLE
.
To rename a table, select one of the following options:
Use the projects.locations.catalogs.databases.tables.rename
method and specify the name of a table and a newName
value.
ALTERTABLEBLMS_TABLERENAMETONEW_BLMS_TABLE;
Replace the following:
NEW_BLMS_TABLE
: the new name for BLMS_TABLE
. Must be in the same dataset as BLMS_TABLE
.The following sections describe how to delete resources in BigLake Metastore.
To delete a catalog, select one of the following options:
Use the projects.locations.catalogs.delete
method and specify the name of a catalog. This method does not delete the associated files on Google Cloud.
DROPNAMESPACESPARK_CATALOG;
To delete a database, select one of the following options:
Use the projects.locations.catalogs.databases.delete
method and specify the name of a database. This method does not delete the associated files on Google Cloud.
DROPNAMESPACESPARK_CATALOG.BLMS_DB;
To delete a table, select one of the following options:
Use the projects.locations.catalogs.databases.tables.delete
method and specify the name of a table. This method does not delete the associated files on Google Cloud.
To only drop the table, use the following statement:
DROPTABLESPARK_CATALOG.BLMS_DB.BLMS_TABLE;
To drop the table and delete the associated files on Google Cloud, use the following statement:
DROPTABLESPARK_CATALOG.BLMS_DB.BLMS_TABLEPURGE;
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-04-17 UTC.