Skip to content

Latest commit

 

History

History
197 lines (126 loc) · 8.48 KB

storage-blob-calculate-container-statistics-databricks.md

File metadata and controls

197 lines (126 loc) · 8.48 KB
titledescriptionauthorms.servicems.topicms.datems.author
Tutorial: Calculate Azure Blob Storage container statistics
Description goes here
normesta
azure-blob-storage
tutorial
01/13/2025
normesta

Tutorial: Calculate container statistics by using Databricks

This tutorial shows you how to gather statistics about your containers by using Azure Blob Storage inventory along with Azure Databricks.

In this tutorial, you learn how to:

[!div class="checklist"]

  • Generate an inventory report
  • Create an Azure Databricks workspace and notebook
  • Read the blob inventory file
  • Get the number and total size of blobs, snapshots, and versions
  • Get the number of blobs by blob type and content type

Prerequisites

Generate an inventory report

Enable blob inventory reports for your storage account. See Enable Azure Storage blob inventory reports.

Use the following configuration settings:

SettingValue
Rule nameblobinventory
Container<name of your container>
Object type to inventoryBlob
Blob typesBlock blobs, Page blobs, and Append blobs
Subtypesinclude blob versions, include snapshots, include deleted blobs
Blob inventory fieldsAll
Inventory frequencyDaily
Export formatCSV

You might have to wait up to 24 hours after enabling inventory reports for your first report to be generated.

Configure Azure Databricks

In this section, you create an Azure Databricks workspace and notebook. Later in this tutorial, you paste code snippets into notebook cells, and then run them to gather container statistics.

  1. Create an Azure Databricks workspace. See Create an Azure Databricks workspace.

  2. Create a new notebook. See Create a notebook.

  3. Choose Python as the default language of the notebook.

Read the blob inventory file

  1. Copy and paste the following code block into the first cell, but don't run this code yet.

    frompyspark.sql.typesimportStructType, StructField, IntegerType, StringTypeimportpyspark.sql.functionsasFstorage_account_name="<storage-account-name>"storage_account_key="<storage-account-key>"container="<container-name>"blob_inventory_file="<blob-inventory-file-name>"hierarchial_namespace_enabled=Falseifhierarchial_namespace_enabled==False: spark.conf.set("fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name), storage_account_key) df=spark.read.csv("wasbs://{0}@{1}.blob.core.windows.net/{2}".format(container, storage_account_name, blob_inventory_file), header='true', inferSchema='true') else: spark.conf.set("fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name), storage_account_key) df=spark.read.csv("abfss://{0}@{1}.dfs.core.windows.net/{2}".format(container, storage_account_name, blob_inventory_file), header='true', inferSchema='true') 
  2. In this code block, replace the following values:

    • Replace the <storage-account-name> placeholder value with the name of your storage account.

    • Replace the <storage-account-key> placeholder value with the account key of your storage account.

    • Replace the <container-name> placeholder value with the container that holds the inventory reports.

    • Replace the <blob-inventory-file-name> placeholder with the fully qualified name of the inventory file (For example: 2023/02/02/02-16-17/blobinventory/blobinventory_1000000_0.csv).

    • If your account has a hierarchical namespace, set the hierarchical_namespace_enabled variable to True.

  3. Press the Run button to run the code in this cell.

Get blob count and size

  1. In a new cell, paste the following code:

    print("Number of blobs in the container:", df.count()) print("Number of bytes occupied by blobs in the container:", df.agg({'Content-Length': 'sum'}).first()['sum(Content-Length)'])
  2. Press the run button to run the cell.

    The notebook displays the number of blobs in a container and the number of bytes occupied by blobs in the container.

    [!div class="mx-imgBorder"] Screenshot of results that appear when you run the cell showing the number of blobs and the size of blobs in the container.

Get snapshot count and size

  1. In a new cell, paste the following code:

    frompyspark.sql.functionsimport*print("Number of snapshots in the container:", df.where(~(col("Snapshot")).like("Null")).count()) dfT=df.where(~(col("Snapshot")).like("Null")) print("Number of bytes occupied by snapshots in the container:", dfT.agg({'Content-Length': 'sum'}).first()['sum(Content-Length)'])
  2. Press the run button to run the cell.

    The notebook displays the number of snapshots and total number of bytes occupied by blob snapshots.

    [!div class="mx-imgBorder"] Screenshot of results that appear when you run the cell showing the number of snapshots and the total combined size of snapshots.

Get version count and size

  1. In a new cell, paste the following code:

    frompyspark.sql.functionsimport*print("Number of versions in the container:", df.where(~(col("VersionId")).like("Null")).count()) dfT=df.where(~(col("VersionId")).like("Null")) print("Number of bytes occupied by versions in the container:", dfT.agg({'Content-Length': 'sum'}).first()['sum(Content-Length)'])
  2. Press SHIFT + ENTER to run the cell.

    The notebook displays the number of blob versions and total number of bytes occupied by blob versions.

    [!div class="mx-imgBorder"] Screenshot of results that appear when you run the cell showing the number of versions and the total combined size of versions.

Get blob count by blob type

  1. In a new cell, paste the following code:

    display(df.groupBy('BlobType').count().withColumnRenamed("count", "Total number of blobs in the container by BlobType"))
  2. Press SHIFT + ENTER to run the cell.

    The notebook displays the number of blob types by type.

    [!div class="mx-imgBorder"] Screenshot of results that appear when you run the cell showing the number of blob types by type.

Get blob count by content type

  1. In a new cell, paste the following code:

    display(df.groupBy('Content-Type').count().withColumnRenamed("count", "Total number of blobs in the container by Content-Type"))
  2. Press SHIFT + ENTER to run the cell.

    The notebook displays the number of blobs associated with each content type.

    [!div class="mx-imgBorder"] Screenshot of results that appear when you run the cell showing the number of blobs by content-type.

Terminate the cluster

To avoid unnecessary billing, terminate your compute resource. See terminate a compute.

Next steps

close