For some recent project work, I found myself building and rebuilding the same or similar Azure environments and resources specific to Azure data workloads (Azure Data Factory, Synapse, etc.). I’ve been interested in DevOps for some time and the combination of these things led me to dig into Infrastructure as Code.
I thought I’d share some of my recent experiences and learnings. Specifically, as it relates to automating and managing deployment of Azure Data Factory environments and resources.
It’s a common buzzword/term these days, but what is Infrastructure as Code? IaC is a software engineering methodology that involves managing and provisioning infrastructure resources using code and automation techniques. It treats infrastructure configuration, deployment, and management as software artifacts, enabling infrastructure to be versioned, tested, and deployed using the same practices as software development.
Traditionally, infrastructure provisioning and management involved manual processes, where system administrators would configure servers, networks, and other infrastructure components manually. With the overwhelming move to the cloud, there is no longer the need (or even the ability) to physically touch or access servers, switches, or any hardware. So, some form of automated / logical deployment is only natural. Treating infrastructure as software allows you to leverage many of the same DevOps methodologies and practices you may be using already. Is allows you to manage your infrastructure and cloud environments as part of the same CI/CD processes.
Within IaC, infrastructure resources are defined in human-readable configuration files, or code, that describe the desired state of the infrastructure. IaC code is typically some form of domain specific language using a declarative syntax that includes some form of programming control constructs, variables, etc. These code files can then be stored in a version control system, allowing teams to collaborate, review changes, and track the history of infrastructure configurations.
IaC tools interpret the code and interact with the underlying platforms to create, modify, or delete the required resources. They ensure that the infrastructure's actual state matches the desired state specified in the code, and any differences are automatically reconciled.
For my recent project, the IaC approach was particularly appealing because I needed to deploy the same, or similar, data engineering environments repeatedly. Based on situation-specific criteria, I needed to deploy various workload environments in specific, pre-defined ways. For this article, I’ll focus specifically on building an Azure Data Factory and related resources.
Clients often don’t have a lot of experience with the data engineering and analytics environments I’m building for them. So, in addition to configuring and deploying the actual working environment, the IaC process allows me to ‘embed’ some standard best practices, coding samples and other aids into the deployment process. Some practices I typically try to suggest to customers include:
In addition to consistently providing standard best practices in deployed environments, IaC gives me the opportunity to provide a basis for some knowledge transfer and training on the solutions by deploying sample files and other training assets. As you’ll see below, I use an IaC approach to deploy not just a data factory, but linked services, datasets, pipelines and sample data files. All these assets are defined via IaC to work together and demonstrate recommended way to use these assets.
For the project I was working on, my needs were specific to Microsoft Azure. Multi-cloud, on-prem or hybrid environments just didn’t come into play. There are many IaC languages and tools out there that address all these areas. They all have their pros and cons, and I don’t intend to get into recommendations or evaluations here. Although, Terraform does seem to be the de facto standard in this space.
Azure Resource Manager (ARM) is the deployment and management service for Azure. You can define and deploy Azure resources using ARM templates. But ARM templates are ‘lots of JSON’ and I find them hard to read and difficult to work with. As the Azure platform has evolved, Microsoft has provided Azure Bicep. Bicep is a declarative domain specific language used to define and deploy Azure resources. Bicep is really a “transparent abstraction” over ARM and ARM templates providing a much more concise and readable syntax, reliable type safety, and support for code reuse. Anything you can do or deploy in Azure or with ARM can be done in Bicep. All resource types, properties, API versions and new features available via ARM are available via Bicep. In my case, that meant some newly released and preview features I wanted to leverage were available right away.
The easy way to get started with Bicep is to use the Azure CLI. The Azure CLI automatically installs the Bicep CLI command set. If you don’t already have the Azure CLI installed, you can find out how to do that here.
Verify the Bicep command set installed and check installed version:
Visual Studio Code offers a first-class authoring experience for your Bicep solutions via the Bicep for Visual Studio Extension. If you plan to do anything with Bicep and don’t have this extension, get it now.
Bicep Extension for Visual Studio Code:
The Bicep extension provides all the language support, IntelliSense, autocompletion, etc. you expect with any first-class programming language.
The extension provides support for all resource types and API versions:
Extension Supports All Properties and Values for All Resource & Modules:
In my case, all my resources share the same life cycle and I want to manage them as a group. So the first thing to do is create a new resource group.
You can see the full code of the Azure Data Factory deployment in this repo bafridley/IaC (github.com). In addition to the main module, I have individual modules to deploy Key Vault, Storage, SQL Server and Azure Data Factory resources.
Create a new resource group for all new resources:
With the simple Bicep code above saved to file called ‘main.bicep’:
Execute the deployment from the Azure CLI:
Bicep enables you to organize deployments into modules. A module is an individual Bicep file that is deployed from another Bicep file. Modules can improve code read ability and allow a module to be reused in other deployments. In the example below, separate module files exist for the Key Vault (‘kv.bicep’) and Storage Account (‘stg.bicep’) deployments. These can all be referenced and deployed from the main bicep module (‘main.bicep’).
Execute the deployment from the Azure CLI:
As mentioned earlier, a Key Vault is the recommended way to store, manage and consume secrets. I manage all the Key Vault resources, as well as, some other ‘security and authorization’ resources in the ‘kv.bicep’ module. Using a Key Vault with Azure Data Factory relies on using a managed identity, either the system assigned managed identity or user assigned managed identity. I use a user assigned managed identity for other reasons as well, so it’s included in my ‘kv.bicep’ module as well.
Define a User Assigned Managed Identity:
Add the newly created Managed ID to the system-defined Contributor role:
Create Key Vault, Policies and Secrets:
Having a low cost, LRS storage account with blob container is a default in every Azure subscription I use. My Data Factory example uses a blob container called ‘data’ to hold sample text files and other data.
Create storage account and blob container:
Add Access Key and Connection String Secrets to KV:
At deployment, Azure Resource Manager evaluates the dependencies between resources and deploys them in their dependent order. To make deployments faster and more efficient, ARM may deploy them in parallel when resources aren't dependent on each other.
Notice the yellow wavy line under the ‘dependsOn’ property in the previous screen shot. That wavy line is the Bicep extensions way of letting you know that it’s an unnecessary (redundant) declaration. While you can manually define dependencies using the ‘dependsOn’ property, the extension will also infer dependencies from the rest of the code. In this case, the ‘storageAccessKey’ variable is assigned the value of: ‘storageAccount.listkeys().keys[0].value’. The variable ‘storageAccessKey’ is then assigned to the value property of the Key Vault Secret resource, creating an implicit dependency, and making it unnecessary to directly specify the dependency. In this way, Bicep understands and manages resource deployment to ensure dependent resources are deployed after those they depend on.
A DeploymentScript resource is a specialized Azure resource type that allows you to define and deploy custom scripts as part of your infrastructure deployments. A deployment script resource lets you execute code (either PowerShell or Azure CLI script) as part of your deployment. A DeploymentScript requires two supporting resources for script execution and troubleshooting: a storage account and a container instance. You can specify an existing storage account, otherwise the script service creates one for you.
A typical use-case for a deployment script is to create or manipulate Azure AD objects. In my case, I want to upload sample CSV files to the blob container in my storage account. In addition to deploying Azure Data Factory and associated resources, I also want to implement some samples and best practices into the Data Factory. The uploaded file will be used as part of a standard Data Factory CopyData pipeline to create a table in a SQL Server database.
Deployment Script to Upload Sample CSV file to Blob Storage:
Like the storage account and blob container highlighted earlier, I deploy a SQL Server and database that will be used as a source in a Data Factory pipeline. While I don’t show that here, you can see the full code of the Azure Data Factory deployment in this repo bafridley/IaC (github.com).
The Azure Data Factory module (‘adf.bicep’) contains the bulk of the bicep code in my deployment, but I won’t discuss it all here. The full code defines a data factory, linked services, datasets, pipelines, access policies, key vault secrets and more.
Define the Data Factory Resource:
Add Key Vault Policy for New Data Factory:
Azure Blob Storage Linked Service:
CSV Dataset Resource:
Data Factory Pipeline:
To focus on key concepts, I’ve rolled up several of the sections of the code. You can get the full code from this GitHub repository mentioned earlier.
For the first pass of this solution, I wanted to make it flexible and reusable. Providing different values for the parameters allows for deploying many different environments with the same basic code. While all the parameters make the solution flexible, it can be tedious to declare all those parameters and pass them to and from all the different modules. A parameter file helps with part of this problem. You can provide the values for all your parameters in a single file rather than typing them in at the command line every time you want to execute a deployment.
While the parameter file helps, it can still be tedious and error prone to declare and pass all those parameters between modules. In the future, I plan to convert all the individual parameters to properties of an object. I will still save and get the values from a parameter file, but instead of passing all those individual parameters around, I’ll pass a single object and reference the values off that object.
Bicep provides a lot of functionality with parameters, including providing default values, constraints on values and much more. For more information, see: Parameters in Bicep.
Parameters and Parameter File:
Execute Deployment via Azure CLI with Parameter File:
You can review the status of completed deployments or monitor in-progress deployments from the Azure portal. If the deployment is actively running, the status will be listed as Deploying. Completed deployments will have a status of Succeeded or Failed, depending on the results. If viewing in real time, the portal should automatically refresh, and you’ll see individual modules and resources within those modules as they are deployed.
Monitor the Deployment:
Recall that the first resource created in my deployment was the resource group (‘rg-iac-adf’, in this case). All resources share the same life cycle and scope and I create all the objects in this resource group.
Review Deployed Objects:
As part of the Data Factory deployment, I create a pipeline that copies data from a source CSV file and creates a SQL Server table with the data. As a demonstration of best practices, I use parameters on various artifacts to make them reusable for any CSV file and SQL table. Below I’ll review the set of resources created to handle the incoming CSV file. Similar resources, utilizing parameters and dynamic expressions, are created to handle the outgoing SQL tables. In the interest of space, I’ll review the CSV resources only.
Data Factory Linked Services:
Data Factory Datasets with Parameters:
CSV Dataset Parameter for File Path:
Data Factory Pipeline with Parameters:
Use Parameter in CopyData Activity:
There is a lot more to Bicep and it’s a much more powerful tool than I’d originally thought. And while I only touched on just the basics, this post still turned out to be longer than I originally anticipated. I also hope to have shown how you can use the concepts and tooling of IaC to not just deploy infrastructure, but to enforce standards and best practices while doing so.
As a Microsoft Gold Partner with 30 years experience helping organizations achieve their IT business goals, we offer unrivaled expertise and innovation in Microsoft technologies.