Building an ETL using Microsoft Azure Synapse Analytics and SQL pool (2/2)

Starting with Azure Synapse Analytics — Data Flows style

In my last article about using Microsoft Azure Synapse Analytics, we covered how to navigate the platform and perform Data Analytics tasks using notebooks. For the benefit of those who are new here, let’s have a quick recap shall we. The platform is a Data Science tool used to process and analyze Big Data. Since it is in preview state, there is rather scanty information on how to seamlessly use Synapse.

Working with this tool requires that you ingest data from a source, transform the ingested data, and push the transformed data to a storage account of your choice, from the available options. From my introduction, it is evident that Notebooks are one way of working in Azure Synapse.

This article entry will introduce the second way of manipulating data in Azure Synapse — using Data Flows.

Creating a workspace

Creating an Azure Synapse Workspace

A workspace can be created by searching for the application in the Microsoft Azure Portal.

Warning: Be sure not to select the Azure Synapse Analytics (formely SQL DW) option.

ETL process:

An ETL process is the process of Extracting, Transforming and Loading data. It is the process of transferring prepared data from one or many sources to other destinations. This medium article properly explains. To start with the extraction process, we need a source from which we get our data. To start a new Data Flow process, click on the Develop tab on the left-hand panel in Synapse Analytics as shown below.

Starting a Data Flow

The image below is your typical working area when you click on Data flows.

Data Flows workspace

Source1 is where you ingest your data to be worked on. In the source settings, select the source type. In this article, we will import a CSV file from a Blob Storage. If you select the Dataset option, click on the + New option to select where you will be pulling your data from.

Choosing an integration dataset

After selecting the Azure Blob Storage image in the pop up window, select DelimitedText.

Selecting format of integration data

In the next window that opens, select your blobstorage as your LinkedService. Browse the file path to select the csv file to be used.

Important: Tick the First row as header to read your column labels when imported.

Setting properties for integration dataset

Ensure that you turn on Debug mode to validate your data flow. After doing this, you will be able to preview your data after ingesting.

For the data transformation step in the ETL process, you can add them by clicking on the + sign on the right corner of a placeholder. You have the option of joining different data sources, renaming column names, aggregating, among others.

ETL options

As a first timer, Synapse provides step by step tips in the Data Flows working area.

Running Pipelines

More ETL options

After transforming your data, sink it to a destination dataset. Since we are interested in SQL pools here, that will be the output stream where the transformed data will end up. Click the Azure Synapse Analytics (formerly SQL DW) image.

Choosing destination dataset

Set your properties in the pop up window. Pick an easily readable name.

Setting properties for destination dataset

In the properties window, you can either select from existing table or create a new table. In creating a new table, choose a schema name. Set your table name as ‘dbo’. This step is similar to what we did in our notebook in writing to the SQL pool.

Errors you might encounter and how to correct them

Error message:

{“message”:”Job failed due to reason: at Source ‘alldatasource’: Operation failed: \”Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\”, 403, HEAD, https://newdevstorage02.dfs.core.windows.net/blob/?upn=false&action=getAccessControl&timeout=90. Details:at Source ‘alldatasource’: Operation failed: \”Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\”, 403, HEAD, https://newdevstorage02.dfs.core.windows.net/blob/?upn=false&action=getAccessControl&timeout=90","failureType":"UserError","target":"Dataflow1","errorCode":"DFExecutorUserError"}

Workaround:

This error will be associated with the Authentication Method. Choose ‘Account Key’ as your option and select the Connection String option. Then, Enter Manually.

Helpful resource:

https://docs.microsoft.com/en-us/answers/questions/58750/data-flow-error-in-azure-synapse-analytics-workspa.html

Workaround using Managed Identity:

This error will be associated with the Authentication Method. Choose ‘Managed Identity’ as your option and select the Connection String option. Then, Enter Manually.

Helpful resource:

https://techcommunity.microsoft.com/t5/azure-synapse-analytics/adf-to-synapse-pool-please-enable-managed-service-identity-and/ba-p/1817785

Error message:

Cannot connect to SQL Database: ‘tcp:sponsoredtelefonica.sql.azuresynapse.net,1433’, Database: ‘dbo’, User: ‘’. Check the linked service configuration is correct, and make sure the SQL Database firewall allows the integration runtime to access. Login failed for user ‘’., SqlErrorNumber=18456,Class=14,State=1, Activity ID: 950909f3-d217–4f75-b9fb-87360e42a4ba.

Workaround:

This may happen if your data source only allows secured connections. If that’s the case, please create a new workspace with virtual network option enabled.

Fin. I’m always a post away !