Building an ETL using Microsoft Azure Synapse Analytics and SQL pool (1/2)

Starting with Azure Synapse Analytics — Notebook style

Today we are delving into Microsoft Azure Synapse Analytics. I am sharing this introductory article on how to use Synapse. This platform is a Data Science tool which aids in processing and analyzing Big Data. Since this platform workspace is in its preview state, there is not much information on it, especially when errors are encountered. Basically, to work with this analytics tool means ingesting data from a source, transforming data and pushing it back into a storage account. There are two ways of working with data in Synapse Analytics. This can be done through Notebooks or Data flows. In this article, we will highlight on what I know best — Notebooks. Ensure you have permission to access a working workspace.

Creating a workspace

A workspace can be created by searching for the application in the Microsoft Azure Portal.

Warning: Be sure not to select the Azure Synapse Analytics (formely SQL DW) option.

ETL process:

An ETL process is the process of Extracting, Transforming and Loading data. It is the process of transferring prepared data from one or many sources to other destinations. This medium article properly explains. To start with the extraction process, we need a source from which we get our data. It is vital to specify the storage account you will be working with, as well as the container. These configurations can be done in the Azure notebook under the workspace you create. To start a new notebook, click on the Develop tab on the left-hand panel in Synapse Analytics as shown below.

On the topmost tab in a new notebook, attach the notebook to a Spark pool. This Spark pool is created by you through the Manage tab also in the left hand-side pane of Synapse Analytics. You can specify the language with which you will be working with.

Currently there are four language options. These are: 1. PySpark (Python) 2. Spark(Scala) 3. .NET Spark (C#) 4. Spark SQL. In this article, we will use PySpark (Python), an option which allows you to use the pandas library.

Using pandas knowledge, import the csv file to be worked with. If working with a file in a storage account (e.g. container), you can retrieve the URL of the file from viewing its properties where the file is saved.

In transforming the data, it depends on the aim you are trying to achieve. Whatever final stage you want your data in, you decide which transformative processes to go through. It can be column renaming, aggregation of data or merging different columns to form a different table. This pandas resource can be helpful in learning the different required techniques for transforming data.

Finally transformed data can be loaded into a data store such as an SQL pool. Ensure that your notebook is connected to a linked service for this to work.

To load to an SQL pool means loading the prepared data or table in a form acceptable. The pandas dataframe must be converted into a pyspark dataframe, converted to Scala and then written into the SQL pool.

Steps to follow:

1. Replace NaN values with ‘None’ values to a form readable by Spark.

2. Datatypes should be float, integer or object type.

3. Create a pyspark dataframe:

creating a pyspark dataframe

4.

5. Create a spark variable to store the spark table. Use the %%spark magic command to allow you to write the spark language in the pySpark language notebook cell.

creating spark table

You might get the error below if you do not have an existing and running storage account being written to:

Error : org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: Status code: -1 error code: null error message: InvalidAbfsRestOperationExceptionjava.net.UnknownHostException:

This is corrected by ensuring that you are assigned the permission to be Blob storage container contributor and owner from the Azure Portal.

6. Do this to write to the SQL pool.

Writing to SQL pool

Important: As you work, Publish the notebook after each major edit. Running the codes alone will not save the Notebook file. Failure to do this will cause you to lose all your work. Also, pause the pools when not in use, to limit cost. Column names should be readable — avoid using special characters such as these “ {}-/ ”.

Spark tables use double datatypes for numeric values.

Errors likely to be faced

Solution: Check internet connectivity

Solution: In writing to a spark dataframe, you might get an error if you have different data types in one table trying to merge. Ensure that you convert different datatypes to one type. This can be done using ‘astype’ function.

Solution: Check that you have been permission as user to write to the database you’re attempting to access. Check also your role assignments to the storage accounts you are trying to access, in the Microsoft azure portal.

You can write an SQL Script in Synapse from the Develop tab to add user:

creating a new database user

Error : %%spark is not a magic line.

Solution: Ensure that %%spark is the very first line of code in your notebook cell where you are running a spark command.

Happy coding!

Supplementary articles:

Top 10 features in Azure Synapse Analytics Workspace