Databricks

Understanding SQL Warehouses and All-Purpose Compute

SQL Warehouses (Serveless)

SQL Warehouses (Serveless) in Databricks refers to using serverless SQL endpoints for running SQL queries.

Here's why it's recommended over all-purpose compute for certain tasks:

Attribute	Description
Cost-effectiveness	Serverless SQL endpoints allow you to pay only for the queries you execute, without the need to provision or manage dedicated infrastructure, making it more cost-effective for ad-hoc or sporadic queries.
Scalability	Serverless architectures automatically scale resources based on demand, ensuring optimal performance for varying workloads.
Simplified Management	With serverless SQL endpoints, you don't need to manage clusters or infrastructure, reducing operational overhead.
Minimum Requirements	The minimum requirements for using SQL warehouse with serverless typically include access to a Databricks workspace and appropriate permissions to create and run SQL queries.

All-purpose Compute

All-purpose compute in Databricks refers to clusters that are not optimized for specific tasks. While they offer flexibility, they may not provide the best performance or cost-effectiveness for certain workloads. Here's why they might not be recommended:

Attribute	Description
Slow Spin-up Time	All-purpose compute clusters may take longer to spin up compared to specialized clusters, resulting in delays before processing can begin.
Timeout Connections	Due to longer spin-up times, there's a risk of timeout connections, especially for applications or services that expect quick responses.

Node pool and its usage

A node pool in Databricks is a set of homogeneous virtual machines (VMs) within a cluster. It allows you to have a fixed set of instances dedicated to specific tasks, ensuring consistent performance and resource isolation. Here's how node pools are typically used:

Attribute	Description
Resource Isolation	Node pools provide resource isolation, allowing different workloads or applications to run without impacting each other's performance.
Optimized Performance	By dedicating specific nodes to particular tasks, you can optimize performance for those workloads.
Cost-effectiveness	Node pools can be more cost-effective than using all-purpose compute for certain workloads, as you can scale resources according to the specific requirements of each task.

Improving "All-purpose compute" with node pool and minimum requirements

To improve the performance of all-purpose compute using node pools, you can follow these steps:

Action	Description
Define Workload-Specific Node Pools	Identify the specific tasks or workloads that require optimized performance and create dedicated node pools for them.
Specify Minimum Requirements	Determine the minimum resources (such as CPU, memory, and disk) required for each workload and configure the node pools accordingly.
Monitor and Adjust	Continuously monitor the performance of your node pools and adjust resource allocations as needed to ensure optimal performance.

Node Pool minimum configuration

Screenshot

Attach the Compute with the Node Pool

Screenshot

Information on how to retrieve the connection details

This section explains how to retrieve the connection details that you need to connect to Databricks.

Credentials to connect with Qualytics

Host: <host-name>.cloud.databricks.com or <host-name>.azuredatabricks.net
Http Path: sql/prodocolv1/o/xxxxx/xyz-xyz-xyz or /sql/1.0/warehouses/xyzpto
Catalog: Your available catalog in Databricks
Database: Your available schema in Databricks
Personal Access Token: Retrieved from User settings

Get connection details for a SQL warehouse

Click SQL Warehouses in the sidebar.
Choose a warehouse to connect to.
Navigate to the Connection Details tab.
Copy the connection details.

Screenshot

Get connection details for a cluster

Click Compute in the sidebar.
Choose a cluster to connect to.
Navigate to Advanced Options.
Click on the JDBC/ODBC tab.
Copy the connection details.

Screenshot

Get the Access Token

The token generation is documented as described in the Databricks documentation.

1. In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down menu

Screenshot

2. In Settings page, select the Developer option in User section

Screenshot

3. In Developer page, you will se below the Developer divisor the Manage Access Tokens

Screenshot

4. In Developer page, click on the Manage in Access Tokens

Screenshot

5. In Access Tokens page, click in Generate new token button

6. You will see a modal to add a description and validation time (in days) for the token:

Screenshot

7. After adding the contents, you can click in generate, it will show the token:

Screenshot

Warning

Once you click in Done the modal will close and you will never see the token again. Please, save the Personal Access Token to a secure space.

8. You can see the new token in Access Tokens page:

Screenshot

You can also revoke a token on Access Tokens page by clicking on the thresh icon:

Screenshot

Steps to Set Up Databricks in Qualytics

Fill the form with the credentials of your data source.

Screenshot

Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:

Screenshot

Warning

By clicking on the Finish button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.

To configure an Enrichment Datastore in another moment, please refer to this section

Note

It is important to associate an Enrichment Datastore with your new Datastore

The Enrichment Datastore will allow Qualytics to record enrichment data, copies of the source anomalous data and additional metadata for your Datastore

Configuring an Enrichment Datastore

If you have an Enrichment Datastore already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list
If you don't have an Enrichment Datastore, you can create one at the same page:

Once the form is completed, it's necessary to test the connection. A successful message will be shown:

Screenshot

Warning

By clicking on the Finish button, it will create the Datastore and link or create the Enrichment Datastore

Fields

`Name` `required`

The datastore name to be created in Qualytics App.

`Server Hostname` `required`

The address of the server to connect to.

`Http Path` `required`

The Databricks compute resources URL.

`Catalog` `optional`

The Catalog name to be accessed.
You can return the list of catalogs running:

    SHOW CATALOGS [ LIKE regex_pattern ]

`Database` `optional`

The database name to be accessed.
You can return the list of databases running:

    SHOW SCHEMAS [ LIKE regex_pattern ]

`Personal Access Token` `required`

The personal access token to access databricks.
Get the token in Authentication requirements.

Last update: April 7, 2024

Databricks

Understanding SQL Warehouses and All-Purpose Compute

SQL Warehouses (Serveless)

All-purpose Compute

Node pool and its usage

Improving "All-purpose compute" with node pool and minimum requirements

Node Pool minimum configuration

Attach the Compute with the Node Pool

Information on how to retrieve the connection details

Get connection details for a SQL warehouse

Get connection details for a cluster

Get the Access Token

Steps to Set Up Databricks in Qualytics

Configuring an Enrichment Datastore

Fields

Name required

Server Hostname required

Http Path required

Catalog optional

Database optional

Personal Access Token required

`Name` `required`

`Server Hostname` `required`

`Http Path` `required`

`Catalog` `optional`

`Database` `optional`

`Personal Access Token` `required`