Creating Databricks Connection on Azure

Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace

Kyvos Azure Marketplace Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)

Prerequisites

For using Databricks as a job/process connection:

Since Kyvos’ BI Server creates the cluster through REST API and will have to attach it with the instance pool as well; the access token to be used must have appropriate permissions to be able to do so .
1. Enable Access Control: Access Control functionality is available in Databricks premium tier and can be checked by clicking on Settings > Admin Console > Workspace settings.
2. Cluster Access Control: If the Cluster Access Control is enabled, the Kyvos user must have an unrestricted cluster creation entitlement. This can be done on the Databricks cluster by going to Settings > Admin Console > Users > Enable Allow unrestricted cluster creation for the required user.
3. Pool Access Control: If the Pool Access Control is enabled, the Kyvos user must have Can Attach To permission to the Instance Pool, provided on the Databricks process connection. This can be done by going to Instance Pool > Edit > Permissions > Select Can Attach To for the required user.

For using Databricks as a read connection with custom metastore:

The following permissions are required:
1. If you are using a storage account from a different subscription, then the Managed Identity attached to the BI Server should also be attached to the storage account of the different subscriptions (Role: Storage Blob Data Reader).
2. If you are using a different subscription’s storage account, then the Service principle attached to the deployed cluster should also be attached to the custom metastore container (Role: Storage Blob Data Reader).
Custom metastore API Implementation, as explained in the section below.

Custom Metastore API Implementation

To support custom metastore in Kyvos, you need to configure a custom metastore class containing the required implementation to fetch databases/tables listing and table metadata details. Also, it will need to have an implementation to load the table metadata from the catalog.

API Implementation

Database and Table Listing

The custom metastore class will need to implement an interface (mentioned below) and provide an implementation of methods to list schemas and tables.

Interface- com.kyvos.commons.callback.jdbc.IJDBCExecutor

API	Method	Input parameters	Output object	Method signature

API	Method	Input parameters	Output object	Method signature
Database listing	listSchemas(com.kyvos.commons.callback.jdbc.IQueryOperation)	IQueryOperation will have method ‘getSecurityCredentials()’ to fetch security credentials.	List of databases	List<String> listSchemas(IQueryOperation iQueryOperation) { return new ArrayList<>(); }
Tables listing	listTables(com.kyvos.commons.callback.jdbc.IQueryOperation)	IQueryOperation will have method ‘getSchemaName()’ to get database name, and a method ‘getSecurityCredentials()’ to fetch security credentials.	List of tables in database.	List<String> listTables(IQueryOperation iQueryOperation) { return new ArrayList<>(); }

As mentioned in the Input parameters column above, the user can fetch the required security parameters using the IQueryOperation.getSecurityCredentials() method and use them to provide secured data access.

Column metadata/Table data

To fetch the column metadata for a table or to preview the table data, you need to implement a custom catalog plugin that will override the spark’s default catalog behavior ( spark_catalog ) by providing a custom implementation for a few methods.

The user will need to extend the below-mentioned abstract class and provide an implementation of the method to load table metadata from the catalog.
Abstract Class - org.apache.spark.sql.connector.catalog.DelegatingCatalogExtension

Method - loadTable

public Table loadTable( Identifier ident) throws org.apache.spark.sql.catalyst.analysis.NoSuchTableException

The Identifier passed as an input will have the database and table name.

The return type of this method is a Table object, which can be an instance of ParquetTable or ORCTable based on the datasource type.

For example, to instantiate a ParquetTable object, the following details will be required:

Table name: It can be identified using ‘ ident.name() ’
Spark Session: It can be fetched using ‘SparkSession . active ’
Paths: It will require table location.
User-specified schema: Provide user-specified schema. If not provided, then schema will be identified using inferSchema method.
Fallback file format: The file format class details. We can provide ParquetFileFormat details here, as mentioned below:

^{def fallbackFileFormat : Class [_ <: FileFormat ] = classOf [ ParquetFileFormat ]}

You can also create a Databricks SQL Warehouse connection for raw data querying.

Creating Databricks Connection for Azure

To create a Databricks connection, perform the following steps.

From the Toolbox, click Setup, then choose Connections.
From the Actions menu ( ⋮ ) click Add Connection.
Enter a Name for the connection and provide information as:

Parameter	Description

Parameter	Description
Category	Select the process option.
Providers	Select the Databricks option.
Databricks Service Address	Enter the URL of the Databricks workspace.
Databricks Authentication Type	Personal Access Token: Select this option to use PAT token for authentication. You will need to provide the personal access token to access and connect to your Databricks workspace. Refer to the Databricks documentation to get your token. AAD Token Using Managed Identity: Select this option to authenticate Databricks using Azure Active Directory (AAD) token.
Databricks Cluster Id	Enter the ID of your Databricks cluster. To obtain this ID, click the Cluster Name on the Clusters page in Databricks. The page URL shows <https://<databricks-instance>/#/settings/clusters/<cluster-id>
Use as source	Select the checkbox to use this as a read connection. In this case, the connection will be used to read data (creating datasets) on which the semantic model will be created.
Is Data Process	By default, the checkbox is selected.
Metastore Type	Metastore type to be used for fetching databases and table listing or writing SQL queries to design register datasets. NOTE: This option is displayed only if you have selected the Use as a source checkbox. Select the CUSTOM METASTORE option. NOTE: You must first upload the third-party JAR files for Databricks custom metastore through the Kyvos Manager.
Custom Metastore API Class	Fully-qualified name of the custom catalog plugin class. NOTE: This option is displayed only when you select the CUSTOM METASTORE option.
SQL Engine	Select the SQL engine as Database SQL Warehouse . See the Provider parameters table for details. Enter Databricks SQL Warehouse JDBC URL. For more information, see Microsoft documentation. Enter Databricks SQL Personal Access Token for the Databricks SQL workspace. NOTE: It is optional in the absence of this PAT. The system will attempt to connect using the PAT provided for the Databricks connection.
Security Parameters (JSON string)	Security parameters to be used in custom catalog plugin class implementation for providing secured data access. NOTE: This option is displayed only when you select the CUSTOM METASTORE option.
Configure Job Cluster	Use this option to allow Kyvos to execute Spark jobs on Job Cluster to reduce the costs of the process jobs. This feature is helpful and recommended in limited scenarios. Please see the recommendations and best practices sections below for details.
Autoscaling	If needed, enable autoscaling and specify the minimum and maximum number of worker nodes for the cluster. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Use same instance pool for worker and driver	Select the checkbox to use the same instance pool for worker and driver nodes. NOTE: Kyvos does not perform any heavy operation on the driver node; it is recommended to use a pool of cheaper nodes for the Spark driver, preferably Standard_DS3_v2. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Instance Pool Id	Instance Pool Id to be used for worker nodes. If the Use same instance pool for worker and driver option is selected, this pool will also be used for driver nodes. NOTE: Runtime version of the instance pool must be the same as that of the Databricks Cluster ID. You can provide the ID of an existing Instance Pool. See Databricks documentation for details. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Driver Instance Pool Id	Instance Pool Id to be used for driver nodes. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Spark config	Enter your Spark configuration to fine-tune the Spark job performance. Provide space-separated key-value pair for a property. Multiple properties must also be separated by space. Learn more. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Tags (JSON string)	You can add additional tags for the cluster by providing the tags in JSON format, both the cluster-level tags and those inherited from pools are applied. You cannot add a cluster-specific tag with the same key name as a custom tag inherited from a pool (that is, you cannot override a custom tag that is inherited from the pool). Example: {"key1": "val1","key2": "val2"} Learn more. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Cluster log path (DBFS)	You can configure the DBFS location where the system should persist the Spark job logs. If you leave it blank, the system will persist the logs at the dbfs:/cluster-logs location. NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.
Catalog Enabled	Select the Catalog Enabled checkbox. It is mandatory to enabled it for Databricks SQL.

Click the Save button to validate the connection settings and save information.