Document toolboxDocument toolbox

Creating Databricks Connection on AWS

Applies to: Kyvos Enterprise  Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace

Kyvos Azure Marketplace   Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)


Prerequisites

For using Databricks as a job/build connection:

  1. Since Kyvos’ BI Server creates the cluster through REST API and will have to attach it with the instance pool as well; the access token to be used must have appropriate permissions to be able to do so .

    1. Enable Access Control: Access Control functionality is available in Databricks premium tier and can be checked by clicking on Settings > Admin Console > Workspace settings.

    2. Cluster Access Control: If the Cluster Access Control is enabled, the Kyvos user must have an unrestricted cluster creation entitlement. This can be done on the Databricks cluster by going to Settings > Admin Console > Users > Enable Allow unrestricted cluster creation for the required user.

    3. Pool Access Control: If the Pool Access Control is enabled, the Kyvos user must have Can Attach To permission to the Instance Pool, provided on the Databricks build connection. This can be done by going to Instance Pool > Edit > Permissions > Select Can Attach To for the required user.

Creating Databricks connection

To create a Databricks connection for AWS, perform the following steps.

  1. From the  Toolbox, click  Setup, then choose  Connections.

  2. From the Actions menu ( ⋮ ) click Add Connection.

  3. Enter a Name for the connection and provide information as:

Parameter

Description/Remarks

Parameter

Description/Remarks

Category 

Select the Build option.

Providers

Select the Databricks option.

Databricks Service Address

Enter the URL of the Databricks workspace.

Databricks Personal Access Token

Provide the personal access token to access and connect to your Databricks workspace. Refer to the Databricks documentation to know how to get your token.

Databricks Cluster Id

Enter the ID of your Databricks cluster.

 

To obtain this ID, click the Cluster Name on the Clusters page in Databricks. The page URL shows <https://<databricks-instance>/#/settings/clusters/<cluster-id>

Is Data Process

By default, the checkbox is selected.

Use as source 

Select the checkbox to use this as a read connection. In this case, the connection will be used to read data (creating registered datasets) on which the semantic model will be created.

Metastore Type

Metastore type to be used for fetching databases and table listing or writing SQL queries to design register datasets.

NOTE: This option is only displayed if you select the Use as a source checkbox.

You can select from the DEFAULT or GLUE METASTORE options. The GLUE option is displayed and set by default only if your AWS cluster was deployed with Glue enabled.

Hive Server JDBC URL

Databricks cluster JDBC URL to connect with Databricks internal metastore by creating JDBC connection.

SQL Engine

Select the HIVE or Spark option from the SQL Engine from the list.

  • Is default SQL engine: To enable the connection for raw data querying, select the checkbox to set this connection to run the default SQL engine. 

  • Configure Job cluster: Use this option to allow Kyvos to execute Spark jobs on Job Cluster to reduce the costs of the build jobs. This feature is helpful and recommended in limited scenarios.

Configure Job Cluster

Use this option to allow Kyvos to execute Spark jobs on Job Cluster to reduce the costs of the process jobs. This feature is helpful and recommended in limited scenarios. Please see the recommendations and best practices sections below for details.

Autoscaling

If needed, enable autoscaling and specify the minimum and maximum number of worker nodes for the cluster. 

NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

Use same instance pool for worker and driver

Select the checkbox to use the same instance pool for worker and driver nodes.

NOTE: Kyvos does not perform any heavy operation on the driver node; it is recommended to use a pool of cheaper nodes for the Spark driver, preferably Standard_DS3_v2.
NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

Instance Pool Id

Instance Pool Id to be used for worker nodes. If the Use same instance pool for worker and driver option is selected, this pool will also be used for driver nodes.

NOTE: Runtime version of the instance pool must be the same as that of the Databricks Cluster ID. You can provide the ID of an existing Instance Pool. See Databricks documentation for details.
NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

Driver Instance Pool Id

Instance Pool Id to be used for driver nodes.
NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

Spark config

Enter your Spark configuration to fine-tune the Spark job performance. Provide space-separated key-value pair for a property. Multiple properties must also be separated by space. Learn more.

NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

Tags (JSON string)

You can add additional tags for the cluster by providing the tags in JSON format. Both the cluster-level tags and those inherited from pools are applied. You cannot add a cluster-specific tag with the same key name as a custom tag inherited from a pool (that is, you cannot override a custom tag that is inherited from the pool). Example: {"key1": "val1","key2": "val2"} Learn more.

NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

Cluster log path (DBFS)

You can configure the DBFS location where the system should persist the Spark job logs. If you leave it blank, the system will persist the logs at the dbfs:/cluster-logs location.
NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

  1. Click the Save button to validate the connection settings and save information.

 

Copyright Kyvos, Inc. All rights reserved.