Working with Databricks Connection

Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace

Kyvos Azure Marketplace Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)

Important

Kyvos Supports Databricks process and read connections for AWS and Azure platforms only.

Kyvos supports the Databricks platform for reading and processing semantic models and data profiling jobs. Databricks provides the following cluster types to run Spark jobs:

Interactive Cluster | All-purpose cluster | Data Analytics
1. You use all-purpose clusters to analyze data collaboratively using interactive notebooks
2. You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative, interactive analysis.
Automated Cluster | Job Cluster | Data Engineering
1. You use job clusters to run fast and robust automated jobs.
2. The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart a job cluster.

To reduce the costs of the process job by reducing the costs of Databricks services, Kyvos brings the capability to execute the Spark jobs on the Job cluster.

You can also use Databricks as a Read connection for reading data and processing semantic models from multiple Azure subscription storage accounts where Databricks and Kyvos are configured to use Custom Metastore.

Recommendations and best practices for using Databricks process and read connections

Cost reduction using Job Cluster is proportional to the usage of Azure Databricks. The higher the usage of Azure Databricks, the more is the benefit in overall costs. Additionally, a system design should extract the maximum throughput from the Virtual Machines used by the cluster. This is currently lagging in Kyvos’ process jobs for various technical reasons, hence the below problem.

Underutilization of Virtual Machines – Cost, Unpredictability, and Failures

Interactive Cluster	Job Cluster

Interactive Cluster

Job Cluster

All the Spark jobs are submitted to a common cluster. The Virtual machines are shared across the jobs.

Better control of the total number of Virtual Machines as it is independent of the build jobs, or semantic model design, etc.
Fewer hardware costs due to optimal usage of Virtual machines

Each Spark job is submitted to a new/dedicated cluster. The Virtual machines are not shared across the jobs

Many Kyvos jobs do not operate in a way that utilizes the capacity of Virtual Machines. So, the system might underutilize the hardware and incur higher costs
When multiple jobs are running in parallel, it might cause the usage of more Virtual Machines simultaneously, and hence the system would be more vulnerable to the following errors:
- Maximum Capacity Error limits on the Instance Pool
- Usage + quota limits at the subscription

In the absence of the above-mentioned product architecture; and to reduce the costs, Kyvos recommends using the Job Cluster only in specific scenarios as detailed below.

Semantic Model Scenario	Details	Recommended Cluster

Semantic Model Scenario	Details	Recommended Cluster
Huge fact data processing (>300M row count) is required during the Full or Incremental semantic model processes, and the semantic model does not have other fact datasets with low data volume.	Level jobs processing time is three hours or more	Job
The indexing time of the semantic model process job is less.	Indexing jobs time is 20% or less in the semantic model process time	Job
The semantic model has a large number of fact transformations.	The utilization of Virtual machines/Cores could exceed the instance pool maximum nodes or subscription quota. If there are ten fact transformations and the cluster is configured to use 20 worker nodes at max, the System might end up using 200 Virtual Machines (at max) simultaneously and result in Usage + quota limit errors.	Interactive
Process jobs on low data volume.	Less usage of Azure Databricks	Interactive
Wide semantic model having a huge number of dimensions or attributes.	Too many process jobs are underutilizing the Virtual machines	Interactive

Related topics