Working with Databricks Connection
Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace
Kyvos Azure Marketplace  Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)
Important
Kyvos Supports Databricks process and read connections for AWS and Azure platforms only.
Kyvos supports the Databricks platform for reading and processing semantic models and data profiling jobs. Databricks provides the following cluster types to run Spark jobs:Â
Interactive Cluster | All-purpose cluster | Data Analytics
You use all-purpose clusters to analyze data collaboratively using interactive notebooks
You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative, interactive analysis.
Automated Cluster | Job Cluster | Data EngineeringÂ
You use job clusters to run fast and robust automated jobs.Â
The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart a job cluster.
To reduce the costs of the process job by reducing the costs of Databricks services, Kyvos brings the capability to execute the Spark jobs on the Job cluster.
You can also use Databricks as a Read connection for reading data and processing semantic models from multiple Azure subscription storage accounts where Databricks and Kyvos are configured to use Custom Metastore.
Recommendations and best practices for using Databricks process and read connections
Cost reduction using Job Cluster is proportional to the usage of Azure Databricks. The higher the usage of Azure Databricks, the more is the benefit in overall costs. Additionally, a system design should extract the maximum throughput from the Virtual Machines used by the cluster. This is currently lagging in Kyvos’ process jobs for various technical reasons, hence the below problem.Â
 Underutilization of Virtual Machines – Cost, Unpredictability, and FailuresÂ
Interactive Cluster | Job Cluster |
---|---|
All the Spark jobs are submitted to a common cluster. The Virtual machines are shared across the jobs.Â
  | Each Spark job is submitted to a new/dedicated cluster. The Virtual machines are not shared across the jobsÂ
|
 In the absence of the above-mentioned product architecture; and to reduce the costs, Kyvos recommends using the Job Cluster only in specific scenarios as detailed below.
Semantic Model Scenario | Details | Recommended Cluster |
---|---|---|
Huge fact data processing (>300M row count) is required during the Full or Incremental semantic model processes, and the semantic model does not have other fact datasets with low data volume. | Level jobs processing time is three hours or more | Job |
The indexing time of the semantic model process job is less. | Indexing jobs time is 20% or less in the semantic model process time | Job |
The semantic model has a large number of fact transformations. | The utilization of Virtual machines/Cores could exceed the instance pool maximum nodes or subscription quota. If there are ten fact transformations and the cluster is configured to use 20 worker nodes at max, the System might end up using 200 Virtual Machines (at max) simultaneously and result in Usage + quota limit errors. | Interactive |
Process jobs on low data volume. | Less usage of Azure Databricks | Interactive |
Wide semantic model having a huge number of dimensions or attributes. | Too many process jobs are underutilizing the Virtual machines | Interactive |
Related topics
Copyright Kyvos, Inc. All rights reserved.