Best practices for working on Azure environment

Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace

Kyvos Azure Marketplace Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)

Configure Cluster and Query Engine Scheduling to save cost and use cloud resources only when needed.
1. You can create a schedule to:
2. Shutdown cluster for any time interval
3. Start cluster for any time interval.
4. Schedule Query Engines for any time interval
Auto Scaling and Auto Termination Policy on Databricks Cluster to save cost and achieve high cluster Utilization.
1. Auto Termination: You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate. If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Databricks automatically terminates that cluster.
  A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing. This does not include commands run by SSH-ing into the cluster and running bash commands.
  Standard clusters are configured to terminate automatically after 120 minutes. You can modify the default value as needed.
2. Auto Scaling: When you create a Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.
  When you provide a fixed size cluster, Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.
  With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).
  Autoscaling makes it easier to achieve high cluster utilization because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time, but it can also apply to a one-time shorter workload whose provisioning requirements are unknown.
  Autoscaling thus offers two advantages:

Workloads can run faster compared to a constant-sized under-provisioned cluster.
Autoscaling clusters can reduce overall costs compared to a statically sized cluster.

You can modify the Auto Scaling configuration later as well.

Enable Logging for DBFS and provide a location to Persist Event Logs, Driver Logs and Executor Logs to analyze later.
To Enable Logging, you can Navigate to Cluster-> Advance Options -> Logging, thereafter you can mention the dbfs path.
Event Logs, Driver Logs and Executor Logs will get persisted at this location.
Ganglia Metrics
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters.
You can navigate Metrics from Databricks UI and then analyze the Live Metrics or periodical Historical snapshots for CPU, Disk, IO, Memory and several other parameters in case you observe any node slowness, node lost, or jobs stuck. You can also download the metrics snapshots for future reference.
Resume semantic model process event of any Failure.
You can use the Resume semantic model process Feature in case a semantic model is resumable to save to cloud cost.
If you resume a failed process the steps that were successfully completed are skipped reducing the process time. There are several ways to resume a process after it fails to complete. 
- From View Job Histories, right-click a failed job and choose Resume job. This option is available when the process failed after some of the steps were successfully completed. 
- When you add a job and have selected a semantic model that failed to process on the previous attempt, you may see an option to resume from last failure. You are prompted to confirm that you want to resume the process. 
Use Delta Table instead of directlyHive table on parquet.

Query performance

As data grows exponentially in size, being able to get meaningful information out of your data becomes crucial. Using several techniques, Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet.

Data Indexing – Delta creates and maintains indexes on the tables.
Data Skipping – Delta maintains file statistics on the data subset so that only relevant portions of the data is read in a query.
Compaction – Delta manages file sizes of the underlying Parquet files for the most efficient use.
Data Caching – Delta automatically caches highly accessed data to improve run times for commonly run queries.

Data reliability

The end users of the data must be able to rely on the accuracy of the data. Delta uses various techniques to achieve data reliability.

ACID transactions – Delta employs an all or nothing approach for data consistency.
Snapshot isolation – Ensures that multiple writers can write to a transformation simultaneously without interfering with jobs that are reading the transformation.
Improve data integrity through schema enforcements.
Checkpoints to ensure data is delivered and read only once even if there are multiple incoming and outgoing streams.
Upserts and deletes support – Being able to handle late arriving and changing records and cases where records should be deleted.
Data versioning capabilities allow organizations to rollback and reprocess data as necessary.

Ensure that all the semantic model which are not eligible for querying must be set to cuboid replication type as None
Query Engines, BI Server and ADLS storage must be in same region
Ensure that there should be enough amount of local disk space available on Query Engine to replicate the built semantic models.
For the environments where we are not having sufficient local disk available (Local disk less than semantic model size) - create a segment, create dedicated metadata folder and allocate the prod semantic models to this segment and rest of the semantic models to the default segment.