Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Applies to: Image Modified Kyvos Enterprise   Image Modified Kyvos

...

Cloud (Managed Services on AWS)    Image Modified Kyvos Azure Marketplace

...

Image Modified Kyvos

...

AWS Marketplace   Image Modified Kyvos Single Node Installation (Kyvos SNI)     Image Modified Kyvos Free ( Limited offering for AWS

...

Prerequisites

For using Databricks as a job/process connection:

...

  • Table name: It can be identified using ‘ ident.name()
  • Spark Session: It can be fetched using ‘SparkSession . active
  • Paths: It will require table location.  
  • User-specified schema: Provide user-specified schema. If not provided, then schema will be identified using inferSchema method.
  • Fallback file format: The file format class details. We can provide ParquetFileFormat details here, as mentioned below:  
    def  fallbackFileFormat :  Class [_ <:  FileFormat ] =   classOf  [ ParquetFileFormat ]

Creating Databricks Connection for Azure

To create a Databricks connection, perform the following steps.

  1. From the  Toolbox, click  Setup, then choose  Connections.
  2. From the Actions menu (  ) click Add Connection.
  3. Enter a Name for the connection and provide information as:

    ParameterDescription/Remarks
    Category Select the Build option.
    Providers Select the Databricks option.
    Databricks Service Address

    Enter the URL of the Databricks workspace.

    Personal Access Token: Select this option to use PAT token for authentication. You will need to provide the personal access token to access and connect to your Databricks workspace. Refer to the Databricks documentation to get your token.

    AAD Token Using Managed Identity: Select this option to authenticate Databricks using Azure Active Directory (AAD) token.

    Databricks Cluster IdEnter the ID of your Databricks cluster.
    To obtain this ID, click the Cluster Name on the Clusters page in Databricks. The page URL shows <https://<databricks-instance>/#/settings/clusters/<cluster-id>
    Use as source  Select the checkbox to use this as a read connection. In this case, the connection will be used to read data (creating datasets) on which the semantic model will be created.
    Is Data Process By default, the checkbox is selected.

    Metastore Type

    Metastore type to be used for fetching databases and table listing or writing SQL queries to design register datasets.

    NOTE: This option is displayed only if you have selected the Use as a source checkbox.

    Select the CUSTOM METASTORE option.

    NOTE: You must first upload the third-party JAR files for Databricks custom metastore through the Kyvos Manager.

    Fully-qualified name of the custom catalog plugin class.

    NOTE: This option is displayed only when you select the CUSTOM METASTORE option.

    Security parameters to be used in custom catalog plugin class implementation for providing secured data access.

    NOTE: This option is displayed only when you select the CUSTOM METASTORE option.

    Configure Job Cluster

    Use this option to allow Kyvos to execute Spark jobs on Job Cluster to reduce the costs of the process jobs. This feature is helpful and recommended in limited scenarios. Please see the recommendations and best practices sections below for details.

    Autoscaling

    If needed, enable autoscaling and specify the minimum and maximum number of worker nodes for the cluster. 

    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

    Use same instance pool for worker and driver

    Select the checkbox to use the same instance pool for worker and driver nodes.

    NOTE: Kyvos does not perform any heavy operation on the driver node; it is recommended to use a pool of cheaper nodes for the Spark driver, preferably Standard_DS3_v2.
    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

    Instance Pool Id to be used for worker nodes. If the Use same instance pool for worker and driver option is selected, this pool will also be used for driver nodes.

    NOTE: Runtime version of the instance pool must be the same as that of the Databricks Cluster ID. You can provide the ID of an existing Instance Pool. See Databricks documentation for details.
    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

    Driver Instance Pool Id

    Instance Pool Id to be used for driver nodes.
    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

    Enter your Spark configuration to fine-tune the Spark job performance. Provide space-separated key-value pair for a property. Multiple properties must also be separated by space. Learn more.

    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

    You can add additional tags for the cluster by providing the tags in JSON format, both the cluster-level tags and those inherited from pools are applied. You cannot add a cluster-specific tag with the same key name as a custom tag inherited from a pool (that is, you cannot override a custom tag that is inherited from the pool). Example: {"key1": "val1","key2": "val2"} Learn more.

    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.

    You can configure the DBFS location where the system should persist the Spark job logs. If you leave it blank, the system will persist the logs at the dbfs:/cluster-logs location.
    NOTE: This option is displayed only when you select the Configure Job Cluster checkbox.


  4. Click the Save button to validate the connection settings and save information.

...