Creating Dataproc (compute cluster)

Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace

Kyvos Azure Marketplace Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)

To create a Dataproc cluster, perform the following steps.

Note

Download the GCP Installation Files folder and keep the files handy before proceeding.

On your Google console, click Dataproc > Create Cluster
Provide cluster Name.
Select Region and Zone the same as the one selected for the Kyvos instances.
Select Cluster type as Standard (1 master, N workers).
Optionally, select Autoscaling policy. It is not mandatory to select an autoscaling policy while creating Dataproc. You can also add it after creating Dataproc.
To attach an autoscaling policy to your cluster after creation, follow the steps given in the Enabling Autoscaling on cluster section
Kyvos recommends attaching the autoscaling policy to the Dataproc cluster.
Under Versioning, use the Change button to select the Image Type and Version for the nodes.
Optionally, select any of the Kyvos supported versions.
Click Configure nodes on the left, and in the Master Node area define Machine configuration as:
Series: N2D
Machine Type – n2-highmem-4 (4 vCPU and 32 GB)
Provide Primary Disk type and Disk size
Configure the Worker Node with Machine Configuration with Minimum Recommendation as:
Series: N2D
Machine Type: n2-highmem-8 (8 vCPU and 64 GB)
Provide Primary Disk type and Disk size as 500 GB with Type Standard Persistent Disk
Node Count – As needed
Local SSD – 0 (default)
Click Customize Cluster on the left, and define the Network Configuration as:
NOTE: You must specify the same network details as used for the Kyvos Instances.
1. Under Internal IP Only, select the Configure all instances to have only internal IP addresses checkbox.
2. In the Dataproc Metastore, select the Metastore service from the list to use Dataproc Metastore as Hive metastore.
Creating Initialization Actions
1. For SSH disabled Dataproc environment, you can provide the dataproc.sh script (provided in the GCP Installation Files folder in gcp.tar) to ensure that your snapshot bundles are uploaded to the bucket. For this, go to Initialization actions and click the Add initialization Action button and use the Browse button to select the dataproc.sh script.
  NOTE: The dataproc.sh script must be available in your bucket.
2. For Livy Server enabled Dataproc environment, you can provide the livyserver.sh script (provided in the GCP Installation Files folder in gcp.tar) to ensure that the Livy Server is deployed with the Dataproc cluster. For this, go to Initialization actions and click the Add initialization Action button and use the Browse button to select the livyserver.sh script.
  NOTE: The livyserver.sh script must be available in your bucket.
In the Cloud Storage staging bucket area, select the Cloud Storage bucket that you want to use.
Click the Create button on the left.
Once the cluster is created, stop the Dataproc VM Instances and click the instances > Edit. Scroll to the SSH Keys section and provide the same public key which was used for the Kyvos instances. NOTE: The Service Account attached to Dataproc must have a Dataproc Worker role assigned to it.
Connect to the Master node and create Kyvos directories on HDFS using the following commands.
hadoop fs -mkdir -p /user/kyvos/temp hadoop fs -chmod -R 777 /user/kyvos

Note

Once created, you can validate if the resources meet the requirements for installing Kyvos on the Google cloud platform.

To deploy the Kyvos cluster using password-based authentication for service nodes, ensure that the permissions listed here are available on all the VM instances for Linux user deploying the cluster.

To deploy the Kyvos cluster using custom hostnames for resources, ensure that the steps listed here are completed on all the resources created for use in the Kyvos cluster.

Kyvos 2023.3

Creating Dataproc (compute cluster)

Analytics

Next: Deploy the Kyvos GCP cluster through Kyvos Manager