For adding datasets
Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace
Kyvos Azure Marketplace Kyvos Azure Marketplace Kyvos Single Node Installation (Kyvos SNI)
Hide unused columns
Your files or Hive tables may have columns used by your Kyvos semantic model as well as unused columns. The registration process allows you to select the columns which should be read by Kyvos. Similarly, if you are using SQL to resister input data in Kyvos, then it is always recommended to get the required columns in the select clause of the SQL.
Selecting only the useful columns increases the performance at various stages of transformation and the semantic model process by reducing the memory footprint and network bandwidth usage required for the transformation and semantic model process.
Specify input format for date columns.
Generally, date columns store dates as strings. Kyvos then reads the date using the default format YYYY/MM/DD. If the date format of the source data doesn't match the default format, then Kyvos will throw an error while validating the register file. To improve performance and avoid errors, you should specify the format string for the date.
In case of changes in the meta information of a dataset, re-validate the format and save
The meta information on the dataset can change on many occasions. There might be a change in the data type, column name, or location of the underlying file on HDFS. In any of these cases, the format for the file must be re-validated to avoid the chances of metadata mismatch and subsequent failure in transformation building or semantic model building.
However, in case of changes in the HCatalog table, especially in scenarios where the dataset is imported through a .cab file – you must be very careful while overwriting metadata. This may cause changes to the imported design.
After making changes to the underlying files, click the FORMAT button.
Hadoop Files Vs HCatalog Tables Vs. SQL
Following are the ways a user can use to create the register file –
File-based register file - When a file on the Distributed file system (HDFS, S3, Blob storage, etc.) is registered, the complete data is read from the source, including subfolders, and is handled for processing.
Table-based register file – This includes Hive, Databricks, Glue, Snowflake, and HCatalog.
SQL based Register File.
Following are the scenario when users can use the above-mentioned ways to register the data.
Partitioned Data – When data is partitioned, it is not advisable to use Hadoop file-based dataset. It would not provide schema control. When using a Partitioned table (Hive, Databricks, Glue, Snowflake, HCatalog), you can apply partition filters in the SQL or Filter step of the dataset. This way, only the data present in the specific partition will be passed for processing. This improves the data read performance.
Frequently Changing Filters – A SQL-based dataset is the best option when we are changing the filters frequently.
The best option to Choose – Users can either use a Table-based dataset or an SQL-based dataset, depending upon their business needs.
If the User wants to get the selected columns for the semantic model design, the user can opt for the SQL-based register, saving the extra step of hiding unused columns.
File-based register file - There are times when HDFS should be used, such as in the case of unstructured data.
Define data sub-types
Kyvos allows you to change the data type for columns while registering files. The Data Types supported are char, date, and number.
If it is a numeric field, then select Data Sub Type from:
Long: If the column value contains integers values, use this option.
Float: Float is useful when you have large decimal numbers. It can store about seven digits after the decimal point and rounds off decimal places beyond that, thus decreasing precision. This data type should never be used for precise values, such as currency.
Double: The double data type can store about 15 digits after the decimal point. It's almost double the float, and so we often say that float is a single-point precision, and double is a double-point precision.
Note
If the raw data contains decimal values with high precision, and you do not want any precision loss post semantic model process in querying, then set the Data Type of the respective field as Decimal and set the correct precision. Else it will result in Precision loss. Semantic model process size may vary if the data type is Decimal.
Format option to match the Data type and data value format
Custom date format should be in line with the underlying data.
If the date format of the underlying data source is yyyy-MM-dd HH:mm:ss.S, then the dataset date format should be the same as the source date format. In case of any other format value like yyyy-MM-dd HH:mm:ss or yyyy-MM-dd will fail the semantic model process.Provide the Datatype sub-format: Kyvos allows you to change the data type for columns while registering files from Hadoop. The Data Types supported are char, date, and number. If it is a numeric field, then select Data Sub Type from:
Long: If the column value contains integers values, use this option.
Float: Float is useful when you have large arrays of decimal numbers. It can store about seven digits after the decimal point and rounds off decimal places beyond that. Thus, decreasing precision. This data type should never be used for precise values, such as currency.
Double: The double data type can store about 15 digits after the decimal point. It is almost double the float, and so we often say that float is a single-point precision, and double is a double-point precision. But remember the issue with precision in float. The same thing happens with a double; things will get rounded or truncated after a certain value. This data type should never be used for precise values, such as currency.
Decimal: Use the correct data type for underlying data (decimal as a decimal)
Decimal data type in SQL or Table is read as double in Kyvos. Kyvos also shows a warning for such types of datatypes.
Users must change the data type of such columns to decimal data type with the proper precisions and scale to avoid any data inaccuracy.
Refresh the dataset metadata when SQL/Source Table Changed
Users need to refresh the dataset metadata for any change made at the underlying table’s data format and data type. In the absence of recommended step, the user may observe semantic model process failures or data inconsistencies. Kyvos do not recommend reading data from stale design.
Select required columns or Hide unused.
Select required columns when writing SQL instead of 'select *'. If using a table or file as a source, hide the column, which is not required in the next steps of semantic model designing.
For SQL Based register file SQLs should follow Spark SQL tuning techniques.
Kyvos internally use Spark SQL when we use SQL based dataset. Following are a few examples that users can consider while writing the SQLs.
SQL join must use a similar data type to join the two tables
Use Group by in place of distinct clause
Avoid using Hash functions
Avoid using Rank or Row Number functions
Applying the Salting technique to distribute the records uniformly among the tasks
Note
There is no limit to the number of character or lines in the dataset SQL editor. We have successfully verified with 440+ lines with approximately the same number of columns in the Kyvos QA lab.
Incremental identifiers should be used when incoming data is in increasing order
It will be helpful for the scenarios like “New Employee added to the organization.”
Not advisable for the Slowly Changing Dimension
Dimension should have an incremental identifier
Copyright Kyvos, Inc. All rights reserved.