For designing semantic models

Applies to: Kyvos Enterprise Kyvos Cloud (SaaS on AWS) Kyvos AWS Marketplace

Kyvos Azure Marketplace Kyvos GCP Marketplace Kyvos Single Node Installation (Kyvos SNI)

Identify dimensions, attributes, and measures from a logical understanding of your business’s use cases and schema

Understanding of use cases is the most important first step for creating an optimized semantic model. You should choose dimensions and measures that reflect the use cases for which you are building the semantic model. Careful planning lets you reduce the semantic model process time and improve query performance by deciding if columns should be taken as attributes, as levels, or if they are required at all.

Why: This helps in identifying the unnecessary dimensions, attributes, and measures in the semantic model that are usually added as they are available in the table/transformation. Adding such dimensions and attributes increase the combination of dimension for aggregation at the time of semantic model process. This results in increasing in the process time and semantic model size. This may also impact the query time.

Avoid designing dimensions, which are not used on reports/dashboards to create visualizations

For example, columns like ID, GUID should be avoided unless necessary. If absolutely required, design count measure on these columns. There might be columns/fields in data that are used to globally identify each row. These columns are generally not used in reports/dashboards as these are just identifiers. Hence, you can drop these from the semantic model design unless asked to include it. Remember, to save the semantic model process time and size, exclude dimensions that are not required by the use cases.

Why: Fields like ID and GUID which are not used in visualizations and reports and usually have very high cardinality. If we add such fields in the dimension of the semantic model we will increase the cardinality of the dimension. An increase in the cardinality of dimension increases the combination exponentially so adding such fields will result in both increase in the process time, semantic model size and query time.

Identify cardinalities of dimensions and attributes and design semantic model based on that

Cardinality refers to the number of distinct values available in a column. For example, the cardinality of states in the US is 50 whereas the cardinality of the number of customers in the US could be in millions.

Cardinality plays a very crucial role in optimizing semantic mdoel process time and size. These are pre-aggregated in the semantic model to improve query response time. Dimensions with low cardinality should be included as attributes, which are aggregated at run time.

Identify the cardinality of dimensions, attributes, and distinct count measures using any of the following methods.

If raw data is available in Hive tables, execute Hive queries to identify cardinality of dimensions, attributes, and distinct count measures.
Otherwise, you can use a Kyvos semantic model process to identify cardinality.
1. Star schema: Run a test semantic model process after logically designing a semantic model and view the job summary. The job summary shows the cardinality of each dimension in the semantic model.
2. Single file schema: Run a full job with a small volume of data (for example, one day’s data) and view the job summary. Then you can extrapolate the cardinality of complete data according to the job summary.
Check cardinalities by running a data profiling job before starting the design.

If the cardinality of the overall dimension is very high, then modify your semantic model design to distribute attributes in multiple dimensions. This helps in reducing the semantic model process time and size. If the cardinality of one dimension is high, divide it into multiple dimensions distributing the attributes to reduce the overall cardinality of one single dimension. It is always recommended to understand the business use case before making such changes in the semantic model design.

Design decisions that you can take, based on cardinality

Logical grouping of columns to avoid cardinality explosion. You can run some profiling queries on the database to check this.
Keep low cardinality dimension as Attribute.
Keep high cardinality dimension as Level.
Avoid adding a frequently used filter column as an attribute.
Avoid creating a lot of dimensions with a single level or attribute coming from the same node.
Combine dimensions with a fewer number of attributes (1-3) into one dimension to reduce the number of dimensions.
Keep the attributes in a single dimension which are likely to be browsed together.

Avoid using wildcard characters or special symbols when naming dimensions, measures, and semantic model. Many external visualization tools do not support special symbols; hence it is best to avoid these.

Use Approximate/Boundary Distinct Count whenever possible

There are two types of distinct count: Approximate and Accurate. Using the Approximate count improves performance; whereas the Accurate count increases the size and processing time of the semantic model.

Why: The approximate distinct counts are computationally less expensive as compared to accurate ones. The calculation uses a probability-based data science algorithm called HyperLogLog. There can be many business use cases that do not need accurate distinct counts, or it is acceptable to trade off accuracy over performance. Some businesses also want to keep the size of the semantic model optimal. In such cases, it is preferred to use Approximate distinct counts.

When using a sliding window in a semantic model, always use a time type dimension

A sliding window refers to how much historical data should be retained within the semantic model(in years, months, or days, etc.) For example, if a semantic model is incrementally built daily, then a sliding window of 10 days means that the semantic model will only retain data for the last 10 days.

To set up a sliding window, you first need to define partitions for the semantic model. A partition is a container for a portion of the semantic model data. Partitions are used to manage and store data and aggregations for measures in a semantic model. Every semantic model has at least one default partition; this partition is created when the semantic model is built.

Why: When you update or replace data in a semantic model partition, you must specify the full range of data for that partition. For example, if a semantic model partition contains a day’s worth of data, then you must specify the time range as a day.

Identify partitioning of raw data and user query patterns to decide the best semantic model partition strategy.

Configure your partitions to a higher level of aggregation (quarterly, monthly) rather than a more granular level (day).

Define subpartitions on columns that are frequently used filter columns in reports. For example, in 80% of the queries product filter will always be applied. In this case, you can add a subpartition of the product level.

Avoid adding high cardinality columns as subpartition. Columns with low cardinalities ( <100) should be considered as subpartitions.

If you want to add partitions from a hierarchy having multiple hierarchies, then design it as the first hierarchy.

Why: This allows for efficient processing of replacing partitions of semantic model whenever underlying data in corresponding Hive partitions are modified. This also results in faster query response as queries are run on only the required partitions.

Use Physical view wherever required

In case your business requirement is not covered in semantic model design best practices recommendations, then use the physical view while designing the semantic model for best practices and logical view for business requirements.

Perform test processs before doing a full process

It’s better to test and identify design-related issues on a small data set rather than on millions of rows, making it is easier to backtrack and debug. As processing semantic models on large volumes of data takes time, processing a semantic model on a smaller set of data will help identify design optimizations early.

It is recommended to perform process on 2 to 3 partitions of raw data. For example, if your data is partitioned on year and month, then you can process a semantic model for 3 months of data.

Execute your possible business use cases and take recommendations from Kyvos

Before scheduling a full job, you must first run a test job. On the test built semantic model, it is recommended to execute your most commonly used queries/business use cases.

Once this is done, go back to the semantic model designer, and take recommendations. See Cube Recommendations for details.

Resilient semantic model process with transformation (table metadata) changes over time

There are time when metadata changes in different datasets are requires. Following are the scenarios which might occur and possible solutions ensure the semantic model doesn't fail.

Column got added in table (at the end) - Dataset SQL has list of column names in "Select"
- No changes required and process will not have impact
Column got added in table (in-between) - Dataset SQL has list of column names in "Select"
- No changes required and process will not have impact
Column got added in table (at the end) - Dataset SQL has "Select *"
- No changes required and process will not have impact
Column got added in table (in-between) - Dataset SQL has "Select *"
- Refresh and save the dataset again. Data type or format changes will not be overwritten when saving the refreshed metadata.
Column got deleted in table
- Column used in semantic model
  - Refresh dataset and Relationship design and save. Remove that attribute from the semantic model design, and perform an incremental process. That deleted column will become irrelevant and wont be available for querying.
- Column not used in semantic model
  - Refresh dataset and save.
Dataset SQL has "" as IsAdjusted and processses are done
- Column got added in table (Same position as of dataset or different position) -
  Remove that column from Dataset SQL (tds-service incremental query needs to be updated), refresh and save dataset/Relationship design. semantic model rebuild won't be required.
If data type is changed of column in table used in semantic model
- For example, for an existing integer column data type is changed to string then dataset will start giving warning but process will only fail in case Kyvos receive string value in that column instead of integer otherwise process won't fail.