Cardinality
Cardinality refers to the number of distinct values available in a column. For example, the cardinality of states in the US is 50, whereas the cardinality of the number of customers in the US could be in millions. The lower the cardinality, the more duplicated elements in a column. Thus, a column with the lowest possible cardinality would have the same value for every row.
Cardinality plays a very crucial role in optimizing semantic model process time and size. These values are pre-aggregated in the semantic model to improve query response time. Dimensions with low cardinality should be included as attributes, which are aggregated at run time.
Identify the cardinality of dimensions, attributes, and distinct count measures using the following methods.
- If raw data is available in Hive tables, execute Hive queries to identify the cardinality of dimensions, attributes, and distinct count measures.
Otherwise, you can use a Kyvos semantic model job summary to identify cardinality. - Star schema: Run a test semantic model job summary after logically defining a semantic model and view the job summary. The job summary shows the cardinality of each dimension in the semantic model.
- Single file schema: Run a full process with a small volume of data (for example, one day’s data) and view the job summary. Then you can extrapolate the cardinality of complete data according to the job summary.
If the cardinality of the overall dimension is very high, then modify your semantic model design to distribute attributes in multiple dimensions. This helps in reducing the semantic model process time and size. If the cardinality of one dimension is high, divide it into multiple dimensions distributing the attributes to reduce the overall cardinality of one single dimension. It is always recommended to understand the business use case before making such changes in the semantic model design.
Copyright Kyvos, Inc. All rights reserved.