🧮The art of selecting columns for Insights

Identify the columns to include or exclude in your Insight generation

The final step for configuring Insights is selecting the columns to be excluded and included in the Insight generation. Including relevant columns ensures a comprehensive analysis, while excluding irrelevant ones removes noise, leading to clearer insights.

Consider the following configurations for generating a Key Driver Insight

Included columns

These columns will be analyzed to determine their impact on the target variable (in this case, it's "Category = Furniture"). The data of the included columns will be used in the analysis to find patterns and relationships that can explain the drivers behind Key Driver Insight.

Excluded columns

Excluded columns will not be included in the analysis. Columns may be excluded because they are the target of the analysis ("Category" in this case), due to incompatible data types, or because they've been deemed irrelevant or redundant for this particular analysis.

The art of including and excluding columns

For the above configuration, the following screen will be displayed as the next step. Certain columns will be automatically excluded from the Insight generation.

Deciding which columns to include or exclude should be strategic. In the above example, "Order_Date" and "Ship_Date" are excluded due to their data type (date), which could be unsuitable for the specific driver analysis.

For example, we can exclude "Order_ID" since it has nothing to do with Insight generation. "Category" is excluded since it’s the target column—the variable you're trying to analyze drivers for.

On the other hand, we can include "City" and "State" which could reveal geographic trends affecting furniture sales.

Which columns are better to exclude?

High cardinality columns (where each row has a unique value) such as Order ID and Customer name often introduce more noise than useful information.
Date-type columns are excluded automatically because their primary function is to organize and filter the data chronologically rather than to serve as an independent variable that could drive change. Suppose you're analyzing the factors that affect monthly sales. If you included the "Sale_Date" column as a driver, every unique date would be treated as a potential explanatory factor for sales. However, "Sale_Date" is simply a timestamp, not a variable that influences sales.
Exclude the target column. Including it as part of the factors that could drive itself would create a circular reference where the target could erroneously appear to "drive" itself. Imagine you're analyzing what influences the number of furniture sold. If you included the "Number of furniture sold" as a factor in your analysis, you might find that the "Number of furniture sold" is a great predictor of itself—which is a tautology and doesn't provide any new information. This is circular reasoning; you are using the result ("Number of furniture sold" ) to predict the same result.
Any columns that require more computational power and can slow down the analysis without adding valuable insights.
Columns that often reflect the idiosyncrasies of individual data points rather than broader trends.

PreviousCreating Comparison Insights NextHow to include/exclude columns?

Last updated 1 year ago

Was this helpful?