Editing Prepare โ Data
Performing preliminary transformations to your datasets
Last updated
Was this helpful?
Performing preliminary transformations to your datasets
Last updated
Was this helpful?
Under Prepare โ Data, you can validate data accuracy, review columns, see row distribution, perform preliminary transformations (in Edit mode).
Lets you access column-level statistics displaying summary metrics and a quick visualization of the columnโs distribution.
Below the column name, you can find a green colored bar indicating the columnโs recognized data type (e.g., date/time, numeric, or string). Hovering over the bar displays "Main type: string 100.00%โ which tells you that every single row (100% of values) fits that text/string patternโthere are no exceptions detected that might suggest a numeric or date/time type.
Count: Total number of rows inspected
Missing (NULL): Number of records with no value in this column
Invalid: Number of entries that do not conform to the columnโs data type
Unique Value: How many distinct values appear in the column
A quick visualization on the right shows how the data is distributed.
Helps you to instantly gauge whether data is uniformly distributed or if certain ranges cluster heavily.
Spot potential anomaliesโe.g., if you see a spike in certain months or a total gap in a given time range.
In the above example, each vertical blue bar represents a set of date/time values plotted on an X-axis. The X-axis labels can appear bunched if the dataset is large or if date values are extremely granular. Hovering over may clarify the distribution.
Click on the burger menu icon above the chart to view the following menu. Here, you can,
View the chart in full screen
Print the chart
Download the image (as PNG, JPEG, PDF, or SVG)
Click on the Filter icon and the above image will be displayed. This filter does not modify the dataset pipeline or permanently remove rows. Instead, itโs a quick filter for on-screen data inspectionโyouโre basically hiding certain rows in the immediate view without altering the underlying dataset.
If you do want to permanently remove or transform rows in the actual pipeline, you can click โTransform dataโ to switch modes and the following window will be displayed.
Unlike the view-only filter, applying a filter here alters the dataset in your pipeline or script. Rows that do not meet the condition are permanently removed from the dataset version thatโs being prepared.
The โ+โ icon lets you add further filter clauses (e.g., โColumn A > 10โ AND โColumn B = โXYZโโ).
The transformation is saved in the pipeline. If you publish these changes, the dataset reloads with rows excluded per your filter logic.
Click on the โ icon to sort the column data in ascending order.
Click on the โฌ icon to sort the column data in descending order.
Click on any required column name, and you can view the following menu. These transform tools allow you to refine and reshape columns in various waysโwhether adjusting data types, altering text, or performing merges and splits.
This submenu lets you convert a columnโs data type. Here are the options:
String: Interprets the column as textual data (e.g., โABC123โ).
Double: Interprets the column as floating-point numeric type (e.g., 3.14159). Use if you need decimal precision or have fractional values.
Date: Interprets the column as a date (YYYY-MM-DD) without a time component.
Integer: Interprets the column as whole numbers only (e.g., 42).
Timestamp: Includes both date and time details. Use if you have data like 2023-01-15 10:25:00 or an ISO-8601 string (2023-01-15T10:25:00Z).
This submenu is for general transformations (not strictly text-based). Options include:
Add Column: Creates a new empty column.
Rename Column: Changes the actual column name.
You can change the name of a column in your dataset but doing so might cause issues, such as breaking existing connections or processes that depend on the current column name. To avoid these risks, use "Display Name" as an alternative, which lets you show a different name without actually renaming the column itself.
Move Column: Reorder columns in the dataset (e.g., bring an important column to the front). This has sub-options like Before previous column, After next column, Before column, and After column.
Merge column: Combine two columns into oneโoften used to concatenate strings (e.g., FirstName + LastName) or unify numeric fields. Here, you specify another column to merge and provide a name to the newly merged column.
Find and Replace: Search for specific text or patterns in the selected column and replace them with something else.
Set as Target variable: Usually relevant for ML or predictive analytics tasks. Incdicates that this column is the outcome variable (label) for training a model.
Split Rows: Splits each row of data if it contains multiple, line-delimited items. If a single cell has multiple lines or values separated by line breaks, this transforms them into multiple rows. The Delimiter field specifies the exact character or substring used to identify where to break a single row into multiple rows.
If your dataset contains textual columns you want to analyze, these transformations help standardize or clean the text for better search, NLP, or machine learning outcomes.
Upper case: Converts the entire columnโs text to upper case (e.g., abc โ ABC).
Lower case: Converts all text to lower case (e.g., ABC โ abc).
Remove stop words: Removes common filler words from text (e.g., โthe,โ โand,โ โofโ), often used in NLP or text analytics.
Stem: Applies a stemming algorithm (e.g., Porter stemmer) to reduce words to their base form (e.g., โrunning,โ โruns,โ โranโ โ โrunโ). Often used to group word variants.
Permanently removes the selected column from the dataset pipeline.
Same as the filter explained .