# Page

Initial Metadata Configuration:&#x20;

### Scripting your dataset

* A caution about Python or PiSpark transformations on datasets exceeding \~1 million rows. The system may display a warning or fail due to resource constraints. Use smaller samples or rely on SQL pushdown if your dataset is extremely large.

### Data fusioning your dataset

**Where It Fits**:

*

**Key Points from the Conversation**:

1. **Limitation—Two Datasets Only**:
   * Data Fusion merges only two datasets at a time. If you have a scenario requiring more than two, or advanced transformations, you’ll have to switch to **Scripting**.

2. **No SQL**:
   * Data Fusion doesn’t let you write custom SQL or do advanced aggregations before joining. You can pick columns or do basic matching, but you can’t easily script business rules or use subqueries.

3. **Use Cases**:
   * **Non-Technical Users**: A user who doesn’t know SQL can join two tables by specifying matching columns.
   * **Quick Merge**: If you only need a straightforward, single join, Data Fusion can be faster than building a script from scratch.

4. **Business Impact**:
   * For large enterprise data or complex requirements, Data Fusion might be too basic. But it’s great for a simple scenario: e.g., “Join my small reference table to my main dataset quickly.”

5. **Handling Large Data**:
   * Datasets can reach **millions of rows**. For transformations:
     * **SQL** nodes are typically more scalable than Python or PiSpark for big data.
     * Live mode in Snowflake or other warehouses can bypass copying data, but you lose advanced data-prep options in Tellius.

6. **Scheduling Refreshes**:
   * Refresh can be set at the dataset level in **Prepare** → **Schedule** or at the **Connect** level.
   *

7. **Metadata Tools**:
   * Some environments have **Kaiya** auto-generation for display names and synonyms. This can save time but requires manual review.

8. **Statistics on Columns**:
   * At the top of each column in the **Data** tab, you can see unique counts, min, max, or filter/sort the sample. This helps you quickly check data quality (e.g., ensuring a column is unique where expected).

***

#### Conclusion

1. **Metadata**: The first crucial step to ensure your dataset is well-labeled, typed correctly, and logically separated into dimensions/measures.
2. **Scripting**: The powerhouse for multi-dataset merges, advanced SQL, or Python transformations. Key for scenarios with more than two datasets or custom logic.
3. **Data Fusion**: A straightforward, UI-based method to merge exactly two datasets, requiring little technical know-how.
4. **Workflow**: Typically, you start with metadata, do any necessary pipeline edits, then use Scripting for complex merges. Data Fusion is a simpler alternative for small merges.
5. **Performance & Refresh**: Mindful scheduling and large-data considerations remain essential—especially deciding between live mode or in-memory caching.

By following this refined approach—aligning data transformations with each tab’s capabilities—you can effectively prepare data at scale, handle advanced merges, and maintain a user-friendly metadata layer in Tellius.

### Schedule

* Refresh can be set at the dataset level in **Prepare** → **Schedule** or at the **Connect** level.
* Large parallel refreshes can overload the system, so recommended practice is refreshing key datasets individually or in a controlled manner.
* “Linked datasets” can auto-refresh if the parent dataset changes, but this is often turned off by default to avoid a chain reaction of massive refreshes.

{% file src="/files/qyoUuZfMUxZpTn3wBNOv" %}

{% file src="/files/9ZKNI5i8Yx3aV25vpnyD" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.tellius.com/tellius-6.3/data/page.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
