Page

Initial Metadata Configuration:

Scripting your dataset

  • A caution about Python or PiSpark transformations on datasets exceeding ~1 million rows. The system may display a warning or fail due to resource constraints. Use smaller samples or rely on SQL pushdown if your dataset is extremely large.

Data fusioning your dataset

Where It Fits:

Key Points from the Conversation:

  1. Limitation—Two Datasets Only:

    • Data Fusion merges only two datasets at a time. If you have a scenario requiring more than two, or advanced transformations, you’ll have to switch to Scripting.

  2. No SQL:

    • Data Fusion doesn’t let you write custom SQL or do advanced aggregations before joining. You can pick columns or do basic matching, but you can’t easily script business rules or use subqueries.

  3. Use Cases:

    • Non-Technical Users: A user who doesn’t know SQL can join two tables by specifying matching columns.

    • Quick Merge: If you only need a straightforward, single join, Data Fusion can be faster than building a script from scratch.

  4. Business Impact:

    • For large enterprise data or complex requirements, Data Fusion might be too basic. But it’s great for a simple scenario: e.g., “Join my small reference table to my main dataset quickly.”

  1. Handling Large Data:

    • Datasets can reach millions of rows. For transformations:

      • SQL nodes are typically more scalable than Python or PiSpark for big data.

      • Live mode in Snowflake or other warehouses can bypass copying data, but you lose advanced data-prep options in Tellius.

  2. Scheduling Refreshes:

    • Refresh can be set at the dataset level in PrepareSchedule or at the Connect level.

  3. Metadata Tools:

    • Some environments have Kaiya auto-generation for display names and synonyms. This can save time but requires manual review.

  4. Statistics on Columns:

    • At the top of each column in the Data tab, you can see unique counts, min, max, or filter/sort the sample. This helps you quickly check data quality (e.g., ensuring a column is unique where expected).


Conclusion

  1. Metadata: The first crucial step to ensure your dataset is well-labeled, typed correctly, and logically separated into dimensions/measures.

  2. Scripting: The powerhouse for multi-dataset merges, advanced SQL, or Python transformations. Key for scenarios with more than two datasets or custom logic.

  3. Data Fusion: A straightforward, UI-based method to merge exactly two datasets, requiring little technical know-how.

  4. Workflow: Typically, you start with metadata, do any necessary pipeline edits, then use Scripting for complex merges. Data Fusion is a simpler alternative for small merges.

  5. Performance & Refresh: Mindful scheduling and large-data considerations remain essential—especially deciding between live mode or in-memory caching.

By following this refined approach—aligning data transformations with each tab’s capabilities—you can effectively prepare data at scale, handle advanced merges, and maintain a user-friendly metadata layer in Tellius.

Schedule

  • Refresh can be set at the dataset level in PrepareSchedule or at the Connect level.

  • Large parallel refreshes can overload the system, so recommended practice is refreshing key datasets individually or in a controlled manner.

  • “Linked datasets” can auto-refresh if the parent dataset changes, but this is often turned off by default to avoid a chain reaction of massive refreshes.

Was this helpful?