🚧Understanding partitioning your data

Get a thorough knowledge of how partitioning works in Tellius

What is data partitioning?

Data partitioning is the process of dividing a large dataset into multiple, more manageable subsets (partitions). Instead of treating millions or even billions of rows as a single block, partitioning breaks the data down into logical “chunks” based on a chosen numeric column and user-defined value ranges. This approach dramatically improves the speed and efficiency of data loading and querying, especially as your data scales.

By using partitioning, you reduce the time it takes to bring data into Tellius and enhance performance during analyses. Large datasets—such as several years of transactional data—can be spread across multiple partitions for parallel loading and faster overall processing.

How to partition your data?

  1. Partition column: A numeric column from your dataset used as the basis for partitioning. The values in this column help define how data is divided across partitions.

  1. Number of partitions: Indicates how many segments or “buckets” you want to split your data into. More partitions can improve loading speeds by allowing parallel operations, but too many partitions may become cumbersome. The general rule of thumb is to have partitions sized around 1 to 2 million rows each.

  1. Lower bound & Upper bound: Approximate minimum (lower) and maximum (upper) numeric values in the partition column. These bounds define the range over which the data will be split. Tellius uses these values to determine how the data is distributed across each partition.

How partitioning works: A practical example

Imagine you have a dataset containing records from the years 2010 to 2020, and you choose the “Year” column as your partition key:

  • Lower bound: 2010

  • Upper bound: 2020

  • Number of Partitions: 12

Tellius will create approximately 12 partitions spanning the range, distributing data as follows:

  • Partition 1: Values less than 2010

  • Partition 2: 2010 – 2011

  • Partition 3: 2011 – 2012

  • ... and so forth until ...

  • Partition 11: 2019 – 2020

  • Partition 12: Values greater than 2020

This balanced approach ensures that each partition handles a manageable slice of data, thus accelerating the load process. By refining the number of partitions and bounds, you can further optimize performance to suit your data scale and distribution.

Best practices and tips

  1. Start with estimates: If you’re unsure of exact bounds, let Tellius determine them automatically first. Once you see the distribution, you can refine the partition settings.

  2. Monitor performance: After initial loads, review load times and partition sizes. Adjust the number of partitions or bounds as necessary to improve speed.

  3. Keep it simple: Avoid overly granular partitions if you have a small dataset. A handful of partitions can suffice. For very large datasets, ramp up the number of partitions to meet your performance targets.

  4. Follow the 1-2 million rows per partition guideline: This heuristic helps maintain a balance between performance and overhead.

With partitioning, as your data grows, your performance scales right alongside it—ensuring that even with massive datasets, you can maintain responsive, high-performance analytics in Tellius.

Last updated

Was this helpful?