🚧Understanding partitioning your data
Get a thorough knowledge of how partitioning works in Tellius
What is data partitioning?
Data partitioning is the process of dividing a large dataset into multiple, more manageable subsets (partitions). Instead of treating millions or even billions of rows as a single block, partitioning breaks the data down into logical “chunks” based on a chosen numeric column and user-defined value ranges. This approach dramatically improves the speed and efficiency of data loading and querying, especially as your data scales.
By using partitioning, you reduce the time it takes to bring data into Tellius and enhance performance during analyses. Large datasets—such as several years of transactional data—can be spread across multiple partitions for parallel loading and faster overall processing.
How to partition your data?
Partition column: A numeric column from your dataset used as the basis for partitioning. The values in this column help define how data is divided across partitions.
Choose a column with a relatively uniform distribution to achieve balanced partitions. such as timestamps or years. For example, a “year” column spanning a range of years is often a good candidate. A skewed distribution may cause certain partitions to be larger and slower, reducing performance benefits.
Number of partitions: Indicates how many segments or “buckets” you want to split your data into. More partitions can improve loading speeds by allowing parallel operations, but too many partitions may become cumbersome. The general rule of thumb is to have partitions sized around 1 to 2 million rows each.
If you have approximately 16 million rows, consider 8 to 10 partitions. This helps ensure each partition has roughly 1-2 million rows, balancing load performance and manageability.
Lower bound & Upper bound: Approximate minimum (lower) and maximum (upper) numeric values in the partition column. These bounds define the range over which the data will be split. Tellius uses these values to determine how the data is distributed across each partition.
Tellius can estimate these bounds automatically, but providing explicit lower and upper bounds can improve efficiency and accuracy—especially if you have prior knowledge of your data’s range.
How partitioning works: A practical example
Imagine you have a dataset containing records from the years 2010 to 2020, and you choose the “Year” column as your partition key:
Lower bound: 2010
Upper bound: 2020
Number of Partitions: 12
Tellius will create approximately 12 partitions spanning the range, distributing data as follows:
Partition 1: Values less than 2010
Partition 2: 2010 – 2011
Partition 3: 2011 – 2012
... and so forth until ...
Partition 11: 2019 – 2020
Partition 12: Values greater than 2020
This balanced approach ensures that each partition handles a manageable slice of data, thus accelerating the load process. By refining the number of partitions and bounds, you can further optimize performance to suit your data scale and distribution.
Best practices and tips
Start with estimates: If you’re unsure of exact bounds, let Tellius determine them automatically first. Once you see the distribution, you can refine the partition settings.
Monitor performance: After initial loads, review load times and partition sizes. Adjust the number of partitions or bounds as necessary to improve speed.
Keep it simple: Avoid overly granular partitions if you have a small dataset. A handful of partitions can suffice. For very large datasets, ramp up the number of partitions to meet your performance targets.
Follow the 1-2 million rows per partition guideline: This heuristic helps maintain a balance between performance and overhead.
With partitioning, as your data grows, your performance scales right alongside it—ensuring that even with massive datasets, you can maintain responsive, high-performance analytics in Tellius.
Last updated
Was this helpful?