Data Size Estimation and Calculation

Radu Miclaus Updated by Radu Miclaus

The purpose of this document is to describe the data size calculation algorithm used in Tellius to calculate the size of a dataset as per the Tellius specifications.

Data Size Calculation Algorithm

Tellius calculates the size of the dataset closer to the CSV format using a specific number of bytes for each data type. Below are the number of bytes for each data type

  • Int: 2 bytes for the integer columns. The range of values for this is -2147483648 to 2147483648
  • Bigint: 4 bytes for the Big int or Long columns. The range of values for this is -9223372036854775808 to 9223372036854775808
  • Float: 4 bytes for the Float columns. The range of values for this is 1.1754 E-38 to 3.4028 E+38
  • Double: 8 bytes for the Double columns. The range of values for this is 2.2250 E-308  to 1.7976 E +308
  • Boolean: 1 byte for the Boolean columns. Values are true/false
  • String: 15 bytes for the String columns.

The correct way for String calculation is based on the max length of the string in the column. So columns like Country code will be of size 2 bytes and columns having user comments will have 20 to 30 bytes. But 15 is an average to start with.

Based on the above sizes, we calculate the size of each row and then multiply by the number of rows to calculate the total size of the dataset.

Data Size Threshold

Based on the data capacity of the instance and the total datasets loaded, Tellius checks if the new dataset being loaded can be accommodated within the available capacity.

Getting the size of the datasets for this step differs based on the data sources dataset is being loaded from.

  • CSV / XLSX/ URL / S3 / HDFS/ AzureBlob / JSON / XML / Unstructured Text / FTP

Datasets loaded from these file-based sources, will not have a way to get the total number of rows of the dataset before loading the data. So we take the size of the files or folders in this scenario and compare across the data capacity to check if that can be loaded. 

This doesn’t give 100% confidence as the data size of the parquet can be 10x smaller compared to CSV, so a 1GB parquet will be allowed to load when the capacity has 1GB available capacity. But the dataset size after loading can be 10GB. So loading more datasets after this load will be blocked.

  • Oracle / MemSQL / MySQL / Postgresql / Redshift / MS SQL / Teradata / Snowflake / JDBC / Exasol

Datasets loaded from these databases, which support JDBC formats, will follow size estimation based on the data types and the total number of rows. Tellius pulls sample data to get the schema of the dataset, queries the database to get the total number of rows in the table, and estimates the size of the table. This size is used to check if it's within the instance limits.

  • MongoDB / Cassandra / ES / Salesforce / Google Analytics / Impala / Hive / Big Query

Datasets loaded from these sources do not have any data estimation so there will be no check on these when the datasets are loaded. They will be allowed to load without any check but after the dataset is loaded the size is calculated and would be added to the total used capacity so loading more datasets after this will be blocked.

How did we do?

Impersonate

Dataflow Access

Contact