data_loaders

Modules

load module

load_database_config(config_file=None, section='timescaledb')[source]

Load TimescaleDB configuration from file or environment variables.

This function loads database configuration from either: 1. An INI configuration file (if provided) 2. Environment variables (as fallback)

Configuration Priority: 1. Function parameters (highest priority) 2. INI file configuration 3. Environment variables (lowest priority)

Parameters:
  • config_file (str, optional) – Path to INI configuration file. If None, only environment variables are used.

  • section (str, optional) – Section name in INI file. Defaults to “timescaledb”.

Returns:

Database configuration dictionary with keys:
  • host: Database host address

  • port: Database port number

  • name: Database name

  • user: Database username

  • password: Database password (None if not set)

Return type:

dict

Environment Variables:

The following environment variables can be used as fallbacks: - TIMESCALEDB_HOST: Database host (default: “localhost”) - TIMESCALEDB_PORT: Database port (default: 5432) - TIMESCALEDB_NAME: Database name (default: “postgres”) - TIMESCALEDB_USER: Database username (default: “postgres”) - TIMESCALEDB_PASSWORD: Database password (default: None)

Example INI file format:

[timescaledb] host = localhost port = 5432 name = postgres user = postgres password = mypassword

Example

>>> config = load_database_config("database.ini")
>>> print(config)
{'host': 'localhost', 'port': 5432, 'name': 'postgres', 'user': 'postgres', 'password': 'mypassword'}
load_from_database(building_name, sensor_name=None, sensor_uuid=None, step_size=None, start_time=None, end_time=None, resample=True, resample_method='linear', clip=True, cache=True, cache_root=None, tz='Europe/Copenhagen', preserve_order=True, config_file=None, section='timescaledb', db_host=None, db_port=None, db_name=None, db_user=None, db_password=None)[source]

Load time series data from TimescaleDB database for building sensor data.

This function connects to a TimescaleDB database and loads sensor data from tables with the naming convention data_{building_name}. The database schema should have columns: time (TIMESTAMPTZ), uuid (TEXT), name (TEXT), and value (FLOAT).

Mathematical Formulation

The database query is formulated as:

\[\begin{split}Q(t) = \begin{cases} \{y(t) : \text{name} = s_{name} \land \text{uuid} = s_{uuid}\} & \text{if } s_{name}, s_{uuid} \text{ specified} \\ \{y(t) : \text{name} = s_{name}\} & \text{if only } s_{name} \text{ specified} \\ \{y(t) : \text{uuid} = s_{uuid}\} & \text{if only } s_{uuid} \text{ specified} \\ \{y(t)\} & \text{otherwise (all sensors)} \end{cases}\end{split}\]
where:
  • \(Q(t)\) is the query result at time \(t\)

  • \(s_{name}\) is the sensor name filter

  • \(s_{uuid}\) is the sensor UUID filter

  • \(y(t)\) represents sensor values at time \(t\)

For multiple sensors, data is transformed from long to wide format:

\[\mathbf{Y}(t) = \begin{bmatrix} y_1(t) & y_2(t) & \cdots & y_n(t) \end{bmatrix}\]
where:
  • \(\mathbf{Y}(t)\) is the wide-format data matrix at time \(t\)

  • \(y_i(t)\) is the value of sensor \(i\) at time \(t\)

  • \(n\) is the number of sensors

The function supports the same resampling and timezone conversion as the spreadsheet loader, following the mathematical formulations defined in the module docstring.

Configuration

Database connection can be configured through multiple methods (in order of priority): 1. Function parameters (highest priority) 2. Configuration file (INI format) 3. Environment variables (lowest priority)

type building_name:

param building_name:

Name of the building (e.g., “bldg1”, “bldg10”). The function will query the table named data_{building_name}.

type building_name:

str

type sensor_name:

param sensor_name:

Name of the sensor to filter by (e.g., “temperature”, “humidity”, “CO2”). If provided, only data from sensors with this name will be returned. If None, data from all sensors will be returned.

type sensor_name:

str, optional

type sensor_uuid:

param sensor_uuid:

UUID of the sensor to filter by. If provided, only data from sensors with this UUID will be returned. Can be used alone or in combination with sensor_name.

type sensor_uuid:

str, optional

type step_size:

param step_size:

Time step size in seconds for resampling (e.g., 300 for 5-minute intervals). Required if resample=True. Ignored if resample=False.

type step_size:

int, optional

type start_time:

param start_time:

Start time for data extraction. If timezone-naive, will be localized to ‘tz’. If None, no lower time bound is applied.

type start_time:

datetime, optional

type end_time:

param end_time:

End time for data extraction (exclusive). If timezone-naive, will be localized to ‘tz’. If None, no upper time bound is applied.

type end_time:

datetime, optional

type resample:

param resample:

Whether to resample the data to regular time intervals. If True, requires step_size, start_time, and end_time to be provided. Defaults to True.

type resample:

bool, optional

type resample_method:

param resample_method:

Resampling method to use when resample=True: - “linear”: Linear interpolation between data points - “constant”: Forward-fill with backward-fill for gaps Defaults to “linear”.

type resample_method:

str, optional

type clip:

param clip:

Whether to clip data to the specified start_time and end_time range. If False, all available data within the time range will be returned. Defaults to True.

type clip:

bool, optional

type cache:

param cache:

Whether to cache the results in pickle files for faster subsequent loads. Cache files are stored in generated_files/cached_data/ directory. Defaults to True.

type cache:

bool, optional

type cache_root:

param cache_root:

Root directory for cache files. If None, uses the default Twin4Build cache location.

type cache_root:

str, optional

type tz:

param tz:

Timezone for data processing. Can be timezone name (e.g., “Europe/Copenhagen”), UTC offset (e.g., “UTC+2”, “GMT-8”), or “UTC”. Defaults to “Europe/Copenhagen”.

type tz:

str, optional

type config_file:

param config_file:

Path to INI configuration file for database settings. If provided, database connection parameters will be loaded from this file. See load_database_config() for file format details.

type config_file:

str, optional

type section:

param section:

Section of the INI file to use for database settings.

type section:

str, optional

type db_host:

param db_host:

Database host address. Overrides config file and environment variables.

type db_host:

str, optional

type db_port:

param db_port:

Database port number. Overrides config file and environment variables.

type db_port:

int, optional

type db_name:

param db_name:

Database name. Overrides config file and environment variables.

type db_name:

str, optional

type db_user:

param db_user:

Database username. Overrides config file and environment variables.

type db_user:

str, optional

type db_password:

param db_password:

Database password. Overrides config file and environment variables.

type db_password:

str, optional

returns:
DataFrame with time series data. The index is a DatetimeIndex with timezone

information. Columns represent different sensors (if multiple sensors are selected) or a single column named after the sensor (if single sensor).

For multiple sensors, the DataFrame will have a wide format with each sensor as a separate column. For a single sensor, the DataFrame will have a single column named after the sensor.

rtype:

pandas.DataFrame

raises Exception:

If the specified table does not exist in the database, if database connection fails, if invalid resample_method is provided, or if required parameters are missing for resampling.

Example

Basic usage with configuration file:

from datetime import datetime, timezone
start_time = datetime(2023, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
end_time = datetime(2023, 1, 2, 0, 0, 0, tzinfo=timezone.utc)
df = load_from_database("bldg1", start_time=start_time, end_time=end_time, config_file="database.ini")

Load specific sensor with explicit connection parameters:

df = load_from_database(
    building_name="bldg1",
    sensor_name="temperature",
    start_time=start_time,
    end_time=end_time,
    step_size=300,
    db_host="192.168.1.100",
    db_port=5433,
    db_user="myuser",
    db_password="mypassword"
)

Note

The function requires psycopg2 to be installed for PostgreSQL connectivity. Database tables should follow the naming convention: data_{building_name}. Database schema should have columns: time, uuid, name, value. Timezone handling follows the same logic as load_from_spreadsheet. Caching uses the same mechanism as load_from_spreadsheet for consistency. For large datasets, consider using sensor_name or sensor_uuid filters to reduce memory usage and improve performance.

load_from_spreadsheet(filename, datecolumn=0, valuecolumn=None, step_size=None, start_time=None, end_time=None, resample=True, clip=True, cache=True, cache_root=None, tz='Europe/Copenhagen', preserve_order=True)[source]

This function loads a spead either in .csv or .xlsx format. The datetime should in the first column - timezone-naive inputs are localized as “tz”, while timezone-aware inputs are converted to “tz”. All data except for datetime column is converted to numeric data.

tz: can be “UTC+2”, “GMT-8” (no trailing zeros) or timezone name “Europe/Copenhagen”

preserve_order: If True, the order of rows in the spreadsheet are important in order to resolve DST when timezone information is not available

PRINT THE FOLLOWING TO SEE AVAILABLE NAMES: from dateutil.zoneinfo import getzoneinfofile_stream, ZoneInfoFile print(ZoneInfoFile(getzoneinfofile_stream()).zones.keys())

parseDateStr(s)[source]
sample_from_df(df, datecolumn=0, valuecolumn=None, step_size=None, start_time=None, end_time=None, resample=True, resample_method='linear', clip=True, tz='Europe/Copenhagen', preserve_order=True)[source]

Sample and process time series data from a DataFrame with various resampling options.

This function processes time series data with support for resampling, timezone conversion, and data clipping. It handles both constant and linear resampling methods.

Mathematical Formulation

  1. Time Series Resampling: a) Constant Resampling:

    For each time step \(t\):

    \[y(t) = y(t_{last})\]

    where \(t_{last}\) is the last available data point before \(t\)

    1. Linear Resampling: For each time step \(t\):

      \[y(t) = y(t_1) + \frac{t - t_1}{t_2 - t_1} \cdot (y(t_2) - y(t_1))\]

      where: - \(t_1\) is the last available data point before \(t\) - \(t_2\) is the first available data point after \(t\)

  2. Time Zone Conversion: For a time \(t\) in timezone \(TZ_1\):

    \[t_{TZ_2} = t_{TZ_1} + \Delta TZ\]

    where: - \(t_{TZ_2}\) is the time in target timezone - \(\Delta TZ\) is the time difference between timezones

  3. Data Clipping: For a time series \(y(t)\):

    \[\begin{split}y_{clipped}(t) = \begin{cases} y(t) & \text{if } t_{start} \leq t < t_{end} \\ \text{undefined} & \text{otherwise} \end{cases}\end{split}\]
type df:

param df:

Input DataFrame with time series data

type df:

pandas.DataFrame

type datecolumn:

param datecolumn:

Column index containing datetime information

type datecolumn:

int

type valuecolumn:

param valuecolumn:

Column index containing values to process

type valuecolumn:

int, optional

type step_size:

param step_size:

Time step size in seconds for resampling

type step_size:

int, optional

type start_time:

param start_time:

Start time for data extraction

type start_time:

datetime, optional

type end_time:

param end_time:

End time for data extraction

type end_time:

datetime, optional

type resample:

param resample:

Whether to resample data to regular intervals

type resample:

bool

type resample_method:

param resample_method:

Resampling method (“linear” or “constant”)

type resample_method:

str

type clip:

param clip:

Whether to clip data to specified time range

type clip:

bool

type tz:

param tz:

Timezone for data processing

type tz:

str

type preserve_order:

param preserve_order:

Whether to preserve original data order

type preserve_order:

bool

returns:

Processed DataFrame with resampled time series data

rtype:

pandas.DataFrame