data_loaders
Modules
load module
- load_database_config(config_file=None, section='timescaledb')[source]
Load TimescaleDB configuration from file or environment variables.
This function loads database configuration from either: 1. An INI configuration file (if provided) 2. Environment variables (as fallback)
Configuration Priority: 1. Function parameters (highest priority) 2. INI file configuration 3. Environment variables (lowest priority)
- Parameters:
config_file (str, optional) – Path to INI configuration file. If None, only environment variables are used.
section (str, optional) – Section name in INI file. Defaults to “timescaledb”.
- Returns:
- Database configuration dictionary with keys:
host: Database host address
port: Database port number
name: Database name
user: Database username
password: Database password (None if not set)
- Return type:
dict
- Environment Variables:
The following environment variables can be used as fallbacks: - TIMESCALEDB_HOST: Database host (default: “localhost”) - TIMESCALEDB_PORT: Database port (default: 5432) - TIMESCALEDB_NAME: Database name (default: “postgres”) - TIMESCALEDB_USER: Database username (default: “postgres”) - TIMESCALEDB_PASSWORD: Database password (default: None)
- Example INI file format:
[timescaledb] host = localhost port = 5432 name = postgres user = postgres password = mypassword
Example
>>> config = load_database_config("database.ini") >>> print(config) {'host': 'localhost', 'port': 5432, 'name': 'postgres', 'user': 'postgres', 'password': 'mypassword'}
- load_from_database(building_name, sensor_name=None, sensor_uuid=None, step_size=None, start_time=None, end_time=None, resample=True, resample_method='linear', clip=True, cache=True, cache_root=None, tz='Europe/Copenhagen', preserve_order=True, config_file=None, section='timescaledb', db_host=None, db_port=None, db_name=None, db_user=None, db_password=None)[source]
Load time series data from TimescaleDB database for building sensor data.
This function connects to a TimescaleDB database and loads sensor data from tables with the naming convention data_{building_name}. The database schema should have columns: time (TIMESTAMPTZ), uuid (TEXT), name (TEXT), and value (FLOAT).
Mathematical Formulation
The database query is formulated as:
\[\begin{split}Q(t) = \begin{cases} \{y(t) : \text{name} = s_{name} \land \text{uuid} = s_{uuid}\} & \text{if } s_{name}, s_{uuid} \text{ specified} \\ \{y(t) : \text{name} = s_{name}\} & \text{if only } s_{name} \text{ specified} \\ \{y(t) : \text{uuid} = s_{uuid}\} & \text{if only } s_{uuid} \text{ specified} \\ \{y(t)\} & \text{otherwise (all sensors)} \end{cases}\end{split}\]- where:
\(Q(t)\) is the query result at time \(t\)
\(s_{name}\) is the sensor name filter
\(s_{uuid}\) is the sensor UUID filter
\(y(t)\) represents sensor values at time \(t\)
For multiple sensors, data is transformed from long to wide format:
\[\mathbf{Y}(t) = \begin{bmatrix} y_1(t) & y_2(t) & \cdots & y_n(t) \end{bmatrix}\]- where:
\(\mathbf{Y}(t)\) is the wide-format data matrix at time \(t\)
\(y_i(t)\) is the value of sensor \(i\) at time \(t\)
\(n\) is the number of sensors
The function supports the same resampling and timezone conversion as the spreadsheet loader, following the mathematical formulations defined in the module docstring.
Configuration
Database connection can be configured through multiple methods (in order of priority): 1. Function parameters (highest priority) 2. Configuration file (INI format) 3. Environment variables (lowest priority)
- type building_name:
- param building_name:
Name of the building (e.g., “bldg1”, “bldg10”). The function will query the table named data_{building_name}.
- type building_name:
str
- type sensor_name:
- param sensor_name:
Name of the sensor to filter by (e.g., “temperature”, “humidity”, “CO2”). If provided, only data from sensors with this name will be returned. If None, data from all sensors will be returned.
- type sensor_name:
str, optional
- type sensor_uuid:
- param sensor_uuid:
UUID of the sensor to filter by. If provided, only data from sensors with this UUID will be returned. Can be used alone or in combination with sensor_name.
- type sensor_uuid:
str, optional
- type step_size:
- param step_size:
Time step size in seconds for resampling (e.g., 300 for 5-minute intervals). Required if resample=True. Ignored if resample=False.
- type step_size:
int, optional
- type start_time:
- param start_time:
Start time for data extraction. If timezone-naive, will be localized to ‘tz’. If None, no lower time bound is applied.
- type start_time:
datetime, optional
- type end_time:
- param end_time:
End time for data extraction (exclusive). If timezone-naive, will be localized to ‘tz’. If None, no upper time bound is applied.
- type end_time:
datetime, optional
- type resample:
- param resample:
Whether to resample the data to regular time intervals. If True, requires step_size, start_time, and end_time to be provided. Defaults to True.
- type resample:
bool, optional
- type resample_method:
- param resample_method:
Resampling method to use when resample=True: - “linear”: Linear interpolation between data points - “constant”: Forward-fill with backward-fill for gaps Defaults to “linear”.
- type resample_method:
str, optional
- type clip:
- param clip:
Whether to clip data to the specified start_time and end_time range. If False, all available data within the time range will be returned. Defaults to True.
- type clip:
bool, optional
- type cache:
- param cache:
Whether to cache the results in pickle files for faster subsequent loads. Cache files are stored in generated_files/cached_data/ directory. Defaults to True.
- type cache:
bool, optional
- type cache_root:
- param cache_root:
Root directory for cache files. If None, uses the default Twin4Build cache location.
- type cache_root:
str, optional
- type tz:
- param tz:
Timezone for data processing. Can be timezone name (e.g., “Europe/Copenhagen”), UTC offset (e.g., “UTC+2”, “GMT-8”), or “UTC”. Defaults to “Europe/Copenhagen”.
- type tz:
str, optional
- type config_file:
- param config_file:
Path to INI configuration file for database settings. If provided, database connection parameters will be loaded from this file. See load_database_config() for file format details.
- type config_file:
str, optional
- type section:
- param section:
Section of the INI file to use for database settings.
- type section:
str, optional
- type db_host:
- param db_host:
Database host address. Overrides config file and environment variables.
- type db_host:
str, optional
- type db_port:
- param db_port:
Database port number. Overrides config file and environment variables.
- type db_port:
int, optional
- type db_name:
- param db_name:
Database name. Overrides config file and environment variables.
- type db_name:
str, optional
- type db_user:
- param db_user:
Database username. Overrides config file and environment variables.
- type db_user:
str, optional
- type db_password:
- param db_password:
Database password. Overrides config file and environment variables.
- type db_password:
str, optional
- returns:
- DataFrame with time series data. The index is a DatetimeIndex with timezone
information. Columns represent different sensors (if multiple sensors are selected) or a single column named after the sensor (if single sensor).
For multiple sensors, the DataFrame will have a wide format with each sensor as a separate column. For a single sensor, the DataFrame will have a single column named after the sensor.
- rtype:
pandas.DataFrame
- raises Exception:
If the specified table does not exist in the database, if database connection fails, if invalid resample_method is provided, or if required parameters are missing for resampling.
Example
Basic usage with configuration file:
from datetime import datetime, timezone start_time = datetime(2023, 1, 1, 0, 0, 0, tzinfo=timezone.utc) end_time = datetime(2023, 1, 2, 0, 0, 0, tzinfo=timezone.utc) df = load_from_database("bldg1", start_time=start_time, end_time=end_time, config_file="database.ini")
Load specific sensor with explicit connection parameters:
df = load_from_database( building_name="bldg1", sensor_name="temperature", start_time=start_time, end_time=end_time, step_size=300, db_host="192.168.1.100", db_port=5433, db_user="myuser", db_password="mypassword" )
Note
The function requires psycopg2 to be installed for PostgreSQL connectivity. Database tables should follow the naming convention: data_{building_name}. Database schema should have columns: time, uuid, name, value. Timezone handling follows the same logic as load_from_spreadsheet. Caching uses the same mechanism as load_from_spreadsheet for consistency. For large datasets, consider using sensor_name or sensor_uuid filters to reduce memory usage and improve performance.
- load_from_spreadsheet(filename, datecolumn=0, valuecolumn=None, step_size=None, start_time=None, end_time=None, resample=True, clip=True, cache=True, cache_root=None, tz='Europe/Copenhagen', preserve_order=True)[source]
This function loads a spead either in .csv or .xlsx format. The datetime should in the first column - timezone-naive inputs are localized as “tz”, while timezone-aware inputs are converted to “tz”. All data except for datetime column is converted to numeric data.
tz: can be “UTC+2”, “GMT-8” (no trailing zeros) or timezone name “Europe/Copenhagen”
preserve_order: If True, the order of rows in the spreadsheet are important in order to resolve DST when timezone information is not available
PRINT THE FOLLOWING TO SEE AVAILABLE NAMES: from dateutil.zoneinfo import getzoneinfofile_stream, ZoneInfoFile print(ZoneInfoFile(getzoneinfofile_stream()).zones.keys())
- sample_from_df(df, datecolumn=0, valuecolumn=None, step_size=None, start_time=None, end_time=None, resample=True, resample_method='linear', clip=True, tz='Europe/Copenhagen', preserve_order=True)[source]
Sample and process time series data from a DataFrame with various resampling options.
This function processes time series data with support for resampling, timezone conversion, and data clipping. It handles both constant and linear resampling methods.
Mathematical Formulation
Time Series Resampling: a) Constant Resampling:
For each time step \(t\):
\[y(t) = y(t_{last})\]where \(t_{last}\) is the last available data point before \(t\)
Linear Resampling: For each time step \(t\):
\[y(t) = y(t_1) + \frac{t - t_1}{t_2 - t_1} \cdot (y(t_2) - y(t_1))\]where: - \(t_1\) is the last available data point before \(t\) - \(t_2\) is the first available data point after \(t\)
Time Zone Conversion: For a time \(t\) in timezone \(TZ_1\):
\[t_{TZ_2} = t_{TZ_1} + \Delta TZ\]where: - \(t_{TZ_2}\) is the time in target timezone - \(\Delta TZ\) is the time difference between timezones
Data Clipping: For a time series \(y(t)\):
\[\begin{split}y_{clipped}(t) = \begin{cases} y(t) & \text{if } t_{start} \leq t < t_{end} \\ \text{undefined} & \text{otherwise} \end{cases}\end{split}\]
- type df:
- param df:
Input DataFrame with time series data
- type df:
pandas.DataFrame
- type datecolumn:
- param datecolumn:
Column index containing datetime information
- type datecolumn:
int
- type valuecolumn:
- param valuecolumn:
Column index containing values to process
- type valuecolumn:
int, optional
- type step_size:
- param step_size:
Time step size in seconds for resampling
- type step_size:
int, optional
- type start_time:
- param start_time:
Start time for data extraction
- type start_time:
datetime, optional
- type end_time:
- param end_time:
End time for data extraction
- type end_time:
datetime, optional
- type resample:
- param resample:
Whether to resample data to regular intervals
- type resample:
bool
- type resample_method:
- param resample_method:
Resampling method (“linear” or “constant”)
- type resample_method:
str
- type clip:
- param clip:
Whether to clip data to specified time range
- type clip:
bool
- type tz:
- param tz:
Timezone for data processing
- type tz:
str
- type preserve_order:
- param preserve_order:
Whether to preserve original data order
- type preserve_order:
bool
- returns:
Processed DataFrame with resampled time series data
- rtype:
pandas.DataFrame