Data Requirements

Dataset input requirements

In this example we will go through the dataset input requirements of the core.NeuralForecast class. The core.NeuralForecast methods operate as global models that receive a set of time series rather than single series. The class uses cross-learning technique to fit flexible-shared models such as neural networks improving its generalization capabilities as shown by the M4 international forecasting competition (Smyl 2019, Semenoglou 2021). While missing values are supported, we require data to be uniformly samples. This means that each consecutive timesteps must be evenly spaced. You can run these experiments using GPU with Google Colab.

Long format

Multiple time series

Store your time series in a pandas dataframe in long format, that is, each row represents an observation for a specific series and timestamp. Let’s see an example using the datasetsforecast library. Y_df = pd.concat( [series1, series2, ...])

%%capture
!pip install datasetsforecast

import pandas as pd
from datasetsforecast.m3 import M3

Y_df, *_ = M3.load('./data', group='Yearly')

Y_df.groupby('unique_id').head(2)

	unique_id	ds	y
0	Y1	1975-12-31	940.66
1	Y1	1976-12-31	1084.86
20	Y10	1975-12-31	2160.04
21	Y10	1976-12-31	2553.48
40	Y100	1975-12-31	1424.70
…	…	…	…
18260	Y97	1976-12-31	1618.91
18279	Y98	1975-12-31	1164.97
18280	Y98	1976-12-31	1277.87
18299	Y99	1975-12-31	1870.00
18300	Y99	1976-12-31	1307.20

Y_df is a dataframe with three columns: unique_id with a unique identifier for each time series, a column ds with the datestamp and a column y with the values of the series.

Single time series

If you have only one time series, you have to include the unique_id column. Consider, for example, the AirPassengers dataset.

Y_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/air_passengers.csv')
Y_df

	timestamp	value
0	1949-01-01	112
1	1949-02-01	118
2	1949-03-01	132
3	1949-04-01	129
4	1949-05-01	121
…	…	…
139	1960-08-01	606
140	1960-09-01	508
141	1960-10-01	461
142	1960-11-01	390
143	1960-12-01	432

In this example Y_df only contains two columns: timestamp, and value. To use NeuralForecast we have to include the unique_id column and rename the previous ones.

Y_df['unique_id'] = 1. # We can add an integer as identifier
Y_df = Y_df.rename(columns={'timestamp': 'ds', 'value': 'y'})
Y_df = Y_df[['unique_id', 'ds', 'y']]
Y_df

	unique_id	ds	y
0	1.0	1949-01-01	112
1	1.0	1949-02-01	118
2	1.0	1949-03-01	132
3	1.0	1949-04-01	129
4	1.0	1949-05-01	121
…	…	…	…
139	1.0	1960-08-01	606
140	1.0	1960-09-01	508
141	1.0	1960-10-01	461
142	1.0	1960-11-01	390
143	1.0	1960-12-01	432

Missing values

Missing values are supported as long as you provide the available_mask column to signal if the value is observed or not. We recommend using a finite placeholder for unobserved values instead of NaN to avoid issues during training. Here’s an example of a input dataset with missing values for neuralforecast.

df = pd.DataFrame({
    "unique_id": ["A", "A", "A", "A", "A"],
    "ds": pd.date_range("2024-01-01", periods=5, freq="D"),
    "y": [10.0, 12.0, 0.0, 15.0, 16.0],
    "available_mask": [1, 1, 0, 1, 1],
})
df.head()

	unique_id	ds	y	available_mask
0	A	2024-01-01	10.0	1
1	A	2024-01-02	12.0	1
2	A	2024-01-03	0.0	0
3	A	2024-01-04	15.0	1
4	A	2024-01-05	16.0	1

Getting Started

Capabilities

Tutorials

Use cases

API Reference

Data Requirements

Long format

Multiple time series

Single time series

Missing values

References

​Long format

​Multiple time series

​Single time series

​Missing values

​References

Long format

Multiple time series

Single time series

Missing values

References