I am trying out PatchTST timeseries transformer (paper, code) on a timeseries data that I have. The way PatchTST handles data is as follows:
Note that on line 78-79, the repo does following:
self.data_x = data[border1:border2] self.data_y = data[border1:border2]
So, both data_x
and data_y
are exactly same, meaning have same rows and columns.
Then it does following on (line 88-89)
seq_x = self.data_x[s_begin:s_end] seq_y = self.data_y[r_begin:r_end]
Finally, it passes batch_x
to input model on line 149:
outputs = self.model(batch_x)
The model forecast all input time series (or columns) for specified prediction window. Model has following three modes:
- M: more than one feature time series which are also the target time series to be predicted
- S: single feature time series which is also the target time series to be predicted
- MS: more than one feature time series, but only single target time series. In this case, we still pass `batch_x` to the model which contains both feature and target time series, but we calculate loss only against last time series (column) considering it the target one.
My problem statement differs from the papers in following sense:
- I have multiple feature time series and targe time series and they are different: There is NO common timeseries that is both feature and target.
- I intermittently loose ground truth target time series when I deploy the model. In such scenario, I still want my predictions to be good. (My training dataset has ground truth target time series values for all time steps / no gaps.)
How can I accommodate these differences in the model?
I can brainstorm following approaches:
Feed only feature time series to the model and then modify the architecture of
FlattenHead
to predict number of target time series. This approach will have two more challenges.1.1. The official repo applies
StandardScaler
on both feature and target time series together. It then creates mini batches on these scaled data. But, this approach don't feed target time series to model and it sounds incorrect to scale only on feature timeseries. Should I completely omit scaling.1.2. The official repo applies scaling at instance level too to deal with distribution shift. This is called Reversible Instance Normalization (paper, code). This also suffers from challenge similar to exactly as in case of StandardScaler, that this approach dont input target features to the model. Should I completely forgo Reversible Instance Normalization (line 62-65, line 78-81).
Train model with ground truth target time series only along with feature time series. During testing, use ground truth target time series for some initial mini batches. In later mini batches, use past predicted target time series values to form a input window.
Train model with ground truth target time series in some initial mini batches. In later mini batches, use past predicted target time series values to form an input window. Same with testing
I know some of above approaches may not work at all. But I was just brain storming. Have you faced such requirements in your problem statement? What approach you will suggest out of above. Also please suggest any other approach that you feel that will be better than above three.
PS: I tried to input the timeseries data (both feature and target time series) as done by paper / official repo. It works well, but its not of my use since my problem statement may experience situation where target ground truth is not available.