tstoolbox.tstoolbox.regression

tstoolbox.tstoolbox.regression(method, x_train_cols, y_train_col, x_pred_cols=None, input_ts='-', columns=None, start_date=None, end_date=None, dropna='no', clean=False, round_index=None, skiprows=None, index_type='datetime', names=None, print_input=False, por=False)

Regression of one or more time-series or indices to a time-series.

If optional x_pred_cols is given will return a time-series of the y predictions. Otherwise returns dictionary of equation and statistics about the regression fit.

Parameters:
  • method (str) –

    The method of regression. The chosen method will use x_train_cols as the independent data and y_pred_col as the dependent data.

    ARD

    Requires lots of memory.

    Fit the weights of a regression model, using an ARD prior. The weights of the regression model are assumed to be in Gaussian distributions. Also estimate the parameters lambda (precisions of the distributions of the weights) and alpha (precision of the distribution of the noise). The estimation is done by an iterative procedures (Evidence Maximization)

    BayesianRidge

    Fit a Bayesian ridge model. See the Notes section for details on this implementation and the optimization of the regularization parameters lambda (precision of the weights) and alpha (precision of the noise).

    ElasticNetCV

    Elastic Net model with iterative fitting along a regularization path.

    ElasticNet

    Linear regression with combined L1 and L2 priors as regularizer.

    Huber

    Linear regression model that is robust to outliers.

    The Huber Regressor optimizes the squared loss for the samples where abs((y - X’w) / sigma) < epsilon and the absolute loss for the samples where abs((y - X’w) / sigma) > epsilon, where w and sigma are parameters to be optimized. The parameter sigma makes sure that if y is scaled up or down by a certain factor, one does not need to rescale epsilon to achieve the same robustness. Note that this does not take into account the fact that the different features of X may be of different scales.

    This makes sure that the loss function is not heavily influenced by the outliers while not completely ignoring their effect.

    LarsCV

    Cross-validated Least Angle Regression model.

    Lars

    Least Angle Regression model.

    LassoCV

    Lasso linear model with iterative fitting along a regularization path.

    LassoLarsCV

    Cross-validated Lasso, using the LARS algorithm.

    LassoLarsIC

    Lasso model fit with Lars using BIC or AIC for model selection.

    LassoLars

    Lasso model fit with Least Angle Regression a.k.a. Lars. It is a Linear Model trained with an L1 prior as regularizer.

    Lasso

    Linear Model trained with L1 prior as regularizer (aka the Lasso).

    Linear

    LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

    RANSAC

    RANSAC (RANdom SAmple Consensus) algorithm. RANSAC is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the complete data set.

    RidgeCV

    Ridge regression with built-in cross-validation. By default, it performs Generalized Cross-Validation, which is a form of efficient Leave-One-Out cross-validation.

    Ridge

    This classifier first converts the target values into (-1, 1) and then treats the problem as a regression task (multi-output regression in the multiclass case).

    SGD

    Input must be scaled by removing mean and scaling to unit variance. Can use ‘tstoolbox normalization …’ to scale the input.

    Linear model fitted by minimizing a regularized empirical loss with SGD. SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

    The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

    TheilSen

    Theil-Sen Estimator: robust multivariate regression model.

    The algorithm calculates least square solutions on subsets with size n_subsamples of the samples in X. Any value of n_subsamples between the number of features and samples leads to an estimator with a compromise between robustness and efficiency. Since the number of least square solutions is “n_samples choose n_subsamples”, it can be extremely large and can therefore be limited with max_subpopulation. If this limit is reached, the subsets are chosen randomly. In a final step, the spatial median (or L1 median) is calculated of all least square solutions.

  • x_train_cols (str or list) – List of column names/numbers that hold the x value datasets used to train the regression. Perform a multiple regression if method allows by giving several x_train_cols. To include the index in the regression use column 0 or the index name.

  • y_train_col (str or list) –

    Column name or number of the y dataset used to train the regression.

    The y_train_col cannot be part of x_train_cols or x_pred_cols.

  • x_pred_cols (str or list) –

    [optional, if supplied will return a time-series of the y prediction based on x_pred_cols.]

    List of column names/numbers of x value datasets used to create the y prediction. Needs to be the same number of columns as x_train_cols. Can be identical columns to x_train_cols.

  • input_ts (str) –

    [optional though required if using within Python, default is ‘-’ (stdin)]

    Whether from a file or standard input, data requires a single line header of column names. The default header is the first line of the input, but this can be changed for CSV files using the ‘skiprows’ option.

    Most common date formats can be used, but the closer to ISO 8601 date/time standard the better.

    Comma-separated values (CSV) files or tab-separated values (TSV):

    File separators will be automatically detected.
    
    Columns can be selected by name or index, where the index for
    data columns starts at 1.
    

    Command line examples:

    Keyword Example

    Description

    –input_ts=fn.csv

    read all columns from ‘fn.csv’

    –input_ts=fn.csv,2,1

    read data columns 2 and 1 from ‘fn.csv’

    –input_ts=fn.csv,2,skiprows=2

    read data column 2 from ‘fn.csv’, skipping first 2 rows so header is read from third row

    –input_ts=fn.xlsx,2,Sheet21

    read all data from 2nd sheet all data from “Sheet21” of ‘fn.xlsx’

    –input_ts=fn.hdf5,Table12,T2

    read all data from table “Table12” then all data from table “T2” of ‘fn.hdf5’

    –input_ts=fn.wdm,210,110

    read DSNs 210, then 110 from ‘fn.wdm’

    –input_ts=’-’

    read all columns from standard input (stdin)

    –input_ts=’-’ –columns=4,1

    read column 4 and 1 from standard input (stdin)

    If working with CSV or TSV files you can use redirection rather than use –input_ts=fname.csv. The following are identical:

    From a file:

    command subcmd –input_ts=fname.csv

    From standard input (since ‘–input_ts=-’ is the default:

    command subcmd < fname.csv

    Can also combine commands by piping:

    command subcmd < filein.csv | command subcmd1 > fileout.csv

    Python library examples:

    You must use the `input_ts=...` option where `input_ts` can be
    one of a [pandas DataFrame, pandas Series, dict, tuple, list,
    StringIO, or file name].
    

  • columns

    [optional, defaults to all columns, input filter]

    Columns to select out of input. Can use column names from the first line header or column numbers. If using numbers, column number 1 is the first data column. To pick multiple columns; separate by commas with no spaces. As used in toolbox_utils pick command.

    This solves a big problem so that you don’t have to create a data set with a certain column order, you can rearrange columns when data is read in.

  • start_date (str) –

    [optional, defaults to first date in time-series, input filter]

    The start_date of the series in ISOdatetime format, or ‘None’ for beginning.

  • end_date (str) –

    [optional, defaults to last date in time-series, input filter]

    The end_date of the series in ISOdatetime format, or ‘None’ for end.

  • dropna (str) –

    [optional, defauls it ‘no’, input filter]

    Set dropna to ‘any’ to have records dropped that have NA value in any column, or ‘all’ to have records dropped that have NA in all columns. Set to ‘no’ to not drop any records. The default is ‘no’.

  • clean

    [optional, default is False, input filter]

    The ‘clean’ command will repair a input index, removing duplicate index values and sorting.

  • round_index

    [optional, default is None which will do nothing to the index, output format]

    Round the index to the nearest time point. Can significantly improve the performance since can cut down on memory and processing requirements, however be cautious about rounding to a very course interval from a small one. This could lead to duplicate values in the index.

  • skiprows (list-like or integer or callable) –

    [optional, default is None which will infer header from first line, input filter]

    Line numbers to skip (0-indexed) if a list or number of lines to skip at the start of the file if an integer.

    If used in Python can be a callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be

    lambda x: x in [0, 2].

  • index_type (str) –

    [optional, default is ‘datetime’, output format]

    Can be either ‘number’ or ‘datetime’. Use ‘number’ with index values that are Julian dates, or other epoch reference.

  • names (str) –

    [optional, default is None, transformation]

    If None, the column names are taken from the first row after ‘skiprows’ from the input dataset.

    MUST include a name for all columns in the input dataset, including the index column.

  • print_input

    [optional, default is False, output format]

    If set to ‘True’ will include the input columns in the output table.

  • tablefmt (str) –

    [optional, default is ‘csv’, output format]

    The table format. Can be one of ‘csv’, ‘tsv’, ‘plain’, ‘simple’, ‘grid’, ‘pipe’, ‘orgtbl’, ‘rst’, ‘mediawiki’, ‘latex’, ‘latex_raw’ and ‘latex_booktabs’.

  • por

    [optional, default is False]

    The por keyword adjusts the operation of start_date and end_date

    If “False” (the default) choose the indices in the time-series between start_date and end_date. If “True” and if start_date or end_date is outside of the existing time-series will fill the time- series with missing values to include the exterior start_date or end_date.