tstoolbox.tstoolbox.fill

tstoolbox.tstoolbox.fill(input_ts='-', method='ffill', print_input=False, start_date=None, end_date=None, columns=None, clean=False, index_type='datetime', names=None, source_units=None, target_units=None, skiprows=None, from_columns=None, to_columns=None, limit=None, order=None, force_freq=None)

Fill missing values (NaN) with different methods.

Missing values can occur because of NaN, or because the time series is sparse.

Parameters:
  • method (str) –

    [optional, default is ‘ffill’]

    String contained in single quotes or a number that defines the method to use for filling.

    method=

    fill missing values with…

    ffill

    …the last good value

    bfill

    …the next good value

    2.3

    …with this number

    linear

    …ignore index, values are equally spaced

    index

    …linear interpolation with datetime index

    values

    …linear interpolation with numerical index

    nearest

    …nearest good value

    zero

    …zeroth order spline

    slinear

    …first order spline

    quadratic

    …second order spline

    cubic

    …third order spline

    spline order=n

    …nth order spline

    polynomial order=n

    …nth order polynomial

    barycentric

    …barycentric

    mean

    …with mean

    median

    …with median

    max

    …with maximum

    min

    …with minimum

    from

    …with good values from other columns

    time

    …daily and higher resolution to interval

    krogh

    …krogh algorithm

    piecewise_polynomial from_derivatives

    …piecewise-polynomial algorithm

    pchip

    …pchip algorithm

    akima

    …akima algorithm

  • print_input

    [optional, default is False, output format]

    If set to ‘True’ will include the input columns in the output table.

  • input_ts (str) –

    [optional though required if using within Python, default is ‘-’ (stdin)]

    Whether from a file or standard input, data requires a single line header of column names. The default header is the first line of the input, but this can be changed for CSV files using the ‘skiprows’ option.

    Most common date formats can be used, but the closer to ISO 8601 date/time standard the better.

    Comma-separated values (CSV) files or tab-separated values (TSV):

    File separators will be automatically detected.
    
    Columns can be selected by name or index, where the index for
    data columns starts at 1.
    

    Command line examples:

    Keyword Example

    Description

    –input_ts=fn.csv

    read all columns from ‘fn.csv’

    –input_ts=fn.csv,2,1

    read data columns 2 and 1 from ‘fn.csv’

    –input_ts=fn.csv,2,skiprows=2

    read data column 2 from ‘fn.csv’, skipping first 2 rows so header is read from third row

    –input_ts=fn.xlsx,2,Sheet21

    read all data from 2nd sheet all data from “Sheet21” of ‘fn.xlsx’

    –input_ts=fn.hdf5,Table12,T2

    read all data from table “Table12” then all data from table “T2” of ‘fn.hdf5’

    –input_ts=fn.wdm,210,110

    read DSNs 210, then 110 from ‘fn.wdm’

    –input_ts=’-’

    read all columns from standard input (stdin)

    –input_ts=’-’ –columns=4,1

    read column 4 and 1 from standard input (stdin)

    If working with CSV or TSV files you can use redirection rather than use –input_ts=fname.csv. The following are identical:

    From a file:

    command subcmd –input_ts=fname.csv

    From standard input (since ‘–input_ts=-’ is the default:

    command subcmd < fname.csv

    Can also combine commands by piping:

    command subcmd < filein.csv | command subcmd1 > fileout.csv

    Python library examples:

    You must use the `input_ts=...` option where `input_ts` can be
    one of a [pandas DataFrame, pandas Series, dict, tuple, list,
    StringIO, or file name].
    

  • start_date (str) –

    [optional, defaults to first date in time-series, input filter]

    The start_date of the series in ISOdatetime format, or ‘None’ for beginning.

  • end_date (str) –

    [optional, defaults to last date in time-series, input filter]

    The end_date of the series in ISOdatetime format, or ‘None’ for end.

  • clean

    [optional, default is False, input filter]

    The ‘clean’ command will repair a input index, removing duplicate index values and sorting.

  • skiprows (list-like or integer or callable) –

    [optional, default is None which will infer header from first line, input filter]

    Line numbers to skip (0-indexed) if a list or number of lines to skip at the start of the file if an integer.

    If used in Python can be a callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be

    lambda x: x in [0, 2].

  • index_type (str) –

    [optional, default is ‘datetime’, output format]

    Can be either ‘number’ or ‘datetime’. Use ‘number’ with index values that are Julian dates, or other epoch reference.

  • names (str) –

    [optional, default is None, transformation]

    If None, the column names are taken from the first row after ‘skiprows’ from the input dataset.

    MUST include a name for all columns in the input dataset, including the index column.

  • source_units (str) –

    [optional, default is None, transformation]

    If unit is specified for the column as the second field of a ‘:’ delimited column name, then the specified units and the ‘source_units’ must match exactly.

    Any unit string compatible with the ‘pint’ library can be used.

  • target_units (str) –

    [optional, default is None, transformation]

    The purpose of this option is to specify target units for unit conversion. The source units are specified in the header line of the input or using the ‘source_units’ keyword.

    The units of the input time-series or values are specified as the second field of a ‘:’ delimited name in the header line of the input or in the ‘source_units’ keyword.

    Any unit string compatible with the ‘pint’ library can be used.

    This option will also add the ‘target_units’ string to the column names.

  • columns

    [optional, defaults to all columns, input filter]

    Columns to select out of input. Can use column names from the first line header or column numbers. If using numbers, column number 1 is the first data column. To pick multiple columns; separate by commas with no spaces. As used in toolbox_utils pick command.

    This solves a big problem so that you don’t have to create a data set with a certain column order, you can rearrange columns when data is read in.

  • from_columns (str or list) –

    [required if method=’from’, otherwise not used]

    List of column names/numbers from which good values will be taken to fill missing values in the to_columns keyword.

  • to_columns (str or list) –

    [required if method=’from’, otherwise not used]

    List of column names/numbers that missing values will be replaced in from good values in the from_columns keyword.

  • limit (int) –

    [default is None]

    Gaps of missing values greater than this number will not be filled.

  • order (int) –

    [required if method is ‘spline’ or ‘polynomial’, otherwise not used, default is None]

    The order of the ‘spline’ or ‘polynomial’ fit for missing values.

  • tablefmt (str) –

    [optional, default is ‘csv’, output format]

    The table format. Can be one of ‘csv’, ‘tsv’, ‘plain’, ‘simple’, ‘grid’, ‘pipe’, ‘orgtbl’, ‘rst’, ‘mediawiki’, ‘latex’, ‘latex_raw’ and ‘latex_booktabs’.

  • force_freq (Optional[str]) –

    [optional, output format]

    Force this frequency for the output. Typically you will only want to enforce a smaller interval where toolbox_utils will insert missing values as needed. WARNING: you may lose data if not careful with this option. In general, letting the algorithm determine the frequency should always work, but this option will override. Use PANDAS offset codes.