tstoolbox.tstoolbox.read

tstoolbox.tstoolbox.read(*filenames, force_freq=None, columns=None, start_date=None, end_date=None, dropna='no', skiprows=None, index_type='datetime', names=None, clean=False, source_units=None, target_units=None, round_index=None)

Combine time-series from different sources into single dataset.

Prints the read in time-series in the tstoolbox standard format.

WARNING: Accepts naive and timezone aware time-series by converting all to UTC and removing timezone information.

Parameters:
  • *filenames (str) –

    From the command line a list of space delimited filenames to read time series from. Using the Python API a list or tuple of filenames.

    The supported file formats are CSV, Excel, WDM (Watershed Data Management), and HDF5. The file formats are determined by the file extension.

    Comma-separated values (CSV) files or tab-separated values (TSV):

    Separators will be automatically detected.  Columns can be
    selected by name or index, where the index for data columns starts
    at 1.
    
    CSV files requires a single line header of column names.  The
    default header is the first line of the input, but this can be
    changed for CSV files using the 'skiprows' option.
    
    Most common date formats can be used, but the closer to ISO 8601
    date/time standard the better.  ISO 8601 is roughly
    "YYYY-MM-DDTHH:MM:SS".
    

    Excel files (xls, xlsx, xlsm, xlsb, odf, ods, odt):

    The time-series data is read in from one or more sheets.  The first
    row is assumed to be the header.  The first column is assumed to be
    the index.  The top left cell of the table should be the name of
    the date/time index and must be in cell A1.
    

    WDM files:

    One of more Data Set Numbers (DSN) can be specified in any order.
    

    HDF5 files (h5, hdf5, hdf):

    One or more tables can be read from the HDF5 file.
    

    Command line examples:

    Keyword Example

    Description

    fname.csv

    read all columns from ‘fname.csv’

    fname.csv,2,1

    read data columns 2 and 1 from ‘fname.csv’

    fname.csv,2,skiprows=2

    read data column 2 from ‘fname.csv’, skipping first 2 rows so header is read from third row

    fname.xlsx,2,Sheet21

    read all data from 2nd sheet then all data from “Sheet21” of ‘fname.xlsx’

    fname.hdf5,Table12,T2

    read all data from table “Table12” then all data from table “T2” of ‘fname.hdf5’

    fname.wdm,210,110

    read DSNs 210, then 110 from ‘fname.wdm’

    read all columns from standard input (stdin)

    Python library examples:

    Each entry in the list can be one of a pandas DataFrame, pandas
    Series, dict, tuple, list, StringIO, or file name with the options
    listed above.
    
    newdf = tstoolbox.read(['fname.csv,4,1', 'fname.xlsx', 'fname.hdf5'])
    

  • force_freq

    [optional, output format]

    Force this frequency for the output. Typically you will only want to enforce a smaller interval where toolbox_utils will insert missing values as needed. WARNING: you may lose data if not careful with this option. In general, letting the algorithm determine the frequency should always work, but this option will override. Use PANDAS offset codes.

columns

[optional, defaults to all columns, input filter]

Columns to select out of input. Can use column names from the first line header or column numbers. If using numbers, column number 1 is the first data column. To pick multiple columns; separate by commas with no spaces. As used in toolbox_utils pick command.

This solves a big problem so that you don’t have to create a data set with a certain column order, you can rearrange columns when data is read in.

start_datestr

[optional, defaults to first date in time-series, input filter]

The start_date of the series in ISOdatetime format, or ‘None’ for beginning.

end_datestr

[optional, defaults to last date in time-series, input filter]

The end_date of the series in ISOdatetime format, or ‘None’ for end.

dropnastr

[optional, defauls it ‘no’, input filter]

Set dropna to ‘any’ to have records dropped that have NA value in any column, or ‘all’ to have records dropped that have NA in all columns. Set to ‘no’ to not drop any records. The default is ‘no’.

skiprows: list-like or integer or callable

[optional, default is None which will infer header from first line, input filter]

Line numbers to skip (0-indexed) if a list or number of lines to skip at the start of the file if an integer.

If used in Python can be a callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be

lambda x: x in [0, 2].

index_typestr

[optional, default is ‘datetime’, output format]

Can be either ‘number’ or ‘datetime’. Use ‘number’ with index values that are Julian dates, or other epoch reference.

names: str

[optional, default is None, transformation]

If None, the column names are taken from the first row after ‘skiprows’ from the input dataset.

MUST include a name for all columns in the input dataset, including the index column.

clean

[optional, default is False, input filter]

The ‘clean’ command will repair a input index, removing duplicate index values and sorting.

source_units: str

[optional, default is None, transformation]

If unit is specified for the column as the second field of a ‘:’ delimited column name, then the specified units and the ‘source_units’ must match exactly.

Any unit string compatible with the ‘pint’ library can be used.

target_units: str

[optional, default is None, transformation]

The purpose of this option is to specify target units for unit conversion. The source units are specified in the header line of the input or using the ‘source_units’ keyword.

The units of the input time-series or values are specified as the second field of a ‘:’ delimited name in the header line of the input or in the ‘source_units’ keyword.

Any unit string compatible with the ‘pint’ library can be used.

This option will also add the ‘target_units’ string to the column names.

float_format

[optional, output format]

Format for float numbers.

round_index

[optional, default is None which will do nothing to the index, output format]

Round the index to the nearest time point. Can significantly improve the performance since can cut down on memory and processing requirements, however be cautious about rounding to a very course interval from a small one. This could lead to duplicate values in the index.

tablefmtstr

[optional, default is ‘csv’, output format]

The table format. Can be one of ‘csv’, ‘tsv’, ‘plain’, ‘simple’, ‘grid’, ‘pipe’, ‘orgtbl’, ‘rst’, ‘mediawiki’, ‘latex’, ‘latex_raw’ and ‘latex_booktabs’.