Using the client

dapclient can be used as a client to inspect and retrieve data from any of the hundreds of scientific datasets available on the internet on OPeNDAP servers. This way, it’s possible to instrospect and manipulate a dataset as if it were stored locally, with data being downloaded on-the-fly as necessary.

Accessing gridded data

Let’s start accessing gridded data, i.e., data that is stored as a regular multidimensional array. Here’s a simple example where we access the COADS climatology from the official OPeNDAP server:

>>> from dapclient.client import open_url
>>> dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc")
>>> type(dataset)
<class 'dapclient.model.DatasetType'>

Here we use the dapclient.client.open_url function to open an URL specifying the location of the dataset; this URL should be stripped of the extensions commonly used for OPeNDAP datasets, like .dds or .das. When we access the remote dataset the function returns a DatasetType object, which is a Structure – a fancy dictionary that stores other variables. We can check the names of the store variables like we would do with a Python dictionary:

>>> list(dataset.keys())
['COADSX', 'COADSY', 'TIME', 'SST', 'AIRT', 'UWND', 'VWND']

Let’s work with the SST variable; we can reference it using the usual dictionary syntax of dataset['SST'], or using the “lazy” syntax dataset.SST:

>>> sst = dataset["SST"]  # or dataset.SST
>>> type(sst)
<class 'dapclient.model.GridType'>

Note that the variable is of type GridType, a multidimensional array with specific axes defining each of its dimensions:

>>> sst.dimensions
('TIME', 'COADSY', 'COADSX')
>>> sst.maps
OrderedDict([('TIME', <BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.TIME', dtype('>f8'), (12,), (slice(None, None, None),))>), ('COADSY', <BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.COADSY', dtype('>f8'), (90,), (slice(None, None, None),))>), ('COADSX', <BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.COADSX', dtype('>f8'), (180,), (slice(None, None, None),))>)])

Each map is also, in turn, a variable that can be accessed using the same syntax used for Structures:

>>> sst.TIME
<BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.TIME', dtype('>f8'), (12,), (slice(None, None, None),))>

The axes are all of type BaseType. This is the OPeNDAP equivalent of a multidimensional array, with a specific shape and type. Even though no data have been downloaded up to this point, we can introspect these attributes from the axes or from the Grid itself:

>>> sst.shape
(12, 90, 180)
>>> sst.dtype
dtype('>f4')
>>> sst.TIME.shape
(12,)
>>> sst.TIME.dtype
dtype('>f8')

We can also introspect the variable attributes; they are stored in an attribute appropriately called attributes, and they can also be accessed with a “lazy” syntax:

>>> import pprint
>>> pprint.pprint(sst.attributes)
{'_FillValue': -9.99999979e+33,
 'history': 'From coads_climatology',
 'long_name': 'SEA SURFACE TEMPERATURE',
 'missing_value': -9.99999979e+33,
 'units': 'Deg C'}
>>> sst.units
'Deg C'

Finally, we can also download some data. To download data we simply access it like we would access a Numpy array, and the data for the corresponding subset will be dowloaded on the fly from the server:

>>> sst.shape
(12, 90, 180)
>>> grid = sst[0, 10:14, 10:14]  # this will download data from the server
>>> grid
<GridType with array 'SST' and maps 'TIME', 'COADSY', 'COADSX'>

The data itself can be accessed in the array attribute of the Grid, and also on the individual axes:

>>> grid.array[:]
<BaseType with data array([[[ -1.26285708e+00,  -9.99999979e+33,  -9.99999979e+33,
          -9.99999979e+33],
        [ -7.69166648e-01,  -7.79999971e-01,  -6.75454497e-01,
          -5.95714271e-01],
        [  1.28333330e-01,  -5.00000156e-02,  -6.36363626e-02,
          -1.41666666e-01],
        [  6.38000011e-01,   8.95384610e-01,   7.21666634e-01,
           8.10000002e-01]]], dtype=float32)>
>>> print(grid.array[:].data)
[[[ -1.26285708e+00  -9.99999979e+33  -9.99999979e+33  -9.99999979e+33]
  [ -7.69166648e-01  -7.79999971e-01  -6.75454497e-01  -5.95714271e-01]
  [  1.28333330e-01  -5.00000156e-02  -6.36363626e-02  -1.41666666e-01]
  [  6.38000011e-01   8.95384610e-01   7.21666634e-01   8.10000002e-01]]]
>>> grid.COADSX[:]
<BaseType with data array([ 41.,  43.,  45.,  47.])>
>>> print(grid.COADSX[:].data)
[ 41.  43.  45.  47.]

Alternatively, we could have dowloaded the data directly, skipping the axes:

>>> print(sst.array[0, 10:14, 10:14].data)
[[[ -1.26285708e+00  -9.99999979e+33  -9.99999979e+33  -9.99999979e+33]
  [ -7.69166648e-01  -7.79999971e-01  -6.75454497e-01  -5.95714271e-01]
  [  1.28333330e-01  -5.00000156e-02  -6.36363626e-02  -1.41666666e-01]
  [  6.38000011e-01   8.95384610e-01   7.21666634e-01   8.10000002e-01]]]

Older Servers

Some servers using a very old OPeNDAP application might run of of memory when attempting to retrieve both the data and the coordinate axes of a variable. The work around is to simply disable the retrieval of coordinate axes by using the output_grid option to open url:

>>> from dapclient.client import open_url
>>> dataset = open_url(
...     "http://test.opendap.org/dap/data/nc/coads_climatology.nc", output_grid=False
... )
>>> grid = sst[0, 10:14, 10:14]  # this will download data from the server
>>> grid
<GridType with array 'SST' and maps 'TIME', 'COADSY', 'COADSX'>

Accessing sequential data

Now let’s see an example of accessing sequential data. Sequential data consists of one or more records of related variables, such as a simultaneous measurements of temperature and wind velocity, for example. In this example we’re going to access data from the Argo project, consisting of profiles made by autonomous buoys drifting on the ocean:

:: python

>>> from dapclient.client import open_url
>>> dataset = open_url("http://dapper.pmel.noaa.gov/dapper/argo/argo_all.cdp")

This dataset is fairly complex, with several variables representing heterogeneous 4D data. The layout of the dataset follows the Dapper in-situ conventions, consisting of two nested sequences: the outer sequence contains, in this case, a latitude, longitude and time variable, while the inner sequence contains measurements along a z axis.

The first thing we’d like to do is limit our region; let’s work with a small region in the Tropical Atlantic:

:: python

>>> type(dataset.location)
<class 'dapclient.model.SequenceType'>
>>> dataset.location.keys()
['LATITUDE', 'JULD', 'LONGITUDE', '_id', 'profile', 'attributes', 'variable_attributes']
>>> my_location = dataset.location[
...     (dataset.location.LATITUDE > -2)
...     & (dataset.location.LATITUDE < 2)
...     & (dataset.location.LONGITUDE > 320)
...     & (dataset.location.LONGITUDE < 330)
... ]

Note that the variable dataset.location is of type SequenceType – also a Structure that holds other variables. Here we’re limiting the sequence dataset.location to measurements between given latitude and longitude boundaries. Let’s access the identification number of the first 10-or-so profiles:

>>> for i, id_ in enumerate(my_location['_id'].iterdata()):
...     print(id_)
...     if i == 10:
...         print('...')
...         break
1125393
835304
839894
875344
110975
864748
832685
887712
962673
881368
1127922
...
>>> len(my_location['_id'].iterdata())
623

Note that calculating the length of a sequence takes some time, since the client has to download all the data and do the calculation locally. This is why you should use len(my_location['_id']) instead of len(my_location). Both should give the same result (unless the dataset changes between requests), but the former retrieves only data for the _id variable, while the later retrives data for all variables.

We can explicitly select just the first 5 profiles from our sequence:

:: python

>>> my_location = my_location[:5]
>>> len(my_location["_id"].iterdata())
5

And we can print the temperature profiles at each location. We’re going to use the coards module to convert the time to a Python datetime object:

>>> from coards import from_udunits
>>> for position in my_location.iterdata():
...     date = from_udunits(position.JULD.data, position.JULD.units.replace('GMT', '+0:00'))
...     print(position.LATITUDE.data, position.LONGITUDE.data, date)
...     print('=' * 40)
...     i = 0
...     for pressure, temperature in zip(position.profile.PRES, position.profile.TEMP):
...         print(pressure, temperature)
...         if i == 10:
...             print('...')
...             break
...         i += 1
-1.01 320.019 2009-05-03 11:42:34+00:00
========================================
5.0 28.59
10.0 28.788
15.0 28.867
20.0 28.916
25.0 28.94
30.0 28.846
35.0 28.566
40.0 28.345
45.0 28.05
50.0 27.595
55.0 27.061
...
-0.675 320.027 2006-12-25 13:24:11+00:00
========================================
5.0 27.675
10.0 27.638
15.0 27.63
20.0 27.616
25.0 27.617
30.0 27.615
35.0 27.612
40.0 27.612
45.0 27.605
50.0 27.577
55.0 27.536
...
-0.303 320.078 2007-01-12 11:30:31.001000+00:00
========================================
5.0 27.727
10.0 27.722
15.0 27.734
20.0 27.739
25.0 27.736
30.0 27.718
35.0 27.694
40.0 27.697
45.0 27.698
50.0 27.699
55.0 27.703
...
-1.229 320.095 2007-04-22 13:03:35.002000+00:00
========================================
5.0 28.634
10.0 28.71
15.0 28.746
20.0 28.758
25.0 28.755
30.0 28.747
35.0 28.741
40.0 28.737
45.0 28.739
50.0 28.748
55.0 28.806
...
-1.82 320.131 2003-04-09 13:20:03+00:00
========================================
5.1 28.618
9.1 28.621
19.4 28.637
29.7 28.662
39.6 28.641
49.6 28.615
59.7 27.6
69.5 26.956
79.5 26.133
89.7 23.937
99.2 22.029
...

These profiles could be easily plotted using matplotlib:

>>> for position in my_location.iterdata():
...     plot(position.profile.TEMP, position.profile.PRES)
>>> show()

You can also access the deep variables directly. When you iterate over these variables the client will download the data as nested lists:

>>> for value in my_location.profile.PRES.iterdata():
...     print(value[:10])
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0999999, 9.1000004, 19.4, 29.700001, 39.599998, 49.599998, 59.700001, 69.5, 79.5, 89.699997]

Authentication

Basic & Digest

To use Basic and Digest authentication, simply add your username and password to the dataset URL. Keep in mind that if the server only supports Basic authentication your credentials will be sent as plaintext, and could be sniffed on the network.:

>>> from dapclient.client import open_url
>>> dataset = open_url('http://username:password@server.example.com/path/to/dataset')

CAS

The Central Authentication Service (CAS) is a single sign-on protocol for the web, usually involving a web browser and cookies. Nevertheless it’s possible to use dapclient with an OPeNDAP server behind a CAS. The function install_cas_client below replaces dapclient’s default HTTP function with a new version able to submit authentication data to an HTML form and store credentials in cookies. (In this particular case, the server uses Javascript to redirect the browser to a new location, so the client has to parse the location from the Javascript code; other CAS would require a tweaked function.)

To use it, just attach a web browsing session with authentication cookies:

>>> from dapclient.client import open_url
>>> from dapclient.cas.get_cookies import setup_session
>>> session = setup_session(authentication_url, username, password)
>>> dataset = open_url('http://server.example.com/path/to/dataset', session=session)

This method could work but each CAS is slightly different and might require a specifically designed setup_session instance. Two CAS are however explicitly supported by dapclient:

URS NASA EARTHDATA

Authentication is done through a username and a password:

>>> from dapclient.client import open_url
>>> from dapclient.cas.urs import setup_session
>>> dataset_url = 'http://server.example.com/path/to/dataset'
>>> session = setup_session(username, password, check_url=dataset_url)
>>> dataset = open_url(dataset_url, session=session)

Earth System Grid Federation (ESGF)

Authentication is done through an openid and a password:

>>> from dapclient.client import open_url
>>> from dapclient.cas.esgf import setup_session
>>> dataset_url = 'http://server.example.com/path/to/dataset'
>>> session = setup_session(openid, password, check_url=dataset_url)
>>> dataset = open_url(dataset_url, session=session)

If your openid contains contains the string ceda.ac.uk authentication requires an additional username argument:

>>> from dapclient.client import open_url
>>> from dapclient.cas.esgf import setup_session
>>> session = setup_session(openid, password, check_url=dataset_url, username=username)
>>> dataset = open_url(dataset_url, session=session)

Advanced features

Calling server-side functions

When you open a remote dataset, the DatasetType object has a special attribute named functions that can be used to invoke any server-side functions. Here’s an example of using the geogrid function from Hyrax:

>>> dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc")
>>> new_dataset = dataset.functions.geogrid(dataset.SST, 10, 20, -10, 60)
>>> new_dataset.SST.shape
(12, 12, 21)
>>> new_dataset.SST.COADSY[:]
[-11.  -9.  -7.  -5.  -3.  -1.   1.   3.   5.   7.   9.  11.]
>>> new_dataset.SST.COADSX[:]
[ 21.  23.  25.  27.  29.  31.  33.  35.  37.  39.  41.  43.  45.  47.  49.
  51.  53.  55.  57.  59.  61.]

Unfortunately, there’s currently no standard mechanism to discover which functions the server support. The function attribute will accept any function name the user specifies, and will try to pass the call to the remote server.

Opening a specific URL

You can pass any URL to the open_url function, together with any valid constraint expression. Here’s an example of restricting values for the months of January, April, July and October:

>>> dataset = open_url(
...     "http://test.opendap.org/dap/data/nc/coads_climatology.nc?SST[0:3:11][0:1:89][0:1:179]"
... )
>>> dataset.SST.shape
(4, 90, 180)

This can be extremely useful for server side-processing; for example, we can create and access a new variable A in this dataset, equal to twice SSH:

>>> dataset = open_url(
...     "http://hycom.coaps.fsu.edu:8080/thredds/dodsC/las/dynamic/data_A5CDC5CAF9D810618C39646350F727FF.jnl_expr_%7B%7D%7Blet%20A=SSH*2%7D?A"
... )
>>> dataset.keys()
['A']

In this case, we’re using the Ferret syntax let A=SSH*2 to define the new variable, since the data is stored in an F-TDS server. Server-side processing is useful when you want to reduce the data before downloading it, to calculate a global average, for example.

Accessing raw data

The client module has a special function called open_dods, used to access raw data from a DODS response:

>>> from dapclient.client import open_dods
>>> dataset = open_dods_url(
...     "http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]"
... )

This function allows you to access raw data from any URL, including appending expressions to
>>> dataset = open_dods(
...     "http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]"
... )

This function allows you to access raw data from any URL, including appending expressions to F-TDS and GDS servers or calling server-side functions directly. By default this method downloads the data directly, and skips metadata from the DAS response; if you want to investigate and introspect datasets you should set the get_metadata parameter to true:

>>> dataset = open_dods(
...     "http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]",
...     get_metadata=True,
... )
>>> dataset.attributes["NC_GLOBAL"]["history"]
FERRET V4.30 (debug/no GUI) 15-Aug-96

Using a cache

You can specify a cache directory in the dapclient.lib.CACHE global variable. If this value is different than None, the client will try (if the server headers don’t prohibit) to cache the result, so repeated requests will be read from disk instead of the network:

>>> import dapclient.lib
>>> dapclient.lib.CACHE = "/tmp/dapclient-cache/"

Timeout

To specify a timeout for the client, just set the desired number of seconds using the timeout option to open_url(...) or open_dods(...). For example, the following commands would timeout after 30 seconds without receiving a response from the server:

>>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', timeout=30)
>>> dataset = open_dods('http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods', timeout=30)

Configuring a proxy

It’s possible to configure dapclient to access the network through a proxy server. Here’s an example for an HTTP proxy running on localhost listening on port 8000:

>>> import httplib2
>>> from dapclient.util import socks
>>> import dapclient.lib
>>> dapclient.lib.PROXY = httplib2.ProxyInfo(
...         socks.PROXY_TYPE_HTTP, 'localhost', 8000)

This way, all further calls to dapclient.client.open_url will be routed through the proxy server. You can also authenticate to the proxy:

>>> dapclient.lib.PROXY = httplib2.ProxyInfo(
...         socks.PROXY_TYPE_HTTP, 'localhost', 8000,
...         proxy_user=USERNAME, proxy_pass=PASSWORD)

A user has reported that httplib2 has problems authenticating against a NTLM proxy server. In this case, a simple solution is to change the dapclient.http.request function to use urllib2 instead of httplib2, monkeypatching the code like in the CAS authentication example above:

import urllib2
import logging


def install_urllib2_client():
    def new_request(url):
        log = logging.getLogger("dapclient")
        log.INFO("Opening %s" % url)

        f = urllib2.urlopen(url.rstrip("?&"))
        headers = dict(f.info().items())
        body = f.read()
        return headers, body

    from dapclient.util import http

    http.request = new_request

The function install_urllib2_client should then be called before doing any requests.