Using the client¶
dapclient can be used as a client to inspect and retrieve data from any of the hundreds of scientific datasets available on the internet on OPeNDAP servers. This way, it’s possible to instrospect and manipulate a dataset as if it were stored locally, with data being downloaded on-the-fly as necessary.
Accessing gridded data¶
Let’s start accessing gridded data, i.e., data that is stored as a regular multidimensional array. Here’s a simple example where we access the COADS climatology from the official OPeNDAP server:
>>> from dapclient.client import open_url
>>> dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc")
>>> type(dataset)
<class 'dapclient.model.DatasetType'>
Here we use the dapclient.client.open_url
function to open an URL
specifying the location of the dataset; this URL should be stripped of the
extensions commonly used for OPeNDAP datasets, like .dds or .das. When we
access the remote dataset the function returns a DatasetType
object, which
is a Structure – a fancy dictionary that stores other variables. We can
check the names of the store variables like we would do with a Python
dictionary:
>>> list(dataset.keys())
['COADSX', 'COADSY', 'TIME', 'SST', 'AIRT', 'UWND', 'VWND']
Let’s work with the SST
variable; we can reference it using the usual
dictionary syntax of dataset['SST']
, or using the “lazy” syntax
dataset.SST
:
>>> sst = dataset["SST"] # or dataset.SST
>>> type(sst)
<class 'dapclient.model.GridType'>
Note that the variable is of type GridType
, a multidimensional array with
specific axes defining each of its dimensions:
>>> sst.dimensions
('TIME', 'COADSY', 'COADSX')
>>> sst.maps
OrderedDict([('TIME', <BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.TIME', dtype('>f8'), (12,), (slice(None, None, None),))>), ('COADSY', <BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.COADSY', dtype('>f8'), (90,), (slice(None, None, None),))>), ('COADSX', <BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.COADSX', dtype('>f8'), (180,), (slice(None, None, None),))>)])
Each map is also, in turn, a variable that can be accessed using the same syntax used for Structures:
>>> sst.TIME
<BaseType with data BaseProxy('http://test.opendap.org/dap/data/nc/coads_climatology.nc', 'SST.TIME', dtype('>f8'), (12,), (slice(None, None, None),))>
The axes are all of type BaseType
. This is the OPeNDAP equivalent of
a multidimensional array, with a specific shape and type. Even though no data
have been downloaded up to this point, we can introspect these attributes from
the axes or from the Grid itself:
>>> sst.shape
(12, 90, 180)
>>> sst.dtype
dtype('>f4')
>>> sst.TIME.shape
(12,)
>>> sst.TIME.dtype
dtype('>f8')
We can also introspect the variable attributes; they are stored in an attribute
appropriately called attributes
, and they can also be accessed with
a “lazy” syntax:
>>> import pprint
>>> pprint.pprint(sst.attributes)
{'_FillValue': -9.99999979e+33,
'history': 'From coads_climatology',
'long_name': 'SEA SURFACE TEMPERATURE',
'missing_value': -9.99999979e+33,
'units': 'Deg C'}
>>> sst.units
'Deg C'
Finally, we can also download some data. To download data we simply access it like we would access a Numpy array, and the data for the corresponding subset will be dowloaded on the fly from the server:
>>> sst.shape
(12, 90, 180)
>>> grid = sst[0, 10:14, 10:14] # this will download data from the server
>>> grid
<GridType with array 'SST' and maps 'TIME', 'COADSY', 'COADSX'>
The data itself can be accessed in the array
attribute of the Grid, and
also on the individual axes:
>>> grid.array[:]
<BaseType with data array([[[ -1.26285708e+00, -9.99999979e+33, -9.99999979e+33,
-9.99999979e+33],
[ -7.69166648e-01, -7.79999971e-01, -6.75454497e-01,
-5.95714271e-01],
[ 1.28333330e-01, -5.00000156e-02, -6.36363626e-02,
-1.41666666e-01],
[ 6.38000011e-01, 8.95384610e-01, 7.21666634e-01,
8.10000002e-01]]], dtype=float32)>
>>> print(grid.array[:].data)
[[[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33]
[ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01]
[ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01]
[ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]]
>>> grid.COADSX[:]
<BaseType with data array([ 41., 43., 45., 47.])>
>>> print(grid.COADSX[:].data)
[ 41. 43. 45. 47.]
Alternatively, we could have dowloaded the data directly, skipping the axes:
>>> print(sst.array[0, 10:14, 10:14].data)
[[[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33]
[ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01]
[ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01]
[ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]]
Older Servers¶
Some servers using a very old OPeNDAP application might run of of memory when
attempting to retrieve both the data and the coordinate axes of a variable. The
work around is to simply disable the retrieval of coordinate axes by using the
output_grid
option to open url:
>>> from dapclient.client import open_url
>>> dataset = open_url(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc", output_grid=False
... )
>>> grid = sst[0, 10:14, 10:14] # this will download data from the server
>>> grid
<GridType with array 'SST' and maps 'TIME', 'COADSY', 'COADSX'>
Accessing sequential data¶
Now let’s see an example of accessing sequential data. Sequential data consists of one or more records of related variables, such as a simultaneous measurements of temperature and wind velocity, for example. In this example we’re going to access data from the Argo project, consisting of profiles made by autonomous buoys drifting on the ocean:
:: python
>>> from dapclient.client import open_url
>>> dataset = open_url("http://dapper.pmel.noaa.gov/dapper/argo/argo_all.cdp")
This dataset is fairly complex, with several variables representing heterogeneous 4D data. The layout of the dataset follows the Dapper in-situ conventions, consisting of two nested sequences: the outer sequence contains, in this case, a latitude, longitude and time variable, while the inner sequence contains measurements along a z axis.
The first thing we’d like to do is limit our region; let’s work with a small region in the Tropical Atlantic:
:: python
>>> type(dataset.location)
<class 'dapclient.model.SequenceType'>
>>> dataset.location.keys()
['LATITUDE', 'JULD', 'LONGITUDE', '_id', 'profile', 'attributes', 'variable_attributes']
>>> my_location = dataset.location[
... (dataset.location.LATITUDE > -2)
... & (dataset.location.LATITUDE < 2)
... & (dataset.location.LONGITUDE > 320)
... & (dataset.location.LONGITUDE < 330)
... ]
Note that the variable dataset.location
is of type SequenceType
– also
a Structure that holds other variables. Here we’re limiting the sequence
dataset.location
to measurements between given latitude and longitude
boundaries. Let’s access the identification number of the first 10-or-so
profiles:
>>> for i, id_ in enumerate(my_location['_id'].iterdata()):
... print(id_)
... if i == 10:
... print('...')
... break
1125393
835304
839894
875344
110975
864748
832685
887712
962673
881368
1127922
...
>>> len(my_location['_id'].iterdata())
623
Note that calculating the length of a sequence takes some time, since the
client has to download all the data and do the calculation locally. This is why
you should use len(my_location['_id'])
instead of len(my_location)
.
Both should give the same result (unless the dataset changes between requests),
but the former retrieves only data for the _id
variable, while the later
retrives data for all variables.
We can explicitly select just the first 5 profiles from our sequence:
:: python
>>> my_location = my_location[:5]
>>> len(my_location["_id"].iterdata())
5
And we can print the temperature profiles at each location. We’re going to use
the coards module to convert the time
to a Python datetime
object:
>>> from coards import from_udunits
>>> for position in my_location.iterdata():
... date = from_udunits(position.JULD.data, position.JULD.units.replace('GMT', '+0:00'))
... print(position.LATITUDE.data, position.LONGITUDE.data, date)
... print('=' * 40)
... i = 0
... for pressure, temperature in zip(position.profile.PRES, position.profile.TEMP):
... print(pressure, temperature)
... if i == 10:
... print('...')
... break
... i += 1
-1.01 320.019 2009-05-03 11:42:34+00:00
========================================
5.0 28.59
10.0 28.788
15.0 28.867
20.0 28.916
25.0 28.94
30.0 28.846
35.0 28.566
40.0 28.345
45.0 28.05
50.0 27.595
55.0 27.061
...
-0.675 320.027 2006-12-25 13:24:11+00:00
========================================
5.0 27.675
10.0 27.638
15.0 27.63
20.0 27.616
25.0 27.617
30.0 27.615
35.0 27.612
40.0 27.612
45.0 27.605
50.0 27.577
55.0 27.536
...
-0.303 320.078 2007-01-12 11:30:31.001000+00:00
========================================
5.0 27.727
10.0 27.722
15.0 27.734
20.0 27.739
25.0 27.736
30.0 27.718
35.0 27.694
40.0 27.697
45.0 27.698
50.0 27.699
55.0 27.703
...
-1.229 320.095 2007-04-22 13:03:35.002000+00:00
========================================
5.0 28.634
10.0 28.71
15.0 28.746
20.0 28.758
25.0 28.755
30.0 28.747
35.0 28.741
40.0 28.737
45.0 28.739
50.0 28.748
55.0 28.806
...
-1.82 320.131 2003-04-09 13:20:03+00:00
========================================
5.1 28.618
9.1 28.621
19.4 28.637
29.7 28.662
39.6 28.641
49.6 28.615
59.7 27.6
69.5 26.956
79.5 26.133
89.7 23.937
99.2 22.029
...
These profiles could be easily plotted using matplotlib:
>>> for position in my_location.iterdata():
... plot(position.profile.TEMP, position.profile.PRES)
>>> show()
You can also access the deep variables directly. When you iterate over these variables the client will download the data as nested lists:
>>> for value in my_location.profile.PRES.iterdata():
... print(value[:10])
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0]
[5.0999999, 9.1000004, 19.4, 29.700001, 39.599998, 49.599998, 59.700001, 69.5, 79.5, 89.699997]
Authentication¶
Basic & Digest¶
To use Basic and Digest authentication, simply add your username and password to the dataset URL. Keep in mind that if the server only supports Basic authentication your credentials will be sent as plaintext, and could be sniffed on the network.:
>>> from dapclient.client import open_url
>>> dataset = open_url('http://username:password@server.example.com/path/to/dataset')
CAS¶
The Central Authentication Service (CAS) is
a single sign-on protocol for the web, usually involving a web browser and
cookies. Nevertheless it’s possible to use dapclient with an OPeNDAP server
behind a CAS. The function install_cas_client
below replaces dapclient’s
default HTTP function with a new version able to submit authentication data to
an HTML form and store credentials in cookies. (In this particular case, the
server uses Javascript to redirect the browser to a new location, so the client
has to parse the location from the Javascript code; other CAS would require
a tweaked function.)
To use it, just attach a web browsing session
with authentication cookies:
>>> from dapclient.client import open_url
>>> from dapclient.cas.get_cookies import setup_session
>>> session = setup_session(authentication_url, username, password)
>>> dataset = open_url('http://server.example.com/path/to/dataset', session=session)
This method could work but each CAS is slightly different and might require
a specifically designed setup_session
instance. Two CAS are however
explicitly supported by dapclient
:
URS NASA EARTHDATA¶
Authentication is done through a username
and a password
:
>>> from dapclient.client import open_url
>>> from dapclient.cas.urs import setup_session
>>> dataset_url = 'http://server.example.com/path/to/dataset'
>>> session = setup_session(username, password, check_url=dataset_url)
>>> dataset = open_url(dataset_url, session=session)
Earth System Grid Federation (ESGF)¶
Authentication is done through an openid
and a password
:
>>> from dapclient.client import open_url
>>> from dapclient.cas.esgf import setup_session
>>> dataset_url = 'http://server.example.com/path/to/dataset'
>>> session = setup_session(openid, password, check_url=dataset_url)
>>> dataset = open_url(dataset_url, session=session)
If your openid
contains contains the string ceda.ac.uk
authentication
requires an additional username
argument:
>>> from dapclient.client import open_url
>>> from dapclient.cas.esgf import setup_session
>>> session = setup_session(openid, password, check_url=dataset_url, username=username)
>>> dataset = open_url(dataset_url, session=session)
Advanced features¶
Calling server-side functions¶
When you open a remote dataset, the DatasetType
object has a special
attribute named functions
that can be used to invoke any server-side
functions. Here’s an example of using the geogrid
function from Hyrax:
>>> dataset = open_url("http://test.opendap.org/dap/data/nc/coads_climatology.nc")
>>> new_dataset = dataset.functions.geogrid(dataset.SST, 10, 20, -10, 60)
>>> new_dataset.SST.shape
(12, 12, 21)
>>> new_dataset.SST.COADSY[:]
[-11. -9. -7. -5. -3. -1. 1. 3. 5. 7. 9. 11.]
>>> new_dataset.SST.COADSX[:]
[ 21. 23. 25. 27. 29. 31. 33. 35. 37. 39. 41. 43. 45. 47. 49.
51. 53. 55. 57. 59. 61.]
Unfortunately, there’s currently no standard mechanism to discover which
functions the server support. The function
attribute will accept any
function name the user specifies, and will try to pass the call to the remote
server.
Opening a specific URL¶
You can pass any URL to the open_url
function, together with any valid
constraint expression. Here’s an example of restricting values for the months
of January, April, July and October:
>>> dataset = open_url(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc?SST[0:3:11][0:1:89][0:1:179]"
... )
>>> dataset.SST.shape
(4, 90, 180)
This can be extremely useful for server side-processing; for example, we can
create and access a new variable A
in this dataset, equal to twice SSH
:
>>> dataset = open_url(
... "http://hycom.coaps.fsu.edu:8080/thredds/dodsC/las/dynamic/data_A5CDC5CAF9D810618C39646350F727FF.jnl_expr_%7B%7D%7Blet%20A=SSH*2%7D?A"
... )
>>> dataset.keys()
['A']
In this case, we’re using the Ferret syntax let A=SSH*2
to define the new
variable, since the data is stored in an F-TDS server.
Server-side processing is useful when you want to reduce the data before
downloading it, to calculate a global average, for example.
Accessing raw data¶
The client module has a special function called open_dods
, used to access
raw data from a DODS response:
>>> from dapclient.client import open_dods
>>> dataset = open_dods_url(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]"
... )
This function allows you to access raw data from any URL, including appending expressions to
>>> dataset = open_dods(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]"
... )
This function allows you to access raw data from any URL, including appending
expressions to F-TDS
and GDS servers or calling server-side
functions directly. By default this method downloads the data directly, and
skips metadata from the DAS response; if you want to investigate and introspect
datasets you should set the get_metadata
parameter to true:
>>> dataset = open_dods(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods?SST[0:3:11][0:1:89][0:1:179]",
... get_metadata=True,
... )
>>> dataset.attributes["NC_GLOBAL"]["history"]
FERRET V4.30 (debug/no GUI) 15-Aug-96
Using a cache¶
You can specify a cache directory in the dapclient.lib.CACHE
global
variable. If this value is different than None
, the client will try (if the
server headers don’t prohibit) to cache the result, so repeated requests will
be read from disk instead of the network:
>>> import dapclient.lib
>>> dapclient.lib.CACHE = "/tmp/dapclient-cache/"
Timeout¶
To specify a timeout for the client, just set the desired number of seconds
using the timeout
option to open_url(...)
or open_dods(...)
. For
example, the following commands would timeout after 30 seconds without
receiving a response from the server:
>>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc', timeout=30)
>>> dataset = open_dods('http://test.opendap.org/dap/data/nc/coads_climatology.nc.dods', timeout=30)
Configuring a proxy¶
It’s possible to configure dapclient to access the network through a proxy
server. Here’s an example for an HTTP proxy running on localhost
listening
on port 8000:
>>> import httplib2
>>> from dapclient.util import socks
>>> import dapclient.lib
>>> dapclient.lib.PROXY = httplib2.ProxyInfo(
... socks.PROXY_TYPE_HTTP, 'localhost', 8000)
This way, all further calls to dapclient.client.open_url
will be routed
through the proxy server. You can also authenticate to the proxy:
>>> dapclient.lib.PROXY = httplib2.ProxyInfo(
... socks.PROXY_TYPE_HTTP, 'localhost', 8000,
... proxy_user=USERNAME, proxy_pass=PASSWORD)
A user has reported
that httplib2
has problems authenticating against a NTLM proxy server. In
this case, a simple solution is to change the dapclient.http.request
function to use urllib2
instead of httplib2
, monkeypatching the code
like in the CAS authentication example above:
import urllib2
import logging
def install_urllib2_client():
def new_request(url):
log = logging.getLogger("dapclient")
log.INFO("Opening %s" % url)
f = urllib2.urlopen(url.rstrip("?&"))
headers = dict(f.info().items())
body = f.read()
return headers, body
from dapclient.util import http
http.request = new_request
The function install_urllib2_client
should then be called before doing any
requests.