Developer documentation¶
This documentation is intended for other developers that would like to contribute with the development of dapclient, or extend it in new ways. It assumes that you have a basic knowledge of Python and HTTP, and understands how data is stored in different formats. It also assumes some familiarity with the Data Access Protocol, though a lot of its inner workings will be explained in detail here.
The DAP data model¶
The DAP is a protocol designed for the efficient transmission of scientific data over the internet. In order to transmit data from the server to a client, both must agree on a way to represent data: is it an array of integers?, a multi-dimensional grid? In order to do this, the specification defines a data model that, in theory, should be able to represent any existing dataset.
Metadata¶
dapclient has a series of classes in the dapclient.model
module,
representing the DAP data model. The most fundamental data type is called
BaseType
, and it represents a value or an array of values. Here an example
of creating one of these objects:
Note
Prior to dapclient 3.2, the name argument was optional for all date types. Since dapclient 3.2, it is mandatory.
>>> from dapclient.model import *
>>> import numpy as np
>>> a = BaseType(name="a", data=np.array([1]), attributes={"long_name": "variable a"})
All dapclient types have five attributes in common. The first one is the
name
of the variable; in this case, our variable is called “a”:
>>> a.name
'a'
Note that there’s a difference between the variable name (the local name a
)
and its attribute name
; in this example they are equal, but we could
reference our object using any other name:
>>> b = a # b now points to a
>>> b.name
'a'
We can use special characters for the variable names; they will be quoted accordingly:
>>> c = BaseType(name="long & complicated")
>>> c.name
'long%20%26%20complicated'
The second attribute is called id
. In the examples we’ve seen so far,
id
and name
are equal:
>>> a.name
'a'
>>> a.id
'a'
>>> c.name
'long%20%26%20complicated'
>>> c.id
'long%20%26%20complicated'
This is because the id
is used to show the position of the variable in
a given dataset, and in these examples the variables do not belong to any
datasets. First let’s store our variables in a container object called
StructureType
. A StructureType
is a special type of ordered dictionary
that holds other dapclient types:
>>> s = StructureType("s")
>>> s["a"] = a
>>> s["c"] = c
Traceback (most recent call last):
...
KeyError: 'Key "c" is different from variable name "long%20%26%20complicated"!'
Note that the variable name has to be used as its key on the StructureType
.
This can be easily remedied:
>>> s[c.name] = c
There is a special derivative of the StructureType
called DatasetType
,
which represent the dataset. The difference between the two is that there
should be only one DatasetType
, but it may contain any number of
StructureType
objects, which can be deeply nested. Let’s create our dataset
object:
>>> dataset = DatasetType(name="example")
>>> dataset["s"] = s
>>> dataset.id
'example'
>>> dataset["s"].id
's'
>>> dataset["s"]["a"].id
's.a'
Note that for objects on the first level of the dataset, like s
, the id is
identical to the name. Deeper objects, like a
which is stored in s
,
have their id calculated by joining the names of the variables with a period.
One detail is that we can access variables stored in a structure using a “lazy”
syntax like this:
>>> dataset.s.a.id
's.a'
The third common attribute that variables share is called attributes
, which
hold most of its metadata. This attribute is a dictionary of keys and values,
and the values themselves can also be dictionaries. For our variable a
we
have:
>>> a.attributes
{'long_name': 'variable a'}
These attributes can be accessed lazily directly from the variable:
>>> a.long_name
'variable a'
But if you want to create a new attribute you’ll have to insert it directly
into attributes
:
>>> a.history = "Created by me"
>>> a.attributes
{'long_name': 'variable a'}
>>> a.attributes["history"] = "Created by me"
>>> sorted(a.attributes.items())
[('history', 'Created by me'),
('long_name', 'variable a')]
It’s always better to use the correct syntax instead of the lazy one when writing code. Use the lazy syntax only when introspecting a dataset on the Python interpreter, to save a few keystrokes.
The fourth attribute is called data
, and it holds a representation of the
actual data. We’ll take a detailed look of this attribute in the next
subsection.
Note
Prior to dapclient 3.2, all variables had also an attribute called
_nesting_level
. This attribute had value 1 if the variable was inside
a SequenceType
object, 0 if it’s outside, and >1 if it’s inside a nested
sequence. Since dapclient 3.2, the _nesting_level
has been deprecated
and there is no intrinsic way of finding the where in a deep object
a variable is located.
Data¶
As we saw on the last subsection, all dapclient objects have a data
attribute that holds a representation of the variable data. This representation
will vary depending on the variable type.
BaseType
¶
For the simple BaseType
objects the data
attributes is usually a Numpy
array, though we can also use a Numpy scalar or Python number:
>>> a = BaseType(name="a", data=np.array(1))
>>> a.data
array(1)
>>> b = BaseType(name="b", data=np.arange(4))
>>> b.data
array([0, 1, 2, 3])
Note that starting from dapclient 3.2 the datatype is inferred from the input data:
>>> a.dtype
dtype('int64')
>>> b.dtype
dtype('int64')
When you slice a BaseType
array, the slice is simply passed onto the data
attribute. So we may have:
>>> b[-1]
<BaseType with data array(3)>
>>> b[-1].data
array(3)
>>> b[:2]
<BaseType with data array([0, 1])>
>>> b[:2].data
array([0, 1])
You can think of a BaseType
object as a thin layer around Numpy arrays,
until you realize that the data
attribute can be any object implementing
the array interface! This is how the DAP client works – instead of assigning
an array with data directly to the attribute, we assign a special object which
behaves like an array and acts as a proxy to a remote dataset.
Here’s an example:
>>> from dapclient.handlers.dap import BaseProxyDap2
>>> pseudo_array = BaseProxyDap2(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc",
... "SST.SST",
... np.float64,
... (12, 90, 180),
... )
>>> print(pseudo_array[0, 10:14, 10:14]) # download the corresponding data
[[[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33]
[ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01]
[ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01]
[ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]]
In the example above, the data is only downloaded in the last line, when the pseudo array is sliced. The object will construct the appropriate DAP URL, request the data, unpack it and return a Numpy array.
>>> pseudo_array = BaseProxyDap2(
... "http://test.opendap.org/dap/data/nc/coads_climatology.nc",
... "SST.SST",
... np.float64,
... (12, 90, 180),
... )
>>> print(pseudo_array[0, 10:14, 10:14]) # download the corresponding data
[[[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33]
[ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01]
[ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01]
[ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]]
In the example above, the data is only downloaded in the last line, when the pseudo array is sliced. The object will construct the appropriate DAP URL, request the data, unpack it and return a Numpy array.
StructureType
¶
A StructureType
holds no data; instead, its data
attribute is
a property that collects data from the children variables:
>>> s = StructureType(name="s")
>>> s[a.name] = a
>>> s[b.name] = b
>>> a.data
array(1)
>>> b.data
array([0, 1, 2, 3])
>>> print(s.data)
[array(1), array([0, 1, 2, 3])]
The opposite is also true; it’s possible to specify the structure data and have it propagated to the children:
>>> s.data = (1, 2)
>>> print(s.a.data)
1
>>> print(s.b.data)
2
The same is true for objects of DatasetType
, since the dataset is simply
the root structure.
SequenceType
¶
A SequenceType
object is a special kind of StructureType
holding
sequential data. Here’s an example of a sequence holding the variables a
and c
that we created before:
>>> s = SequenceType(name="s")
>>> s[a.name] = a
>>> s[c.name] = c
Let’s add some data to our sequence. This can be done by setting a structured numpy array to the data attribute:
>>> print(s)
<SequenceType with children 'a', 'long%20%26%20complicated'>
>>> test_data = np.array(
... [(1, 10), (2, 20), (3, 30)],
... dtype=np.dtype([("a", np.int32), ("long%20%26%20complicated", np.int16)]),
... )
>>> s.data = test_data
>>> print(s.data)
[(1, 10) (2, 20) (3, 30)]
Note that the data for the sequence is an aggregation of the children data,
similar to Python’s zip()
builtin. This will be more complicated when
encountering nested sequences, but for flat sequences they behave the same.
We can also iterate over the SequenceType
. In this case, it will return
a series of tuples with the data:
>>> for record in s.iterdata():
... print(record)
...
(1, 10)
(2, 20)
(3, 30)
Prior to dapclient 3.2.2, this approach was not possible and one had to iterate
directly over SequenceType
:
>>> for record in s:
... print(record)
...
(1, 10)
(2, 20)
(3, 30)
This approach will be deprecated in dapclient 3.4.
The SequenceType
behaves pretty much like record arrays from Numpy, since we
can reference them by column (s['a']
) or by index:
>>> s[1].data
(2, 20)
>>> s[s.a < 3].data
array([(1, 10), (2, 20)],
dtype=[('a', '<i4'), ('long%20%26%20complicated', '<i2')])
Note that these objects are also SequenceType
themselves. The basic rules
when working with sequence data are:
When a
SequenceType
is sliced with a string the corresponding children is returned. For example:s['a']
will return childa
;When a
SequenceType
is iterated over (using.iterdata()
after dapclient 3.2.2) it will return a series of tuples, each one containing the data for a record;When a
SequenceType
is sliced with an integer, a comparison or aslice()
a newSequenceType
will be returned;When a
SequenceType
is sliced with a tuple of strings a newSequenceType
will be returned, containing only the children defined in the tuple in the new order. For example,s[('c', 'a')]
will return a sequences
with the childrenc
anda
, in that order.
Note that except for rule 4 SequenceType
mimics the behavior of Numpy
record arrays.
Now imagine that we want to add to a SequenceType
data pulled from
a relational database. The easy way would be to fetch the data in the correct
column order, and insert it into the sequence. But what if we don’t want to
store the data in memory, and instead we would like to stream it directly from
the database? In this case we can create an object that behaves like a record
array, similar to the proxy object that implements the array interface.
dapclient defines a “protocol” called IterData
, which is simply any object
that:
Returns data when iterated over.
Returns a new
IterData
when sliced such that:if the slice is a string the new
IterData
contains data only for that children;if the slice is a tuple of strings the object contains only those children, in that order;
if the slice is an integer, a
slice()
or a comparison, the data is filtered accordingly.
The base implementation works by wrapping data from a basic Numpy array. And here is an example of how we would use it:
>>> from dapclient.handlers.lib import IterData
>>> s.data = IterData(np.array([(1, 2), (10, 20)]), s)
>>> print(s)
<SequenceType with children 'a', 'long%20%26%20complicated'>
>>> s2 = s.data[s["a"] > 1]
>>> print(s2)
<IterData to stream array([[ 1, 2],
[10, 20]])>
>>> for record in s2.iterdata():
... print(record)
...
(10, 20)
One can also iterate directly over the IterData
object to obtain the data:
>>> for record in s2:
... print(record)
...
(10, 20)
This approach will not be deprecated in dapclient 3.4.
There are many implementations of classes derived from IterData
:
dapclient.handlers.dap.SequenceProxy
is a proxy to sequential data on
Opendap servers, dapclient.handlers.csv.CSVProxy
wraps a CSV file, and
dapclient.handlers.sql.SQLProxy
works as a stream to a relational database.
GridType
¶
A GridType
is a special kind of object that behaves like an array and
a StructureType
. The class is derived from StructureType
; the major
difference is that the first defined variable is a multidimensional array,
while subsequent children are vector maps that define the axes of the array.
This way, the data
attribute on a GridType
returns the data of all its
children: the n-dimensional array followed by n maps.
Here is a simple example:
>>> g = GridType(name="g")
>>> data = np.arange(6)
>>> data.shape = (2, 3)
>>> g["a"] = BaseType(
... name="a", data=data, shape=data.shape, type=np.int32, dimensions=("x", "y")
... )
>>> g["x"] = BaseType(name="x", data=np.arange(2), shape=(2,), type=np.int32)
>>> g["y"] = BaseType(name="y", data=np.arange(3), shape=(3,), type=np.int32)
>>> g.data
[array([[0, 1, 2],
[3, 4, 5]]), array([0, 1]), array([0, 1, 2])]
Grid behave like arrays in that they can be sliced. When this happens, a new
GridType
is returned with the proper data and axes:
>>> print(g)
<GridType with array 'a' and maps 'x', 'y'>
>>> print(g[0])
<GridType with array 'a' and maps 'x', 'y'>
>>> print(g[0].data)
[array([0, 1, 2]), array(0), array([0, 1, 2])]
It is possible to disable this feature (some older servers might not handle it nicely):
>>> g = GridType(name="g")
>>> g.set_output_grid(False)
>>> data = np.arange(6)
>>> data.shape = (2, 3)
>>> g["a"] = BaseType(
... name="a", data=data, shape=data.shape, type=np.int32, dimensions=("x", "y")
... )
>>> g["x"] = BaseType(name="x", data=np.arange(2), shape=(2,), type=np.int32)
>>> g["y"] = BaseType(name="y", data=np.arange(3), shape=(3,), type=np.int32)
>>> g.data
[array([[0, 1, 2],
[3, 4, 5]]), array([0, 1]), array([0, 1, 2])]
>>> print(g)
<GridType with array 'a' and maps 'x', 'y'>
>>> print(g[0])
<BaseType with data array([0, 1, 2])>
>>> print(g[0].name)
a
>>> print(g[0].data)
[0 1 2]
Handlers¶
Now that we saw the dapclient data model we can understand handlers: handlers
are simply classes that convert data into the dapclient data model. The NetCDF
handler, for example, reads a NetCDF file and builds a DatasetType
object.
The SQL handler reads a file describing the variables and maps them to a given
table on a relational database. dapclient uses entry points
in order to find handlers that are installed in the system. This means that
handlers can be developed and installed separately from dapclient. Handlers are
mapped to files by a regular expression that usually matches the extension of
the file.
Here is the minimal configuration for a handler that serves .npz
files from
Numpy:
import os
import re
import numpy
from dapclient.model import *
from dapclient.handlers.lib import BaseHandler
from dapclient.handlers.helper import constrain
class Handler(BaseHandler):
extensions = re.compile(r"^.*\.npz$", re.IGNORECASE)
def __init__(self, filepath):
self.filename = os.path.split(filepath)[1]
self.f = numpy.load(filepath)
def parse_constraints(self, environ):
dataset = DatasetType(name=self.filename)
for name in self.f.files:
data = self.f[name][:]
dataset[name] = BaseType(
name=name, data=data, shape=data.shape, type=data.dtype.char
)
return constrain(dataset, environ.get("QUERY_STRING", ""))
if __name__ == "__main__":
import sys
from paste.httpserver import serve
application = Handler(sys.argv[1])
serve(application, port=8001)
So let’s go over the code. Our handler has a single class called Handler
that should be configured as an entry point in the setup.py
file for the
handler:
[dapclient.handler]
npz = path.to.my.handler:Handler
Here the name of our handler (“npz”) can be anything, as long as it points to
the correct class. In order for dapclient to be able to find the handler, it
must be installed in the system with either a python setup.py install
or,
even better, python setup.py develop
.
The class-level attribute extensions
defines a regular expression that
matches the files supported by the handler. In this case, the handler will
match all files ending with the .npz
extension.
When the handler is instantiated the complete filepath to the data file is
passed in __init__
. With this information, our handler extracts the
filename of the data file and opens it using the load()
function from
Numpy. The handler will be initialized for every request, and immediately its
parse_constraints
method is called.
The parse_constraints
method is responsible for building a dataset object
based on information for the request available on environ
. In this simple
handler we simply built a DatasetType
object with the entirety of our
dataset, i.e., we added all data from all variables that were available on
the .npz
file. Some requests will ask for only a few variables, and only
a subset of their data. The easy way parsing the request is simply passing the
complete dataset together with the QUERY_STRING
to the contrain()
function:
return constrain(dataset, environ.get("QUERY_STRING", ""))
This will take care of filtering our dataset according to the request, although it may not be very efficient. For this reason, handlers usually implement their own parsing of the Opendap constraint expression. The SQL handler, for example, will translate the query string into an SQL expression that filters the data on the database, and not on Python.
Finally, note that the handler is a WSGI application: we can initialize it with
a filepath and pass it to a WSGI server. This enables us to quickly test the
handler, by checking the different responses at http://localhost:8001/.dds
,
for example. It also means that it is very easy to dinamically serve datasets
by plugging them to a route dispatcher.
Responses¶
If handlers are responsible for converting data into the dapclient data model, responses to the opposite: the convert from the data model to different representations. The Opendap specification defines a series of standard responses, that allow clients to introspect a dataset by downloading metadata, and later download data for the subset of interest. These standard responses are the DDS (Dataset Descriptor Structure), the DAS (Dataset Attribute Structure) and the DODS response.
Apart from these, there are additional non-standard responses that add functionality to the server. The ASCII response, for example, formats the data as ASCII for quick visualization on the browser; the HTML response builds a form from which the user can select a subset of the data.
Here is an example of a minimal dapclient response that returns the attributes of the dataset as JSON:
from simplejson import dumps
from dapclient.responses.lib import BaseResponse
from dapclient.lib import walk
class JSONResponse(BaseResponse):
def __init__(self, dataset):
BaseResponse.__init__(self, dataset)
self.headers.append(("Content-type", "application/json"))
@staticmethod
def serialize(dataset):
attributes = {}
for child in walk(dataset):
attributes[child.id] = child.attributes
if hasattr(dataset, "close"):
dataset.close()
return [dumps(attributes)]
This response is mapped to a specific extension defined in its entry point:
[dapclient.response]
json = path.to.my.response:JSONResponse
In this example the response will be called when the .json
extension is
appended to any dataset.
The most important method in the response is the serialize
method, which is
responsible for serializing the dataset into the external format. The method
should be a generator or return a list of strings, like in this example. Note
that the method is responsible for calling dataset.close()
, if available,
since some handlers use this for closing file handlers or database connections.
One important thing about the responses is that, like handlers, they are also
WSGI applications. WSGI applications should return an iterable when called;
usually this is a list of strings corresponding to the output from the
application. The BaseResponse
application, however, returns a special
iterable that contains both the DatasetType
object and its serialization
function. This means that WSGI middleware that manipulate the response have
direct access to the dataset, avoiding the need for
deserialization/serialization of the dataset in order to change it.
Here is a simple example of a middleware that adds an attribute to a given dataset:
from webob import Request
from dapclient.model import *
from dapclient.handlers.lib import BaseHandler
class AttributeMiddleware(object):
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
# say that we want the parsed response
environ["x-wsgiorg.want_parsed_response"] = True
# call the app with the original request and get the response
req = Request(environ)
res = req.get_response(self.app)
# get the dataset from the response and set attribute
dataset = res.app_iter.x_wsgiorg_parsed_response(DatasetType)
dataset.attributes["foo"] = "bar"
# return original response
response = BaseHandler.response_map[environ["dapclient.response"]]
responder = response(dataset)
return responder(environ, start_response)
The code should actually do more bookkeeping, like checking if the dataset can
be retrieved from the response or updating the Content-Length
header, but
I wanted to keep it simple. dapclient comes with a WSGI middleware for handling
server-side functions (dapclient.wsgi.ssf
) that makes heavy use of this
feature. It works by removing function calls from the request, fetching the
dataset from the modified request, applying the function calls and returning
a new dataset.
Templating¶
dapclient uses an experimental backend-agnostic templating API for rendering HTML and
XML by responses. Since the API is neutral dapclient can use any templating
engine, like Mako or Genshi, and templates can be loaded from disk, memory
or a database. The server that comes with dapclient, for example, defines
a renderer
object that loads Genshi templates from a directory
templatedir
:
from dapclient.util.template import FileLoader, GenshiRenderer
class FileServer(object):
def __init__(self, templatedir, ...):
loader = FileLoader(templatedir)
self.renderer = GenshiRenderer(options={}, loader=loader)
...
def __call__(self, environ, start_response):
environ.setdefault('dapclient.renderer', self.renderer)
...
And here is how the HTML response uses the renderer: the response requests
a template called html.html
, that is loaded from the directory by the
FileLoader
object, and renders it by passing the context:
def serialize(dataset):
...
renderer = environ['dapclient.renderer']
template = renderer.loader('html.html')
output = renderer.render(template, context, output_format='text/html')
return [output]
(This is actually a simplification; if you look at the code you’ll notice that
there’s also code to fallback to a default renderer if one is not found in the
environ
.)