CDBS 2.0 NetCDF File Design

for Observed Atmospheric and Water Data

Table of Contents

INTRODUCTION

The CDBS database separates the observed data into separate files for each data collection station. This is done to make distribution of subsets of the database easier. Within the National Resources Conservation Service, the Water and Climate Center has the responsibility for being the central repository for Atmospheric and Water data, and the responsibility for periodically transferring localized subsets of that data to the various NRCS Field Offices. Since a Field Office may only require data from as few as 3 sites, and rarely needs data from more than a dozen sites, the separation of netCDF data into separate files will allow easier fulfillment of the mission of supporting the Field Offices.

Within each netCDF data file, there are several variables that describe the station, several more variables that describe the data, and separate variables for each type of data that is collected at that station. There are also attributes in each of the variables that define the data that is stored in that variable. Within each data variable, data will be stored as a 2-dimensional array, with each row being data for one year, and the unlimited dimension being the number of years for which there is data. The number of columns in each row will depend on the data's duration (the frequency with which data is reported). For example, the variable containing daily data will have 366 columns in each row, and the variable containing hourly data will have 8684 columns per row. Variables containing other durations will have the appropriate number of columns. To allow easier reading of blocks of data, every row will have a column (or columns, depending on duration) for February 29. In non-leap years, these columns will contain the "_FillValue" value.

Data for which quality flags or data source flags are available will have separate variables for the flags, paralleling the layout of the variable containing the observed data. These variables will have a third dimension based on the number of flags associated with each data value. The different flagging systems that are used by the different data providers vary from 1 flag to 3 flags, so the third dimension of the flags variable may be 1, 2, or 3, depending on the source from which CDBS receives the data.

There will also be an optional variable for observed data that contains a time stamp associated with each piece of data. The time stamp variable, like the flag variable, will parallel the layout of the observed data variable.

A more detailed description of the design of the netCDF data file follows, and the accompanying sample CDL shows an implementation of this design.

GLOBAL INFORMATION

FILE NAMES

The name of the netCDF file for each station must be unique, and must fit within the limitations of the most restrictive operating system on which these files will be used -- in this case, DOS. Since copies of the data files will be transferred to any user wanting data from a particular station, the file name should indicate the station whose data is in the file. The following convention is used to define names for CDBS netCDF files:

The name of each NetCDF data file will consist of the following 5 concatenated fields (all characters will be lower case):
data network 2-character code for the data collection network identifying the source of the data. (See the CDBS 2.0 Data Collection Network Codes document for a list of possible values.) This is included to eliminate confusion between stations in different networks that use the same station identifier.
station id This is the station identifier used by the data collection network . Note that the station id may consist of mixed digits and characters.
period 1-character separator (".", ASCII 46), used to separate the file name from the file suffix.
state code 2-character code identifying the state in which the station is located. This is the same as the postal code for that state.
data file type The letter "o." This is to mark this file as an observed data file. (The complete CDBS database design also includes netCDF files for Forecast, Statistical, Central Tendencies, and Simulated data.)

Example: file "co5614.ido"
This file contains observed data for station number 5614, a National Weather Service Cooperative Data Network station in Idaho.

DIMENSIONS

Since each variable will store one year of data in each row, most of the dimensions just list the number of data reports a particular duration will report in a year. These dimensions are as follows:

The dimensions that do not follow this pattern are: For example, in the NWS Cooperative Data Network, each piece of daily data has 2 quality flags, while the monthly and yearly data values only have 1 quality flag. The dimension "fg_coop2" will be used in the variable containing the quality flags for the daily data, and the dimension "fg_coop1" will be used in the variables containing the quality flags for the monthly and yearly data.
data_yr This is the unlimited dimension, and shows the number of years for which this database has data for this station.
inst This dimension will be large enough to hold the greatest number of instantaneous datareports received in a single year. The size of this dimension will vary from station to station. In practice, this dimension starts at a small number and, as the data loading program loads data, will be increased to meet the needs of the data. It is, in effect, a second, artificial "unlimited" dimension.
fg_type This is a number of data flags that are associated with each piece of data. The type will be replaced by the 4- or 5-character code for one of the data flagging systems. See the CDBS 2.0 Data Flag System Codes document for a list of known flag systems. There may be several of these "fg_type" dimensions for a single station.
sta_id_lgth=9 This is the maximum length of a station id, plus 1 character to hold the NULL that terminates the string.
hand_5_lgth=9 This is the maximum length of a Handbook 5 (SHEF) station id, plus 1 character to hold the NULL that terminates the string.
sta_nm_lgth=61 This is the maximum length of a station name, plus 1 character to hold the NULL that terminates the string.
st_cd_lgth=3 This is the maximum length of a FIPS alphabetic state code (postal code), plus 1 character to hold the NULL that terminates the string.
data_net_lgth=5 This is the maximum length of a data network code, plus 1 character to hold the NULL that terminates the string.

GLOBAL ATTRIBUTES

The following list of global attributes is included in each CDBS netCDF observed data file.
Conventions "CDBS"
element_reference "Elements Used in CDBS 2.0"
duration_reference "CDBS 2.0 Duration Codes"
time_units The character string "minutes since 1800-1-1 00:00 -time_zone_offset", where time_zone_offset is the difference between the station's reporting time and Greenwich Mean Time, listed as a 4-digit time. For example, a station located in the Eastern Time Zone would have a time_zone_offset of -05:00, and the "units" would be "minutes since 1800-1-1 00:00 -05:00".
history This attribute will hold the history of modifications to the file. To quote the netCDF User's Guide, this "is a character array with a line for each invocation of a program and arguments that were used to derive the file."

INFORMATION ABOUT VARIABLES

VARIABLES DESCRIBING THE STATION

Within the netCDF file are several variables that identify the station. These variables hold a minimal set of metadata for the station; a much more extensive set of metadata is stored in the Informix side of the CDBS database. (You may notice that there is duplication between some of these variables and the information encoded in the file name. This is both for redundancy in case the file name gets inadvertently scrambled, and to allow better access to this information from within programs.)

The following is a list of the file's variables that describe the station, and the dimensions and attributes of each of these variables:

VARIABLES DESCRIBING THE DATA

Each CDBS netCDF data file will have several variables that describe the data. They are concerned primarily with indicating the time period for which data is available, and the nominal time-of-observation for observed data. Note that all times within the CDBS netCDF files use a time origin of either Jan. 1, 1800 00:00 GMT, or of January 1 at 00:00. All netCDF time representations will be offsets from one of these origins.

COORDINATE VARIABLES

The first coordinate variable will be for the unlimited dimension, and will hold the start of each year, listed as the number of minutes since CDBS 2.0's time origin (January 1, 1800 at 00:00 GMT). Also, for each duration of data in the data file, there will be a coordinate variable containing the nominal observation times for the data with that duration (For those elements that also need to record the actual observation times for the data, the actual observation times will be stored in variables with the "_time_obs" suffix. See the section on Data Variables below.). Only the durations for which the station has data need have coordinate variables. For example, if a station only has hourly, daily, and monthly data values, then it only needs to have coordinate variables for those three durations and need not clutter up the data file with coordinate variables for all of the other durations.

The first variable described below is the coordinate variable for the unlimited dimension. The next two examples show the pattern for coordinate variables for the various durations of data. All of the other duration coordinate variables (except "inst") will be different only in their variable names, their dimensions, and their long_names. The "inst" duration will not have a coordinate variable.

DATA VARIABLES (element[_depth_height_code]_duration[_sensor number]_data type)

These variables hold the actual observed, derived, and interpreted data. The difference between data types is:

data_yr The unlimited dimension
duration_dim One of the dimensions defining the number of data reports in a year. See the list of Dimensions above for the list of possible dimensions. Examples: daily data will have a second dimension of "day," hourly data will have a second dimension of "hour_1," etc.
Attributes
The following attributes are given to each data variable:
long_name A character string of the format "data_type duration values for element_name", where data_type is either "observed", "derived", or "interpreted", duration is a duration name such as "daily", "hourly", "15-minute", etc. taken from the CDBS 2.0 Duration Codes document, element_name is the element description taken from the Elements Used in CDBS 2.0 document.
Examples: variable prcp_d_o would have a long_name of "observed daily values for precipitation-incremental", and variable tmin_m_d would have a long_name of "derived monthly values for temperature, minimum"
units One of the unit names used by the Unidata udunits package. For example, element prcp is measured in units of "inch", and element rhum is measured in units of "percent".
element The 5-character code for the type of data stored in this variable. This is the same element code as the one incorporated into the name of the variable, and is repeated here both for redundancy, and to eliminate the need for a program accessing this data to have to know how to decode the information encoded in a variable name.
depth_height_code (Optional attribute -- used only when the element has such a code) The 1-character code identifying the height above ground or the depth below ground at which the data is measured. See the documents CDBS 2.0 Sensor Depth Codes and CDBS 2.0 Sensor Height Codes for a list of the codes.
duration The 1-character code for the data duration. This is the same duration code as the one incorporated into the name of the variable, and is repeated here both for redundancy, and to eliminate the need for a program accessing this data to have to know how to decode the information encoded in a variable name.
mult_snsr_num (Optional attribute -- used only when the data is being measured by one of several sensors measuring the same element at a station.) The number assigned by the station's owner to one of several sensors collecting data for the same element. For example, if a station has 3 snow pillows, the station owner will designate one of them as pillow "1", another as pillow "2", and the third as pillow "3". The "mult_snsr_num" will be either 1, 2, or 3, depending on which snow pillow the data is collected by.
data_type The 1-character code for the type of data in this variable. This is the same data type as the one incorporated into the name of the variable, and is repeated here both for redundancy, and to eliminate the need for a program accessing this data to have to know how to decode the information encoded in a variable name.
decimal_places This is a number of type "short", holding the precision of the data as it is measured by the sensor, listed as the number of decimal places, and having default values taken from the Elements Used in CDBS 2.0 document. Examples: element prcp has a default decimal_places of 2 (meaning measurements are accurate to the nearest 0.01 units), element snwd has a default of 0 (meaning measurements are accurate to the nearest whole unit), etc.
_FillValue This is a number of type "float", and holds the value to be inserted into the variable when no data report was given. This will be the default FILL_FLOAT (defined in netcdf.h).
missing_value This is a number of type "float", and holds the value to be inserted into the variable when a data report was received, and the meaning of the data report was "data missing." This will be the value (-FILL_FLOAT).
last_data This is a number of type "double", and holds the time stamp of the most recent piece of data stored in this variable, in the form of the udunits equivalent of the phrase "nnn minutes since 1800-1-1 00:00 time_zone_offset" where time_zone_offset is the difference between the station's reporting time and Greenwich Mean Time, listed as a 4-digit time.
last_update This is a number of type "double", and holds the date of the most recent update to the data stored in this variable, in the form of the udunits equivalent of the phrase "nnn minutes since 1800-1-1 00:00 time_zone_offset" where time_zone_offset is the difference between the station's reporting time and Greenwich Mean Time, listed as a 4-digit time.
season_name (Optional attribute -- only present if the data duration is "season".) This is a character string containing the name of the season whose data is stored in this variable. The name will be one of the season names listed for the station in the CDBS 2.0 Informix metadata database.
source_variable (Optional attribute -- only present if the data is either "derived" or "interpreted".) A character string containing the name(s) of the variable(s) containing the observed data from which the data in this variable was created. If more than one observed data variable was used in creating the data in this variable, then the names of the source variables will be separated by commas. Examples: If a variable named prcp_m_d contains monthly data derived from daily data stored in variable prcp_d_o, then this source_variable attribute would contain the string "prcp_d_o". If a variable named evap_d_i contains daily evaporation values based on observed hourly temperature and relative humidity, then this source_variable attribute would contain the string "tobs_h_o, rhum_h_o".

FLAG VARIABLES (element[_depth_height_code]_duration[_sensor number]_fg_flag-type)

These variables hold the data quality flags or data source flags associated with the observed data, and will only be present in the file if flags are available for the observed data.

TIME STAMP VARIABLES (element[_depth_height_code]_duration[_sensor number]_tm_obs)

These variables hold the time stamps associated with the observed data, and will only be present in the file if the data is defined as being of duration "instantaneous" (which requires a time stamp on each piece of data), or if the database needs to store the actual time of the data reports (as opposed to the nominal time of the data reports). These variables store the time stamp as a value created by the Unidata udunits utInvCalendar() library function (or created by any other function that produces the same value for the same date). Note that all times within the CDBS netCDF files use a time origin of Jan. 1, 1800 00:00 GMT. All time stamps will be the equivalent of the phrase "nnn minutes since 1800-1-1 00:00 time_zone_offset", where time_zone_offset is the difference between the station's reporting time and Greenwich Mean Time, listed as a 4-digit time.

SAMPLE SET OF VARIABLES FOR OBSERVED DATA FOR ONE ELEMENT -- instantaneous precipitation dataPICTURES OF VARIABLES

VARIABLE NAME: prec_i_o VARIABLE'S ATTRIBUTES:

VARIABLE TYPE: float long_name = "observed instantaneous values for precipitation, cumulative"

units = "inch" _FillValue=9.96e+36

element = "prcp" missing_value=-9.96e+36

duration = "I" last_data=double equal to "1994-12-31 23:48 -07:00

data_type = "o" @ 1800-1-1 00:00"

decimal_places=2 last_update = double equal to "1995-6-14 11:18

enough columns for the largest year of data values -07:00 @ 1800-1-1 00:00"

1 row

per

year

VARIABLE NAME: prec_i_fg_qlty VARIABLE'S ATTRIBUTES:

VARIABLE TYPE: character long_name="data quality flags for data in prec_i_o"

flag_sys = "coop2" duration = "i"

element = "prec" _FillValue = '\0'

reference = "NCDC TD-3200 document"

2 flags enough columns for flags for the largest year of data

per data

value

1 row

per

year

VARIABLE NAME: prec_i_time_obs VARIABLE'S ATTRIBUTES:

VARIABLE TYPE: float long_name="observation times for data in prec_i_o"

element = "prec" _FillValue = 9.96e+36

duration = "i" missing_value = -9.96+e36

units = "minutes since 1800-01-01 00:00 -07:00"

enough columns for the largest year of data

1 row

per

year