disdrodb.l0 package#

Subpackages#

Submodules#

disdrodb.l0.check_configs module#

Check configuration files.

class disdrodb.l0.check_configs.L0BEncodingSchema(*, contiguous: bool, dtype: str, zlib: bool, complevel: int, shuffle: bool, fletcher32: bool, chunksizes: Optional[Union[int, list[int]]])[source]#

Bases: BaseModel

Pydantic model for DISDRODB L0B encodings.

classmethod check_chunksizes_and_zlib(values)[source]#

Check the chunksizes validity.

classmethod check_contiguous_and_fletcher32(values)[source]#

Check the fletcher value validity.

classmethod check_contiguous_and_zlib(values)[source]#

Check the the compression value validity.

chunksizes: Optional[Union[int, list[int]]]#
complevel: int#
contiguous: bool#
dtype: str#
fletcher32: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'chunksizes': FieldInfo(annotation=Union[int, list[int], NoneType], required=True), 'complevel': FieldInfo(annotation=int, required=True), 'contiguous': FieldInfo(annotation=bool, required=True), 'dtype': FieldInfo(annotation=str, required=True), 'fletcher32': FieldInfo(annotation=bool, required=True), 'shuffle': FieldInfo(annotation=bool, required=True), 'zlib': FieldInfo(annotation=bool, required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

model_post_init(__context: Any) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters
  • self – The BaseModel instance.

  • __context – The context.

shuffle: bool#
zlib: bool#
class disdrodb.l0.check_configs.RawDataFormatSchema(*, n_digits: Optional[int], n_characters: Optional[int], n_decimals: Optional[int], n_naturals: Optional[int], data_range: Optional[list[float]], nan_flags: Optional[Union[int, str]] = None, valid_values: Optional[list[float]] = None, dimension_order: Optional[list[str]] = None, n_values: Optional[int] = None, field_number: Optional[str] = None)[source]#

Bases: BaseModel

Pydantic model for the DISDRODB Raw Data Format YAML files.

classmethod check_list_length(value)[source]#

Check the data_range validity.

data_range: Optional[list[float]]#
dimension_order: Optional[list[str]]#
field_number: Optional[str]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'data_range': FieldInfo(annotation=Union[list[float], NoneType], required=True), 'dimension_order': FieldInfo(annotation=Union[list[str], NoneType], required=False), 'field_number': FieldInfo(annotation=Union[str, NoneType], required=False), 'n_characters': FieldInfo(annotation=Union[int, NoneType], required=True), 'n_decimals': FieldInfo(annotation=Union[int, NoneType], required=True), 'n_digits': FieldInfo(annotation=Union[int, NoneType], required=True), 'n_naturals': FieldInfo(annotation=Union[int, NoneType], required=True), 'n_values': FieldInfo(annotation=Union[int, NoneType], required=False), 'nan_flags': FieldInfo(annotation=Union[int, str, NoneType], required=False), 'valid_values': FieldInfo(annotation=Union[list[float], NoneType], required=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

n_characters: Optional[int]#
n_decimals: Optional[int]#
n_digits: Optional[int]#
n_naturals: Optional[int]#
n_values: Optional[int]#
nan_flags: Optional[Union[int, str]]#
valid_values: Optional[list[float]]#
exception disdrodb.l0.check_configs.SchemaValidationException[source]#

Bases: Exception

Exception raised when schema validation fails.

disdrodb.l0.check_configs.check_all_sensors_configs() None[source]#

Check all sensors configuration YAML files.

disdrodb.l0.check_configs.check_l0a_encoding(sensor_name: str) None[source]#

Check l0a_encodings.yml file.

Parameters

sensor_name (str) – Name of the sensor.

Raises

ValueError – Error raised if the value of a key is not in the list of accepted values.

disdrodb.l0.check_configs.check_l0b_encoding(sensor_name: str) None[source]#

Check l0b_encodings.yml file based on the schema defined in the class L0BEncodingSchema.

Parameters

sensor_name (str) – Name of the sensor.

disdrodb.l0.check_configs.check_sensor_configs(sensor_name: str) None[source]#

Check validity of sensor configuration YAML files.

Parameters

sensor_name (str) – Name of the sensor.

disdrodb.l0.check_standards module#

Check data standards.

disdrodb.l0.check_standards.check_l0a_column_names(df: DataFrame, sensor_name: str) None[source]#

Checks that the dataframe columns respects DISDRODB standards.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

Raises

ValueError – Error if some columns do not meet the DISDRODB standards or if the 'time' column is missing in the dataframe.

disdrodb.l0.check_standards.check_l0a_standards(df: DataFrame, sensor_name: str, verbose: bool = True) None[source]#

Checks that a file respects the DISDRODB L0A standards.

Parameters
  • df (pd.DataFrame) – L0A dataframe.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool, optional) – Whether to verbose the processing. The default is True.

Raises

ValueError – Error if some columns have inconsistent values.

disdrodb.l0.check_standards.check_l0b_standards(x: str) None[source]#

Check L0B standards.

disdrodb.l0.io module#

Define DISDRODB Data Input/Output.

disdrodb.l0.io.get_l0a_filepaths(processed_dir, station_name, debugging_mode=False)[source]#

Retrieve L0A files for a give station.

Parameters
  • processed_dir (str) – Directory of the campaign where to search for the L0A files. Format: <..>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME>.

  • station_name (str) – ID of the station

  • debugging_mode (bool, optional) – If True, it select maximum 3 files for debugging purposes. The default is False.

Returns

filepaths – List of L0A file paths.

Return type

list

disdrodb.l0.io.get_raw_filepaths(raw_dir, station_name, glob_patterns, verbose=False, debugging_mode=False)[source]#

Get the list of files from a directory based on input parameters.

Currently concatenates all files provided by the glob patterns. In future, this might be modified to enable DISDRODB processing when raw data are separated in multiple files.

Parameters
  • raw_dir (str) – Directory of the campaign where to search for files. Format <..>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>

  • station_name (str) – ID of the station

  • verbose (bool, optional) – Whether to verbose the processing. The default is False.

  • debugging_mode (bool, optional) – If True, it select maximum 3 files for debugging purposes. The default is False.

Returns

filepaths – List of files file paths.

Return type

list

disdrodb.l0.io.read_l0a_dataframe(filepaths: Union[str, list], verbose: bool = False, debugging_mode: bool = False) DataFrame[source]#

Read DISDRODB L0A Apache Parquet file(s).

Parameters
  • filepaths (str or list) – Either a list or a single filepath.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is False.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. If filepaths is a list, it reads only the first 3 files. For each file it select only the first 100 rows. The default is False.

Returns

L0A Dataframe.

Return type

pd.DataFrame

disdrodb.l0.l0_processing module#

Implement DISDRODB L0 processing.

disdrodb.l0.l0_processing.run_l0a(raw_dir, processed_dir, station_name, glob_patterns, column_names, reader_kwargs, df_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]#

Run the L0A processing for a specific DISDRODB station.

This function is called in each reader to convert raw text files into DISDRODB L0A products.

Parameters
  • raw_dir (str) –

    The directory path where all the raw content of a specific campaign is stored. The path must have the following structure: <...>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>. Inside the raw_dir directory, it is required to adopt the following structure:

    - ``/data/<station_name>/<raw_files>``
    - ``/metadata/<station_name>.yml``
    

    Important points:

    • For each <station_name>, there must be a corresponding YAML file in the metadata subdirectory.

    • The campaign_name are expected to be UPPER CASE.

    • The <CAMPAIGN_NAME> must semantically match between:
      • the raw_dir and processed_dir directory paths;

      • with the key campaign_name within the metadata YAML files.

  • processed_dir (str) – The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure: <...>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME>. For testing purposes, this function exceptionally accepts also a directory path simply ending with <CAMPAIGN_NAME> (e.g., /tmp/<CAMPAIGN_NAME>).

  • station_name (str) –

  • station. (The name of the) –

  • glob_patterns (str) – Glob pattern to search for data files in <raw_dir>/data/<station_name>.

  • column_names (list) – Column names of the raw text file.

  • reader_kwargs (dict) – Arguments for Pandas read_csv function to open the text file.

  • df_sanitizer_fun (callable, optional) – Sanitizer function to format the DataFrame into DISDRODB L0A standard. Default is None.

  • parallel (bool, optional) – If True, process the files simultaneously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed.LocalCluster. If False, process the files sequentially in a single process. Default is False.

  • verbose (bool, optional) – If True, print detailed processing information to the terminal. Default is False.

  • force (bool, optional) – If True, overwrite existing data in destination directories. If False, raise an error if data already exists in destination directories. Default is False.

  • debugging_mode (bool, optional) – If True, reduce the amount of data to process. Processes only the first 100 rows of 3 raw data files. Default is False.

disdrodb.l0.l0_processing.run_l0a_station(data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0A processing of a specific DISDRODB station when invoked from the terminal.

This function is intended to be called through the disdrodb_run_l0a_station command-line interface.

Parameters
  • data_source (str) – The name of the institution (for campaigns spanning multiple countries) or the name of the country (for campaigns or sensor networks within a single country). Must be provided in UPPER CASE.

  • campaign_name (str) – The name of the campaign. Must be provided in UPPER CASE.

  • station_name (str) – The name of the station.

  • force (bool, optional) – If True, existing data in the destination directories will be overwritten. If False (default), an error will be raised if data already exists in the destination directories.

  • verbose (bool, optional) – If True (default), detailed processing information will be printed to the terminal. If False, less information will be displayed.

  • parallel (bool, optional) – If True, files will be processed in multiple processes simultaneously with each process using a single thread. If False (default), files will be processed sequentially in a single process, and multi-threading will be automatically exploited to speed up I/O tasks.

  • debugging_mode (bool, optional) – If True, the amount of data processed will be reduced. Only the first 3 raw data files will be processed. By default, False.

  • base_dir (str, optional) – The base directory of DISDRODB, expected in the format <...>/DISDRODB. If not specified, the path specified in the DISDRODB active configuration will be used.

disdrodb.l0.l0_processing.run_l0b(processed_dir, station_name, parallel, force, verbose, debugging_mode)[source]#

Run the L0B processing for a specific DISDRODB station.

Parameters
  • raw_dir (str) –

    The directory path where all the raw content of a specific campaign is stored. The path must have the following structure: <...>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>. Inside the raw_dir directory, it is required to adopt the following structure:

    - ``/data/<station_name>/<raw_files>``
    - ``/metadata/<station_name>.yml``
    

    Important points:

    • For each <station_name>, there must be a corresponding YAML file in the metadata subdirectory.

    • The campaign_name are expected to be UPPER CASE.

    • The <CAMPAIGN_NAME> must semantically match between:
      • the raw_dir and processed_dir directory paths;

      • with the key campaign_name within the metadata YAML files.

  • processed_dir (str) – The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure: <...>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME>. For testing purposes, this function exceptionally accepts also a directory path simply ending with <CAMPAIGN_NAME> (e.g., /tmp/<CAMPAIGN_NAME>).

  • station_name (str) – The name of the station.

  • force (bool, optional) – If True, overwrite existing data in destination directories. If False, raise an error if data already exists in destination directories. Default is False.

  • verbose (bool, optional) – If True, print detailed processing information to the terminal. Default is True.

  • parallel (bool, optional) – If True, process the files simultaneously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed.LocalCluster. Ensure that the threads_per_worker (number of thread per process) is set to 1 to avoid HDF errors. Also, ensure to set the HDF5_USE_FILE_LOCKING environment variable to False. If False, process the files sequentially in a single process. Default is False.

  • debugging_mode (bool, optional) – If True, reduce the amount of data to process. Only the first 3 raw data files will be processed. Default is False.

disdrodb.l0.l0_processing.run_l0b_concat(processed_dir, station_name, verbose=False)[source]#

Concatenate all L0B netCDF files into a single netCDF file.

The single netCDF file is saved at <processed_dir>/L0B.

disdrodb.l0.l0_processing.run_l0b_concat_station(data_source, campaign_name, station_name, remove_l0b=False, verbose=True, base_dir: Optional[str] = None)[source]#

Define the L0B file concatenation of a station.

This function is intended to be called through the disdrodb_run_l0b_concat station command-line interface.

Parameters
  • data_source (str) – The name of the institution (for campaigns spanning multiple countries) or the name of the country (for campaigns or sensor networks within a single country). Must be provided in UPPER CASE.

  • campaign_name (str) – The name of the campaign. Must be provided in UPPER CASE.

  • station_name (str) – The name of the station.

  • verbose (bool, optional) – If True (default), detailed processing information will be printed to the terminal. If False, less information will be displayed.

  • base_dir (str, optional) – The base directory of DISDRODB, expected in the format <...>/DISDRODB. If not specified, the path specified in the DISDRODB active configuration will be used.

disdrodb.l0.l0_processing.run_l0b_from_nc(raw_dir, processed_dir, station_name, glob_patterns, dict_names, ds_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]#

Run the L0B processing for a specific DISDRODB station with raw netCDFs.

This function is called in the reader where raw netCDF files must be converted into DISDRODB L0B format.

Parameters
  • raw_dir (str) –

    The directory path where all the raw content of a specific campaign is stored. The path must have the following structure: <...>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>. Inside the raw_dir directory, it is required to adopt the following structure:

    - ``/data/<station_name>/<raw_files>``
    - ``/metadata/<station_name>.yml``
    

    Important points:

    • For each <station_name>, there must be a corresponding YAML file in the metadata subdirectory.

    • The campaign_name are expected to be UPPER CASE.

    • The <CAMPAIGN_NAME> must semantically match between:
      • the raw_dir and processed_dir directory paths;

      • with the key campaign_name within the metadata YAML files.

  • processed_dir (str) – The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure: <...>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME>. For testing purposes, this function exceptionally accepts also a directory path simply ending with <CAMPAIGN_NAME> (e.g., /tmp/<CAMPAIGN_NAME>).

  • station_name (str) – The name of the station.

  • glob_patterns (str) – Glob pattern to search data files in <raw_dir>/data/<station_name>. Example: glob_patterns = "*.nc"

  • dict_names (dict) –

    Dictionary mapping raw netCDF variables/coordinates/dimension names

    to DISDRODB standards.

    ds_sanitizer_funobject, optional

    Sanitizer function to format the raw netCDF into DISDRODB L0B standard.

  • force (bool, optional) – If True, overwrite existing data in destination directories. If False, raise an error if data already exists in destination directories. Default is False.

  • verbose (bool, optional) – If True, print detailed processing information to the terminal. Default is True.

  • parallel (bool, optional) – If True, process the files simultaneously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed.LocalCluster. Ensure that the threads_per_worker (number of thread per process) is set to 1 to avoid HDF errors. Also, ensure to set the HDF5_USE_FILE_LOCKING environment variable to False. If False, process the files sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks. Default is False.

  • debugging_mode (bool, optional) – If True, reduce the amount of data to process. Only the first 3 raw netCDF files will be processed. Default is False.

disdrodb.l0.l0_processing.run_l0b_station(data_source, campaign_name, station_name, force: bool = False, verbose: bool = True, parallel: bool = True, debugging_mode: bool = False, remove_l0a: bool = False, base_dir: Optional[str] = None)[source]#

Run the L0B processing of a specific DISDRODB station when invoked from the terminal.

This function is intended to be called through the disdrodb_run_l0b_station command-line interface.

Parameters
  • data_source (str) – The name of the institution (for campaigns spanning multiple countries) or the name of the country (for campaigns or sensor networks within a single country). Must be provided in UPPER CASE.

  • campaign_name (str) – The name of the campaign. Must be provided in UPPER CASE.

  • station_name (str) – The name of the station.

  • force (bool, optional) – If True, existing data in the destination directories will be overwritten. If False (default), an error will be raised if data already exists in the destination directories.

  • verbose (bool, optional) – If True (default), detailed processing information will be printed to the terminal. If False, less information will be displayed.

  • parallel (bool, optional) – If True, files will be processed in multiple processes simultaneously, with each process using a single thread to avoid issues with the HDF/netCDF library. If False (default), files will be processed sequentially in a single process, and multi-threading will be automatically exploited to speed up I/O tasks.

  • debugging_mode (bool, optional) – If True, the amount of data processed will be reduced. Only the first 100 rows of 3 L0A files will be processed. By default, False.

  • base_dir (str, optional) – The base directory of DISDRODB, expected in the format <...>/DISDRODB. If not specified, the path specified in the DISDRODB active configuration will be used.

disdrodb.l0.l0_reader module#

Define DISDRODB L0 readers routines.

disdrodb.l0.l0_reader.available_readers(data_sources=None, reader_path=False)[source]#

Retrieve available readers information.

disdrodb.l0.l0_reader.check_available_readers()[source]#

Check the readers arguments of all package.

disdrodb.l0.l0_reader.get_reader_function(reader_data_source: str, reader_name: str) object[source]#

Returns the reader function based on input parameters.

Parameters
  • reader_data_source (str) – The directory within which the reader_name is located in the disdrodb.l0.readers directory.

  • reader_name (str) – The reader name.

Returns

The reader() function

Return type

object

disdrodb.l0.l0_reader.get_reader_function_from_metadata_key(reader_data_source_name)[source]#

Retrieve the reader function from the reader metadata value.

The convention for metadata reader key: <data_source/reader_name> in disdrodb.l0.readers.

disdrodb.l0.l0_reader.get_station_reader_function(data_source, campaign_name, station_name, base_dir=None)[source]#

Retrieve the reader function from the station metadata.

disdrodb.l0.l0_reader.is_documented_by(original)[source]#

Wrapper function to apply generic docstring to the decorated function.

Parameters

original (function) – Function to take the docstring from.

disdrodb.l0.l0_reader.reader_generic_docstring()[source]#

Script to convert the raw data to L0A format.

Parameters
  • raw_dir (str) –

    The directory path where all the raw content of a specific campaign is stored. The path must have the following structure <...>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>. Inside the raw_dir directory, it is required to adopt the following structure:

    - ``/data/<station_name>/<raw_files>``
    - ``/metadata/<station_name>.yml``
    

    Important points:

    • For each <station_name>, there must be a corresponding YAML file in the metadata subdirectory.

    • The <CAMPAIGN_NAME> are expected to be UPPER CASE.

    • The <CAMPAIGN_NAME> must semantically match between:

      • the raw_dir and processed_dir directory paths;

      • with the key campaign_name within the metadata YAML files.

  • processed_dir (str) – The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure <...>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME> For testing purposes, this function exceptionally accepts also a directory path simply ending with <CAMPAIGN_NAME> (e.g., /tmp/<CAMPAIGN_NAME>).

  • station_name (str) – The name of the station.

  • force (bool, optional) – If True, overwrite existing data in destination directories. If False, raise an error if data already exists in destination directories. Default is False.

  • verbose (bool, optional) – If True, print detailed processing information to the terminal. Default is True.

  • parallel (bool, optional) – If True, process the files simultaneously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed.LocalCluster. If False, process the files sequentially in a single process. Default is False.

  • debugging_mode (bool, optional) – If True, reduce the amount of data to process. Only the first 3 raw data files will be processed. Default is False.

disdrodb.l0.l0a_processing module#

Functions to process raw text files into DISDRODB L0A Apache Parquet.

disdrodb.l0.l0a_processing.cast_column_dtypes(df: DataFrame, sensor_name: str) DataFrame[source]#

Convert 'object' dataframe columns into DISDRODB L0A dtype standards.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

Returns

Dataframe with corrected columns types.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.coerce_corrupted_values_to_nan(df: DataFrame, sensor_name: str) DataFrame[source]#

Coerce corrupted values in dataframe numeric columns to np.nan.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

Returns

Dataframe with string columns without corrupted values.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.concatenate_dataframe(list_df: list, verbose: bool = False) DataFrame[source]#

Concatenate a list of dataframes.

Parameters
  • list_df (list) – List of dataframes.

  • verbose (bool, optional) – If True, print messages. If False, no print.

Returns

Concatenated dataframe.

Return type

pd.DataFrame

Raises

ValueError – Concatenation can not be done.

disdrodb.l0.l0a_processing.drop_time_periods(df, time_periods)[source]#

Drop problematic time periods.

disdrodb.l0.l0a_processing.drop_timesteps(df, timesteps)[source]#

Drop problematic time steps.

disdrodb.l0.l0a_processing.process_raw_file(filepath, column_names, reader_kwargs, df_sanitizer_fun, sensor_name, verbose=True, issue_dict=None)[source]#

Read and parse a raw text files into a L0A dataframe.

Parameters
  • filepath (str) – File path

  • column_names (list) – Columns names.

  • reader_kwargs (dict) – Pandas read_csv arguments.

  • df_sanitizer_fun (object, optional) – Sanitizer function to format the datafame.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing. The default is True.

  • issue_dict (dict) – Issue dictionary providing information on timesteps to remove. The default is an empty dictionary {}. Valid issue_dict key are 'timesteps' and 'time_periods'. Valid issue_dict values are list of datetime64 values (with second accuracy). To correctly format and check the validity of the issue_dict, use the disdrodb.l0.issue.check_issue_dict function.

Returns

Dataframe

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.read_raw_file(filepath: str, column_names: list, reader_kwargs: dict) DataFrame[source]#

Read a raw file into a dataframe.

Parameters
  • filepath (str) – Raw file path.

  • column_names (list) – Column names.

  • reader_kwargs (dict) – Pandas pd.read_csv arguments.

Returns

Pandas dataframe.

Return type

pandas.DataFrame

disdrodb.l0.l0a_processing.read_raw_files(filepaths: Union[list, str], column_names: list, reader_kwargs: dict, sensor_name: str, verbose: bool, df_sanitizer_fun: object = None) DataFrame[source]#

Read and parse a list for raw files into a dataframe.

Parameters
  • filepaths (Union[list,str]) – File(s) path(s)

  • column_names (list) – Columns names.

  • reader_kwargs (dict) – Pandas read_csv arguments.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing. The default is False.

  • df_sanitizer_fun (object, optional) – Sanitizer function to format the datafame.

Returns

Dataframe

Return type

pd.DataFrame

Raises

ValueError – Input parameters can not be used or the raw file can not be processed.

disdrodb.l0.l0a_processing.remove_corrupted_rows(df)[source]#

Remove corrupted rows by checking conversion of raw fields to numeric.

Note: The raw array must be stripped away from delimiter at start and end !

disdrodb.l0.l0a_processing.remove_duplicated_timesteps(df: DataFrame, verbose: bool = False)[source]#

Remove duplicated timesteps.

It keep only the first timestep occurrence !

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataframe with valid unique timesteps.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.remove_issue_timesteps(df, issue_dict, verbose=False)[source]#

Drop dataframe rows with timesteps listed in the issue dictionary.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • issue_dict (dict) – Issue dictionary.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataframe with problematic timesteps removed.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.remove_rows_with_missing_time(df: DataFrame, verbose: bool = False)[source]#

Remove dataframe rows where the "time" is NaT.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataframe with valid timesteps.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.replace_nan_flags(df, sensor_name, verbose=False)[source]#

Set values corresponding to nan_flags to np.nan.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataframe without nan_flags values.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.set_nan_invalid_values(df, sensor_name, verbose=False)[source]#

Set invalid (class) values to np.nan.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataframe without invalid values.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.set_nan_outside_data_range(df, sensor_name, verbose=False)[source]#

Set values outside the data range as np.nan.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataframe without values outside the expected data range.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.strip_delimiter_from_raw_arrays(df)[source]#

Remove the first and last delimiter occurrence from the raw array fields.

disdrodb.l0.l0a_processing.strip_string_spaces(df: DataFrame, sensor_name: str) DataFrame[source]#

Strip leading/trailing spaces from dataframe string columns.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • sensor_name (str) – Name of the sensor.

Returns

Dataframe with string columns without leading/trailing spaces.

Return type

pd.DataFrame

disdrodb.l0.l0a_processing.write_l0a(df: DataFrame, filepath: str, force: bool = False, verbose: bool = False)[source]#

Save the dataframe into an Apache Parquet file.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • filepath (str) – Output file path.

  • force (bool, optional) – Whether to overwrite existing data. If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. This is the default.

  • verbose (bool, optional) – Whether to verbose the processing. The default is False.

Raises
  • ValueError – The input dataframe can not be written as an Apache Parquet file.

  • NotImplementedError – The input dataframe can not be processed.

disdrodb.l0.l0b_nc_processing module#

Functions to process DISDRODB raw netCDF files into DISDRODB L0B netCDF files.

disdrodb.l0.l0b_nc_processing.add_dataset_missing_variables(ds, missing_vars, sensor_name)[source]#

Add missing xr.Dataset variables as np.nan xr.DataArrays.

disdrodb.l0.l0b_nc_processing.create_l0b_from_raw_nc(ds, dict_names, ds_sanitizer_fun, sensor_name, verbose, attrs)[source]#

Convert a raw xr.Dataset into a DISDRODB L0B netCDF.

Parameters
  • ds (xr.Dataset) – Raw xarray dataset

  • dict_names (dict) – Dictionary mapping raw netCDF variables/coordinates/dimension names to DISDRODB standards.

  • ds_sanitizer_fun (function) – Sanitizer function to do ad-hoc processing of the xr.Dataset.

  • attrs (dict) – Global metadata to attach as global attributes to the xr.Dataset.

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing.

Returns

L0B xr.Dataset

Return type

xr.Dataset

disdrodb.l0.l0b_nc_processing.preprocess_raw_netcdf(ds, dict_names, sensor_name)[source]#

This function preprocess raw netCDF to improve compatibility with DISDRODB standards.

This function checks validity of the dict_names, rename and subset the data accordingly. If some variables specified in the dict_names are missing, it adds a np.nan xr.DataArray !

Parameters
  • ds (xr.Dataset) – Raw netCDF to be converted to DISDRODB standards.

  • dict_names (dict) – Dictionary mapping raw netCDF variables/coordinates/dimension names to DISDRODB standards.

  • sensor_name (str) – Sensor name.

Returns

ds – xarray Dataset with variables compliant to DISDRODB conventions.

Return type

xr.Dataset

disdrodb.l0.l0b_nc_processing.rename_dataset(ds, dict_names)[source]#

Rename xr.Dataset variables, coordinates and dimensions.

disdrodb.l0.l0b_nc_processing.replace_custom_nan_flags(ds, dict_nan_flags, verbose=False)[source]#

Set values corresponding to nan_flags to np.nan.

This function must be used in a reader, if necessary.

Parameters
  • df (xr.Dataset) – Input xarray dataset

  • dict_nan_flags (dict) – Dictionary with nan flags value to set as np.nan.

  • verbose (bool) – Whether to verbose the processing. The default is False.

Returns

Dataset without nan_flags values.

Return type

xr.Dataset

disdrodb.l0.l0b_nc_processing.replace_nan_flags(ds, sensor_name, verbose)[source]#

Set values corresponding to nan_flags to np.nan.

Parameters
  • ds (xr.Dataset) – Input xarray dataset

  • dict_nan_flags (dict) – Dictionary with nan flags value to set as np.nan

  • verbose (bool) – Whether to verbose the processing.

Returns

Dataset without nan_flags values.

Return type

xr.Dataset

disdrodb.l0.l0b_nc_processing.set_nan_invalid_values(ds, sensor_name, verbose)[source]#

Set invalid (class) values to np.nan.

Parameters
  • ds (xr.Dataset) – Input xarray dataset

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing.

Returns

Dataset without invalid values.

Return type

xr.Dataset

disdrodb.l0.l0b_nc_processing.set_nan_outside_data_range(ds, sensor_name, verbose)[source]#

Set values outside the data range as np.nan.

Parameters
  • ds (xr.Dataset) – Input xarray dataset

  • sensor_name (str) – Name of the sensor.

  • verbose (bool) – Whether to verbose the processing.

Returns

Dataset without values outside the expected data range.

Return type

xr.Dataset

disdrodb.l0.l0b_nc_processing.subset_dataset(ds, dict_names, sensor_name)[source]#

Subset xr.Dataset with expected variables.

disdrodb.l0.l0b_processing module#

Functions to process DISDRODB L0A files into DISDRODB L0B netCDF files.

disdrodb.l0.l0b_processing.add_dataset_crs_coords(ds)[source]#

Add the CRS coordinate to the xr.Dataset.

disdrodb.l0.l0b_processing.create_l0b_from_l0a(df: DataFrame, attrs: dict, verbose: bool = False) Dataset[source]#

Transform the L0A dataframe to the L0B xr.Dataset.

Parameters
  • df (pd.DataFrame) – DISDRODB L0A dataframe.

  • attrs (dict) – Station metadata.

  • verbose (bool, optional) – Whether to verbose the processing. The default is False.

Returns

DISDRODB L0B dataset.

Return type

xr.Dataset

Raises

ValueError – Error if the DISDRODB L0B xarray dataset can not be created.

disdrodb.l0.l0b_processing.finalize_dataset(ds, sensor_name)[source]#

Finalize DISDRODB L0B Dataset.

disdrodb.l0.l0b_processing.infer_split_str(string: str) str[source]#

Infer the delimiter inside a string.

Parameters

string (str) – Input string.

Returns

Inferred delimiter.

Return type

str

disdrodb.l0.l0b_processing.rechunk_dataset(ds: Dataset, encoding_dict: dict) Dataset[source]#

Coerce the dataset arrays to have the chunk size specified in the encoding dictionary.

Parameters
  • ds (xr.Dataset) – Input xarray dataset

  • encoding_dict (dict) – Dictionary containing the encoding to write the xarray dataset as a netCDF.

Returns

Output xarray dataset

Return type

xr.Dataset

disdrodb.l0.l0b_processing.retrieve_l0b_arrays(df: DataFrame, sensor_name: str, verbose: bool = False) dict[source]#

Retrieves the L0B data matrix.

Parameters
  • df (pd.DataFrame) – Input dataframe

  • sensor_name (str) – Name of the sensor

Returns

Dictionary with data arrays.

Return type

dict

disdrodb.l0.l0b_processing.sanitize_encodings_dict(encoding_dict: dict, ds: Dataset) dict[source]#

Ensure chunk size to be smaller than the array shape.

Parameters
  • encoding_dict (dict) – Dictionary containing the encoding to write DISDRODB L0B netCDFs.

  • ds (xr.Dataset) – Input dataset.

Returns

Encoding dictionary.

Return type

dict

disdrodb.l0.l0b_processing.set_encodings(ds: Dataset, sensor_name: str) Dataset[source]#

Apply the encodings to the xarray Dataset.

Parameters
  • ds (xr.Dataset) – Input xarray dataset.

  • sensor_name (str) – Name of the sensor.

Returns

Output xarray dataset.

Return type

xr.Dataset

disdrodb.l0.l0b_processing.write_l0b(ds: Dataset, filepath: str, force=False) None[source]#

Save the xarray dataset into a NetCDF file.

Parameters
  • ds (xr.Dataset) – Input xarray dataset.

  • filepath (str) – Output file path.

  • sensor_name (str) – Name of the sensor.

  • force (bool, optional) – Whether to overwrite existing data. If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. This is the default.

disdrodb.l0.routines module#

Implement DISDRODB wrappers to launch L0 processing in the terminal.

disdrodb.l0.routines.click_l0_archive_options(function: object)[source]#

Click command line arguments for L0 processing archiving of a station.

Parameters

function (object) – Function.

disdrodb.l0.routines.click_l0_processing_options(function: object)[source]#

Click command line default parameters for L0 processing options.

Parameters

function (object) – Function.

disdrodb.l0.routines.click_l0_stations_options(function: object)[source]#

Click command line options for DISDRODB archive L0 processing.

Parameters

function (object) – Function.

disdrodb.l0.routines.click_l0b_concat_options(function: object)[source]#

Click command line default parameters for L0B concatenation.

Parameters

function (object) – Function.

disdrodb.l0.routines.click_remove_l0a_option(function: object)[source]#

Click command line argument for remove_l0a.

disdrodb.l0.routines.run_disdrodb_l0(data_sources=None, campaign_names=None, station_names=None, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = False, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0 processing of DISDRODB stations.

This function allows to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters
  • data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None.

  • campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None.

  • station_names (list) – Station names to process. The default is None.

  • l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.

  • l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.

  • l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.

  • remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.

  • remove_l0b (bool) – Whether to remove the L0B files after having concatenated all L0B netCDF files. It takes places only if l0b_concat = True. The default is False.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is False.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files. For L0B, it processes just the first 100 rows of 3 L0A files. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.routines.run_disdrodb_l0_station(data_source, campaign_name, station_name, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = False, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0 processing of a specific DISDRODB station from the terminal.

Parameters
  • data_source (str) – Institution name (when campaign data spans more than 1 country), or country (when all campaigns (or sensor networks) are inside a given country). Must be UPPER CASE.

  • campaign_name (str) – Campaign name. Must be UPPER CASE.

  • station_name (str) – Station name

  • l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.

  • l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.

  • l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.

  • remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.

  • remove_l0b (bool) – Whether to remove the L0B files after having concatenated all L0B netCDF files. It takes places only if l0b_concat=True. The default is False.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is True.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files for each station. For L0B, it processes just the first 100 rows of 3 L0A files for each station. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.routines.run_disdrodb_l0a(data_sources=None, campaign_names=None, station_names=None, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0A processing of DISDRODB stations.

This function allows to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters
  • data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None.

  • campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None.

  • station_names (list) – Station names to process. The default is None.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is True.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.routines.run_disdrodb_l0a_station(data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0A processing of a station calling the disdrodb_l0a_station in the terminal.

disdrodb.l0.routines.run_disdrodb_l0b(data_sources=None, campaign_names=None, station_names=None, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None, remove_l0a: bool = False)[source]#

Run the L0B processing of DISDRODB stations.

This function allows to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB L0A stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters
  • data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None.

  • campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None.

  • station_names (list) – Station names to process. The default is None.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is True.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0B, it processes just the first 100 rows of 3 L0A files. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.routines.run_disdrodb_l0b_concat(data_sources=None, campaign_names=None, station_names=None, remove_l0b=False, verbose=False, base_dir=None)[source]#

Concatenate the L0B files of the DISDRODB archive.

This function is called by the disdrodb_run_l0b_concat script.

disdrodb.l0.routines.run_disdrodb_l0b_concat_station(data_source, campaign_name, station_name, remove_l0b=False, verbose=False, base_dir=None)[source]#

Concatenate the L0B files of a single DISDRODB station.

This function runs the disdrodb_run_l0b_concat_station script in the terminal.

disdrodb.l0.routines.run_disdrodb_l0b_station(data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None, remove_l0a: bool = False)[source]#

Run the L0B processing of a station calling disdrodb_run_l0b_station in the terminal.

disdrodb.l0.standards module#

Retrieve L0 sensor standards.

disdrodb.l0.standards.get_bin_coords_dict(sensor_name: str) dict[source]#

Retrieve diameter (and velocity) bin coordinates.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with coordinates arrays.

Return type

dict

disdrodb.l0.standards.get_coords_attrs_dict()[source]#

Return dictionary with DISDRODB coordinates attributes.

disdrodb.l0.standards.get_data_format_dict(sensor_name: str) dict[source]#

Get a dictionary containing the data format of each sensor variable.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Data format of each sensor variable.

Return type

dict

disdrodb.l0.standards.get_data_range_dict(sensor_name: str) dict[source]#

Get the variable data range.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the expected data value range for each data field. It excludes variables without specified data_range key.

Return type

dict

disdrodb.l0.standards.get_diameter_bin_center(sensor_name: str) list[source]#

Get diameter bin center.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Diameter bin center.

Return type

list

disdrodb.l0.standards.get_diameter_bin_lower(sensor_name: str) list[source]#

Get diameter bin lower bound.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Diameter bin lower bound.

Return type

list

disdrodb.l0.standards.get_diameter_bin_upper(sensor_name: str) list[source]#

Get diameter bin upper bound.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Diameter bin upper bound.

Return type

list

disdrodb.l0.standards.get_diameter_bin_width(sensor_name: str) list[source]#

Get diameter bin width.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Diameter bin width.

Return type

list

disdrodb.l0.standards.get_diameter_bins_dict(sensor_name: str) dict[source]#

Get dictionary with sensor_name diameter bins information.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Sensor diameter bins information.

Return type

dict

disdrodb.l0.standards.get_dims_size_dict(sensor_name: str) dict[source]#

Get the number of bins for each dimension.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the number of bins for each dimension.

Return type

dict

disdrodb.l0.standards.get_field_nchar_dict(sensor_name: str) dict[source]#

Get the total number of characters from the instrument default string standards.

Important note: it accounts also for the comma and the minus sign !!!

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the expected number of characters for each data field.

Return type

dict

disdrodb.l0.standards.get_field_ndigits_decimals_dict(sensor_name: dict) dict[source]#

Get number of digits on the right side of the comma from the instrument default string standards.

Example: 123,45 -> 45 –> 2 decimal digits.

Parameters

sensor_name (dict) – Name of the sensor.

Returns

Dictionary with the expected number of decimal digits for each data field.

Return type

dict

disdrodb.l0.standards.get_field_ndigits_dict(sensor_name: str) dict[source]#

Get number of digits from the instrument default string standards.

Important note: it excludes the comma but it counts the minus sign !!!

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the expected number of digits for each data field.

Return type

dict

disdrodb.l0.standards.get_field_ndigits_natural_dict(sensor_name: str) dict[source]#

Get number of digits on the left side of the comma from the instrument default string standards.

Example: 123,45 -> 123 –> 3 natural digits.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the expected number of natural digits for each data field.

Return type

dict

disdrodb.l0.standards.get_l0a_dtype(sensor_name: str) dict[source]#

Get a dictionary containing the L0A dtype.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the L0A dtype.

Return type

dict

disdrodb.l0.standards.get_l0a_encodings_dict(sensor_name: str) dict[source]#

Get a dictionary containing the L0A encodings.

Parameters

sensor_name (str) – Name of the sensor.

Returns

L0A encodings.

Return type

dict

disdrodb.l0.standards.get_l0b_cf_attrs_dict(sensor_name: str) dict[source]#

Get a dictionary containing the CF attributes of each sensor variable.

Parameters

sensor_name (str) – Name of the sensor.

Returns

CF attributes of each sensor variable. For each variable, the ‘units’, ‘description’, and ‘long_name’ attributes are specified.

Return type

dict

disdrodb.l0.standards.get_l0b_encodings_dict(sensor_name: str) dict[source]#

Get a dictionary containing the encoding to write L0B netCDFs.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Encoding to write L0B netCDFs

Return type

dict

disdrodb.l0.standards.get_n_diameter_bins(sensor_name)[source]#

Get the number of diameter bins.

disdrodb.l0.standards.get_n_velocity_bins(sensor_name)[source]#

Get the number of velocity bins.

disdrodb.l0.standards.get_nan_flags_dict(sensor_name: str) dict[source]#

Get the variable nan_flags.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the expected nan_flags list for each data field. It excludes variables without specified nan_flags key.

Return type

dict

disdrodb.l0.standards.get_raw_array_dims_order(sensor_name: str) dict[source]#

Get the dimension order of the raw fields.

The order of dimension specified for raw_drop_number controls the reshaping of the precipitation raw spectrum.

Examples

OTT Parsivel spectrum [v1d1 … v1d32, v2d1, …, v2d32] –> dimension_order = [“velocity_bin_center”, “diameter_bin_center”] Thies LPM spectrum [v1d1 … v20d1, v1d2, …, v20d2] –> dimension_order = [“diameter_bin_center”, “velocity_bin_center”]

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dimension order dictionary.

Return type

dict

disdrodb.l0.standards.get_raw_array_nvalues(sensor_name: str) dict[source]#

Get a dictionary with the number of values expected for each raw array.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Field definition.

Return type

dict

disdrodb.l0.standards.get_sensor_logged_variables(sensor_name: str) list[source]#

Get the sensor logged variables list.

Parameters

sensor_name (str) – Name of the sensor.

Returns

List of the variables logged by the sensor.

Return type

list

disdrodb.l0.standards.get_time_encoding() dict[source]#

Create time encoding.

Returns

Time encoding.

Return type

dict

disdrodb.l0.standards.get_valid_coordinates_names(sensor_name)[source]#

Get list of valid coordinates for DISDRODB L0B.

disdrodb.l0.standards.get_valid_dimension_names(sensor_name)[source]#

Get list of valid dimension names for DISDRODB L0B.

disdrodb.l0.standards.get_valid_names(sensor_name)[source]#

Return the list of valid variable and coordinates names for DISDRODB L0B.

disdrodb.l0.standards.get_valid_values_dict(sensor_name: str) dict[source]#

Get the list of valid values for a variable.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Dictionary with the expected values for specific variables. It excludes variables without specified valid_values key.

Return type

dict

disdrodb.l0.standards.get_valid_variable_names(sensor_name)[source]#

Get list of valid variables.

disdrodb.l0.standards.get_variables_dimension(sensor_name: str)[source]#

Returns a dictionary with the variable dimensions of a L0B product.

disdrodb.l0.standards.get_velocity_bin_center(sensor_name: str) list[source]#

Get velocity bin center.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Velocity bin center.

Return type

list

disdrodb.l0.standards.get_velocity_bin_lower(sensor_name: str) list[source]#

Get velocity bin lower bound.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Velocity bin lower bound.

Return type

list

disdrodb.l0.standards.get_velocity_bin_upper(sensor_name: str) list[source]#

Get velocity bin upper bound.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Velocity bin upper bound.

Return type

list

disdrodb.l0.standards.get_velocity_bin_width(sensor_name: str) list[source]#

Get velocity bin width.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Velocity bin width.

Return type

list

disdrodb.l0.standards.get_velocity_bins_dict(sensor_name: str) dict[source]#

Get velocity with sensor_name diameter bins information.

Parameters

sensor_name (str) – Name of the sensor.

Returns

Sensor velocity bins information.

Return type

dict

disdrodb.l0.standards.set_disdrodb_attrs(ds, product: str)[source]#

Add DISDRODB processing information to the netCDF global attributes.

It assumes stations metadata are already added the dataset.

Parameters
  • ds (xarray dataset.) – Dataset

  • product (str) – DISDRODB product.

Returns

Dataset.

Return type

xarray dataset

disdrodb.l0.template_tools module#

Useful tools helping in the implementation of the DISDRODB L0 readers.

disdrodb.l0.template_tools.check_column_names(column_names: list, sensor_name: str) None[source]#

Checks that the column names respects DISDRODB standards.

Parameters
  • column_names (list) – List of columns names.

  • sensor_name (str) – Name of the sensor.

Raises

TypeError – Error if some columns do not meet the DISDRODB standards.

disdrodb.l0.template_tools.get_decimal_ndigits(string: str) int[source]#

Get the number of decimal digits.

Parameters

string (str) – Input string.

Returns

The number of decimal digits.

Return type

int

disdrodb.l0.template_tools.get_df_columns_unique_values_dict(df: DataFrame, column_indices: Optional[Union[int, slice, list]] = None, column_names: bool = True)[source]#

Create a dictionary {column: unique values}.

Parameters
  • df (pd.DataFrame) – Input dataframe

  • column_indices (Union[int,slice,list], optional) – Column indices. If None, select all columns.

  • column_names (bool, optional) – If True, the dictionary key are the column names. The default is True.

disdrodb.l0.template_tools.get_natural_ndigits(string: str) int[source]#

Get the number of natural digits.

Parameters

string (str) – Input string.

Returns

The number of natural digits.

Return type

int

disdrodb.l0.template_tools.get_nchar(string: str) int[source]#

Get the number of characters.

Parameters

string (str) – Input string.

Returns

The number of characters.

Return type

int

disdrodb.l0.template_tools.get_ndigits(string: str) int[source]#

Get the number of total numeric digits.

Parameters

string (str) – Input string

Returns

The number of total digits.

Return type

int

disdrodb.l0.template_tools.infer_column_names(df: DataFrame, sensor_name: str, row_idx: int = 1)[source]#

Try to guess the dataframe columns names based on string characteristics.

Parameters
  • df (pd.DataFrame) – The dataframe to analyse.

  • sensor_name (str) – name of the sensor.

  • row_idx (int, optional) – The row index of the dataframe to use to infer the column names. The default row index is 1.

Returns

Dictionary with the keys being the column id and the values being the guessed column names

Return type

dict

disdrodb.l0.template_tools.print_df_column_names(df: DataFrame) None[source]#

Print dataframe columns names.

Parameters

df (dataframe) – The dataframe.

disdrodb.l0.template_tools.print_df_columns_unique_values(df: DataFrame, column_indices: Optional[Union[int, slice, list]] = None, print_column_names: bool = True) None[source]#

Print columns’ unique values.

Parameters
  • df (pd.DataFrame) – Input dataframe

  • column_indices (Union[int,slice,list], optional) – Column indices. If None, select all columns.

  • column_names (bool, optional) – If True, print the column names. The default is True.

disdrodb.l0.template_tools.print_df_first_n_rows(df: DataFrame, n: int = 5, print_column_names: bool = True) None[source]#

Print the n first n rows dataframe by column.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • n (int, optional) – Number of row. The default is 5.

  • column_names (bool , optional) – If true columns name are printed, by default True.

disdrodb.l0.template_tools.print_df_random_n_rows(df: DataFrame, n: int = 5, print_column_names: bool = True) None[source]#

Print the content of the dataframe by column, randomly chosen.

Parameters
  • df (dataframe) – The dataframe.

  • n (int, optional) – The number of row to print. The default is 5.

  • print_column_names (bool, optional) – If true, print the column names. The default is True.

disdrodb.l0.template_tools.print_df_summary_stats(df: DataFrame, column_indices: Optional[Union[int, slice, list]] = None, print_column_names: bool = True)[source]#

Create a columns statistics summary.

Parameters
  • df (pd.DataFrame) – Input dataframe

  • column_indices (Union[int,slice,list], optional) – Column indices. If None, select all columns.

  • print_column_names (bool, optional) – If True, print the column names. The default is True.

Raises

ValueError – Error if columns types is not numeric.

disdrodb.l0.template_tools.print_df_with_any_nan_rows(df: DataFrame) None[source]#

Print empty rows.

Parameters

df (pd.DataFrame) – Input dataframe.

disdrodb.l0.template_tools.print_valid_l0_column_names(sensor_name: str) None[source]#

Print valid columns names from the standard.

Parameters

sensor_name (str) – Name of the sensor.

disdrodb.l0.template_tools.str_has_decimal_digits(string: str) bool[source]#

Check if a string has decimals.

Parameters

string – Input string.

Returns

True if string has digits.

Return type

bool

disdrodb.l0.template_tools.str_is_integer(string: str) bool[source]#

Check if a string represent an integer.

Parameters

string (Input string.) –

Returns

True if integer.

Return type

bool

disdrodb.l0.template_tools.str_is_number(string: str) bool[source]#

Check if a string represents a number.

Parameters

string (Input string.) –

Returns

True if float.

Return type

bool

Module contents#

disdrodb.l0.available_readers(data_sources=None, reader_path=False)[source]#

Retrieve available readers information.

disdrodb.l0.run_disdrodb_l0(data_sources=None, campaign_names=None, station_names=None, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = False, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0 processing of DISDRODB stations.

This function allows to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters
  • data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None.

  • campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None.

  • station_names (list) – Station names to process. The default is None.

  • l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.

  • l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.

  • l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.

  • remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.

  • remove_l0b (bool) – Whether to remove the L0B files after having concatenated all L0B netCDF files. It takes places only if l0b_concat = True. The default is False.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is False.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files. For L0B, it processes just the first 100 rows of 3 L0A files. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.run_disdrodb_l0_station(data_source, campaign_name, station_name, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = False, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0 processing of a specific DISDRODB station from the terminal.

Parameters
  • data_source (str) – Institution name (when campaign data spans more than 1 country), or country (when all campaigns (or sensor networks) are inside a given country). Must be UPPER CASE.

  • campaign_name (str) – Campaign name. Must be UPPER CASE.

  • station_name (str) – Station name

  • l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.

  • l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.

  • l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.

  • remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.

  • remove_l0b (bool) – Whether to remove the L0B files after having concatenated all L0B netCDF files. It takes places only if l0b_concat=True. The default is False.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is True.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files for each station. For L0B, it processes just the first 100 rows of 3 L0A files for each station. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.run_disdrodb_l0a(data_sources=None, campaign_names=None, station_names=None, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0A processing of DISDRODB stations.

This function allows to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters
  • data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None.

  • campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None.

  • station_names (list) – Station names to process. The default is None.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is True.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.run_disdrodb_l0a_station(data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None)[source]#

Run the L0A processing of a station calling the disdrodb_l0a_station in the terminal.

disdrodb.l0.run_disdrodb_l0b(data_sources=None, campaign_names=None, station_names=None, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None, remove_l0a: bool = False)[source]#

Run the L0B processing of DISDRODB stations.

This function allows to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB L0A stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters
  • data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None.

  • campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None.

  • station_names (list) – Station names to process. The default is None.

  • force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.

  • verbose (bool) – Whether to print detailed processing information into terminal. The default is True.

  • parallel (bool) – If True, the files are processed simultaneously in multiple processes. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process.

  • debugging_mode (bool) – If True, it reduces the amount of data to process. For L0B, it processes just the first 100 rows of 3 L0A files. The default is False.

  • base_dir (str (optional)) – Base directory of DISDRODB. Format: <...>/DISDRODB. If None (the default), the base_dir path specified in the DISDRODB active configuration will be used.

disdrodb.l0.run_disdrodb_l0b_station(data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True, base_dir: Optional[str] = None, remove_l0a: bool = False)[source]#

Run the L0B processing of a station calling disdrodb_run_l0b_station in the terminal.

disdrodb.l0.run_l0a(raw_dir, processed_dir, station_name, glob_patterns, column_names, reader_kwargs, df_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]#

Run the L0A processing for a specific DISDRODB station.

This function is called in each reader to convert raw text files into DISDRODB L0A products.

Parameters
  • raw_dir (str) –

    The directory path where all the raw content of a specific campaign is stored. The path must have the following structure: <...>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>. Inside the raw_dir directory, it is required to adopt the following structure:

    - ``/data/<station_name>/<raw_files>``
    - ``/metadata/<station_name>.yml``
    

    Important points:

    • For each <station_name>, there must be a corresponding YAML file in the metadata subdirectory.

    • The campaign_name are expected to be UPPER CASE.

    • The <CAMPAIGN_NAME> must semantically match between:
      • the raw_dir and processed_dir directory paths;

      • with the key campaign_name within the metadata YAML files.

  • processed_dir (str) – The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure: <...>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME>. For testing purposes, this function exceptionally accepts also a directory path simply ending with <CAMPAIGN_NAME> (e.g., /tmp/<CAMPAIGN_NAME>).

  • station_name (str) –

  • station. (The name of the) –

  • glob_patterns (str) – Glob pattern to search for data files in <raw_dir>/data/<station_name>.

  • column_names (list) – Column names of the raw text file.

  • reader_kwargs (dict) – Arguments for Pandas read_csv function to open the text file.

  • df_sanitizer_fun (callable, optional) – Sanitizer function to format the DataFrame into DISDRODB L0A standard. Default is None.

  • parallel (bool, optional) – If True, process the files simultaneously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed.LocalCluster. If False, process the files sequentially in a single process. Default is False.

  • verbose (bool, optional) – If True, print detailed processing information to the terminal. Default is False.

  • force (bool, optional) – If True, overwrite existing data in destination directories. If False, raise an error if data already exists in destination directories. Default is False.

  • debugging_mode (bool, optional) – If True, reduce the amount of data to process. Processes only the first 100 rows of 3 raw data files. Default is False.

disdrodb.l0.run_l0b_from_nc(raw_dir, processed_dir, station_name, glob_patterns, dict_names, ds_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]#

Run the L0B processing for a specific DISDRODB station with raw netCDFs.

This function is called in the reader where raw netCDF files must be converted into DISDRODB L0B format.

Parameters
  • raw_dir (str) –

    The directory path where all the raw content of a specific campaign is stored. The path must have the following structure: <...>/DISDRODB/Raw/<DATA_SOURCE>/<CAMPAIGN_NAME>. Inside the raw_dir directory, it is required to adopt the following structure:

    - ``/data/<station_name>/<raw_files>``
    - ``/metadata/<station_name>.yml``
    

    Important points:

    • For each <station_name>, there must be a corresponding YAML file in the metadata subdirectory.

    • The campaign_name are expected to be UPPER CASE.

    • The <CAMPAIGN_NAME> must semantically match between:
      • the raw_dir and processed_dir directory paths;

      • with the key campaign_name within the metadata YAML files.

  • processed_dir (str) – The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure: <...>/DISDRODB/Processed/<DATA_SOURCE>/<CAMPAIGN_NAME>. For testing purposes, this function exceptionally accepts also a directory path simply ending with <CAMPAIGN_NAME> (e.g., /tmp/<CAMPAIGN_NAME>).

  • station_name (str) – The name of the station.

  • glob_patterns (str) – Glob pattern to search data files in <raw_dir>/data/<station_name>. Example: glob_patterns = "*.nc"

  • dict_names (dict) –

    Dictionary mapping raw netCDF variables/coordinates/dimension names

    to DISDRODB standards.

    ds_sanitizer_funobject, optional

    Sanitizer function to format the raw netCDF into DISDRODB L0B standard.

  • force (bool, optional) – If True, overwrite existing data in destination directories. If False, raise an error if data already exists in destination directories. Default is False.

  • verbose (bool, optional) – If True, print detailed processing information to the terminal. Default is True.

  • parallel (bool, optional) – If True, process the files simultaneously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed.LocalCluster. Ensure that the threads_per_worker (number of thread per process) is set to 1 to avoid HDF errors. Also, ensure to set the HDF5_USE_FILE_LOCKING environment variable to False. If False, process the files sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks. Default is False.

  • debugging_mode (bool, optional) – If True, reduce the amount of data to process. Only the first 3 raw netCDF files will be processed. Default is False.