DISDRODB reader preparation tutorial

DISDRODB reader preparation tutorial#

This notebook aims to guide you through creating the reader for the raw files logged by a disdrometer device.

In first place, this notebook will provide you with functions that will display and enable to investigate the content of your raw data files.

Successively, you will define a series of parameters defining the reader behaviour. These pieces of code will be consolidated in the reader_template.py file to generate a DISDRODB L0 reader.

In this notebook, we uses a lightweight dataset for illustratory purposes. You may use it and readapt it for exploring your own dataset, when preparing a new reader.

Following the documentation in How to Contribute New Data to DISDRODB, you should have already:

  • defined the metadata for the stations you aim to define the reader

  • copied the raw data within the correct folder of the local DISDRODB archive

  • copied the reader_template.py, place it in the correct disdrodb.l0.<READER_DATA_SOURCE> directory and renamed it as <READER_NAME>.py

For this tutorial, we have prepared some sample data in the folder data/DISDRODB of the disdrodb repository. In this tutorial, this data/DISDRODB directory will act as toy DISDRODB base directory.

The data corresponds to some measurements taken at two stations (station_name_1 and station_name_2) during two days of a field campaign led by the EPFL LTE laboratory.

📁 DISDRODB
├── 📁 Raw
    ├── 📁 DATA_SOURCE
        ├── 📁 CAMPAIGN_NAME
            ├── 📁 data
                ├── 📁 station_name_1
                ├── 📜 file60_20180817.dat.gz
                ├── 📜 file60_20180818.dat.gz
                ├── 📁 station_name_2
                ├── 📜 file61_20180817.dat.gz
                ├── 📜 file61_20180818.dat.gz
            ├── 📁 info
            ├── 📁 issue
                ├── 📜 station_name_1.yml
                ├── 📜 station_name_2.yml
            ├── 📁 metadata
                ├── 📜 station_name_1.yml
                ├── 📜 station_name_2.yml

Step 1: Read and analyse the data#

The goal of Step 1 is to define the specifications to read the raw data into a dataframe and ensure that the dataframe columns match the DISDRODB standards. At the end of this tutorial, you should be able to generate Apache Parquet files from your input raw data.


Here we load the modules and packages required. Nothing must be changed here.

[1]:
# Define project base directory
import os

root_path = os.path.dirname(os.getcwd())  # something like /home/ghiggi/Projects/disdrodb
print(root_path)
/home/ghiggi/Python_Packages/disdrodb
[4]:
import pandas as pd

from disdrodb.api.checks import check_sensor_name

# Directory
from disdrodb.api.create_directories import create_l0_directory_structure
from disdrodb.api.info import infer_path_info_dict

# Standards
from disdrodb.api.path import define_campaign_dir
from disdrodb.l0.check_standards import check_l0a_column_names

# L0A processing
from disdrodb.l0.io import get_raw_filepaths
from disdrodb.l0.l0a_processing import (
    read_raw_file,
    read_raw_files,
)

# L0B processing
from disdrodb.l0.l0b_processing import (
    create_l0b_from_l0a,
)

# Tools to develop the reader
from disdrodb.l0.template_tools import (
    check_column_names,
    get_df_columns_unique_values_dict,
    infer_column_names,
    print_df_column_names,
    print_df_columns_unique_values,
    print_df_first_n_rows,
    print_df_random_n_rows,
    print_df_summary_stats,
    print_valid_l0_column_names,
)

# Metadata
from disdrodb.metadata import read_station_metadata

1. Define paths and running parameters

In the following section, define the raw and processed directory paths. This may be changed if you are using another folder.

NB: - In the real use case, the DATA_SOURCE and CAMPAIGN_NAME should be replaced by meaningul names ! - The raw_dir and processed_dir must end with the same CAMPAIGN_NAME (in upper case format)

[5]:
base_dir = os.path.join(root_path, "data", "DISDRODB")
data_source = "DATA_SOURCE"
campaign_name = "CAMPAIGN_NAME"
raw_dir = define_campaign_dir(base_dir=base_dir,
                                    product="RAW",
                                    data_source=data_source,
                                    campaign_name=campaign_name,
)
processed_dir = define_campaign_dir(base_dir=base_dir,
                                    product="L0A",
                                    data_source=data_source,
                                    campaign_name=campaign_name,
)
assert os.path.exists(raw_dir), "Raw directory does not exist"
print(f"raw_dir: {raw_dir}")
print(f"processed_dir: {processed_dir}")
raw_dir: /home/ghiggi/Python_Packages/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME
processed_dir: /home/ghiggi/Python_Packages/disdrodb/data/DISDRODB/Processed/DATA_SOURCE/CAMPAIGN_NAME

Then we define the reader execution parameters. When the new reader will be created, these parameters will be become the reader function arguments. Please have a look at the documentation to get a full description.

[6]:
force = True
parallel = False
verbose = True
debugging_mode = True
sensor_name = "OTT_Parsivel"

2. Selection of the station

In this example, we choose to implement and run the reader for station station_name_1. However, feel free to change the station name :)

[7]:
station_name = "station_name_1"

3. Initialization

We initiate some checks, and get some variable. Nothing must be changed here.

[10]:
# Create directory structure
create_l0_directory_structure(
    raw_dir=raw_dir,
    processed_dir=processed_dir,
    station_name=station_name,
    force=force,
    product="L0A",
)

Please, be sure to run the cell above only one time. If it is run many times, the log file blocks the folder creation.

4. Get the list of file to process

We now list all raw data files that are available for the selected station. Here we need to specify the glob pattern that enables to select all the relevant data files. Since the files in this case study are named like file<XXX>_<TIME>.dat.gz, we define the glob pattern "*.dat*". Note that also "*.dat.gz" or "file*.dat.gz" would have worked.

[11]:
glob_pattern = "*.dat*"

filepaths = get_raw_filepaths(
    raw_dir=raw_dir,
    station_name=station_name,
    glob_patterns=glob_pattern,
    verbose=verbose,
    debugging_mode=debugging_mode,
)

print(filepaths)
 -  - 2 files to process in /home/ghiggi/Python_Packages/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1
['/home/ghiggi/Python_Packages/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1/file60_20180817.dat.gz', '/home/ghiggi/Python_Packages/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1/file60_20180818.dat.gz']

🚨 The glob_pattern variable definition will be transferred into your reader function at the end of this notebook.

Remember that the glob_pattern variable depends on the file naming and extensions of your raw data !!!

5. Retrieve metadata from YAML files

We now load the metadata file of the station.

If the name of the station is not correctly defined, an error message is raised.

[12]:
# Retrieve metadata
attrs = read_station_metadata(station_name=station_name,
                              product="RAW",
                              **infer_path_info_dict(raw_dir))
# Retrieve sensor name
sensor_name = attrs["sensor_name"]
check_sensor_name(sensor_name)

5. Load the one file into a dataframe

In the reader_kwargs dictionary, you may set any arguments that need to be passed to read the raw text file into a pandas.DataFrame.

[13]:
reader_kwargs = {}

# - Define delimiter
reader_kwargs["delimiter"] = ","

# - Avoid first column to become df index !!!
reader_kwargs["index_col"] = False

# Since column names are expected to be passed explicitly, header is set to None
reader_kwargs["header"] = None

# - Number of rows to be skipped at the beginning of the file
reader_kwargs["skiprows"] = None

# - Define behaviour when encountering bad lines
reader_kwargs["on_bad_lines"] = "skip"

# - Define reader engine
#   - C engine is faster
#   - Python engine is more feature-complete
reader_kwargs["engine"] = "python"

# - Define on-the-fly decompression of on-disk data
#   - Available: gzip, bz2, zip
reader_kwargs["compression"] = "infer"

# - Strings to recognize as NA/NaN and replace with standard NA flags
#   - Already included: '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN',
#                       '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A',
#                       'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'
reader_kwargs["na_values"] = ["na", "", "error"]


# -----------------------------------------------------------
# Select first file
filepath = filepaths[0]

# Try to read the raw file
df_raw = read_raw_file(filepath, column_names=None, reader_kwargs=reader_kwargs)
# Print the dataframe
print(f"Dataframe for the file {os.path.basename(filepath)} :")
display(df_raw)
Dataframe for the file file60_20180817.dat.gz :
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 362511 4612.0301 00847.4977 01-08-2018 12:44:30 NaN OK 0000.000 0056.49 00 00 ... 035 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
1 362512 4612.0301 00847.4978 01-08-2018 12:45:01 NaN OK 0000.000 0056.49 00 00 ... 035 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
2 362513 4612.0301 00847.4985 01-08-2018 12:45:30 NaN OK 0000.000 0056.49 00 00 ... 035 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
3 362514 4612.0305 00847.4990 01-08-2018 12:46:01 NaN OK 0000.000 0056.49 00 00 ... 035 0.05 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
4 362515 4612.0303 00847.4992 01-08-2018 12:46:31 NaN OK 0000.000 0056.49 00 00 ... 034 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4736 367249 4612.0313 00847.4956 03-08-2018 04:13:25 NaN OK 0000.000 0056.71 00 00 ... 015 0.06 24.9 0 005.671 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
4737 367250 4612.0313 00847.4955 03-08-2018 04:13:56 NaN OK 0000.000 0056.71 00 00 ... 015 0.06 24.9 0 005.671 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
4738 367251 4612.0313 00847.4955 03-08-2018 04:14:26 NaN OK 0000.000 0056.71 00 00 ... 015 0.06 24.9 0 005.671 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
4739 367252 4612.0313 00847.4954 03-08-2018 04:14:55 NaN OK 0000.000 0056.71 00 00 ... 015 0.06 24.9 0 005.671 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
4740 367253 4612.0313 00847.4954 03-08-2018 04:15:25 NaN OK 0000.000 0056.71 00 00 ... 015 0.07 24.9 0 005.671 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0

4741 rows × 24 columns

[14]:
print("Column names:", df_raw.columns)
print("Row Index:", df_raw.index)
Column names: Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23],
      dtype='int64')
Row Index: RangeIndex(start=0, stop=4741, step=1)

Here we expect the df_raw to have: - numeric column names (i.e. Int64Index) - numeric row index (i.e. RangeIndex)

If the structure of the dataframe looks fine (no header and no row index), we are on the good track !

Depending on the schema of your data, this reader_kwargs dictionary may be fairly different from the one above.

🚨 The reader_kwargs dictionary will be transferred to your reader function at the end of this notebook.

6. Data exploration

Since the settings for searching and reading the raw data are now specified, we can now load one file and analyse its content to see if there is any errors or inconsistencies.

Here are some instructions :

  • Do not assign column names to the dataframe columns yet

  • Do not assign a dtype to the dataframe columns yet

  • Possibly look at multiple files !

We print the content first 3 rows : (Feel free to change the value of n to see more/less rows)

[15]:
print_df_first_n_rows(df_raw, n=2, print_column_names=False)
 - Column 0 :
      ['362511' '362512' '362513']
 - Column 1 :
      ['4612.0301' '4612.0301' '4612.0301']
 - Column 2 :
      ['00847.4977' '00847.4978' '00847.4985']
 - Column 3 :
      ['01-08-2018 12:44:30' '01-08-2018 12:45:01' '01-08-2018 12:45:30']
 - Column 4 :
      [nan nan nan]
 - Column 5 :
      ['OK' 'OK' 'OK']
 - Column 6 :
      ['0000.000' '0000.000' '0000.000']
 - Column 7 :
      ['0056.49' '0056.49' '0056.49']
 - Column 8 :
      ['00' '00' '00']
 - Column 9 :
      ['00' '00' '00']
 - Column 10 :
      ['-9.999' '-9.999' '-9.999']
 - Column 11 :
      ['9999' '9999' '9999']
 - Column 12 :
      ['12611' '12617' '12600']
 - Column 13 :
      ['00000' '00000' '00000']
 - Column 14 :
      ['035' '035' '035']
 - Column 15 :
      ['0.06' '0.06' '0.06']
 - Column 16 :
      ['24.9' '24.9' '24.9']
 - Column 17 :
      ['0' '0' '0']
 - Column 18 :
      ['005.649' '005.649' '005.649']
 - Column 19 :
      ['000' '000' '000']
 - Column 20 :
      ['-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,']
 - Column 21 :
      ['00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,']
 - Column 22 :



 - Column 23 :
      ['0' '0' '0']
[16]:
df_raw.head(3)
[16]:
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 362511 4612.0301 00847.4977 01-08-2018 12:44:30 NaN OK 0000.000 0056.49 00 00 ... 035 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
1 362512 4612.0301 00847.4978 01-08-2018 12:45:01 NaN OK 0000.000 0056.49 00 00 ... 035 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0
2 362513 4612.0301 00847.4985 01-08-2018 12:45:30 NaN OK 0000.000 0056.49 00 00 ... 035 0.06 24.9 0 005.649 000 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 0

3 rows × 24 columns

We print the content of n rows picked randomly :

[17]:
print_df_random_n_rows(df_raw, n=6, print_column_names=False)
 - Column 0 :
      ['362996' '367008' '364763' '363666' '366535' '366533']
 - Column 1 :
      ['4612.0298' '4612.0289' '4612.0321' '4612.0315' '4612.0310' '4612.0309']
 - Column 2 :
      ['00847.4959' '00847.4950' '00847.4965' '00847.4955' '00847.4951'
 '00847.4950']
 - Column 3 :
      ['01-08-2018 16:47:00' '03-08-2018 02:13:01' '02-08-2018 07:30:31'
 '01-08-2018 22:22:01' '02-08-2018 22:16:30' '02-08-2018 22:15:30']
 - Column 4 :
      [nan nan nan nan nan nan]
 - Column 5 :
      ['OK' 'OK' 'OK' 'OK' 'OK' 'OK']
 - Column 6 :
      ['0000.000' '0000.000' '0000.000' '0000.000' '0000.000' '0000.000']
 - Column 7 :
      ['0056.52' '0056.71' '0056.67' '0056.67' '0056.71' '0056.71']
 - Column 8 :
      ['00' '00' '00' '00' '00' '00']
 - Column 9 :
      ['00' '00' '00' '00' '00' '00']
 - Column 10 :
      ['-9.999' '-9.999' '-9.999' '-9.999' '-9.999' '-9.999']
 - Column 11 :
      ['9999' '9999' '9999' '9999' '9999' '9999']
 - Column 12 :
      ['12510' '11702' '12580' '12540' '12144' '12128']
 - Column 13 :
      ['00000' '00000' '00000' '00000' '00000' '00000']
 - Column 14 :
      ['021' '015' '025' '018' '016' '016']
 - Column 15 :
      ['0.06' '0.06' '0.05' '0.06' '0.06' '0.06']
 - Column 16 :
      ['24.9' '24.9' '24.9' '24.9' '24.9' '24.9']
 - Column 17 :
      ['0' '0' '0' '0' '0' '0']
 - Column 18 :
      ['005.652' '005.671' '005.667' '005.667' '005.671' '005.671']
 - Column 19 :
      ['000' '000' '000' '000' '000' '000']
 - Column 20 :
      ['-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,']
 - Column 21 :
      ['00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,']
 - Column 22 :

 '000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,'




 - Column 23 :
      ['0' '0' '0' '0' '0' '0']

Get the number of column :

[18]:
len(df_raw.columns)
[18]:
24

Look at unique values for a single column :

[19]:
print_df_columns_unique_values(df_raw, column_indices=11, print_column_names=False)
 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']

Look at unique values for a few columns :

Note: Use column_indices=None to get the unique values for all columns

[20]:
print_df_columns_unique_values(df_raw, column_indices=slice(10, 12), print_column_names=False)
 - Column 10 :
      ['-9.999', '02.669', '04.241', '04.745', '04.826', '04.879', '05.430', '06.095', '06.220', '07.415', '08.436', '08.489', '08.506', '08.724', '08.956', '09.079', '09.894', '10.057', '10.567', '11.705', '12.097', '12.390', '12.923', '13.114', '13.407', '13.684', '14.324', '15.060', '16.530', '16.636', '16.668', '17.194', '17.382', '17.829', '17.918', '18.334', '18.655', '19.526', '20.329', '21.134', '21.426', '23.098', '23.664', '23.760', '24.472', '25.473', '25.957', '29.270', '31.271', '32.255', '33.844', '36.196']
 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']

Get the unique values as dictionary

[21]:
get_df_columns_unique_values_dict(df_raw, column_indices=slice(10, 12), column_names=False)
[21]:
{'Column 10': ['-9.999',
  '02.669',
  '04.241',
  '04.745',
  '04.826',
  '04.879',
  '05.430',
  '06.095',
  '06.220',
  '07.415',
  '08.436',
  '08.489',
  '08.506',
  '08.724',
  '08.956',
  '09.079',
  '09.894',
  '10.057',
  '10.567',
  '11.705',
  '12.097',
  '12.390',
  '12.923',
  '13.114',
  '13.407',
  '13.684',
  '14.324',
  '15.060',
  '16.530',
  '16.636',
  '16.668',
  '17.194',
  '17.382',
  '17.829',
  '17.918',
  '18.334',
  '18.655',
  '19.526',
  '20.329',
  '21.134',
  '21.426',
  '23.098',
  '23.664',
  '23.760',
  '24.472',
  '25.473',
  '25.957',
  '29.270',
  '31.271',
  '32.255',
  '33.844',
  '36.196'],
 'Column 11': ['0824',
  '0906',
  '1363',
  '1397',
  '2921',
  '3203',
  '3326',
  '3816',
  '4465',
  '9999']}

7. Columns name

Now we have validated the content of our data. It’s time to care about its content and structure: the column names.

The function infer_column_names() tries to guess the column names based on the type of sensor and the sensor specifications described within the raw_data_format.yml config file file.

[23]:
infer_column_names(df_raw, sensor_name=sensor_name)
[23]:
{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: ['rainfall_rate_32bit'],
 7: ['rainfall_accumulated_32bit', 'rainfall_accumulated_16bit'],
 8: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 9: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 10: [],
 11: ['mor_visibility'],
 12: ['sample_interval', 'number_particles', 'laser_amplitude'],
 13: ['sample_interval', 'number_particles', 'laser_amplitude'],
 14: ['error_code', 'sensor_temperature'],
 15: ['sensor_heating_current'],
 16: ['sensor_battery_voltage'],
 17: ['sensor_status'],
 18: ['rainfall_amount_absolute_32bit'],
 19: ['error_code', 'sensor_temperature'],
 20: ['raw_drop_concentration', 'raw_drop_average_velocity'],
 21: ['raw_drop_concentration', 'raw_drop_average_velocity'],
 22: ['raw_drop_number'],
 23: ['sensor_status']}

This can help us to define later the column_names list.

As reference, here is the list of valid columns name (taken from l0a_encodings.yml):

[24]:
print_valid_l0_column_names(sensor_name)
['rainfall_rate_32bit', 'rainfall_accumulated_32bit', 'weather_code_synop_4680', 'weather_code_synop_4677', 'weather_code_metar_4678', 'weather_code_nws', 'reflectivity_32bit', 'mor_visibility', 'sample_interval', 'laser_amplitude', 'number_particles', 'sensor_temperature', 'sensor_serial_number', 'firmware_iop', 'firmware_dsp', 'sensor_heating_current', 'sensor_battery_voltage', 'sensor_status', 'start_time', 'sensor_time', 'sensor_date', 'station_name', 'station_number', 'rainfall_amount_absolute_32bit', 'error_code', 'rainfall_rate_16bit', 'rainfall_rate_12bit', 'rainfall_accumulated_16bit', 'reflectivity_16bit', 'raw_drop_concentration', 'raw_drop_average_velocity', 'raw_drop_number']

It’s time now to define our current column names :

Hint to define the names : * get information from the disdrometer user guide and the data logger employed. * use infer_df_str_column_names() to help you * analyse the content column after column with print_df_columns_unique_values()

[25]:
column_names = [
    "unknown1",
    "unknown2",
    "unknown3",
    "timestep",
    "unknown4",
    "unknown5",
    "rainfall_rate_32bit",
    "rainfall_accumulated_32bit",
    "weather_code_synop_4680",
    "weather_code_synop_4677",
    "reflectivity_32bit",
    "mor_visibility",
    "laser_amplitude",
    "number_particles",
    "sensor_temperature",
    "sensor_heating_current",
    "sensor_battery_voltage",
    "sensor_status",
    "rainfall_amount_absolute_32bit",
    "error_code",
    "raw_drop_concentration",
    "raw_drop_average_velocity",
    "raw_drop_number",
    "unknown6",
]

🚨 The column_names list will be transferred to the reader function at the end of this notebook.

Check the validity of your definition

[26]:
check_column_names(column_names, sensor_name)
The following columns do no met the DISDRODB standards: ['unknown4', 'unknown2', 'unknown3', 'unknown5', 'unknown1', 'unknown6', 'timestep'].
Please remove such columns within the df_sanitizer_fun
Please be sure to create the 'time' column within the df_sanitizer_fun.
The 'time' column must be datetime with resolution in seconds (dtype='M8[s]').

Ok, fair enough. There are columns that need to be removed, and we need to also define a column "time" with dtype datetime to meet the DISDRODB standards.

These points will be addressed in Section 10 of this notebook !

8. Read the dataframe with correct columns name

We can now create a new dataframe with the columns name :

[27]:
df = read_raw_file(filepath=filepath, column_names=column_names, reader_kwargs=reader_kwargs)

And print the dataframe column names :

[28]:
print_df_column_names(df)
 - Column 0 : unknown1
 - Column 1 : unknown2
 - Column 2 : unknown3
 - Column 3 : timestep
 - Column 4 : unknown4
 - Column 5 : unknown5
 - Column 6 : rainfall_rate_32bit
 - Column 7 : rainfall_accumulated_32bit
 - Column 8 : weather_code_synop_4680
 - Column 9 : weather_code_synop_4677
 - Column 10 : reflectivity_32bit
 - Column 11 : mor_visibility
 - Column 12 : laser_amplitude
 - Column 13 : number_particles
 - Column 14 : sensor_temperature
 - Column 15 : sensor_heating_current
 - Column 16 : sensor_battery_voltage
 - Column 17 : sensor_status
 - Column 18 : rainfall_amount_absolute_32bit
 - Column 19 : error_code
 - Column 20 : raw_drop_concentration
 - Column 21 : raw_drop_average_velocity
 - Column 22 : raw_drop_number
 - Column 23 : unknown6

9. Perform further tests and analysis to check the correctness of ``column_names``

You can for example check some statistics for a specific column.

[29]:
column_name = "rainfall_rate_32bit"
array_of_values = df.loc[:, [column_name]].astype("float")
print_df_summary_stats(array_of_values)
 - Column 0 ( rainfall_rate_32bit ):

mean  0.005426
min   0.000000
25%   0.000000
50%   0.000000
75%   0.000000
max   2.881000

10. Final columns formatting

[30]:
check_l0a_column_names(df, sensor_name=sensor_name)
The following columns do no met the DISDRODB standards: ['unknown4', 'unknown2', 'unknown3', 'unknown5', 'unknown1', 'unknown6', 'timestep']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/ghiggi/Python_Packages/disdrodb/tutorials/reader_preparation.ipynb Cell 62 line 1
----> <a href='vscode-notebook-cell:/home/ghiggi/Python_Packages/disdrodb/tutorials/reader_preparation.ipynb#Y115sZmlsZQ%3D%3D?line=0'>1</a> check_l0a_column_names(df, sensor_name=sensor_name)

File ~/Python_Packages/disdrodb/disdrodb/l0/check_standards.py:154, in check_l0a_column_names(df, sensor_name)
    152     msg = f"The following columns do no met the DISDRODB standards: {invalid_columns}"
    153     logger.error(msg)
--> 154     raise ValueError(msg)
    155 # --------------------------------------------
    156 # Check time column is present
    157 if "time" not in df_columns:

ValueError: The following columns do no met the DISDRODB standards: ['unknown4', 'unknown2', 'unknown3', 'unknown5', 'unknown1', 'unknown6', 'timestep']
[31]:
check_column_names(column_names, sensor_name)
The following columns do no met the DISDRODB standards: ['unknown4', 'unknown2', 'unknown3', 'unknown5', 'unknown1', 'unknown6', 'timestep'].
Please remove such columns within the df_sanitizer_fun
Please be sure to create the 'time' column within the df_sanitizer_fun.
The 'time' column must be datetime with resolution in seconds (dtype='M8[s]').

Now, it’s time to remove all the columns that does not match the DISDRODB standard.

[32]:
df = df.drop(columns=["unknown1", "unknown2", "unknown3", "unknown4", "unknown5", "unknown6"])

It’s also time to define the column time which is requested by the DISDRODB standard.

[33]:
df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
df = df.drop(columns=["timestep"])

Now let’s check that the column names, after custom processing, conform with the DISDRODB standards:

[34]:
check_l0a_column_names(df, sensor_name=sensor_name)

Finally, check if the dataframe looks as desired:

[35]:
print_df_column_names(df)
 - Column 0 : rainfall_rate_32bit
 - Column 1 : rainfall_accumulated_32bit
 - Column 2 : weather_code_synop_4680
 - Column 3 : weather_code_synop_4677
 - Column 4 : reflectivity_32bit
 - Column 5 : mor_visibility
 - Column 6 : laser_amplitude
 - Column 7 : number_particles
 - Column 8 : sensor_temperature
 - Column 9 : sensor_heating_current
 - Column 10 : sensor_battery_voltage
 - Column 11 : sensor_status
 - Column 12 : rainfall_amount_absolute_32bit
 - Column 13 : error_code
 - Column 14 : raw_drop_concentration
 - Column 15 : raw_drop_average_velocity
 - Column 16 : raw_drop_number
 - Column 17 : time
[36]:
print_df_random_n_rows(df, n=5)
 - Column 0 ( rainfall_rate_32bit ):
      ['0000.000' '0000.000' '0000.000' '0000.000' '0000.000']
 - Column 1 ( rainfall_accumulated_32bit ):
      ['0056.52' '0056.67' '0056.71' '0056.71' '0056.71']
 - Column 2 ( weather_code_synop_4680 ):
      ['00' '00' '00' '00' '00']
 - Column 3 ( weather_code_synop_4677 ):
      ['00' '00' '00' '00' '00']
 - Column 4 ( reflectivity_32bit ):
      ['-9.999' '-9.999' '-9.999' '-9.999' '-9.999']
 - Column 5 ( mor_visibility ):
      ['9999' '9999' '9999' '9999' '9999']
 - Column 6 ( laser_amplitude ):
      ['12529' '12595' '11388' '12456' '12248']
 - Column 7 ( number_particles ):
      ['00000' '00000' '00000' '00000' '00000']
 - Column 8 ( sensor_temperature ):
      ['023' '024' '015' '020' '017']
 - Column 9 ( sensor_heating_current ):
      ['0.05' '0.06' '0.06' '0.05' '0.06']
 - Column 10 ( sensor_battery_voltage ):
      ['24.9' '24.9' '24.9' '24.9' '24.9']
 - Column 11 ( sensor_status ):
      ['0' '0' '0' '0' '0']
 - Column 12 ( rainfall_amount_absolute_32bit ):
      ['005.652' '005.667' '005.671' '005.671' '005.671']
 - Column 13 ( error_code ):
      ['000' '000' '000' '000' '000']
 - Column 14 ( raw_drop_concentration ):
      ['-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,'
 '-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,']
 - Column 15 ( raw_drop_average_velocity ):
      ['00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,'
 '00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,']
 - Column 16 ( raw_drop_number ):
      ['000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,'




 - Column 17 ( time ):
      ['2018-01-08T15:26:31.000000000' '2018-02-08T06:21:30.000000000'
 '2018-03-08T02:27:30.000000000' '2018-02-08T17:41:01.000000000'
 '2018-02-08T21:08:01.000000000']
[37]:
print_df_columns_unique_values(df, column_indices=2, print_column_names=True)
 - Column 2 ( weather_code_synop_4680 ):
      ['00', '57', '61', '62', '71', '72', '88']

11. Define the dataframe sanitizer function

The df_sanitizer_fun encapsulate the code specific to each reader/dataset that is required to obtain a dataframe compliants with the DISDRODB standards.

With the data used in this notebook, we need to drop some columns and define the time column !

From the code defined in Section 10, we define the following function:

[38]:
def df_sanitizer_fun(df):
    # Import pandas
    import pandas as pd

    # - Drop unvalid columns
    columns_to_drop = [
        "unknown1",
        "unknown2",
        "unknown3",
        "unknown4",
        "unknown5",
        "unknown6",
    ]

    df = df.drop(columns=columns_to_drop)

    # - Convert timestep column to datetime format
    df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
    df = df.drop(columns=["timestep"])

    # - Return the dataframe
    return df

🚨 The df_sanitizer_fun() function will be transfered to the reader function at the end of this notebook.

12. Now let’s try calling the reader function as it will be called in the DISDRODB L0 reader

  • You may try with increasing number of files (update filepaths)

Here we combine all raw files in a single dataframe.

The function read_raw_files takes as argument : * filepaths : the list of files present in the specified station directory * column_names : the list of column (defined previously) * reader_kwargs : dictionary to data loading into the dataframe (defined previously) * sensor_name : taken from the sensor_name key in the metadata YAML file of the station * df_sanitizer_fun: the function to sanitize the data frame (defined previously)

All these arguments are defined either in the data directory structure, or earlier in the code.

[39]:
subset_filepaths = filepaths[:1]

df = read_raw_files(
    filepaths=subset_filepaths,
    column_names=column_names,
    reader_kwargs=reader_kwargs,
    sensor_name=sensor_name,
    verbose=verbose,
    df_sanitizer_fun=df_sanitizer_fun,
)
display(df)
 - 1 / 1 processed successfully. File name: /home/ghiggi/Python_Packages/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1/file60_20180817.dat.gz
 -  - 0 of 1 have been skipped.
rainfall_rate_32bit rainfall_accumulated_32bit weather_code_synop_4680 weather_code_synop_4677 reflectivity_32bit mor_visibility laser_amplitude number_particles sensor_temperature sensor_heating_current sensor_battery_voltage sensor_status rainfall_amount_absolute_32bit error_code raw_drop_concentration raw_drop_average_velocity raw_drop_number time
0 0.0 56.490002 0.0 0.0 -9.999 9999.0 12611.0 0.0 35.0 0.06 24.9 0.0 5.649 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-01-08 12:44:30
1 0.0 56.490002 0.0 0.0 -9.999 9999.0 12617.0 0.0 35.0 0.06 24.9 0.0 5.649 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-01-08 12:45:01
2 0.0 56.490002 0.0 0.0 -9.999 9999.0 12600.0 0.0 35.0 0.06 24.9 0.0 5.649 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-01-08 12:45:30
3 0.0 56.490002 0.0 0.0 -9.999 9999.0 12603.0 0.0 35.0 0.05 24.9 0.0 5.649 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-01-08 12:46:01
4 0.0 56.490002 0.0 0.0 -9.999 9999.0 12606.0 0.0 34.0 0.06 24.9 0.0 5.649 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-01-08 12:46:31
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4736 0.0 56.709999 0.0 0.0 -9.999 9999.0 11059.0 0.0 15.0 0.06 24.9 0.0 5.671 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-03-08 04:13:25
4737 0.0 56.709999 0.0 0.0 -9.999 9999.0 11175.0 0.0 15.0 0.06 24.9 0.0 5.671 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-03-08 04:13:56
4738 0.0 56.709999 0.0 0.0 -9.999 9999.0 11275.0 0.0 15.0 0.06 24.9 0.0 5.671 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-03-08 04:14:26
4739 0.0 56.709999 0.0 0.0 -9.999 9999.0 11361.0 0.0 15.0 0.06 24.9 0.0 5.671 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-03-08 04:14:55
4740 0.0 56.709999 0.0 0.0 -9.999 9999.0 11492.0 0.0 15.0 0.07 24.9 0.0 5.671 0.0 -9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9... 00.000,00.000,00.000,00.000,00.000,00.000,00.0... 000,000,000,000,000,000,000,000,000,000,000,00... 2018-03-08 04:15:25

4741 rows × 18 columns

Here we derive the corresponding xr.Dataset object

[40]:
ds = create_l0b_from_l0a(df, attrs, verbose=False)
print(ds)
<xarray.Dataset>
Dimensions:                         (time: 4741, diameter_bin_center: 32,
                                     velocity_bin_center: 32, crs: 1)
Coordinates: (12/13)
  * diameter_bin_center             (diameter_bin_center) float64 0.062 ... 24.5
    diameter_bin_lower              (diameter_bin_center) float64 0.0 ... 23.0
    diameter_bin_upper              (diameter_bin_center) float64 0.1245 ... ...
    diameter_bin_width              (diameter_bin_center) float64 0.125 ... 3.0
  * velocity_bin_center             (velocity_bin_center) float64 0.05 ... 20.8
    velocity_bin_lower              (velocity_bin_center) float64 0.0 ... 19.2
    ...                              ...
    velocity_bin_width              (velocity_bin_center) float64 0.1 ... 3.2
  * time                            (time) datetime64[ns] 2018-01-08T12:44:30...
    latitude                        float64 46.2
    longitude                       float64 8.792
    altitude                        int64 1671
  * crs                             (crs) <U5 'WGS84'
Data variables: (12/17)
    raw_drop_concentration          (time, diameter_bin_center) float64 0.0 ....
    raw_drop_average_velocity       (time, velocity_bin_center) float64 0.0 ....
    raw_drop_number                 (time, diameter_bin_center, velocity_bin_center) float64 ...
    rainfall_rate_32bit             (time) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    rainfall_accumulated_32bit      (time) float32 56.49 56.49 ... 56.71 56.71
    weather_code_synop_4680         (time) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    ...                              ...
    sensor_temperature              (time) float32 35.0 35.0 35.0 ... 15.0 15.0
    sensor_heating_current          (time) float32 0.06 0.06 0.06 ... 0.06 0.07
    sensor_battery_voltage          (time) float32 24.9 24.9 24.9 ... 24.9 24.9
    sensor_status                   (time) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    rainfall_amount_absolute_32bit  (time) float32 5.649 5.649 ... 5.671 5.671
    error_code                      (time) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
Attributes: (12/61)
    data_source:                     DATA_SOURCE
    campaign_name:                   CAMPAIGN_NAME
    station_name:                    station_name_1
    sensor_name:                     OTT_Parsivel
    reader:                          EPFL/LOCARNO_2018
    raw_data_format:                 raw
    ...                              ...
    time_coverage_start:             2018-01-08T12:44:30.000000000
    time_coverage_end:               2018-03-08T04:15:25.000000000
    disdrodb_processing_date:        2023-12-01 13:36:52
    disdrodb_product_version:        V0
    disdrodb_software_version:       V0.0.18.dev57+g8911365.d20231103
    disdrodb_product:                L0B
/home/ghiggi/Python_Packages/disdrodb/disdrodb/l0/l0b_processing.py:475: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
  ds = xr.Dataset(

which can be saved as DISDRODB L0B netCDF by running the following code:

[41]:
# ds = set_encodings(ds, sensor_name)
# ds.to_netcdf("/path/where/to/save/the/file.nc")

Step 2 : Create the reader#

Now we have all the parameters required to define a DISDRODB reader. All the DISDRODB reader parameters that we defined in this notebook must be transcribed into the reader function you are developing:

  1. Update the glob_pattern string

    Before :

    glob_patterns = "*"
    

    After :

    glob_pattern = "*.dat*"
    
  1. Update the columns_names list

    Before :

    column_names = []
    

    After :

    column_names = [
        "unknown1",
        "unknown2",
        "unknown3",
        "timestep",
        "unknown4",
        "unknown5",
        "rainfall_rate_32bit",
        "rainfall_accumulated_32bit",
        "weather_code_synop_4680",
        "weather_code_synop_4677",
        "reflectivity_32bit",
        "mor_visibility",
        "laser_amplitude",
        "number_particles",
        "sensor_temperature",
        "sensor_heating_current",
        "sensor_battery_voltage",
        "sensor_status",
        "rainfall_amount_absolute_32bit",
        "error_code",
        "raw_drop_concentration",
        "raw_drop_average_velocity",
        "raw_drop_number",
        "unknown6",
    ]
    
  1. Update the reader_kwargs **

dictionary**

Before :

``` python
    reader_kwargs = {}

```

After :

``` python
    reader_kwargs = {}

    # - Define delimiter
    reader_kwargs["delimiter"] = ","

    # - Avoid first column to become df index !!!
    reader_kwargs["index_col"] = False

    # Since column names are expected to be passed explicitly, header is set to None
    reader_kwargs['header'] = None

    # - Number of rows to be skipped at the beginning of the file
    reader_kwargs['skiprows']= None

    # - Define behaviour when encountering bad lines
    reader_kwargs["on_bad_lines"] = "skip"

    # - Define reader engine
    #   - C engine is faster
    #   - Python engine is more feature-complete
    reader_kwargs["engine"] = "python"

    # - Define on-the-fly decompression of on-disk data
    #   - Available: gzip, bz2, zip
    reader_kwargs["compression"] = "infer"

    # - Strings to recognize as NA/NaN and replace with standard NA flags
    #   - Already included: '#N/A’, '#N/A N/A’, '#NA’, '-1.#IND’, '-1.#QNAN’,
    #                       '-NaN’, '-nan’, '1.#IND’, '1.#QNAN’, '<NA>’, 'N/A’,
    #                       'NA’, 'NULL’, 'NaN’, 'n/a’, 'nan’, 'null’
    reader_kwargs["na_values"] = ["na", "", "error"]

```
  1. Update the df_sanitizer_fun() function

    Before:

    def df_sanitizer_fun(df):
        # - Import dask or pandas
        import pandas as pd
    
        # - Add here below the reader required custom code
        pass
    
        # - Return the dataframe
        return df
    

    After :

    def df_sanitizer_fun(df):
        # Import pandas
        import pandas as pd
    
        # - Drop unvalid columns
        columns_to_drop = ["unknown1", "unknown2", "unknown3","unknown4",'unknown5','unknown6']
        df = df.drop(columns=columns_to_drop)
    
        # - Convert timestep column to datetime format
        df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
        df = df.drop(columns=["timestep"])
    
        # - Return the dataframe
        return df
    

You arrived at the end of the tutorial. Well done 👋👋👋

At this point, you should now be able to create a new reader for your own data. When you think your reader is ready, you can test it following the Test the DISDRODB L0 processsing documentation of the How to Contribute New Data to DISDRODB guidelines.

Do not hesitate to open a GitHub Issue if you need any clarification.

The DISDRODB team hope you enjoyed this tutorial