/
Data Processing Documentation

Data Processing Documentation


Introduction

Since the initial creation of the legacy documentation, there have been significant changes to the VCS data processing workflow due to the introduction of the MWAX correlator and the new beamforming software, VCSBeam. The purpose of this documentation is to provide an updated description of the processing workflow. Please direct any questions or suggestions to @Christopher Lee or @Bradley Meyers.

This documentation deals with data collected with the MWA's Voltage Capture System (VCS), which is described in Tremblay et al. (2015). The MWA tied-Array processing paper series provides details about calibration, beamforming, and signal processing – see Ord et al. (2015), Xue et al. (2019), McSweeney et al. (2020), and Swainston et al. (2022). Additionally, the VCSBeam documentation provides some theory and examples regarding calibration and beamforming.

While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this course may be a useful resource. Additionally, the PRESTO search tutorial provides a good introduction to the software tools available in PRESTO.

Table of Contents


Software Usage on Setonix

VCS data processing is currently performed primarily on Pawsey's Setonix supercomputer. We refer you to the Setonix User Guide for the complete details on the software stack and how to run jobs. Relevant details are provided here.

Modules

Members of the mwavcs Pawsey group will have access to the common software used to calibrate and reduce VCS data. These can be browsed with:

module avail

To load a module, the user must specify the module name and version like so:

module load <module>/<version>

Using Common Pulsar Software

On Setonix, the common pulsar software packages are provided as singularity images (`.sif` files), which can be used interactively or used to execute routines on the commandline. There are two ways in which these software containers can be used: directly (as was previously done on Garrawarla), or via the module system - we'll describe both below.

There are three containers, all of which are located in `/software/projects/mwavcs/setonix/2024.05/containers/`. Specifically:

  • psr-timing: contains the standard pulsar timing software packages, tempo, tempo2 and PINT (along with updated clock file from when the image was produced).

  • psr-search: contains the most commonly used searching software packages, PRESTO and riptide

  • psr-analysis: contains the most commonly used analysis tools and utilities, PSRCHIVE, DSPSR, PSRSALSA, clfd and PulsePortraiture

If there are missing packages in any of the above containers, please let someone know via Slack or email so that we can re-build and include what you need. (Although be prepared to answer any compilation/installation questions that may arise!)

You can see the Dockerfiles used here: https://github.com/CIRA-Pulsars-and-Transients-Group/pulsar-software-containers/tree/main

Directly using the containers

First, we need to load the Singularity module,

Loading the Singularity module
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0 module load singularity/4.1.0-mwavcs

You can then execute a command via the container of you choice, like

Execute a command via a Singularity container
singularity exec /software/projects/mwavcs/setonix/2024.05/containers/<container>.sif <command>

and, if you need an interactive X-Window session to appear, then you will need to add   -B ~/.Xauthority    after the exec command and before the path to the container.

Using the containers via modules

We can also make the singularity containers "look" like modules, with all the software available on your regular command line (rather than needing to execute a Singularity pre-command). Nominally, if you follow the below steps, you should be able to see the three containers installed as system modules for our group.

Using Singularity containers as modules
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0 module av psr-

which should then show something like

and then these modules can be loaded as per usual. For example, if you load the   psr-analysis/24-11-07   module, you will then have access to various post-processing packages from the command line. e.g.,

These modules can be used as normal within job submissions via SLURM as well.

Using Nextflow

Nextflow is a workflow management software which is described here. It is extremely useful for constructing data processing pipelines, including our own mwa_search pipeline used for the SMART survey.

On Setonix

Nextflow can be loaded with:

module load nextflow

Manual Installation

Nextflow requires Java to run. Installation instructions can be found here.

Following Nextflow's installation docs, start by downloading Nextflow with curl:

curl -s https://get.nextflow.io | bash

Then make Nextflow executable with chmod:

chmod +x nextflow

Move nextflow into a directory in your PATH in order to execute it anywhere.

Run nextflow self-update to upgrade to the latest version. This will update the Nextflow executable in-place. To see the available subcommands, run nextflow help.

Typical Usage

For local pipelines, the basic syntax is:

nextflow run pipeline.nf [PIPELINE OPTIONS]

Where pipeline.nf contains an entry workflow (i.e. a workflow with no name).

For pipelines hosted on GitHub and configured for pipeline sharing, you can simply run the pipeline with:

nextflow run <username>/<repository> [-r <branch>] [PIPELINE OPTIONS]

where -r <branch> allows you to optionally specify a Git branch to use.

If the pipeline receives updates on GitHub, you can update your locally cached version of the pipeline with:

nextflow pull [-r <branch>] <username>/<repository>

If the pipeline name is unique among your locally cached pipelines, you can omit the GitHub username on execution.

As a simple example, a typical workflow could look like:

nextflow pull cplee1/vcs_recombine nextflow run vcs_recombine [PIPELINE OPTIONS]

In general, you should execute Nextflow within a screen session so that it can run separate from your terminal session. On Setonix, pipelines must be run on a dedicated workflow login node. A typical workflow would be:

# Create a session screen -S pipeline # Start the pipeline nextflow run pipeline.nf # Detach from the session with Ctrl+A then D # View all of your running screen sessions screen -ls # Reattach to your session screen -r pipeline

Once the pipeline has completed and you no longer need access to the intermediate files, you should delete the work directory to save disk space. The location of the work directory will differ depending on the pipeline, but by default will be placed in the working directory in which the pipeline was launched.


 

Downloading Data

Using the ASVO Web Application

MWA observation data are accessed via the ASVO. The simplest way to download an observation is via the ASVO Web Application. To find an observation, navigate to the Observations page and use the Search menu filters. To browse all SMART observations, find the Project option in the Advanced Search menu and select project G0057.

For VCS-mode observations, you will need to specify the time offset from the start of the observation and the duration of time to download. The following example shows the Job menu for downloading 600 seconds at the start of an observation.

For correlator-mode observations, such as calibrator observations, navigate to the Visibility Download Job tab and select the delivery location as /scratch. An example is shown below.

Once the jobs are submitted, they can be monitored via the My Jobs page, which will display the ASVO Job ID and the status of the download. The downloaded data will appear in /scratch/mwavcs/asvo/<ASVO Job ID>.

Using Giant Squid

Submitting ASVO jobs can also be done from the command line using Giant Squid. To use Giant Squid, you must first set your ASVO API key as an environment variable. You can find your API key in your Profile page on the ASVO Web Application. Then add the following to your .bashrc

export MWA_ASVO_API_KEY=<api key>

On Setonix, Giant Squid can be loaded with:

module load giant-squid/<version>

VCS download jobs can be submitted with the submit-volt subcommand:

giant-squid submit-volt --delivery scratch --offset <offset> --duration <duration> <obs ID>

Visibility observation download jobs (e.g., for calibration) can be submitted using the submit-vis subcommand:

giant-squid submit-vis --delivery scratch <obs ID>

To monitor the status of your ASVO jobs, you can use the list subcommand:

giant-squid list

Data Directory Structure

On Setonix, we store VCS data on the scratch partition in user directories. Typically the file structure is the following:

  • ${MYSCRATCH}/<obs ID>

    • /raw

      • Raw VCS data, legacy only (.dat files)

    •  /combined

      • Combined VCS data (.sub or .dat files)

    • /cal

      • <cal ID>

        • Visibilities (GPU Box or MWAX correlator .fits files)

        • <cal ID>.metafits

        • /hyperdrive

          • Calibration solution (.fits and .bin file)

          • Calibration solution diagnostic plots (phases and amplitudes)

      • ...

    • /pointings

      • <Pointing 1>

        • PSRFITS search-mode output (.fits), and/or

        • VDIF voltage beam output (.vdif and .hdr)

      • ...

    • <obs ID>.metafits

 

Unpacking recombined data

In some cases, Legacy VCS observation have already been recombined and re-uploaded to the MWA archive. In this case, you will have downloaded one .tar files and one .ics file per second recorded, in addition to the usual metadata. An efficient way to extract the voltage channel files in parallel is to use the following script.

#!/bin/bash -l #SBATCH --account=mwavcs #SBATCH --partition=mwa #SBATCH --job-name=untar-vcs-recombined #SBATCH --output=%x-%j.out #SBATCH --time=8:00:00 #SBATCH --ntasks=64 #SBATCH --ntasks-per-node=64 #SBATCH --cpus-per-task=1 #SBATCH --mem=109Gb # where the tarballs are located (and where the files will be extracted to) download_dir=/scratch/mwavcs/asvo/833965 cd $download_dir # collect the files list filelist=(ls *_combined.tar) # how many parallel jobs to run n_proc=${SLURM_NTASKS} #number of input files n_input_files=${#filelist[@]} #number of batches n_batch=$(((n_input_files + n_proc - 1) / n_proc)) for i in seq 1 ${n_batch}; do echo "batch $i extracting 128 tarballs" start_id=$((n_proc * (i - 1))) for archive in ${filelist[@]:start_id:n_proc}; do echo "tar -v --touch --extract -f ${archive} &" tar -v --touch --extract -f ${archive} & done wait done wait cd -

The amount of memory requested is definitely overkill, but is close to the nominal memory allocated based on the number of CPUs. Providing this option just makes things more explicit and easier to schedule in the queue, usually.

Calibration

Finding a calibration observation

In order to form a coherent tied-array beam, the antenna delays and gains need to be calibrated. This is commonly done using a dedicated observation of a bright calibrator source such as Centaurus A or Hercules A. Calibrator observations are taken in correlator mode and stored as visibilities. To find a calibration observation, either search the MWA archive using the ASVO Web Application, or run the following command from VCSTools:

mwa_metadb_utils.py -c <obs ID>

This will produce a list of calibrator observations, ranked based on their distance in time to the VCS observation.

VCSTools is not currently installed on Setonix. (As of January 2025)

Preprocessing with Birli

Although Hyperdrive accepts the raw visibilities, it is beneficial to downsample the time and frequency resolution to reduce the data size and improve the calibration quality. Preprocessing of raw visibilities can be done with Birli, which has a variety of options for flagging and averaging data (refer to the Birli help information). In general, 2 sec / 40 kHz resolution is a good starting point. Birli can be loaded on Setonix with:

module load birli/<version>

The following demonstrates the typical usage:

birli --metafits <METAFITS_FILE> --uvfits-out <UVFITS_FILE> --avg-freq-res <FREQ_RES> --avg-time-res <TIME_RES>

In general, it is also a good idea to specify the maximum allowed RAM for Birli to use with the --max-memory option, which accepts a number corresponding to how many GB of RAM to allocate. Otherwise, Birli will attempt to load the entire raw visibility data set, which often requires >100 GB of memory.

On Setonix, specifying the maximum allowed memory is particularly important as jobs will be terminated prematurely if they try to access more memory that allocated based on the number of CPUs in the resource request. There is approximately 1.79 GB RAM per CPU core allocated on the work/mwa partitions.

Example Birli job on Setonix

This script would be submitted from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive directory.

#!/bin/bash -l #SBATCH --account=mwavcs #SBATCH --partition=mwa #SBATCH --job-name=vcs-calibrate-birli #SBATCH --output=%x-%j.out #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=32 #SBATCH --mem=57500M #SBATCH --time=01:00:00 module load birli/0.15.1 birli -V # OpenMP settings export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} export OMP_PLACES=cores export OMP_PROC_BIND=close # MWA primary beam file export MWA_BEAM_FILE=/scratch/references/mwa/beam-models/mwa_full_embedded_element_pattern.h5 # Get the ObsID from the metafits filename mfits=$(basename -- "$(ls ../*.metafits)") obsid="${mfits%.*}" fres=40 # desired freq. resolution in kHz for cal. UVFITS tres=2 # desired time resolution in seconds for cal. UVFITS # Make the downsampled uvfits data if it's not already there if [[ ! -r ../${obsid}_birli.uvfits ]]; then shopt -s nullglob srun -N ${SLURM_JOB_NUM_NODES} -n ${SLURM_NTASKS} -c ${OMP_NUM_THREADS} -m block:block:block \ birli \ --max-memory $(bc -l <<< "0.9*${SLURM_MEM_PER_NODE}/1000") \ -m ../*.metafits \ -u ../${obsid}_birli.uvfits \ --avg-time-res ${tres} \ --avg-freq-res ${fres} \ ../*gpubox*.fits ../*ch???*.fits shopt -u nullglob else echo "../${obsid}_birli.uvfits already exists! Move it or delete it. Exiting here." fi

Calibrating with Hyperdrive

Hyperdrive is the latest generation calibration software for the MWA. It is written in Rust and uses GPU acceleration, making it several times faster than previous MWA calibration software. See the documentation for further details. Hyperdrive can be loaded on Setonix with:

# For the CPU version: module load hyperdrive/<version> # For the GPU version (recommended): module load hyperdrive-amd-gfx90a/<version>

We use Hyperdrive to perform direction-independent (DI) calibration using a sky model provided by the user. The more accurately the sky model reflects the input data, the better the convergence of the calibration solution. The sky model source list can be extracted from an all-sky catalogue such as GLEAM, or one can use a dedicated model for a well-characterised source such as Centaurus A. To compile a list of 1000 sources within the beam from the GLEAM Galactic Sky Model and save it as srclist_1000.yaml, use Hyperdrives's srclist-by-beam subcommand as shown below:

hyperdrive srclist-by-beam --metafits <METAFITS_FILE> --number 1000 /software/projects/mwavcs/GGSM_updated_2023Oct23.txt srclist_1000.yaml

Calibration is performed using the Hyperdrive's di-calibrate subcommand, as shown below.

hyperdrive di-calibrate --source-list <SRC_LIST> --data <UVFITS_FILE> <METAFITS_FILE> --fine-chan-flags-per-coarse-chan <CHAN_FLAGS>

Note for 40 kHz frequency channels, we typically flag channels 0 1 16 30 31 to zero-weight the channels at the edges and centre of the bandpass response.

This will produce hyperdrive_solutions.fits. To inspect the solution, use Hyperdrive's solutions-plot subcommand to generate two PNG images (for the amplitudes and phases) as shown below.

hyperdrive solutions-plot --metafits <METAFITS_FILE> hyperdrive_solutions.fits

An example calibration solution is shown below. The legend is shown in the upper right of each figure. The last (non-flagged) tile is the reference tile, unless a reference tile is selected with --ref-tile. All other tile's solutions are divided by the reference's solution before plotting. The gains of the X and Y polarisations of each antenna, gx and gy, are plotted along with the leakage terms Dx and Dy. In the amplitudes plot, the gains should be flat across the band with a value of around 1, and the leakage terms should be around 0. In the phases plot, the gains should exhibit a linear ramp with frequency and the leakage terms should be randomly distributed. Tiles which deviate far from these behaviours should be flagged by providing a space separated list to the --tile-flags option in di-calibrate. Also note that the plot range covers the entire data range by default, which is often dominated by bad tiles. To rescale the plots you can either flag the bad tiles or use the --max-amp and --min-amp options in solutions-plot.

Some examples of less-than-ideal calibrations solutions are provided here:

In this case, we have a solution that is almost okay, except for two aspects.

One is clearly visible if you look closely at Tiles 104-111 where the phase ramp in the first ~third of the band has periodic rippling structure (likely due to loose cabling). Unfortunately, the best course of action is to just flag those tiles outright, as it is not currently possible to flag ranges of fine/coarse channels on a per-tile basis. (One could in principle "manually" open the FITS file containing the solutions and set the corresponding bad partial tiles solutions to NaN.)

The second is that in many of the tile solution plots, we see a nice phase ramp with individual points above/below the ramp in a periodic fashion. This is indicative that the edge channels have not been correctly flagged. Since our data in this case are at 40 kHz resolution, we can fix this simply by adding --fine-chan-flags-per-coarse-chan 0 1 16 30 31  (flagging the edge 80 kHz and central 40 kHz), as mentioned above.

Once we are satisfied with the solution, the FITS solutions file must be converted to Offringa format in order for VCSBeam to use it. This can be done with Hyperdrive's solutions-convert subcommand as shown below.

hyperdrive solutions-convert --metafits <METAFITS_FILE> hyperdrive_solutions.fits hyperdrive_solutions.bin

Example Hyperdrive job on Setonix

The following script will run Hyperdrive as described above. The script must be executed from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive directory.

#!/bin/bash -l #SBATCH --account=mwavcs-gpu #SBATCH --partition=mwa-gpu #SBATCH --job-name=vcs-calibrate-hyperdrive #SBATCH --output=%x-%j.out #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --time=01:00:00 # Load the GPU-accelerated version of Hyperdrive # (You can use the CPU-only version by just loading hyperdrive/0.4.1 - but, why would you?) module load hyperdrive-amd-gfx90a/0.4.1 hyperdrive -V # OpenMP settings export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} export OMP_PLACES=cores export OMP_PROC_BIND=close # MWA primary beam file export MWA_BEAM_FILE=/scratch/references/mwa/beam-models/mwa_full_embedded_element_pattern.h5 # Get the ObsID from the metafits filename mfits=$(basename -- "$(ls ../*.metafits)") obsid="${mfits%.*}" # Create a list of 1000 sources from a catalogue srclist=srclist_1000.yaml hyperdrive srclist-by-beam \ -n 1000 \ -m ../*.metafits \ /software/projects/mwavcs/GGSM_updated_2023Oct23.txt \ $srclist # Perform DI calibration # If necessary, flag tiles with --tile-flags <Tile1> <Tile2> ... <TileN>) hyperdrive di-calibrate \ --max-iterations 1000 \ --source-list $srclist \ --data ../${obsid}_birli.uvfits ../*.metafits \ --fine-chan-flags-per-coarse-chan 0 1 16 30 31 # Plot the solutions hyperdrive solutions-plot \ --metafits ../*.metafits \ hyperdrive_solutions.fits # Convert to Offringa format for VCSBeam hyperdrive solutions-convert \ --metafits ../*.metafits \ hyperdrive_solutions.fits \ hyperdrive_solutions.bin

On a Setonix GPU node, requesting a GPU actually reserves an "allocation-pack" - see https://pawsey.atlassian.net/wiki/spaces/US/pages/51928618/Setonix+GPU+Partition+Quick+Start

This means that for each GPU requested we have 8 CPU cores and ~25 GB RAM. In the case of running Hyperdrive, this memory needs to be able to store the entire UVFITS output from Birli.

If the UVFITS is larger than ~25 GB then you will need to use 2 “allocation-packs” worth of resources. This can be achieved by changing --gres=gpu:1 to --gres=gpu:2 in the above example script.

Recombining Legacy VCS Data

If you are using Setonix, there is a Nextflow pipeline that is available to make recombining data simple. To use it, run:

nextflow run cplee1/vcs_recombine --obsid <OBSID> --download_dir <DOWNLOAD_DIR> [--vcs_dir <VCS_DIR>]

The pipeline will look in the directory specified with the --download_dir option and determine what data is available. The combined data will be placed in <VCS_DIR>/<OBSID>/combined. The metafits file (with the name <OBSID>_metafits_ppds.fits) should exist in either the download directory or the <VCS_DIR>/<OBSID> directory. The default for --vcs_dir is /scratch/mwavcs/$USER.

For example:

nextflow run cplee1/vcs_recombine --obsid 1261241272 --download_dir /scratch/mwavcs/asvo/123456

In this case, the pipeline would recombine the files in /scratch/mwavcs/asvo/123456 and place them in /scratch/mwavcs/$USER/1261241272/combined.

By default, the pipeline will submit one job per 32 seconds of data, so a full SMART observation it will be split into 150 jobs. Each job should take ~10-20 min to complete.

The pipeline does not delete the raw data. Therefore, to save disk space, it is recommended that you manually delete the raw data as soon as the pipeline has successfully completed.

This workflow does NOT untar data files, so you should only run this once you’re confident the legacy data have been extracted into the --download_dir location.

Beamforming

Finding a target

Pulsars can be located in the MWA primary beam using the mwa-source-finder, a pip-installable Python package with a command line interface. Python environments on Setonix are managed at the user level, so this software should be installed into a Python virtual environment. For example:

module load python/3.11.6 module load hyperbeam/0.9.3 module load py-numpy/1.26.1 python -m venv /path/to/myenv source /path/to/myenv/bin/activate python -m pip install git+https://github.com/cplee1/mwa_source_finder.git

Example 1: Finding all pulsars in an observation

source-finder -o 1221399680

To be specific, the software defines "in the beam" as the zenith-normalised beam power towards the source exceeding the a specified minimum (by default 0.3, i.e. 30%). By default it evaluates the primary beam model at the centre frequency of the observation, but this can be changed with --freq_mode {low/centre/high/multi}.

Example 2: Finding all observations for a pulsar/source

source-finder -s J2241-5236 --obs_for_source [--filter_available]

The --filter_available option will exclude observations with no data files available. The source(s) can be specified as pulsar B/J names, or RA/Dec coordinates in sexigesimal or decimal form. The following are all acceptable: "B1451-68" "J1456-6843" "14:55:59.92_-68:43:39.50" "223.999679_-68.727639".

Example 3: A more complex query to find pulsars in an observation

source-finder -o 1221399680 --start 0.1 --end 0.5 --min_power 0.5 --condition 'P0<0.03 && DM<100' --time_plot

The above example will search for pulsars matching the given PSRCAT query that exceed a zenith-normalised beam power of 0.5 between the fractional observation times 0.1 and 0.5. It will also make a plot of the source power traces over time.

Pulsar coordinates can be obtained from the ATNF Pulsar Catalogue web portal. The coordinates should be written in a text file (RA and Dec in sexigesimal format separated by a space, one set of coordinates per line) to give to VCSBeam.

Tied-array beamforming with VCSBeam

VCSBeam is the successor to the legacy VCS tied-array beamformer. It is capable of processing both legacy and MWAX data into any of the following formats:

  • fine-channelised (10 kHz) full-Stokes time series in PSRFITS format (-p option)

  • fine-channelised (10 kHz) Stokes I time series in PSRFITS format (-N option)

  • coarse-channelised (1.28 MHz) complex voltage time series in VDIF format (-v option)

The coarse channelised output is made possible by the inverse PFB, which reverses the 10 kHz channelisation of the initial PFB stage. To run the tied-array beamformer, we use the make_mwa_tied_array_beam program, which is MPI-enabled to process multiple coarse channels in a single Slurm job. Further details about this command can be found here.

An example Slurm script is given below, where 24 channels are being processed across 24 nodes. In this case, the -p option is used to specify PSRFITS output.

beamform.sbatch
#!/bin/bash -l #SBATCH --account=mwavcs-gpu #SBATCH --partition=mwa-gpu #SBATCH --job-name=beamform #SBATCH --output=%x-%j.out #SBATCH --ntasks=24 #SBATCH --gpus-per-task=1 #SBATCH --time=01:00:00 module load vcsbeam/v5.2 make_mwa_tied_array_beam -V #=============================================================================== # Required inputs #------------------------------------------------------------------------------- # path to VCS metafits file metafits= # path to combined (.dat) or MWAX (.sub) data directory datadir= # path to calibration solution from hyperdrive (should be a .bin file) calsol= # path to the calibrator observation metafits calmetafits= # the starting GPS second of the observation startgps= # how many seconds to process duration= # lowest coarse channel index lowchan= # path to pointings file pointings= #=============================================================================== srun make_mwa_tied_array_beam \ -m ${metafits} \ -b ${startgps} \ -T ${duration} \ -f ${lowchan} \ -d ${datadir} \ -P ${pointings} \ -c ${calmetafits} \ -C ${calsol} \ -p -R NONE -U 0,0 -O -X -s

For VDIF output, the beamforming can be split into multiple smaller jobs using a Slurm job array. This is done with the sbatch option --array=LOWCHAN-HIGHCHAN. The -f option in make_mwa_tied_array_beam should then be given $SLURM_ARRAY_TASK_ID.

Related content