/
Data Processing Documentation

Data Processing Documentation


Introduction

Since the initial creation of the legacy documentation, there have been significant changes to the VCS data processing workflow due to the introduction of the MWAX correlator and the new beamforming software, VCSBeam. The purpose of this documentation is to provide an updated description of the processing workflow. Please direct any questions or suggestions to @Christopher Lee or @Bradley Meyers.

This documentation deals with data collected with the MWA's Voltage Capture System (VCS), which is described in Tremblay et al. (2015). The MWA tied-Array processing paper series provides details about calibration, beamforming, and signal processing – see Ord et al. (2015), Xue et al. (2019), McSweeney et al. (2020), and Swainston et al. (2022). Additionally, the VCSBeam documentation provides some theory and examples regarding calibration and beamforming.

While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this course may be a useful resource. Additionally, the PRESTO search tutorial provides a good introduction to the software tools available in PRESTO.

Table of Contents


Software Usage on Setonix

VCS data processing is currently performed primarily on Pawsey's Setonix supercomputer. We refer you to the Setonix User Guide for the complete details on the software stack and how to run jobs. Relevant details are provided here.

Modules

Members of the mwavcs Pawsey group will have access to the common software used to calibrate and reduce VCS data. These can be browsed with:

module avail

To load a module, the user must specify the module name and version like so:

module load <module>/<version>

Using Common Pulsar Software

On Setonix, the common pulsar software packages are provided as singularity images (`.sif` files), which can be used interactively or used to execute routines on the commandline. There are two ways in which these software containers can be used: directly (as was previously done on Garrawarla), or via the module system - we'll describe both below.

There are three containers, all of which are located in `/software/projects/mwavcs/setonix/2024.05/containers/`. Specifically:

  • psr-timing: contains the standard pulsar timing software packages, tempo, tempo2 and PINT (along with updated clock file from when the image was produced).

  • psr-search: contains the most commonly used searching software packages, PRESTO and riptide

  • psr-analysis: contains the most commonly used analysis tools and utilities, PSRCHIVE, DSPSR, PSRSALSA, clfd and PulsePortraiture

If there are missing packages in any of the above containers, please let someone know via Slack or email so that we can re-build and include what you need. (Although be prepared to answer any compilation/installation questions that may arise!)

You can see the Dockerfiles used here: https://github.com/CIRA-Pulsars-and-Transients-Group/pulsar-software-containers/tree/main

Directly using the containers

First, we need to load the Singularity module,

Loading the Singularity module
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0 module load singularity/4.1.0-mwavcs

You can then execute a command via the container of you choice, like

Execute a command via a Singularity container

and, if you need an interactive X-Window session to appear, then you will need to add   -B ~/.Xauthority    after the exec command and before the path to the container.

Using the containers via modules

We can also make the singularity containers "look" like modules, with all the software available on your regular command line (rather than needing to execute a Singularity pre-command). Nominally, if you follow the below steps, you should be able to see the three containers installed as system modules for our group.

Using Singularity containers as modules

which should then show something like

and then these modules can be loaded as per usual. For example, if you load the   psr-analysis/24-11-07   module, you will then have access to various post-processing packages from the command line. e.g.,

These modules can be used as normal within job submissions via SLURM as well.

Using Nextflow

Nextflow is a workflow management software which is described here. It is extremely useful for constructing data processing pipelines, including our own mwa_search pipeline used for the SMART survey.

On Setonix

Nextflow can be loaded with:

Manual Installation

Nextflow requires Java to run. Installation instructions can be found here.

Following Nextflow's installation docs, start by downloading Nextflow with curl:

Then make Nextflow executable with chmod:

Move nextflow into a directory in your PATH in order to execute it anywhere.

Run nextflow self-update to upgrade to the latest version. This will update the Nextflow executable in-place. To see the available subcommands, run nextflow help.

Typical Usage

For local pipelines, the basic syntax is:

Where pipeline.nf contains an entry workflow (i.e. a workflow with no name).

For pipelines hosted on GitHub and configured for pipeline sharing, you can simply run the pipeline with:

where -r <branch> allows you to optionally specify a Git branch to use.

If the pipeline receives updates on GitHub, you can update your locally cached version of the pipeline with:

If the pipeline name is unique among your locally cached pipelines, you can omit the GitHub username on execution.

As a simple example, a typical workflow could look like:

In general, you should execute Nextflow within a screen session so that it can run separate from your terminal session. On Setonix, pipelines must be run on a dedicated workflow login node. A typical workflow would be:

Once the pipeline has completed and you no longer need access to the intermediate files, you should delete the work directory to save disk space. The location of the work directory will differ depending on the pipeline, but by default will be placed in the working directory in which the pipeline was launched.


 

Downloading Data

Using the ASVO Web Application

MWA observation data are accessed via the ASVO. The simplest way to download an observation is via the ASVO Web Application. To find an observation, navigate to the Observations page and use the Search menu filters. To browse all SMART observations, find the Project option in the Advanced Search menu and select project G0057.

For VCS-mode observations, you will need to specify the time offset from the start of the observation and the duration of time to download. The following example shows the Job menu for downloading 600 seconds at the start of an observation.

For correlator-mode observations, such as calibrator observations, navigate to the Visibility Download Job tab and select the delivery location as /scratch. An example is shown below.

Once the jobs are submitted, they can be monitored via the My Jobs page, which will display the ASVO Job ID and the status of the download. The downloaded data will appear in /scratch/mwavcs/asvo/<ASVO Job ID>.

Using Giant Squid

Submitting ASVO jobs can also be done from the command line using Giant Squid. To use Giant Squid, you must first set your ASVO API key as an environment variable. You can find your API key in your Profile page on the ASVO Web Application. Then add the following to your .bashrc

On Setonix, Giant Squid can be loaded with:

VCS download jobs can be submitted with the submit-volt subcommand:

Visibility observation download jobs (e.g., for calibration) can be submitted using the submit-vis subcommand:

To monitor the status of your ASVO jobs, you can use the list subcommand:

Data Directory Structure

On Setonix, we store VCS data on the scratch partition in user directories. Typically the file structure is the following:

  • ${MYSCRATCH}/<obs ID>

    • /raw

      • Raw VCS data, legacy only (.dat files)

    •  /combined

      • Combined VCS data (.sub or .dat files)

    • /cal

      • <cal ID>

        • Visibilities (GPU Box or MWAX correlator .fits files)

        • <cal ID>.metafits

        • /hyperdrive

          • Calibration solution (.fits and .bin file)

          • Calibration solution diagnostic plots (phases and amplitudes)

      • ...

    • /pointings

      • <Pointing 1>

        • PSRFITS search-mode output (.fits), and/or

        • VDIF voltage beam output (.vdif and .hdr)

      • ...

    • <obs ID>.metafits

 

Calibration

Finding a calibration observation

In order to form a coherent tied-array beam, the antenna delays and gains need to be calibrated. This is commonly done using a dedicated observation of a bright calibrator source such as Centaurus A or Hercules A. Calibrator observations are taken in correlator mode and stored as visibilities. To find a calibration observation, either search the MWA archive using the ASVO Web Application, or run the following command from VCSTools:

This will produce a list of calibrator observations, ranked based on their distance in time to the VCS observation.

VCSTools is not currently installed on Setonix. (As of January 2025)

Preprocessing with Birli

Although Hyperdrive accepts the raw visibilities, it is beneficial to downsample the time and frequency resolution to reduce the data size and improve the calibration quality. Preprocessing of raw visibilities can be done with Birli, which has a variety of options for flagging and averaging data (refer to the Birli help information). In general, 2 sec / 40 kHz resolution is a good starting point. Birli can be loaded on Setonix with:

The following demonstrates the typical usage:

In general, it is also a good idea to specify the maximum allowed RAM for Birli to use with the --max-memory option, which accepts a number corresponding to how many GB of RAM to allocate. Otherwise, Birli will attempt to load the entire raw visibility data set, which often requires >100 GB of memory.

On Setonix, specifying the maximum allowed memory is particularly important as jobs will be terminated prematurely if they try to access more memory that allocated based on the number of CPUs in the resource request. There is approximately 1.79 GB RAM per CPU core allocated on the work/mwa partitions.

Example Birli job on Setonix

This script would be submitted from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive directory.

Calibrating with Hyperdrive

Hyperdrive is the latest generation calibration software for the MWA. It is written in Rust and uses GPU acceleration, making it several times faster than previous MWA calibration software. See the documentation for further details. Hyperdrive can be loaded on Setonix with:

We use Hyperdrive to perform direction-independent (DI) calibration using a sky model provided by the user. The more accurately the sky model reflects the input data, the better the convergence of the calibration solution. The sky model source list can be extracted from an all-sky catalogue such as GLEAM, or one can use a dedicated model for a well-characterised source such as Centaurus A. To compile a list of 1000 sources within the beam from the GLEAM Galactic Sky Model and save it as srclist_1000.yaml, use Hyperdrives's srclist-by-beam subcommand as shown below:

Calibration is performed using the Hyperdrive's di-calibrate subcommand, as shown below.

Note for 40 kHz frequency channels, we typically flag channels 0 1 16 30 31 to zero-weight the channels at the edges and centre of the bandpass response.

This will produce hyperdrive_solutions.fits. To inspect the solution, use Hyperdrive's solutions-plot subcommand to generate two PNG images (for the amplitudes and phases) as shown below.

An example calibration solution is shown below. The legend is shown in the upper right of each figure. The last (non-flagged) tile is the reference tile, unless a reference tile is selected with --ref-tile. All other tile's solutions are divided by the reference's solution before plotting. The gains of the X and Y polarisations of each antenna, gx and gy, are plotted along with the leakage terms Dx and Dy. In the amplitudes plot, the gains should be flat across the band with a value of around 1, and the leakage terms should be around 0. In the phases plot, the gains should exhibit a linear ramp with frequency and the leakage terms should be randomly distributed. Tiles which deviate far from these behaviours should be flagged by providing a space separated list to the --tile-flags option in di-calibrate. Also note that the plot range covers the entire data range by default, which is often dominated by bad tiles. To rescale the plots you can either flag the bad tiles or use the --max-amp and --min-amp options in solutions-plot.

Some examples of less-than-ideal calibrations solutions are provided here:

In this case, we have a solution that is almost okay, except for two aspects.

One is clearly visible if you look closely at Tiles 104-111 where the phase ramp in the first ~third of the band has periodic rippling structure (likely due to loose cabling). Unfortunately, the best course of action is to just flag those tiles outright, as it is not currently possible to flag ranges of fine/coarse channels on a per-tile basis. (One could in principle "manually" open the FITS file containing the solutions and set the corresponding bad partial tiles solutions to NaN.)

The second is that in many of the tile solution plots, we see a nice phase ramp with individual points above/below the ramp in a periodic fashion. This is indicative that the edge channels have not been correctly flagged. Since our data in this case are at 40 kHz resolution, we can fix this simply by adding --fine-chan-flags-per-coarse-chan 0 1 16 30 31  (flagging the edge 80 kHz and central 40 kHz), as mentioned above.

Once we are satisfied with the solution, the FITS solutions file must be converted to Offringa format in order for VCSBeam to use it. This can be done with Hyperdrive's solutions-convert subcommand as shown below.

Example Hyperdrive job on Setonix

The following script will run Hyperdrive as described above. The script must be executed from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive directory.

Recombining Legacy VCS Data

If you are using Setonix, there is a Nextflow pipeline that is available to make recombining data simple. To use it, run:

The pipeline will look in the directory specified with the --download_dir option and determine what data is available. The combined data will be placed in <VCS_DIR>/<OBSID>/combined. The metafits file (with the name <OBSID>_metafits_ppds.fits) should exist in either the download directory or the <VCS_DIR>/<OBSID> directory. The default for --vcs_dir is /scratch/mwavcs/$USER.

For example:

In this case, the pipeline would recombine the files in /scratch/mwavcs/asvo/123456 and place them in /scratch/mwavcs/$USER/1261241272/combined.

By default, the pipeline will submit one job per 32 seconds of data, so a full SMART observation it will be split into 150 jobs. Each job should take ~10-20 min to complete.

The pipeline does not delete the raw data. Therefore, to save disk space, it is recommended that you manually delete the raw data as soon as the pipeline has successfully completed.

Beamforming

Finding a target

Pulsars can be located in the MWA primary beam using the mwa-source-finder, a pip-installable Python package with a command line interface. Python environments on Setonix are managed at the user level, so this software should be installed into a Python virtual environment. For example:

Example 1: Finding all pulsars in an observation

To be specific, the software defines "in the beam" as the zenith-normalised beam power towards the source exceeding the a specified minimum (by default 0.3, i.e. 30%). By default it evaluates the primary beam model at the centre frequency of the observation, but this can be changed with --freq_mode {low/centre/high/multi}.

Example 2: Finding all observations for a pulsar/source

The --filter_available option will exclude observations with no data files available. The source(s) can be specified as pulsar B/J names, or RA/Dec coordinates in sexigesimal or decimal form. The following are all acceptable: "B1451-68" "J1456-6843" "14:55:59.92_-68:43:39.50" "223.999679_-68.727639".

Example 3: A more complex query to find pulsars in an observation

The above example will search for pulsars matching the given PSRCAT query that exceed a zenith-normalised beam power of 0.5 between the fractional observation times 0.1 and 0.5. It will also make a plot of the source power traces over time.

Pulsar coordinates can be obtained from the ATNF Pulsar Catalogue web portal. The coordinates should be written in a text file (RA and Dec in sexigesimal format separated by a space, one set of coordinates per line) to give to VCSBeam.

Tied-array beamforming with VCSBeam

VCSBeam is the successor to the legacy VCS tied-array beamformer. It is capable of processing both legacy and MWAX data into any of the following formats:

  • fine-channelised (10 kHz) full-Stokes time series in PSRFITS format (-p option)

  • fine-channelised (10 kHz) Stokes I time series in PSRFITS format (-N option)

  • coarse-channelised (1.28 MHz) complex voltage time series in VDIF format (-v option)

The coarse channelised output is made possible by the inverse PFB, which reverses the 10 kHz channelisation of the initial PFB stage. To run the tied-array beamformer, we use the make_mwa_tied_array_beam program, which is MPI-enabled to process multiple coarse channels in a single Slurm job. Further details about this command can be found here.

An example Slurm script is given below, where 24 channels are being processed across 24 nodes. In this case, the -p option is used to specify PSRFITS output.

beamform.sbatch

For VDIF output, the beamforming can be split into multiple smaller jobs using a Slurm job array. This is done with the sbatch option --array=LOWCHAN-HIGHCHAN. The -f option in make_mwa_tied_array_beam should then be given $SLURM_ARRAY_TASK_ID.

Related pages