Data Processing Documentation
Introduction
Since the initial creation of the legacy documentation, there have been significant changes to the VCS data processing workflow due to the introduction of the MWAX correlator and the new beamforming software, VCSBeam. The purpose of this documentation is to provide an updated description of the processing workflow. Please direct any questions or suggestions to @Christopher Lee or @Bradley Meyers.
This documentation deals with data collected with the MWA's Voltage Capture System (VCS), which is described in Tremblay et al. (2015). The MWA tied-Array processing paper series provides details about calibration, beamforming, and signal processing – see Ord et al. (2015), Xue et al. (2019), McSweeney et al. (2020), and Swainston et al. (2022). Additionally, the VCSBeam documentation provides some theory and examples regarding calibration and beamforming.
While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this course may be a useful resource. Additionally, the PRESTO search tutorial provides a good introduction to the software tools available in PRESTO.
Table of Contents
Software Usage on Setonix
VCS data processing is currently performed primarily on Pawsey's Setonix supercomputer. We refer you to the Setonix User Guide for the complete details on the software stack and how to run jobs. Relevant details are provided here.
Modules
Members of the mwavcs Pawsey group will have access to the common software used to calibrate and reduce VCS data. These can be browsed with:
module avail
To load a module, the user must specify the module name and version like so:
module load <module>/<version>
Using Common Pulsar Software
On Setonix, the common pulsar software packages are provided as singularity images (`.sif` files), which can be used interactively or used to execute routines on the commandline. There are two ways in which these software containers can be used: directly (as was previously done on Garrawarla), or via the module system - we'll describe both below.
There are three containers, all of which are located in `/software/projects/mwavcs/setonix/2024.05/containers/`. Specifically:
psr-timing
: contains the standard pulsar timing software packages,tempo
,tempo2
andPINT
(along with updated clock file from when the image was produced).psr-search
: contains the most commonly used searching software packages,PRESTO
andriptide
psr-analysis
: contains the most commonly used analysis tools and utilities,PSRCHIVE
,DSPSR
,PSRSALSA
,clfd
andPulsePortraiture
If there are missing packages in any of the above containers, please let someone know via Slack or email so that we can re-build and include what you need. (Although be prepared to answer any compilation/installation questions that may arise!)
You can see the Dockerfiles used here: https://github.com/CIRA-Pulsars-and-Transients-Group/pulsar-software-containers/tree/main
Directly using the containers
First, we need to load the Singularity module,
Loading the Singularity module
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0
module load singularity/4.1.0-mwavcs
You can then execute a command via the container of you choice, like
Execute a command via a Singularity container
and, if you need an interactive X-Window session to appear, then you will need to add -B ~/.Xauthority
after the exec
command and before the path to the container.
Using the containers via modules
We can also make the singularity containers "look" like modules, with all the software available on your regular command line (rather than needing to execute a Singularity pre-command). Nominally, if you follow the below steps, you should be able to see the three containers installed as system modules for our group.
Using Singularity containers as modules
which should then show something like
and then these modules can be loaded as per usual. For example, if you load the psr-analysis/24-11-07
module, you will then have access to various post-processing packages from the command line. e.g.,
These modules can be used as normal within job submissions via SLURM as well.
Using Nextflow
Nextflow is a workflow management software which is described here. It is extremely useful for constructing data processing pipelines, including our own mwa_search
pipeline used for the SMART survey.
On Setonix
Nextflow can be loaded with:
Manual Installation
Nextflow requires Java to run. Installation instructions can be found here.
Following Nextflow's installation docs, start by downloading Nextflow with curl
:
Then make Nextflow executable with chmod
:
Move nextflow into a directory in your PATH
in order to execute it anywhere.
Run nextflow self-update
to upgrade to the latest version. This will update the Nextflow executable in-place. To see the available subcommands, run nextflow help
.
Typical Usage
For local pipelines, the basic syntax is:
Where pipeline.nf
contains an entry workflow (i.e. a workflow with no name).
For pipelines hosted on GitHub and configured for pipeline sharing, you can simply run the pipeline with:
where -r <branch>
allows you to optionally specify a Git branch to use.
If the pipeline receives updates on GitHub, you can update your locally cached version of the pipeline with:
If the pipeline name is unique among your locally cached pipelines, you can omit the GitHub username on execution.
As a simple example, a typical workflow could look like:
In general, you should execute Nextflow within a screen
session so that it can run separate from your terminal session. On Setonix, pipelines must be run on a dedicated workflow login node. A typical workflow would be:
Once the pipeline has completed and you no longer need access to the intermediate files, you should delete the work directory to save disk space. The location of the work directory will differ depending on the pipeline, but by default will be placed in the working directory in which the pipeline was launched.
Downloading Data
Using the ASVO Web Application
MWA observation data are accessed via the ASVO. The simplest way to download an observation is via the ASVO Web Application. To find an observation, navigate to the Observations page and use the Search menu filters. To browse all SMART observations, find the Project option in the Advanced Search menu and select project G0057.
For VCS-mode observations, you will need to specify the time offset from the start of the observation and the duration of time to download. The following example shows the Job menu for downloading 600 seconds at the start of an observation.
For correlator-mode observations, such as calibrator observations, navigate to the Visibility Download Job tab and select the delivery location as /scratch
. An example is shown below.
Once the jobs are submitted, they can be monitored via the My Jobs page, which will display the ASVO Job ID and the status of the download. The downloaded data will appear in /scratch/mwavcs/asvo/<ASVO Job ID>
.
Using Giant Squid
Submitting ASVO jobs can also be done from the command line using Giant Squid. To use Giant Squid, you must first set your ASVO API key as an environment variable. You can find your API key in your Profile page on the ASVO Web Application. Then add the following to your .bashrc
On Setonix, Giant Squid can be loaded with:
VCS download jobs can be submitted with the submit-volt
subcommand:
Visibility observation download jobs (e.g., for calibration) can be submitted using the submit-vis
subcommand:
To monitor the status of your ASVO jobs, you can use the list
subcommand:
Data Directory Structure
On Setonix, we store VCS data on the scratch partition in user directories. Typically the file structure is the following:
${MYSCRATCH}/<obs ID>
/raw
Raw VCS data, legacy only (.dat files)
/combined
Combined VCS data (.sub or .dat files)
/cal
<cal ID>
Visibilities (GPU Box or MWAX correlator .fits files)
<cal ID>.metafits
/hyperdrive
Calibration solution (.fits and .bin file)
Calibration solution diagnostic plots (phases and amplitudes)
...
/pointings
<Pointing 1>
PSRFITS search-mode output (.fits), and/or
VDIF voltage beam output (.vdif and .hdr)
...
<obs ID>.metafits
Calibration
Finding a calibration observation
In order to form a coherent tied-array beam, the antenna delays and gains need to be calibrated. This is commonly done using a dedicated observation of a bright calibrator source such as Centaurus A or Hercules A. Calibrator observations are taken in correlator mode and stored as visibilities. To find a calibration observation, either search the MWA archive using the ASVO Web Application, or run the following command from VCSTools:
This will produce a list of calibrator observations, ranked based on their distance in time to the VCS observation.
VCSTools is not currently installed on Setonix. (As of January 2025)
Preprocessing with Birli
Although Hyperdrive accepts the raw visibilities, it is beneficial to downsample the time and frequency resolution to reduce the data size and improve the calibration quality. Preprocessing of raw visibilities can be done with Birli, which has a variety of options for flagging and averaging data (refer to the Birli help information). In general, 2 sec / 40 kHz resolution is a good starting point. Birli can be loaded on Setonix with:
The following demonstrates the typical usage:
In general, it is also a good idea to specify the maximum allowed RAM for Birli to use with the --max-memory
option, which accepts a number corresponding to how many GB of RAM to allocate. Otherwise, Birli will attempt to load the entire raw visibility data set, which often requires >100 GB of memory.
On Setonix, specifying the maximum allowed memory is particularly important as jobs will be terminated prematurely if they try to access more memory that allocated based on the number of CPUs in the resource request. There is approximately 1.79 GB RAM per CPU core allocated on the work
/mwa
partitions.
Example Birli job on Setonix
This script would be submitted from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive
directory.
Calibrating with Hyperdrive
Hyperdrive is the latest generation calibration software for the MWA. It is written in Rust and uses GPU acceleration, making it several times faster than previous MWA calibration software. See the documentation for further details. Hyperdrive can be loaded on Setonix with:
We use Hyperdrive to perform direction-independent (DI) calibration using a sky model provided by the user. The more accurately the sky model reflects the input data, the better the convergence of the calibration solution. The sky model source list can be extracted from an all-sky catalogue such as GLEAM, or one can use a dedicated model for a well-characterised source such as Centaurus A. To compile a list of 1000 sources within the beam from the GLEAM Galactic Sky Model and save it as srclist_1000.yaml
, use Hyperdrives's srclist-by-beam
subcommand as shown below:
Calibration is performed using the Hyperdrive's di-calibrate
subcommand, as shown below.
Note for 40 kHz frequency channels, we typically flag channels 0 1 16 30 31
to zero-weight the channels at the edges and centre of the bandpass response.
This will produce hyperdrive_solutions.fits
. To inspect the solution, use Hyperdrive's solutions-plot
subcommand to generate two PNG images (for the amplitudes and phases) as shown below.
An example calibration solution is shown below. The legend is shown in the upper right of each figure. The last (non-flagged) tile is the reference tile, unless a reference tile is selected with --ref-tile
. All other tile's solutions are divided by the reference's solution before plotting. The gains of the X and Y polarisations of each antenna, gx and gy, are plotted along with the leakage terms Dx and Dy. In the amplitudes plot, the gains should be flat across the band with a value of around 1, and the leakage terms should be around 0. In the phases plot, the gains should exhibit a linear ramp with frequency and the leakage terms should be randomly distributed. Tiles which deviate far from these behaviours should be flagged by providing a space separated list to the --tile-flags
option in di-calibrate
. Also note that the plot range covers the entire data range by default, which is often dominated by bad tiles. To rescale the plots you can either flag the bad tiles or use the --max-amp
and --min-amp
options in solutions-plot
.
Some examples of less-than-ideal calibrations solutions are provided here:
Once we are satisfied with the solution, the FITS solutions file must be converted to Offringa format in order for VCSBeam to use it. This can be done with Hyperdrive's solutions-convert
subcommand as shown below.
Example Hyperdrive job on Setonix
The following script will run Hyperdrive as described above. The script must be executed from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive
directory.
Recombining Legacy VCS Data
If you are using Setonix, there is a Nextflow pipeline that is available to make recombining data simple. To use it, run:
The pipeline will look in the directory specified with the --download_dir
option and determine what data is available. The combined data will be placed in <VCS_DIR>/<OBSID>/combined
. The metafits file (with the name <OBSID>_metafits_ppds.fits
) should exist in either the download directory or the <VCS_DIR>/<OBSID>
directory. The default for --vcs_dir
is /scratch/mwavcs/$USER
.
For example:
In this case, the pipeline would recombine the files in /scratch/mwavcs/asvo/123456
and place them in /scratch/mwavcs/$USER/1261241272/combined
.
By default, the pipeline will submit one job per 32 seconds of data, so a full SMART observation it will be split into 150 jobs. Each job should take ~10-20 min to complete.
The pipeline does not delete the raw data. Therefore, to save disk space, it is recommended that you manually delete the raw data as soon as the pipeline has successfully completed.
Beamforming
Finding a target
Pulsars can be located in the MWA primary beam using the mwa-source-finder, a pip-installable Python package with a command line interface. Python environments on Setonix are managed at the user level, so this software should be installed into a Python virtual environment. For example:
Example 1: Finding all pulsars in an observation
To be specific, the software defines "in the beam" as the zenith-normalised beam power towards the source exceeding the a specified minimum (by default 0.3, i.e. 30%). By default it evaluates the primary beam model at the centre frequency of the observation, but this can be changed with --freq_mode {low/centre/high/multi}
.
Example 2: Finding all observations for a pulsar/source
The --filter_available
option will exclude observations with no data files available. The source(s) can be specified as pulsar B/J names, or RA/Dec coordinates in sexigesimal or decimal form. The following are all acceptable: "B1451-68" "J1456-6843" "14:55:59.92_-68:43:39.50" "223.999679_-68.727639".
Example 3: A more complex query to find pulsars in an observation
The above example will search for pulsars matching the given PSRCAT query that exceed a zenith-normalised beam power of 0.5 between the fractional observation times 0.1 and 0.5. It will also make a plot of the source power traces over time.
Pulsar coordinates can be obtained from the ATNF Pulsar Catalogue web portal. The coordinates should be written in a text file (RA and Dec in sexigesimal format separated by a space, one set of coordinates per line) to give to VCSBeam.
Tied-array beamforming with VCSBeam
VCSBeam is the successor to the legacy VCS tied-array beamformer. It is capable of processing both legacy and MWAX data into any of the following formats:
fine-channelised (10 kHz) full-Stokes time series in PSRFITS format (-p option)
fine-channelised (10 kHz) Stokes I time series in PSRFITS format (-N option)
coarse-channelised (1.28 MHz) complex voltage time series in VDIF format (-v option)
The coarse channelised output is made possible by the inverse PFB, which reverses the 10 kHz channelisation of the initial PFB stage. To run the tied-array beamformer, we use the make_mwa_tied_array_beam
program, which is MPI-enabled to process multiple coarse channels in a single Slurm job. Further details about this command can be found here.
An example Slurm script is given below, where 24 channels are being processed across 24 nodes. In this case, the -p
option is used to specify PSRFITS output.
beamform.sbatch
For VDIF output, the beamforming can be split into multiple smaller jobs using a Slurm job array. This is done with the sbatch option --array=LOWCHAN-HIGHCHAN
. The -f
option in make_mwa_tied_array_beam
should then be given $SLURM_ARRAY_TASK_ID
.