Updated Documentation
Introduction
Since the initial creation of the legacy documentation, there have been significant changes to the VCS data processing workflow due to the introduction of the MWAX correlator and the new beamforming software, VCSBeam. The purpose of this documentation is to provide an updated description of the processing workflow. Please direct any questions or suggestions to Christopher Lee.
This documentation deals with data collected with the MWA's Voltage Capture System (VCS), which is described in Tremblay et al. (2015). The MWA tied-Array processing paper series provides details about calibration, beamforming, and signal processing – see Ord et al. (2015), Xue et al. (2019), McSweeney et al. (2020), and Swainston et al. (2022). Additionally, the VCSBeam documentation provides some theory and examples regarding calibration and beamforming.
While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this course may be a useful resource. Additionally, the PRESTO search tutorial provides a good introduction to the software tools available in PRESTO.
Table of Contents
Environment setup
Loading modules
The software used in this documentation is available on several supercomputing clusters. Below we describe how to access the software on Pawsey's Garrawarla cluster.
The installed software on Garrawarla are managed using Lmod, which keeps the login environment tidy and reproducible. Detailed documentation is provided by Pawsey here, but the basic usage is summarised here.
To access custom MWA software modules, run the following command:
module use /pawsey/mwa/software/python3/modulefiles
You can then browse the available modules which do not conflict with your machine using
module avail
To load a module, run the command
module load <module>/<version>
For example, to use Singularity containers, you must first load the singularity
module like so:
module load singularity/3.7.4
Garrawarla also supports default modules. If singularity
version 3.7.4
is the default module, then the above is equivalent to:
module load singularity
Using common pulsar software
Pulsar software is difficult to install at the best of times, so some common packages are not currently natively installed on Garrawarla, but are provided via containerisation. There are two generic Singularity containers available to users that focus on two different aspects of pulsar science/processing.
The psr-search
container includes common pulsar searching tools including PRESTO and riptide (FFA implementation). It can be accessed as shown below.
singularity exec /pawsey/mwa/singularity/psr-search/psr-search.sif <command>
The psr-analysis
container includes common pulsar analysis tools including DSPSR, PSRCHIVE, and various timing packages. It can be accessed as shown below.
singularity exec /pawsey/mwa/singularity/psr-analysis/psr-analysis.sif <command>
For programs with interactivity, ~/.Xauthority
must be added as a bind path:
singularity exec -B ~/.Xauthority <container> <command>
Commands which do not place a significant load on the CPU/memory can usually be run on the login node, provided the supercomputing provided allows this. This is particularly useful for querying metadata and generating interactive plots. For convenience, you can add the file paths as environment variables and define aliases. For example, the following could be added to your .bashrc
:
export PSR_SEARCH_CONT=/pawsey/mwa/singularity/psr-search/psr-search.sif export PSR_ANALYSIS_CONT=/pawsey/mwa/singularity/psr-analysis/psr-analysis.sif alias psredit="singularity exec $PSR_ANALYSIS_CONT psredit" alias pav="singularity exec -B ~/.Xauthority $PSR_ANALYSIS_CONT pav"
Once the above .bashrc
has been loaded, the commands psredit
and pav
will be available in an interactive session on the login node.
Using Nextflow
Nextflow is a workflow management software which is described here. It is extremely useful for constructing data processing pipelines, including our own mwa_search
pipeline used for the SMART survey.
On Setonix
Nextflow can be loaded with:
module load nextflow
On Garrawarla (or your PC)
Following Nextflow's installation docs, start by downloading Nextflow with curl
:
curl -s https://get.nextflow.io | bash
Then make Nextflow executable with chmod
:
chmod +x nextflow
Move nextflow into a directory in your PATH
in order to execute it anywhere.
Nextflow requires Java. On Garrawarla, this can be loaded with:
module load java/17
Run nextflow self-update
to upgrade to the latest version. This will update the Nextflow executable in-place. To see the available subcommands, run nextflow help
.
Typical Usage
For local pipelines, the basic syntax is:
nextflow run pipeline.nf [PIPELINE OPTIONS]
Where pipeline.nf
contains an entry workflow.
For pipelines hosted on GitHub and configured for pipeline sharing, you can simply run the pipeline with:
nextflow run <username>/<repository> [-r <branch>] [PIPELINE OPTIONS]
where -r <branch>
allows you to optionally specify a Git branch to use.
If the pipeline receives updates on GitHub, you can update your locally cached version of the pipeline with:
nextflow pull [-r <branch>] <username>/<repository>
In general, you should execute Nextflow within a screen
session so that it can run separate from your terminal session. On Garrawarala, we can run Nextflow on the login node, whereas on Setonix we run nextflow on a dedicated workflow login node. E.g.
# Create a session screen -S pipeline # Start the pipeline nextflow run pipeline.nf # Detach from the session with Ctrl+A then D # View all of your running screen sessions screen -ls # Reattach to your session screen -r pipeline
Once the pipeline has completed and you no longer need access to the intermediate files, you should delete the work directory to save disk space. The location of the work directory will differ depending on the pipeline, but by default will be placed in the working directory in which the pipeline was launched.
WIP: Using common pulsar software (Setonix)
On Setonix, the common pulsar software packages are provided as singularity images (`.sif` files), which can be used interactively or used to execute routines on the commandline. There are two ways in which these software containers can be used: directly (as was previously done on Garrawarla), or via the module system - we'll describe both below.
There are three containers, all of which are located in `/software/projects/mwavcs/setonix/2024.05/containers/`. Specifically:
psr-timing
: contains the standard pulsar timing software packages,tempo
,tempo2
andPINT
(along with updated clock file from when the image was produced).psr-search
: contains the most commonly used searching software packages,PRESTO
andriptide
psr-analysis
: contains the most commonly used analysis tools and utilities,PSRCHIVE
,DSPSR
,PSRSALSA
,clfd
andPulsePortraiture
If there are missing packages in any of the above containers, please let someone know via Slack or email so that we can re-build and include what you need. (Although be prepared to answer any compilation/installation questions that may arise!)
You can see the Dockerfiles used here: https://github.com/CIRA-Pulsars-and-Transients-Group/pulsar-software-containers/tree/main
Directly using the containers
First, we need to load the Singularity module,
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0 module load singularity/4.1.0-mwavcs
You can then execute a command via the container of you choice, like
singularity exec /software/projects/mwavcs/setonix/2024.05/containers/<container>.sif <command>
and, if you need an interactive X-Window session to appear, then you will need to add -B ~/.Xauthority
after the exec
command and before the path to the container.
Using the containers via modules
We can also make the singularity containers "look" like modules, with all the software available on your regular command line (rather than needing to execute a Singularity pre-command). Nominally, if you follow the below steps, you should be able to see the three containers installed as system modules for our group.
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0 module av psr-
which should then show something like
and then these modules can be loaded as per usual. For example, if you load the psr-analysis/24-11-07
module, you will then have access to various post-processing packages from the command line. e.g.,
These modules can be used as normal within job submissions via SLURM as well.
Downloading data
Using the ASVO Web Application
MWA observation data are accessed via the ASVO. The simplest way to download an observation is via the ASVO Web Application. To find an observation, navigate to the Observations page and use the Search menu filters. To browse all SMART observations, find the Project option in the Advanced Search menu and select project G0057.
For VCS-mode observations, you will need to specify the time offset from the start of the observation and the duration of time to download. The following example shows the Job menu for downloading 600 seconds at the start of an observation.
For correlator-mode observations, such as calibrator observations, navigate to the Visibility Download Job tab and select the delivery location as /scratch
. An example is shown below.
Once the jobs are submitted, they can be monitored via the My Jobs page, which will display the ASVO Job ID and the status of the download. The downloaded data will appear in /scratch/mwavcs/asvo/<ASVO Job ID>
.
Using Giant Squid
Submitting ASVO jobs can also be done from the command line using Giant Squid. To use Giant Squid, you must first set your ASVO API key as an environment variable. You can find your API key in your Profile page on the ASVO Web Application. Then add the following to your .bashrc
export MWA_ASVO_API_KEY=<api key>
To run Giant Squid, add the following alias to your .bashrc
alias giant-squid='module load singularity; singularity exec /pawsey/mwa/singularity/giant-squid/giant-squid_latest.sif'
VCS download jobs can then be submitted with the submit-volt
subcommand as follows:
giant-squid submit-volt --delivery scratch --offset <offset> --duration <duration> <obs ID>
Visibility observation download jobs (e.g., for calibration) can be submitted using the submit-vis
subcommand:
giant-squid submit-vis --delivery scratch <obs ID>
To monitor the status of your ASVO jobs, you can use the list
subcommand:
giant-squid list
The standard directory structure
Due to the large data volume of VCS observations and to assist with the automation of many of the repetitive processing steps, we use a standard directory structure to store downloaded data. Data on Garrawarla are currently stored on the /scratch partition under the /scratch/mwavcs group directory. In the past, downloaded data have been stored in a shared directory, however the pooling of all users downloads into one directory can be messy and difficult to manage. Therefore, VCS downloads should be stored in the user's personal VCS directory, i.e. /scratch/mwavcs/${USER}/<obs ID>. Within this directory, the VCS data (.sub or .dat files) should be stored in a subdirectory called raw
or combined
(depending on whether the data have been combined) while the metafits file of the VCS observation should remain in the parent directory. A second subdirectory called cal should contain a directory for each calibration observation, within which are the visibilities and the calibration metafits. Within each calibrator's directory, the calibration solutions are stored within a subdirectory called hyperdrive. Beamformed data are stored under a directory called pointings, organised by source name. This is summarised below:
- /scratch/mwavcs/${USER}/<obs ID>
- /raw
- Raw VCS data (.dat files)
- /combined
- Combined VCS data (.sub or .dat files)
- /cal
- <cal ID>
- Visibilities (.fits files)
- <cal ID>.metafits
- /hyperdrive
- Calibration solution
- ...
- <cal ID>
- /pointings
- <Pointing 1>
- ...
- <obs ID>.metafits
- /raw
Calibration
Finding a calibration observation
In order to form a coherent tied-array beam, the antenna delays and gains need to be calibrated. This is commonly done using a dedicated observation of a bright calibrator source such as Centaurus A or Hercules A. Calibrator observations are taken in correlator mode and stored as visibilities. To find a calibration observation, either search the MWA archive using the ASVO Web Application, or run the following command from VCSTools:
module load vcstools mwa_metadb_utils.py -c <obs ID>
This will produce a list of calibrator observations, ranked based on their distance in time to the VCS observation.
Preprocessing with Birli
Although Hyperdrive accepts the raw visibilities, it is beneficial to downsample the time and frequency resolution to reduce the data size and improve the calibration quality. Preprocessing of FITS data can be done with Birli, which has a variety of options for flagging and averaging data (refer to the Birli help information). In general, 2 sec / 40 kHz resolution is a good starting point.
module load birli birli --metafits <METAFITS_FILE> --uvfits-out <UVFITS_FILE> --avg-freq-res <FREQ_RES> --avg-time-res <TIME_RES>
Birli is quite resource intensive, and should be given adequate memory to run. An example Slurm script is given below, which should be executed from within the /scratch/mwavcs/${USER}/<obs ID>/cal/<cal ID>/hyperdrive
directory, with the FITS files in the /scratch/mwavcs/${USER}/<obs ID>/cal/<cal ID>
directory.
This will produce <cal ID>_birli.uvfits
in the parent directory, which can be given to Hyperdrive to calibrate.
Calibrating with Hyperdrive
Hyperdrive is the latest generation calibration software for the MWA. It is written in Rust and uses GPU acceleration, making it several times faster than previous MWA calibration software. See the documentation for further details.
We use Hyperdrive to perform direction-independent (DI) calibration using a sky model provided by the user. The more accurately the sky model reflects the input data, the better the convergence of the calibration solution. The sky model source list can either be extracted from an all-sky catalogue such as GLEAM, or from a dedicated model for a well-characterised source such as Centaurus A. To compile a list of 1000 sources within the beam from the GLEAM Galactic Sky Model and save it as srclist_1000.yaml
, use Hyperdrives's srclist-by-beam
subcommand as shown below.
hyperdrive srclist-by-beam --metafits <METAFITS_FILE> --number 1000 /software/projects/mwavcs/GGSM_updated_2023Oct23.txt
Alternatively, you can browse a list of dedicated source models here: /pawsey/mwa/software/python3/mwa-reduce/mwa-reduce-git/models
.
Calibration is performed using the Hyperdrive's di-calibrate
subcommand, as shown below.
hyperdrive di-calibrate --source-list <SRC_LIST> --data <UVFITS_FILE> <METAFITS_FILE> --fine-chan-flags-per-coarse-chan <CHAN_FLAGS>
Note for 40 kHz frequency channels, we typically flag channels 0 1 16 30 31
to zero-weight the channels at the edges and centre of the bandpass response. This will produce hyperdrive_solutions.fits. To inspect the solution, use Hyperdrive's solutions-plot
subcommand to generate two PNG images (for the amplitudes and phases) as shown below.
hyperdrive solutions-plot --metafits <METAFITS_FILE> hyperdrive_solutions.fits
An example calibration solution is shown below.
The legend is shown in the upper right of each figure. The last (non-flagged) tile is the reference tile, unless a reference tile is selected with --ref-tile
. All other tile's solutions are divided by the reference's solution before plotting. The gains of the X and Y polarisations of each antenna, gx and gy, are plotted along with the leakage terms Dx and Dy. In the amplitudes plot, the gains should be flat across the band with a value of around 1, and the leakage terms should be around 0. In the phases plot, the gains should exhibit a linear ramp with frequency and the leakage terms should be randomly distributed. Tiles which deviate far from these behaviours should be flagged by providing a space separated list to the --tile-flags
option in di-calibrate
. Also note that the plot range covers the entire data range by default, which is often dominated by bad tiles. To rescale the plots you can either flag the bad tiles or use the --max-amp
and --min-amp
options in solutions-plot
.
Some examples of less-than-ideal calibrations solutions are provided here:
Once we are satisfied with the solution, the FITS solutions file must be converted to Offringa format in order for VCSBeam to use it. This can be done with Hyperdrive's solutions-convert
subcommand as shown below.
hyperdrive solutions-convert --metafits <METAFITS_FILE> hyperdrive_solutions.fits hyperdrive_solutions.bin
An example Slurm script is given below, which should be executed from within the /scratch/mwavcs/${USER}/<obs ID>/cal/<cal ID>/hyperdrive
directory.
The execution time for Hyperdrive will depend on the size of the input files, the source list, and the resources allocated. Using the downsampled UVFITS as input and 1000 sources, this job ran on Garrawarla in under 2 minutes.
Recombining Legacy VCS Data
If you are using Garrawarla or Setonix, there is a Nextflow pipeline that is available to make recombining data simple. To use it, run:
nextflow run cplee1/vcs_recombine --obsid <OBSID> --download_dir <DOWNLOAD_DIR> [--vcs_dir <VCS_DIR>]
The pipeline will look in the directory specified with the --download_dir
option and determine what data is available. The combined data will be placed in <VCS_DIR>/<OBSID>/combined
. The metafits file (with the name <OBSID>_metafits_ppds.fits
) should exist in either the download directory or the <VCS_DIR>/<OBSID>
directory. The default for --vcs_dir
is /scratch/mwavcs/$USER
.
For example:
nextflow run cplee1/vcs_recombine --obsid 1261241272 --download_dir /scratch/mwavcs/asvo/123456
In this case, the pipeline would recombine the files in /scratch/mwavcs/asvo/123456
and place them in /scratch/mwavcs/$USER/1261241272/combined
.
By default, the pipeline will submit one job per 32 seconds of data, so a full SMART observation it will be split into 150 jobs. Each job should take ~10-20 min to complete.
The pipeline does not delete the raw data. Therefore, to save disk space, it is recommended that you manually delete the raw data as soon as the pipeline has successfully completed.
Beamforming
Finding a target
Compiling a list of pulsars to beamform on can be done with the find_pulsar_in_obs.py
script in vcstools. On Garrawarla, this can be loaded as follows:
module load vcstools
To find all of the pulsars within a given observation, the syntax is as follows:
find_pulsar_in_obs.py -o <obs ID>
To find all of the observations associated with a given source, you can either provide a pulsar J name or the equatorial coordinates:
find_pulsar_in_obs.py -p <Jname> find_pulsar_in_obs.py -c <RA_DEC>
All of the above options also accept space-separated lists of arguments. For example, given a list of pulsars and a list of observations, to find which observations contain which pulsars, run the following:
find_pulsar_in_obs.py -o <obs ID 1> <obs ID 2> ... <obs ID N> -p <Jname 1> <Jname 2> ... <Jname N>
Note: For MWAX VCS observations, you must include the --all_volt
option.
VCSBeam requires pointings to be specified in equatorial coordinates. To find these coordinates for a catalogued pulsar, you can use the ATNF catalogue's command line interface:
/pawsey/mwa/singularity/psr-analysis/psr-analysis.sif psrcat -e3 -c "raj decj" <Jname>
Tied-array beamforming with VCSBeam
VCSBeam is the successor to the legacy VCS tied-array beamformer. It is capable of processing both legacy and MWAX data into any of the following formats:
fine-channelised (10 kHz) full-Stokes time series in PSRFITS format (-p option)
- fine-channelised (10 kHz) Stokes I time series in PSRFITS format (-N option)
- coarse-channelised (1.28 MHz) complex voltage time series in VDIF format (-v option)
The coarse channelised output is made possible by the inverse PFB, which reverses the 10 kHz channelisation of the initial PFB stage. To run the tied-array beamformer, we use the make_mwa_tied_array_beam
program, which is MPI-enabled to process multiple coarse channels in a single Slurm job. Further details about this command can be found here.
An example Slurm script is given below, where 24 channels are being processed across 24 nodes. In this case, the -p
option is used to specify PSRFITS output.
For VDIF output, the beamforming can be split into multiple smaller jobs using a Slurm job array. This can save time if the Slurm queue is busy. An example is given below for channels 109 to 132, where in this case the -v
option is used to specify VDIF output.