Data Processing Documentation
Introduction
Since the initial creation of the legacy documentation, there have been significant changes to the VCS data processing workflow due to the introduction of the MWAX correlator and the new beamforming software, VCSBeam. The purpose of this documentation is to provide an updated description of the processing workflow. Please direct any questions or suggestions to @Christopher Lee or @Bradley Meyers.
This documentation deals with data collected with the MWA's Voltage Capture System (VCS), which is described in Tremblay et al. (2015). The MWA tied-Array processing paper series provides details about calibration, beamforming, and signal processing – see Ord et al. (2015), Xue et al. (2019), McSweeney et al. (2020), and Swainston et al. (2022). Additionally, the VCSBeam documentation provides some theory and examples regarding calibration and beamforming.
While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this course may be a useful resource. Additionally, the PRESTO search tutorial provides a good introduction to the software tools available in PRESTO.
Table of Contents
Software Usage on Setonix
VCS data processing is currently performed primarily on Pawsey's Setonix supercomputer. We refer you to the Setonix User Guide for the complete details on the software stack and how to run jobs. Relevant details are provided here.
Modules
Members of the mwavcs Pawsey group will have access to the common software used to calibrate and reduce VCS data. These can be browsed with:
module availTo load a module, the user must specify the module name and version like so:
module load <module>/<version>Using Common Pulsar Software
On Setonix, the common pulsar software packages are provided as singularity images (`.sif` files), which can be used interactively or used to execute routines on the commandline. There are two ways in which these software containers can be used: directly (as was previously done on Garrawarla), or via the module system - we'll describe both below.
There are three containers, all of which are located in `/software/projects/mwavcs/setonix/2024.05/containers/`. Specifically:
psr-timing: contains the standard pulsar timing software packages,tempo,tempo2andPINT(along with updated clock file from when the image was produced).psr-search: contains the most commonly used searching software packages,PRESTOandriptidepsr-analysis: contains the most commonly used analysis tools and utilities,PSRCHIVE,DSPSR,PSRSALSA,clfdandPulsePortraiture
If there are missing packages in any of the above containers, please let someone know via Slack or email so that we can re-build and include what you need. (Although be prepared to answer any compilation/installation questions that may arise!)
You can see the Dockerfiles used here: https://github.com/CIRA-Pulsars-and-Transients-Group/pulsar-software-containers/tree/main
Directly using the containers
First, we need to load the Singularity module,
Loading the Singularity module
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0
module load singularity/4.1.0-mwavcsYou can then execute a command via the container of you choice, like
Execute a command via a Singularity container
singularity exec /software/projects/mwavcs/setonix/2024.05/containers/<container>.sif <command>and, if you need an interactive X-Window session to appear, then you will need to add -B ~/.Xauthority after the exec command and before the path to the container.
Using the containers via modules
We can also make the singularity containers "look" like modules, with all the software available on your regular command line (rather than needing to execute a Singularity pre-command). Nominally, if you follow the below steps, you should be able to see the three containers installed as system modules for our group.
Using Singularity containers as modules
module use /software/projects/mwavcs/setonix/2024.05/modules/zen3/gcc/12.2.0
module av psr-which should then show something like
and then these modules can be loaded as per usual. For example, if you load the psr-analysis/24-11-07 module, you will then have access to various post-processing packages from the command line. e.g.,
These modules can be used as normal within job submissions via SLURM as well.
Using Nextflow
Nextflow is a workflow management software which is described here. It is extremely useful for constructing data processing pipelines, including our own mwa_search pipeline used for the SMART survey.
On Setonix
Nextflow can be loaded with:
module load nextflowManual Installation
Nextflow requires Java to run. Installation instructions can be found here.
Following Nextflow's installation docs, start by downloading Nextflow with curl:
curl -s https://get.nextflow.io | bashThen make Nextflow executable with chmod:
chmod +x nextflowMove nextflow into a directory in your PATH in order to execute it anywhere.
Run nextflow self-update to upgrade to the latest version. This will update the Nextflow executable in-place. To see the available subcommands, run nextflow help.
Typical Usage
For local pipelines, the basic syntax is:
nextflow run pipeline.nf [PIPELINE OPTIONS]Where pipeline.nf contains an entry workflow (i.e. a workflow with no name).
For pipelines hosted on GitHub and configured for pipeline sharing, you can simply run the pipeline with:
nextflow run <username>/<repository> [-r <branch>] [PIPELINE OPTIONS]where -r <branch> allows you to optionally specify a Git branch to use.
If the pipeline receives updates on GitHub, you can update your locally cached version of the pipeline with:
nextflow pull [-r <branch>] <username>/<repository>If the pipeline name is unique among your locally cached pipelines, you can omit the GitHub username on execution.
As a simple example, a typical workflow could look like:
nextflow pull cplee1/vcs_recombine
nextflow run vcs_recombine [PIPELINE OPTIONS]In general, you should execute Nextflow within a screen session so that it can run separate from your terminal session. On Setonix, pipelines must be run on a dedicated workflow login node. A typical workflow would be:
# Create a session
screen -S pipeline
# Start the pipeline
nextflow run pipeline.nf
# Detach from the session with Ctrl+A then D
# View all of your running screen sessions
screen -ls
# Reattach to your session
screen -r pipelineOnce the pipeline has completed and you no longer need access to the intermediate files, you should delete the work directory to save disk space. The location of the work directory will differ depending on the pipeline, but by default will be placed in the working directory in which the pipeline was launched.
Downloading Data
Checking whether data is on disk already
Sometimes observations may already exist on the disk. On Setonix, the data will be located in /scratch/mwavcs/asvo/<AsvoJobID>. To check for data from a particular observation ID, you can simply run:
find /scratch/mwavcs/asvo -type f -name '$OBSID_*' | sed -r 's|/[^/]+$||' | sort | uniqwhich will return the directories containing the datafile(s).
Using the ASVO Web Application
MWA observation data are accessed via the ASVO. The simplest way to download an observation is via the ASVO Web Application. To find an observation, navigate to the Observations page and use the Search menu filters. To browse all SMART observations, find the Project option in the Advanced Search menu and select project G0057.
For VCS-mode observations, you will need to specify the time offset from the start of the observation and the duration of time to download. The following example shows the Job menu for downloading 600 seconds at the start of an observation.
For correlator-mode observations, such as calibrator observations, navigate to the Visibility Download Job tab and select the delivery location as /scratch. An example is shown below.
Once the jobs are submitted, they can be monitored via the My Jobs page, which will display the ASVO Job ID and the status of the download. The downloaded data will appear in /scratch/mwavcs/asvo/<ASVO Job ID>.
Using Giant Squid
Submitting ASVO jobs can also be done from the command line using Giant Squid. To use Giant Squid, you must first set your ASVO API key as an environment variable. You can find your API key in your Profile page on the ASVO Web Application. Then add the following to your .bashrc
export MWA_ASVO_API_KEY=<api key>On Setonix, Giant Squid can be loaded with:
module load giant-squid/<version>VCS download jobs can be submitted with the submit-volt subcommand:
giant-squid submit-volt --delivery scratch --offset <offset> --duration <duration> <obs ID>Visibility observation download jobs (e.g., for calibration) can be submitted using the submit-vis subcommand:
giant-squid submit-vis --delivery scratch <obs ID>To monitor the status of your ASVO jobs, you can use the list subcommand:
giant-squid listData Directory Structure
On Setonix, we store VCS data on the scratch partition in user directories. Typically the file structure is the following:
${MYSCRATCH}/<obs ID>/rawRaw VCS data, legacy only (.dat files)
/combinedCombined VCS data (.sub or .dat files)
/cal<cal ID>Visibilities (GPU Box or MWAX correlator .fits files)
<cal ID>.metafits/hyperdriveCalibration solution (.fits and .bin file)
Calibration solution diagnostic plots (phases and amplitudes)
...
/pointings<Pointing 1>PSRFITS search-mode output (.fits), and/or
VDIF voltage beam output (.vdif and .hdr)
...
<obs ID>.metafits
Unpacking recombined data
In some cases, Legacy VCS observation have already been recombined and re-uploaded to the MWA archive. In this case, you will have downloaded one .tar files and one .ics file per second recorded, in addition to the usual metadata. An efficient way to extract the voltage channel files in parallel is to use the following script.
If successful, there should be 24 .dat files per 1 .tar file, all of which should have identically the same size.
This operation will extract the raw data files from the tar-balls, but DOES NOT delete them. It, therefore, represents a doubling of the total data volume for this observation. You will need to manually delete the tar-balls after verifying that the extraction was complete.
Calibration
Finding a calibration observation
In order to form a coherent tied-array beam, the antenna delays and gains need to be calibrated. This is commonly done using a dedicated observation of a bright calibrator source such as Centaurus A or Hercules A. Calibrator observations are taken in correlator mode and stored as visibilities. To find a calibration observation, either search the MWA archive using the ASVO Web Application, or run the following command from VCSTools:
mwa_metadb_utils.py -c <obs ID>This will produce a list of calibrator observations, ranked based on their distance in time to the VCS observation.
VCSTools is not currently installed on Setonix. (As of January 2025)
Preprocessing with Birli
Although Hyperdrive accepts the raw visibilities, it is beneficial to downsample the time and frequency resolution to reduce the data size and improve the calibration quality. Preprocessing of raw visibilities can be done with Birli, which has a variety of options for flagging and averaging data (refer to the Birli help information). In general, 2 sec / 40 kHz resolution is a good starting point. Birli can be loaded on Setonix with:
module load birli/<version>The following demonstrates the typical usage:
birli --metafits <METAFITS_FILE> --uvfits-out <UVFITS_FILE> --avg-freq-res <FREQ_RES> --avg-time-res <TIME_RES>In general, it is also a good idea to specify the maximum allowed RAM for Birli to use with the --max-memory option, which accepts a number corresponding to how many GB of RAM to allocate. Otherwise, Birli will attempt to load the entire raw visibility data set, which often requires >100 GB of memory.
On Setonix, specifying the maximum allowed memory is particularly important as jobs will be terminated prematurely if they try to access more memory that allocated based on the number of CPUs in the resource request. There is approximately 1.79 GB RAM per CPU core allocated on the work/mwa partitions.
Example Birli job on Setonix
This script would be submitted from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive directory.
Calibrating with Hyperdrive
Hyperdrive is the latest generation calibration software for the MWA. It is written in Rust and uses GPU acceleration, making it several times faster than previous MWA calibration software. See the documentation for further details. Hyperdrive can be loaded on Setonix with:
module load hyperdrive/<version>We use Hyperdrive to perform direction-independent (DI) calibration using a sky model provided by the user. The more accurately the sky model reflects the input data, the better the convergence of the calibration solution. The sky model source list can be extracted from an all-sky catalogue such as GLEAM, or one can use a dedicated model for a well-characterised source such as Centaurus A. To compile a list of 1000 sources within the beam from the GLEAM Galactic Sky Model and save it as srclist_1000.yaml, use Hyperdrives's srclist-by-beam subcommand as shown below:
hyperdrive srclist-by-beam --metafits <METAFITS_FILE> --number 1000 /software/projects/mwavcs/GGSM_updated_2023Oct23.txt srclist_1000.yamlCalibration is performed using the Hyperdrive's di-calibrate subcommand, as shown below.
hyperdrive di-calibrate --source-list <SRC_LIST> --data <UVFITS_FILE> <METAFITS_FILE> --fine-chan-flags-per-coarse-chan <CHAN_FLAGS>Note for 40 kHz frequency channels, we typically flag channels 0 1 16 30 31 to zero-weight the channels at the edges and centre of the bandpass response.
This will produce hyperdrive_solutions.fits. To inspect the solution, use Hyperdrive's solutions-plot subcommand to generate two PNG images (for the amplitudes and phases) as shown below.
hyperdrive solutions-plot --metafits <METAFITS_FILE> hyperdrive_solutions.fitsAn example calibration solution is shown below. The legend is shown in the upper right of each figure. The last (non-flagged) tile is the reference tile, unless a reference tile is selected with --ref-tile. All other tile's solutions are divided by the reference's solution before plotting. The gains of the X and Y polarisations of each antenna, gx and gy, are plotted along with the leakage terms Dx and Dy. In the amplitudes plot, the gains should be flat across the band with a value of around 1, and the leakage terms should be around 0. In the phases plot, the gains should exhibit a linear ramp with frequency and the leakage terms should be randomly distributed. Tiles which deviate far from these behaviours should be flagged by providing a space separated list to the --tile-flags option in di-calibrate. Also note that the plot range covers the entire data range by default, which is often dominated by bad tiles. To rescale the plots you can either flag the bad tiles or use the --max-amp and --min-amp options in solutions-plot.
Some examples of less-than-ideal calibrations solutions are provided here:
Once we are satisfied with the solution, the FITS solutions file must be converted to Offringa format in order for VCSBeam to use it. This can be done with Hyperdrive's solutions-convert subcommand as shown below.
hyperdrive solutions-convert --metafits <METAFITS_FILE> hyperdrive_solutions.fits hyperdrive_solutions.binExample Hyperdrive job on Setonix
The following script will run Hyperdrive as described above. The script must be executed from the ${MYSCRATCH}/<obs ID>/cal/<cal ID>/hyperdrive directory.
On a Setonix GPU node, requesting a GPU actually reserves an "allocation-pack" - see https://pawsey.atlassian.net/wiki/spaces/US/pages/51928618/Setonix+GPU+Partition+Quick+Start
This means that for each GPU requested we have 8 CPU cores and ~25 GB RAM. In the case of running Hyperdrive, this memory needs to be able to store the entire UVFITS output from Birli.
If the UVFITS is larger than ~25 GB then you will need to use 2 “allocation-packs” worth of resources. This can be achieved by changing --gres=gpu:1 to --gres=gpu:2 in the above example script.
Recombining Legacy VCS Data
If you are using Setonix, there is a Nextflow pipeline that is available to make recombining data simple. To use it, run:
nextflow run cplee1/vcs_recombine --obsid <OBSID> --download_dir <DOWNLOAD_DIR> [--vcs_dir <VCS_DIR>]The pipeline will look in the directory specified with the --download_dir option and determine what data is available. The combined data will be placed in <VCS_DIR>/<OBSID>/combined. The metafits file (with the name <OBSID>_metafits_ppds.fits) should exist in either the download directory or the <VCS_DIR>/<OBSID> directory. The default for --vcs_dir is /scratch/mwavcs/$USER.
For example:
nextflow run cplee1/vcs_recombine --obsid 1261241272 --download_dir /scratch/mwavcs/asvo/123456In this case, the pipeline would recombine the files in /scratch/mwavcs/asvo/123456 and place them in /scratch/mwavcs/$USER/1261241272/combined.
By default, the pipeline will submit one job per 32 seconds of data, so a full SMART observation it will be split into 150 jobs. Each job should take ~10-20 min to complete.
The pipeline does not delete the raw data. Therefore, to save disk space, it is recommended that you manually delete the raw data as soon as the pipeline has successfully completed.
This workflow does NOT untar data files, so you should only run this once you’re confident the legacy data have been extracted into the --download_dir location.
Beamforming
Finding a target
Pulsars can be located in the MWA primary beam using the mwa-source-finder, a pip-installable Python package with a command line interface. Python environments on Setonix are managed at the user level, so this software should be installed into a Python virtual environment. For example:
module load python/3.11.6
module load hyperbeam/0.9.3
module load py-numpy/1.26.1
python -m venv /path/to/myenv
source /path/to/myenv/bin/activate
python -m pip install git+https://github.com/cplee1/mwa_source_finder.gitExample 1: Finding all pulsars in an observation
source-finder -o 1221399680To be specific, the software defines "in the beam" as the zenith-normalised beam power towards the source exceeding the a specified minimum (by default 0.3, i.e. 30%). By default it evaluates the primary beam model at the centre frequency of the observation, but this can be changed with --freq_mode {low/centre/high/multi}.
Example 2: Finding all observations for a pulsar/source
source-finder -s J2241-5236 --obs_for_source [--filter_available]The --filter_available option will exclude observations with no data files available. The source(s) can be specified as pulsar B/J names, or RA/Dec coordinates in sexigesimal or decimal form. The following are all acceptable: "B1451-68" "J1456-6843" "14:55:59.92_-68:43:39.50" "223.999679_-68.727639".
Example 3: A more complex query to find pulsars in an observation
source-finder -o 1221399680 --start 0.1 --end 0.5 --min_power 0.5 --condition 'P0<0.03 && DM<100' --time_plotThe above example will search for pulsars matching the given PSRCAT query that exceed a zenith-normalised beam power of 0.5 between the fractional observation times 0.1 and 0.5. It will also make a plot of the source power traces over time.
Pulsar coordinates can be obtained from the ATNF Pulsar Catalogue web portal. The coordinates should be written in a text file (RA and Dec in sexigesimal format separated by a space, one set of coordinates per line) to give to VCSBeam.
Generating a grid of pointings
Sometimes it is necessary or desirable to not only form a single tied-array beam at a target, but also some close-packed (possibly overlapping) pointings to help with follow-up and/or mitigate potential ionospheric offsets. There is an easy-to-use Python package here (mwa-tab-gridder) which can generate these in an optimal concentric centred-hexagonal pattern. It is reasonably flexible, and can be provided a MWA metafits file which it will use to better estimate the tied-array beam size to ensure better packing and overlap properties.
Example: Forming a ring of hexagonally-configured, overlapping tied-array beam pointings around a central point.
mwa-tab-grid -m 1226062160.metafits -c "00:26:00 -19:55:00" --showTied-array beamforming with VCSBeam
VCSBeam is the successor to the legacy VCS tied-array beamformer. It is capable of processing both legacy and MWAX data into any of the following formats:
fine-channelised (10 kHz) full-Stokes time series in PSRFITS format (
-p -N 4option)fine-channelised (10 kHz) Stokes I time series in PSRFITS format (
-p -N 1option)coarse-channelised (1.28 MHz) complex voltage time series in VDIF format (
-voption)
The coarse channelised output is made possible by the inverse PFB, which reverses the 10 kHz channelisation of the initial PFB stage. To run the tied-array beamformer, we use the make_mwa_tied_array_beam program, which is MPI-enabled to process multiple coarse channels in a single Slurm job. Further details about this command can be found here.
An example Slurm script is given below, where 24 channels are being processed across 24 nodes. In this case, the -p option is used to specify PSRFITS output.
For VDIF output, the beamforming can be split into multiple smaller jobs using a Slurm job array. This is done with the sbatch option --array=LOWCHAN-HIGHCHAN. The -f option in make_mwa_tied_array_beam should then be given $SLURM_ARRAY_TASK_ID.
NOTE: MWAX VCS “sub” files include the channel number, but they are not zero-padded to three (3) places, as VCSBeam expects. If your observation contains any coarse channels <100 then you will need to execute the following (in the directory where the .sub files exist) to create symbolic links with the expected file naming convention.
find . -type f -name '*_[0-9][0-9].sub' | while read -r s; do
d=$(dirname "$s")
b=$(basename "$s")
n="${b%_*}"
m="${b##*_}"; m="${m%%\.sub}"
p=$(printf "%03d" "$m")
ln -sf "$s" "${d}/${n}_${p}.sub"
done
Picket fence processing - WIP
The flexibility of the MWA recording system allows for, in principle, any combination of 24 coarse channels across the accessible bandwidth. When these subbands are not contiguous, we call the data a “picket fence”. In theory, both the calibration and beamforming procedures outlined above are directly applicable, except that we now treat each subband within a given observation as it’s own “job” or task. For the most part, this can be achieved by writing some simple wrappers and templates of the above scripts and then exploiting SLURM’s utility to process subbands in parallel.
Calibrating picket fence data
Nominally, the Birli pre-processing script above will natively handle picket fence data in that a separate UVFITS output will be produced for each subband. We will assume at this point that the Birli pre-processing has already been executed and the UVFITS files exist in the directory “above” the current working directory.
The UVFITS per subband are then directly ingested into a Hyperdrive calibration job (one per subband). Copy this template into the current directory as calibrate_template.batch.
This template may then be executed by running the following bash script. Copy this into the current directory and execute as bash calibrate_picket_fence.sh.
You may then inspect the amplitude and phase diagnostics as in the normal calibration case.
Beamforming picket fence data
Similarly to the calibration jobs, the beamforming jobs essentially treat each subband as an independent task. Starting again with the template, which should be placed in the directory where you want the output to be written, copy the following to beamform_template.batch. You should also have a file named pointings.txt in the working directory that contains the look-directions (one per line) you want to beamform on. This template produces Full Stokes PSRFITS output, but you can change that by altering the executed command (see the options available in the normal beamforming section above).
NEW as of vcsbeam/v5.5(?) : There is a new flag -k to process calibration solution for picket fence data correctly. For picket fence data, the calibration solution provided must be of the same set of channels as those that you want to beamform. The template below correctly process the data provided the -k flag is provided.
Especially salient for picket fence beamforming is the note at the bottom of the normal beamforming section regarding MWAX VCS file naming conventions.
This template is the ingested by a wrapper which send of jobs for each subband. Copy this script to beamfom_picket_fence.sh in the same directory as the template. You will need to manually enter the required information in this wrapper script to match the beamforming task(s) you wish to accomplish. The required inputs are self-explanatory. Execute with bash beamform_picket_fence.sh once the required inputs are filled in.