VCS pulsar data processing
This documentation is focused on the pulsar science applications of VCS data. For a detailed description of the VCS and its science goals and capabilities, please see Tremblay et al. (2015).
While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this might help (if you're unfamiliar).
The wrapper script process_vcs.py
is used for many of the steps needed to reduce the voltages to produce incoherently/coherently summed data in PSRFITS and/or VDIF format. It attempts to make things as streamlined and user-friendly as possible. To see detailed usage information, process_vcs.py -h
.
For our current Galaxy installation, there are a few specifics:
- In order to access any of the tools, you will need to source in your personal profile:
/group/mwavcs/PULSAR/psrBash.profile
- There is a master branch and a development branch of the main codebase. By default,
psrBash.profile
will load the master branch, which is meant to represent the most stable version. However, there may be bugs that were later discovered and fixed, but which have not yet been grafted back into the master branch. If you wish to try the more cutting edge development branch, then after sourcing/group/mwavcs/PULSAR/psrBash.profile
, run "module unload vcstools/master
" followed by "module load vcstools/devel
". (Using module unload & load is preferable to using module switch on account of certain undesirable quirks of the Galaxy system.)
- There is a master branch and a development branch of the main codebase. By default,
- There is a strict directory structure for VCS processing, given the large volumes of data involved. Raw/recombined data is stored on the
/astro
file system, however all other products (incoherent sum products, coherent sum products, calibration solutions, etc.) will be stored on/group
file system.We have endeavoured to make sure this is done without requiring the user's attention, however, please double check working directories and defaults for scripts.
A note regarding code examples:
Code snippets or complete command-line calls will be enclosed in a code block, i.e.
This is a code block
Furthermore, scripts will require some input and other input will be optional. Optional arguments will be surrounded by []
characters. In both cases, where the user need to specify some input (rather than just toggling a mode of operation), a brief description of that input will be surrounded by <>
characters. For example:
test.py --user_input <required input> [--optional_user_input <optional user input>] [--optional_flag]
Table of Contents
Downloading data
Data are downloaded via the "copyq" on the Zeus machine at Pawsey. In the background on the NGAS servers, the raw voltages are (slowly) being "recombined" - they are labeled with the "Processed" flag on the MWA data archive.
Therefore, there are two different "kinds" of downloads available: raw data OR recombined data. In both cases, the time from a download request to having the data on disk is typically ~few days (depending on volume, load on queues/archives, etc).
All being well, the decision between "raw" and "recombined" downloads should be auto-magically made for you by the process_vcs.py
script. If the data on the NGAS server has been processed, the the script will download the "tarballed" recombined files (and, rather nicely, sends them off to get "untarred" on the fly). If the data has not been completely processed, then the raw data will be downloaded (in which case you will need to "recombine" the data yourself - and then delete the raw voltages). Additionally, if the data have been processed on the NGAS server, then you have the option to ONLY download the incoherently summed data files (*_ics.dat) with the mode option: -m download_ics
.
- If raw voltages are downloaded, then they will be placed in:
/astro/mwavcs/vcs/[obs ID]/raw
- If the processed/recombined data are downloaded, then the relevant files are placed in:
/astro/mwavcs/vcs/[obs ID]/combined
These data will appear as symbolic links (raw and combined, respectively) in: /group/mwaovcs/vcs/[obs ID]
.
To download an entire observation (referred to by their observation IDs, GPS seconds, as labelled on the MWA data archive), use:
process_vcs.py -m download -o <obs ID> -a
otherwise, if you only want some interval [begin, end] of the full observation, then use:
process_vcs.py -m download -o <obs ID> -b <starting GPS second> -e <end GPS second>
To check progress, use:
squeue -p copyq -M zeus -u $USER
and you should see something like this:
Each user can have a maximum of 8 concurrent download jobs running.
A note on beginning and end times: the observation ID does not correspond to the starting GPS second of the data. In order to determine the beginning/end times of the observation data, use:
mwa_metadb_utils.py <obs ID>
and this will help you determine the correct GPS times to pass to the -b
/-e
options. This applies for all jobs that have an optional beginning and end times.
Checking downloaded data
By default, data are written to /astro/mwavcs/vcs/[obs ID]
. Once the observation appears to have downloaded correctly, it is good practice to check. The easiest way is to check to output from the process_vcs.py
download command. In /group/mwavcs/vcs/[obs ID]
there is a subdirectory labelled "batch
". Within the "batch" directory is stored all of the submitted SLURM scripts and their output. If there has been a problem in downloading data, this will appear as up to 10 files named "check_volt*.out
", i.e.
check_volt_[GPS time]_0.out check_volt_[GPS time]_1.out check_volt_[GPS time]_2.out ...
After 10 attempts to correctly download the data, we stop trying. In this case, the user will need to investigate the SLURM output to see what was going wrong.
If the raw data was downloaded, and you manually want to check them, then use:
checks.py -m download -o <obs ID> -a (or -b/-e)
otherwise, if the recombined data (i.e. with the "Processed" flag on the MWA data archive) were downloaded, and you manually want to check them, then use:
checks.py -m recombine -o <obs ID> -a (or -b/-e)
The total data volume downloaded will vary, but for maximum duration VCS observations this can easily be ~40 TB of just raw data. It is therefore important to keep in mind the amount of data you are processing and the disk space your are consuming. If only the raw voltages have been downloaded then you will need to recombine the data yourself, which doubles the amount of data (see next section).
Recombine
Recombine takes data spread over 32 files per second (each file contains 4 fine channels from one quarter of the array) and recombines them to 24+1 files per second (24 files with 128 fine channels from the entire array and one incoherent sum file); this is done on the GPU cluster ("gpuq") on Galaxy. When downloading the data, if you retrieved the "Processed" (i.e. recombined) data, then ignore this step as it has already been done on the NGAS server.
To recombine all of the data, use
process_vcs.py -m recombine -o <obs ID> -a
or, for only a subset of data, use
process_vcs.py -m recombine -o <obs ID> -b <starting GPS second> -e <end GPS second>
If you want to see the progress, then use:
squeue -p gpuq -u $USER
Generally, this processing should not take too long, typically ~few hours.
Checking the recombined data
As before, it is a good idea to check at this stage to make sure that all of the data were recombined properly. To do this, use:
checks.py -m recombine -o <obs ID>
This will check that there are all the recombined files are present and of the correct size. If there are missing raw files the recombining process will make zero-padded files and leave gaps in your data. If you would like to do a more robust check, beamform and splice the data (using the following steps) and then run:
prepdata -o recombine_test -nobary -dm 0 <fits files>
Then you can look through the produced .dat file for gaps using:
exploredat <.dat file>
Once you are happy that the data have been recombined correctly then you should delete the raw voltages (as they are no longer used in the pipeline and are a massive drain on storage resources).
Incoherent sums
After the download has completed, you already have all that is necessary to create the first kind of beamformed data: an incoherent sum. The data used to create this kind of beamformed output are labelled as <obsID>_<GPS second>_ics.dat
in the downloaded data directory (/group/mwavcs/vcs/<obs ID>/combined
by default), hence we refer to the incoherent sum files as ICS files. The incoherent sum is, basically, the sum of the tile powers, producing a full tile's field of view, with a √N improvement in sensitivity over a single tile, where N is the number of tiles combined. Unfortunately, this kind of data is also very susceptible to RFI corruption, thus you will need to be quite stringent with your RFI mitigation techniques (in time and frequency).
To create the incoherent sum, we run:
create_ics_psrfits.py <obs ID> [-b base directory]
where the optional argument -b
can be used to make the script look in the input path for the ICS data, but the default is /group/mwavcs/vcs/<obs ID>
. This script will create a folder called ics
in the base directory and populate it with the incoherently summed PSRFITS data products. There will be one file per 200-seconds of data, and the files are labelled as <obs ID>_XXXX.fits
, where XXXX
corresponds to which 200-second chunk is included (i.e. 0001
means the first 200 seconds of data).
Calibration
In order to coherently beamform the data (i.e. form a tied-array or phased-array beam at a given pointing somewhere within the "incoherent" beam), we need to account for a number of factors. These corrections and how they are calculated is totally beyond the scope of this guide, but it amounts to: "What delay do we add to each tile to make sure they all phase-up and point at the same place on the sky?"
There are different ways that we can get this information. In our case, there are two types of observations that can be used:
dedicated calibrator: usually directly before or after the target observation on a bright calibrator source. These are stored on the MWA data archive as visibilities (they are run through the online correlator like normal MWA observations). You can find the observation ID for the calibrator by searching on the MWA data archive or by using the following command which will list compatible calibrator IDs sorted by how close they are in time to the observation:
mwa_metadb_utils.py -c <obs ID>
- in-beam calibrator: using data from the target observation itself, correlating a section offline, and using those visibilities. See Offline correlation.
In order to download the calibration observation, set your MW ASVO API key as an environment variable. Below are some steps to do so:
- First go to the website https://asvo.mwatelescope.org
- Login with your username and password, or via eduGAIN (using your institution details)
- Go to your profile and get your API key
- Copy this API key to your
.profile
or.bashrc
file (i.e. export MWA_ASVO_API_KEY=<your API key>) - For more information visit manta-ray-client github
To download a dedicated calibrator observation, use:
process_vcs.py -m download_cal -o <obs ID> -O <cal obs ID>
which will automatically create the correct directory structure and symbolic links and download the calibrator data to /astro/mwavcs/vcs/[cal obs ID]/[cal obs ID]
. A symbolic link to this directory, called "vis
", that is created in /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]
.
For both of these types of calibration data, there are (technically) multiple methods we can use to get calibration solutions, including:
- using the Real Time System (RTS);
- using the MWA imaging software pipeline, (i.e. AOFlagger, wsclean and cotter); or
- using the CASA software suite
We currently only use the RTS to produce solutions, though in principle the imaging software pipeline is just as valid.
Calibrating with the Real Time System (RTS)
There are a number of steps to go from the visibilities to a usable calibration solution. Most of which is setting up the RTS input files. In all cases, we need a metafits file which contains a lot of information about the array setup, observation parameters, etc. By downloading a dedicated calibrator observation, there will be a [obs ID]*metafits*.fits
included. When correlating for in-beam calibration, process_vcs.py
will also check to see if there is a metafits file available and download it.
Getting calibrator source lists
In order to do the calibration, we need 1 or more bright sources in the field to actually get the phase and amplitude solutions. The RTS uses a list of known sources to calibrate the antennas.
First, change to the directory where the calibrator "vis"
symbolic link is located. This should be /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]
. We get a list of sources for the RTS to use by running:
srclist_by_beam.py -m <cal obs metafits file> -n <number of sources> -s <base RTS source list>
which will create a source list of one "super" source, which contains N components and is called "srclist_puma-*_[obs ID]_patch[N].txt
". In general, it's a good idea to have N~1000, regardless of whether you are using a dedicated calibrator observation of doing in-beam calibration, but this number is totally arbitrary (in some cases, 100 may do).The RTS selects the brightest source per channel and uses it to calibrate the tile amplitudes and gains. The remaining N-1 sources ("components") are then used to correct for ionospheric shifts.
An important note here is that we have two options for the base RTS source list, which are captured with the following environment variables:
$RTS_SRCLIST
(newest source list, based on GLEAM catalogue, but does have gaps, particularly around some of the "A-team" sources and the Galactic plane)$RTS_SRCLIST_V2
(the old source list, still has gaps around the Galactic plane, but includes the "A-team" we normally use for calibration)
By default, this code searches for a source with a beam-weighted flux density >10 Jy and within 1 degree of the nominal pointing centre. If no source satisfies that criteria, the search radius is incrementally increased by 0.5 degrees until a source is found. The srclist_by_beam.py
code also has an experimental option "order" (-o
), which we can use to specify which sources to select based on flux density and distance from the pointing centre. For example, if we wanted to allow the primary calibrator to be 5 Jy (instead of the default minimum of 10 Jy), and search within a 20 degree radius of the pointing centre, then our command would become:
srclist_by_beam.py -m <cal obs metafits file> -n <number of sources> -s <base RTS source list> -o experimental=10,20
This is particularly handy if, by default, the code does not pick up on the desired calibrator source (i.e. if you know Pic A is in the field, but not within the central degree or so). You can specify a wider search radius to begin with, at which point Pic A should be selected as the "base" calibrator source. If all else fails, please use $RTS_SRCLIST_V2
as the base source list and try again.
RTS configuration and tile/channel flagging
To create a RTS-specific configuration file (which the RTS reads to get all the information and file locations, etc, that it needs), use:
calibrate_vcs.py -o <obs ID> -O <cal obs ID> -m <cal obs metafits file> -s <output source list> [--gpubox_dir <path/to/visibilities>] [--rts_output_dir <output directory>] [--n_vis_grp <# groups>] [--offline] [--nosubmit]
The --gpubox_dir
option is by default pointing to /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]/vis
, and should only be changed if the visibilities are stored in a non-standard location.
The --rts_output_dir
is by default pointing to /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]
, and only experienced users should change the output directory. The script will create a "rts
" subdirectory in the argument of --rts_output_dir
.
The --n_vis_grp
is an advanced option that sets how many bins the visibility data are chopped into. This affects how well the RTS can combat decorrelation over the band and will depend strongly on the baseline configuration (i.e. long baselines decorrelate faster, so more bins are required). In theory there is no downside to increasing this number other than computation time, but we are limited by GPU memory in this case. The default number of bins created is 6, but even this is not sufficient for the longest baselines. Given the current limitations of the Galaxy GPU queue hardware, we cannot use anything more than 6.
The --offline
flag allows for offline correlated data to be used with the RTS (experimental - we have had limited success in doing this).
The --nosubmit
flag tells the script to not submit the jobs to the GPU queue, but just to write the required scripts.
You should see an output like the following:
This script will create a "rts
" subdirectory in whatever directory was given --rts_output_dir
and populate it with:
flagged_tiles.txt
, which contains the numbers of tiles that need to be flagged, based on what is in the metafits file provided,flagged_channels.txt
, which contains the channel numbers (depending on the fine channel resolution provided) to be flagged, and- a configuration file,
rts_[cal obs ID].in
, which is what the RTS will read during initialisation.
Inspecting the solutions
After the RTS has run, it will create a set of files in the "rts
" subdirectory (Bandpass calibration files and Direction Independent Jones matrix files) used to actually beamform the data.
Once the calibration is completed (it runs on the "gpuq") you can use the code:
# Plot the calibration solution amplitudes python3 `which plot_BPcal_128T.py` -r --file <BandpassCalibration file> [-t <threshold>] # Plot the calibration solution phases python3 `which plot_BPcal_128T.py` -r -p --file <BandpassCalibration file> [-t <threshold>]
which will show you the bandpass amplitude and phase calibration solutions for a single channel. From these plots, we are able to identify troublesome tiles and/or channels and include these in the flagged_tiles.txt
and flagged_channels.txt
and then re-running the calibration step. The -t
option means different things for amplitude or phase solution plotting:
- Amplitudes: the relative gain amplitude threshold used to determine whether tiles are bad. Default is 0.35.
- Phases: the phase deviation threshold (in degrees) used to determine bad tiles. Default is 30 degrees.
An easy way to make amplitude and phase solution plots is to use a simple bash loop, e.g.
for i in $(seq -w 1 24);do python3 `which plot_BPcal_128T.py` -r -o chan${i}_amp.png --file Bandpass*_node0${i}.dat; python3 `which plot_BPcal_128T.py` -r -p -o chan${i}_phase.png --file Bandpass*_node0${i}.dat;done
A good solution will have a relatively flat bandpass with a nominal "gain relative to band average" value of ~1 for Jones Matrix elements P←X (top left) and Q←Y (bottom right), and should have a gain value of ~0 for the P←Y (top right) and Q←X (bottom left) components. Below is an example of the amplitudes for a decent calibration solution.
For the phase solutions, we expect that the P←X (top left) and Q←Y (bottom right) components are around 0 degrees, while the P←Y (top right) and Q←X(bottom left) should be random. Below are the phase solutions corresponding to the above amplitude solutions.
Picket fence calibration with the RTS
The MWA can optionally split its 30.72 MHz of bandwidth up into 24 x 1.28 MHz bands that can be spread anywhere within the nominal observing range (70-300 MHz). We call observations of this type "picket fence". Calibrating these observations and combining them can be a little more involved, but none-the-less doable.
The RTS takes quite a bit of work to get going with picket fence data. It assumes that when you pass it data that all the frequency bands are contiguous. So we have to do some "hacking" of the configuration files we send for each picket fence channel. We've incorporated all of the difficult stuff into calibrate_vcs.py
, so the user shouldn't need to worry whether they are calibrating using picket fence data or not.
Notes:
- The RTS basically does a "
ls
" command to find all of the gpubox files, which on a parallel file system can sometimes hang. This will be obvious when watching the RTS job for a channel takes much longer than other channels. In this case, it is a good idea to look in the batch output file (named something likeRTS_[obs ID]*.out)
and the log files (which are written to the output directory specified when callingcalibrate_vcs.py
) to help determine which jobs are hanging. These jobs should be cancelled and resubmitted individually to the queue. - You may find that some coarse channel solutions simply don't converge. In this case, look at those channels that did converge and flag any obviously bad tiles or fine frequency channels (by adding the relevant entry to either
flagged_tiles.txt
orflagged_channels.txt
) and re-run the calibration (often you will need to iterate a few times anyway). Hopefully, by eliminating really bad tiles/fine-channels, other coarse channels will converge.
Offline correlation
In the case where trying to calibrate on a dedicated calibrator observation fails, we can try in-beam calibration. This requires visibilities made from the actual data you're ultimately trying to beamform on, thus we need to use the offline correlator. This is relatively easy, as it's a mode that process_vcs.py
has: it will send off the correlator jobs at the frequency and time resolution you request and will (by default) put them in the /astro/mwavcs/vcs/[obs ID]/vis
.
While you technically can correlate the entire observation, that is not recommended - the data volume will be immense. Usually you can just correlate ~200 seconds and that will be sufficient for a calibration solution.
For a desired frequency and time resolution over the interval [begin : end] (in GPS seconds), use:
process_vcs.py -m correlate -o <obs ID> -b <starting GPS second> -e <end GPS second> --ft_res <freq. res.> <time res.>
Typical values for --ft_res
are 40 (kHz) and 1 (second) integrations. Note that the offline implementation of the correlator is (currently) LIMITED to a maximum dump time of 1 second. Again, see the process_vcs.py
help for details.
The correlator jobs are run on Galaxy's "gpuq", thus you can check the jobs using:
squeue -u $USER -p gpuq
Beamforming
There are a few different modes in which the beamformer can be operated and these different "modes" produce different outputs.
With the calibration solutions we are able to form a tied-array beam at any pointing direction within the field-of-view. To start the beamforming process (it runs on the "gpuq"), use:
process_vcs.py -m beamform -o <obs ID> -a (or -b/-e) -p <"RA string"> <"DEC string"> --DI_dir </path/to/rts/output> --flagged_tiles </path/to/flagged_tiles.txt> [--incoh] [--bf_out_format <"psrfits" or "vdif" or "both">]
where ["RA string"]
is formatted as "hh:mm:ss.ss" and ["DEC string"]
is formatted as "dd:mm:ss.ss" (including the sign: "+" or "-"). If you also want the incoherent sum, you can pass the --incoh
flag.
In our experience, it typically takes ~3-5x real-time to beamform. The process will create a directory /group/mwavcs/vcs/[obs ID]/pointings/[RA string]_[DEC string]
where there will be several full polarisation (Stokes I, Q, U & V) PSRFITS files created: 1 file per coarse channel per 200 seconds of data. They are labelled as something like:
[Project ID]_[obs ID]_ch[channel]_????.fits
where [channel]
is the absolute frequency channel (0-256), and 00??
is which 200 second chunk the file corresponds to (e.g. 00?? = 0001
means it is the first 200 seconds of the observation).
We are also able to create VDIF format output by including --bf_out_format=vdif
which instead produces two files per channel:
[Project ID]_[obs ID]_ch[channel]_u.hdr [Project ID]_[obs ID]_ch[channel]_u.vdif
where [Project ID]
, [obs ID]
and [channel]
are as defined above. It is also possible to create VDIF and PSRFITS simultaneously with the --bf_out_format=both
argument.
Splicing channels together
We are able to splice together adjacent frequency channels for each 200 second chunk using the splice_psrfits
program. In it's current form it accepts all the files you want to splice into one (i.e. the 24 coarse channel data files for an individual 200 second chunk), plus an output suffix. It is invoked as follows:
splice_psrfits [file1] [file2] ... [suffix]
and the output file will be named [suffix]_0001.fits
.
Note that frequency ordering matters here! The beamformed PSRFITS are already ordered by channel, so just add them in order. i.e.
splice_psrfits [lowest channel] [second lowest channel] [...] [suffix]
If you give it channels in the wrong frequency order, it will complain about channels not being "adjacent" and crash.
For VDIF output, each channel is written to its own file. As far as we can tell, the only way to add them together is to first create the archives using DSPSR and then add them together with PSRCHIVE tools.
To make this all easier, the splice.sh
script will handle the multiple frequency channels and multiple 200-second chunk aspect of this. To make use of this, change into the directory that contains the individual coarse channel PSRFITS, and then run:
splice.sh
at which point you will be asked to input the Project ID (usually G0024), the observation ID (this is required), how many 200-second chunks you want to combine, and the lowest and high coarse channel number. All of these, except the observation ID, have sensible defaults. If you do not enter an observation ID, the script will not proceed.
Uploading to the MWA Pulsar Database
Finding targets in the VCS data archive
Estimating pulsar fluxes at MWA frequencies
The ATNF database is an abundant resource for published pulsar data. This may include flux densities over a broad range of frequencies. Should such data be available for a pulsar, we are able to make a first order estimate of the flux we expect from a pulsar.
To view the spectral data on the ATNF database for any pulsar:
sn_flux_est.py --mode ATNF -p <Pulsar Name>
Which will output a plot such as this one:
Note that a spectral Index will be derived from the points. If there is sufficient data to do so, like in this example, this is calculated from a least squares approach. Otherwise, the standard 1.4 +/- 1.0 will be applied from Bates et al. 2013
We can also make an estimate of a pulsar's flux at from any observation ID:
sn_flux_est.py --mode SNFE -p <Pulsar Name> -o <Obs ID>
This will apply the radiometer equation and estimate the flux and signal to noise ratio at the observation's central frequency. It is also possible to specify beginning and end times if you don't intend to use the entire observation length by adding:
-b <beginning GPS time > -e <end GPS time>
Plotting the SED like before is also possible with the tag:
--plot_est
For example:
sn_flux_est.py --mode SNFE -p J0630-2834 -o 1258221008 --plot_est
Will output the following:
J0630-2834 derived spectral index: -1.3698868864530596 +/- 0.04796669612862415 J0630-2834 flux estimate at 154.24 MHz: 0.41368443304349944 +/- 0.006638533625983657 Jy Pulsar S/N: 747.0705031130714 +/- 161.9689693670181
VCS pulsar data processing
This documentation is focused on the pulsar science applications of VCS data. For a detailed description of the VCS and its science goals and capabilities, please see Tremblay et al. (2015).
While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this might help (if you're unfamiliar).
The wrapper script process_vcs.py
is used for many of the steps needed to reduce the voltages to produce incoherently/coherently summed data in PSRFITS and/or VDIF format. It attempts to make things as streamlined and user-friendly as possible. To see detailed usage information, process_vcs.py -h
.
For our current Galaxy installation, there are a few specifics:
- In order to access any of the tools, you will need to source in your personal profile:
/group/mwavcs/PULSAR/psrBash.profile
- There is a master branch and a development branch of the main codebase. By default,
psrBash.profile
will load the master branch, which is meant to represent the most stable version. However, there may be bugs that were later discovered and fixed, but which have not yet been grafted back into the master branch. If you wish to try the more cutting edge development branch, then after sourcing/group/mwavcs/PULSAR/psrBash.profile
, run "module unload vcstools/master
" followed by "module load vcstools/devel
". (Using module unload & load is preferable to using module switch on account of certain undesirable quirks of the Galaxy system.)
- There is a master branch and a development branch of the main codebase. By default,
- There is a strict directory structure for VCS processing, given the large volumes of data involved. Raw/recombined data is stored on the
/astro
file system, however all other products (incoherent sum products, coherent sum products, calibration solutions, etc.) will be stored on/group
file system.We have endeavoured to make sure this is done without requiring the user's attention, however, please double check working directories and defaults for scripts.
A note regarding code examples:
Code snippets or complete command-line calls will be enclosed in a code block, i.e.
This is a code block
Furthermore, scripts will require some input and other input will be optional. Optional arguments will be surrounded by []
characters. In both cases, where the user need to specify some input (rather than just toggling a mode of operation), a brief description of that input will be surrounded by <>
characters. For example:
test.py --user_input <required input> [--optional_user_input <optional user input>] [--optional_flag]
Table of Contents
Downloading data
Data are downloaded via the "copyq" on the Zeus machine at Pawsey. In the background on the NGAS servers, the raw voltages are (slowly) being "recombined" - they are labeled with the "Processed" flag on the MWA data archive.
Therefore, there are two different "kinds" of downloads available: raw data OR recombined data. In both cases, the time from a download request to having the data on disk is typically ~few days (depending on volume, load on queues/archives, etc).
All being well, the decision between "raw" and "recombined" downloads should be auto-magically made for you by the process_vcs.py
script. If the data on the NGAS server has been processed, the the script will download the "tarballed" recombined files (and, rather nicely, sends them off to get "untarred" on the fly). If the data has not been completely processed, then the raw data will be downloaded (in which case you will need to "recombine" the data yourself - and then delete the raw voltages). Additionally, if the data have been processed on the NGAS server, then you have the option to ONLY download the incoherently summed data files (*_ics.dat) with the mode option: -m download_ics
.
- If raw voltages are downloaded, then they will be placed in:
/astro/mwavcs/vcs/[obs ID]/raw
- If the processed/recombined data are downloaded, then the relevant files are placed in:
/astro/mwavcs/vcs/[obs ID]/combined
These data will appear as symbolic links (raw and combined, respectively) in: /group/mwaovcs/vcs/[obs ID]
.
To download an entire observation (referred to by their observation IDs, GPS seconds, as labelled on the MWA data archive), use:
process_vcs.py -m download -o <obs ID> -a
otherwise, if you only want some interval [begin, end] of the full observation, then use:
process_vcs.py -m download -o <obs ID> -b <starting GPS second> -e <end GPS second>
To check progress, use:
squeue -p copyq -M zeus -u $USER
and you should see something like this:
Each user can have a maximum of 8 concurrent download jobs running.
A note on beginning and end times: the observation ID does not correspond to the starting GPS second of the data. In order to determine the beginning/end times of the observation data, use:
mwa_metadb_utils.py <obs ID>
and this will help you determine the correct GPS times to pass to the -b
/-e
options. This applies for all jobs that have an optional beginning and end times.
Checking downloaded data
By default, data are written to /astro/mwavcs/vcs/[obs ID]
. Once the observation appears to have downloaded correctly, it is good practice to check. The easiest way is to check to output from the process_vcs.py
download command. In /group/mwavcs/vcs/[obs ID]
there is a subdirectory labelled "batch
". Within the "batch" directory is stored all of the submitted SLURM scripts and their output. If there has been a problem in downloading data, this will appear as up to 10 files named "check_volt*.out
", i.e.
check_volt_[GPS time]_0.out check_volt_[GPS time]_1.out check_volt_[GPS time]_2.out ...
After 10 attempts to correctly download the data, we stop trying. In this case, the user will need to investigate the SLURM output to see what was going wrong.
If the raw data was downloaded, and you manually want to check them, then use:
checks.py -m download -o <obs ID> -a (or -b/-e)
otherwise, if the recombined data (i.e. with the "Processed" flag on the MWA data archive) were downloaded, and you manually want to check them, then use:
checks.py -m recombine -o <obs ID> -a (or -b/-e)
The total data volume downloaded will vary, but for maximum duration VCS observations this can easily be ~40 TB of just raw data. It is therefore important to keep in mind the amount of data you are processing and the disk space your are consuming. If only the raw voltages have been downloaded then you will need to recombine the data yourself, which doubles the amount of data (see next section).
Recombine
Recombine takes data spread over 32 files per second (each file contains 4 fine channels from one quarter of the array) and recombines them to 24+1 files per second (24 files with 128 fine channels from the entire array and one incoherent sum file); this is done on the GPU cluster ("gpuq") on Galaxy. When downloading the data, if you retrieved the "Processed" (i.e. recombined) data, then ignore this step as it has already been done on the NGAS server.
To recombine all of the data, use
process_vcs.py -m recombine -o <obs ID> -a
or, for only a subset of data, use
process_vcs.py -m recombine -o <obs ID> -b <starting GPS second> -e <end GPS second>
If you want to see the progress, then use:
squeue -p gpuq -u $USER
Generally, this processing should not take too long, typically ~few hours.
Checking the recombined data
As before, it is a good idea to check at this stage to make sure that all of the data were recombined properly. To do this, use:
checks.py -m recombine -o <obs ID>
This will check that there are all the
Once you are happy that the data have been recombined correctly then you should delete the raw voltages (as they are no longer used in the pipeline and are a massive drain on storage resources).
Incoherent sums
After the download has completed, you already have all that is necessary to create the first kind of beamformed data: an incoherent sum. The data used to create this kind of beamformed output are labelled as <obsID>_<GPS second>_ics.dat
in the downloaded data directory (/group/mwavcs/vcs/<obs ID>/combined
by default), hence we refer to the incoherent sum files as ICS files. The incoherent sum is, basically, the sum of the tile powers, producing a full tile's field of view, with a √N improvement in sensitivity over a single tile, where N is the number of tiles combined. Unfortunately, this kind of data is also very susceptible to RFI corruption, thus you will need to be quite stringent with your RFI mitigation techniques (in time and frequency).
To create the incoherent sum, we run:
create_ics_psrfits.py <obs ID> [-b base directory]
where the optional argument -b
can be used to make the script look in the input path for the ICS data, but the default is /group/mwavcs/vcs/<obs ID>
. This script will create a folder called ics
in the base directory and populate it with the incoherently summed PSRFITS data products. There will be one file per 200-seconds of data, and the files are labelled as <obs ID>_XXXX.fits
, where XXXX
corresponds to which 200-second chunk is included (i.e. 0001
means the first 200 seconds of data).
Calibration
In order to coherently beamform the data (i.e. form a tied-array or phased-array beam at a given pointing somewhere within the "incoherent" beam), we need to account for a number of factors. These corrections and how they are calculated is totally beyond the scope of this guide, but it amounts to: "What delay do we add to each tile to make sure they all phase-up and point at the same place on the sky?"
There are different ways that we can get this information. In our case, there are two types of observations that can be used:
dedicated calibrator: usually directly before or after the target observation on a bright calibrator source. These are stored on the MWA data archive as visibilities (they are run through the online correlator like normal MWA observations). You can find the observation ID for the calibrator by searching on the MWA data archive or by using the following command which will list compatible calibrator IDs sorted by how close they are in time to the observation:
mwa_metadb_utils.py -c <obs ID>
- in-beam calibrator: using data from the target observation itself, correlating a section offline, and using those visibilities. See Offline correlation.
In order to download the calibration observation, set your MW ASVO API key as an environment variable. Below are some steps to do so:
- First go to the website https://asvo.mwatelescope.org
- Login with your username and password, or via eduGAIN (using your institution details)
- Go to your profile and get your API key
- Copy this API key to your
.profile
or.bashrc
file (i.e. export MWA_ASVO_API_KEY=<your API key>) - For more information visit manta-ray-client github
To download a dedicated calibrator observation, use:
process_vcs.py -m download_cal -o <obs ID> -O <cal obs ID>
which will automatically create the correct directory structure and symbolic links and download the calibrator data to /astro/mwavcs/vcs/[cal obs ID]/[cal obs ID]
. A symbolic link to this directory, called "vis
", that is created in /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]
.
For both of these types of calibration data, there are (technically) multiple methods we can use to get calibration solutions, including:
- using the Real Time System (RTS);
- using the MWA imaging software pipeline, (i.e. AOFlagger, wsclean and cotter); or
- using the CASA software suite
We currently only use the RTS to produce solutions, though in principle the imaging software pipeline is just as valid.
Calibrating with the Real Time System (RTS)
There are a number of steps to go from the visibilities to a usable calibration solution. Most of which is setting up the RTS input files. In all cases, we need a metafits file which contains a lot of information about the array setup, observation parameters, etc. By downloading a dedicated calibrator observation, there will be a [obs ID]*metafits*.fits
included. When correlating for in-beam calibration, process_vcs.py
will also check to see if there is a metafits file available and download it.
Getting calibrator source lists
In order to do the calibration, we need 1 or more bright sources in the field to actually get the phase and amplitude solutions. The RTS uses a list of known sources to calibrate the antennas.
First, change to the directory where the calibrator "vis"
symbolic link is located. This should be /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]
. We get a list of sources for the RTS to use by running:
srclist_by_beam.py -m <cal obs metafits file> -n <number of sources> -s <base RTS source list>
which will create a source list of one "super" source, which contains N components and is called "srclist_puma-*_[obs ID]_patch[N].txt
". In general, it's a good idea to have N~1000, regardless of whether you are using a dedicated calibrator observation of doing in-beam calibration, but this number is totally arbitrary (in some cases, 100 may do).The RTS selects the brightest source per channel and uses it to calibrate the tile amplitudes and gains. The remaining N-1 sources ("components") are then used to correct for ionospheric shifts.
An important note here is that we have two options for the base RTS source list, which are captured with the following environment variables:
$RTS_SRCLIST
(newest source list, based on GLEAM catalogue, but does have gaps, particularly around some of the "A-team" sources and the Galactic plane)$RTS_SRCLIST_V2
(the old source list, still has gaps around the Galactic plane, but includes the "A-team" we normally use for calibration)
By default, this code searches for a source with a beam-weighted flux density >10 Jy and within 1 degree of the nominal pointing centre. If no source satisfies that criteria, the search radius is incrementally increased by 0.5 degrees until a source is found. The srclist_by_beam.py
code also has an experimental option "order" (-o
), which we can use to specify which sources to select based on flux density and distance from the pointing centre. For example, if we wanted to allow the primary calibrator to be 5 Jy (instead of the default minimum of 10 Jy), and search within a 20 degree radius of the pointing centre, then our command would become:
srclist_by_beam.py -m <cal obs metafits file> -n <number of sources> -s <base RTS source list> -o experimental=10,20
This is particularly handy if, by default, the code does not pick up on the desired calibrator source (i.e. if you know Pic A is in the field, but not within the central degree or so). You can specify a wider search radius to begin with, at which point Pic A should be selected as the "base" calibrator source. If all else fails, please use $RTS_SRCLIST_V2
as the base source list and try again.
RTS configuration and tile/channel flagging
To create a RTS-specific configuration file (which the RTS reads to get all the information and file locations, etc, that it needs), use:
calibrate_vcs.py -o <obs ID> -O <cal obs ID> -m <cal obs metafits file> -s <output source list> [--gpubox_dir <path/to/visibilities>] [--rts_output_dir <output directory>] [--n_vis_grp <# groups>] [--offline] [--nosubmit]
The --gpubox_dir
option is by default pointing to /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]/vis
, and should only be changed if the visibilities are stored in a non-standard location.
The --rts_output_dir
is by default pointing to /group/mwavcs/vcs/[obs ID]/cal/[cal obs ID]
, and only experienced users should change the output directory. The script will create a "rts
" subdirectory in the argument of --rts_output_dir
.
The --n_vis_grp
is an advanced option that sets how many bins the visibility data are chopped into. This affects how well the RTS can combat decorrelation over the band and will depend strongly on the baseline configuration (i.e. long baselines decorrelate faster, so more bins are required). In theory there is no downside to increasing this number other than computation time, but we are limited by GPU memory in this case. The default number of bins created is 6, but even this is not sufficient for the longest baselines. Given the current limitations of the Galaxy GPU queue hardware, we cannot use anything more than 6.
The --offline
flag allows for offline correlated data to be used with the RTS (experimental - we have had limited success in doing this).
The --nosubmit
flag tells the script to not submit the jobs to the GPU queue, but just to write the required scripts.
You should see an output like the following:
This script will create a "rts
" subdirectory in whatever directory was given --rts_output_dir
and populate it with:
flagged_tiles.txt
, which contains the numbers of tiles that need to be flagged, based on what is in the metafits file provided,flagged_channels.txt
, which contains the channel numbers (depending on the fine channel resolution provided) to be flagged, and- a configuration file,
rts_[cal obs ID].in
, which is what the RTS will read during initialisation.
Inspecting the solutions
After the RTS has run, it will create a set of files in the "rts
" subdirectory (Bandpass calibration files and Direction Independent Jones matrix files) used to actually beamform the data.
Once the calibration is completed (it runs on the "gpuq"), move to the rts directory and run:
auto_plot.bash
which will show you the bandpass amplitude and phase calibration solutions for each coarse channel in the "attempt_1" subdirectory. From these plots, we are able to identify troublesome tiles and/or channels and include these in the flagged_tiles.txt
and flagged_channels.txt
and then re-running the calibration step.
A good solution will have a relatively flat bandpass with a nominal "gain relative to band average" value of ~1 for Jones Matrix elements P←X (top left) and Q←Y (bottom right), and should have a gain value of ~0 for the P←Y (top right) and Q←X (bottom left) components. Below is an example of the amplitudes for an excellent calibration solution.
For the phase solutions, we expect that the P←X (top left) and Q←Y (bottom right) components are around 0 degrees, while the P←Y (top right) and Q←X(bottom left) should be random. Below are the phase solutions corresponding to the above amplitude solutions.
Remember to check the calibration solution for each coarse channel and if only a single coarse channel appears to have problems, that is okay, just remember that when processing your data (with Presto or DSPSR) as you may have to flag that coarse channel.
Improving the RTS solutions
If your calibration solutions don't look as clean as the above images you can use the following steps to attempt to improve them. Remember that calibration is more art than science so try a few different methods and see what works, auto_plot.bash will make a new "attempt_N" subdirectory so if you make your calibration solution worse you can copy the files from the attempt subdirectory with a better solution to the rts directory to overwrite your latest attempt. If a calibration solution isn't converging it may be easy to try a different calibration observation (3C444 often gives the best calibration solutions).
Here is an example of an imperfect calibration solution
The key in these plots are suggestions of which tiles may need to be flagged. These suggestions mean different things for amplitude or phase solution plotting:
- Amplitudes: the relative gain amplitude threshold used to determine whether tiles are bad. Default is 0.35.
- Phases: the phase deviation threshold (in degrees) used to determine bad tiles. Default is 30 degrees.
I (Nick) normally use the amplitude plots to work out what needs flagging so you can assume I'm talking about amplitude plots from here onward. Here are some basic tips:
- If there are any tiles with max>3 on multiple channels it is worth flagging.
- If there are any tiles that have a max=0 this is worth flagging (even if only one polarisation has max=0!) as it is contributing no signal and can cause errors in beamforming
- It is best to only flag up to ~3 tiles at a time as bad tiles can affect other potentially good tiles
- Sensitivity scales as ~sqrt(128-N) where N is the number of tiles flagged so if you start flagging more than 50 tiles will start to lower your sensitivity and maybe worth abandoning
- Make sure you put the number in the right of the key (between flag and ?) into the
flagged_tiles.txt
file
Picket fence calibration with the RTS
The MWA can optionally split its 30.72 MHz of bandwidth up into 24 x 1.28 MHz bands that can be spread anywhere within the nominal observing range (70-300 MHz). We call observations of this type "picket fence". Calibrating these observations and combining them can be a little more involved, but none-the-less doable.
The RTS takes quite a bit of work to get going with picket fence data. It assumes that when you pass it data that all the frequency bands are contiguous. So we have to do some "hacking" of the configuration files we send for each picket fence channel. We've incorporated all of the difficult stuff into calibrate_vcs.py
, so the user shouldn't need to worry whether they are calibrating using picket fence data or not.
Notes:
- The RTS basically does a "
ls
" command to find all of the gpubox files, which on a parallel file system can sometimes hang. This will be obvious when watching the RTS job for a channel takes much longer than other channels. In this case, it is a good idea to look in the batch output file (named something likeRTS_[obs ID]*.out)
and the log files (which are written to the output directory specified when callingcalibrate_vcs.py
) to help determine which jobs are hanging. These jobs should be cancelled and resubmitted individually to the queue. - You may find that some coarse channel solutions simply don't converge. In this case, look at those channels that did converge and flag any obviously bad tiles or fine frequency channels (by adding the relevant entry to either
flagged_tiles.txt
orflagged_channels.txt
) and re-run the calibration (often you will need to iterate a few times anyway). Hopefully, by eliminating really bad tiles/fine-channels, other coarse channels will converge.
Offline correlation
In the case where trying to calibrate on a dedicated calibrator observation fails, we can try in-beam calibration. This requires visibilities made from the actual data you're ultimately trying to beamform on, thus we need to use the offline correlator. This is relatively easy, as it's a mode that process_vcs.py
has: it will send off the correlator jobs at the frequency and time resolution you request and will (by default) put them in the /astro/mwavcs/vcs/[obs ID]/vis
.
While you technically can correlate the entire observation, that is not recommended - the data volume will be immense. Usually you can just correlate ~200 seconds and that will be sufficient for a calibration solution.
For a desired frequency and time resolution over the interval [begin : end] (in GPS seconds), use:
process_vcs.py -m correlate -o <obs ID> -b <starting GPS second> -e <end GPS second> --ft_res <freq. res.> <time res.>
Typical values for --ft_res
are 40 (kHz) and 1 (second) integrations. Note that the offline implementation of the correlator is (currently) LIMITED to a maximum dump time of 1 second. Again, see the process_vcs.py
help for details.
The correlator jobs are run on Galaxy's "gpuq", thus you can check the jobs using:
squeue -u $USER -p gpuq
Beamforming
There are a few different modes in which the beamformer can be operated and these different "modes" produce different outputs.
With the calibration solutions we are able to form a tied-array beam at any pointing direction within the field-of-view. To start the beamforming process (it runs on the "gpuq"), use:
process_vcs.py -m beamform -o <obs ID> -a (or -b/-e) -p <"RA string"> <"DEC string"> --DI_dir </path/to/rts/output> --flagged_tiles </path/to/flagged_tiles.txt> [--incoh] [--bf_out_format <"psrfits" or "vdif" or "both">]
where ["RA string"]
is formatted as "hh:mm:ss.ss" and ["DEC string"]
is formatted as "dd:mm:ss.ss" (including the sign: "+" or "-"). If you also want the incoherent sum, you can pass the --incoh
flag.
In our experience, it typically takes ~3-5x real-time to beamform. The process will create a directory /group/mwavcs/vcs/[obs ID]/pointings/[RA string]_[DEC string]
where there will be several full polarisation (Stokes I, Q, U & V) PSRFITS files created: 1 file per coarse channel per 200 seconds of data. They are labelled as something like:
[Project ID]_[obs ID]_ch[channel]_????.fits
where [channel]
is the absolute frequency channel (0-256), and 00??
is which 200 second chunk the file corresponds to (e.g. 00?? = 0001
means it is the first 200 seconds of the observation).
We are also able to create VDIF format output by including --bf_out_format=vdif
which instead produces two files per channel:
[Project ID]_[obs ID]_ch[channel]_u.hdr [Project ID]_[obs ID]_ch[channel]_u.vdif
where [Project ID]
, [obs ID]
and [channel]
are as defined above. It is also possible to create VDIF and PSRFITS simultaneously with the --bf_out_format=both
argument.
Splicing channels together
We are able to splice together adjacent frequency channels for each 200 second chunk using the splice_psrfits
program. In it's current form it accepts all the files you want to splice into one (i.e. the 24 coarse channel data files for an individual 200 second chunk), plus an output suffix. It is invoked as follows:
splice_psrfits [file1] [file2] ... [suffix]
and the output file will be named [suffix]_0001.fits
.
Note that frequency ordering matters here! The beamformed PSRFITS are already ordered by channel, so just add them in order. i.e.
splice_psrfits [lowest channel] [second lowest channel] [...] [suffix]
If you give it channels in the wrong frequency order, it will complain about channels not being "adjacent" and crash.
For VDIF output, each channel is written to its own file. As far as we can tell, the only way to add them together is to first create the archives using DSPSR and then add them together with PSRCHIVE tools.
To make this all easier, the splice.sh
script will handle the multiple frequency channels and multiple 200-second chunk aspect of this. To make use of this, change into the directory that contains the individual coarse channel PSRFITS, and then run:
splice.sh
at which point you will be asked to input the Project ID (usually G0024), the observation ID (this is required), how many 200-second chunks you want to combine, and the lowest and high coarse channel number. All of these, except the observation ID, have sensible defaults. If you do not enter an observation ID, the script will not proceed.
Uploading to the MWA Pulsar Database
Finding targets in the VCS data archive
Estimating pulsar fluxes at MWA frequencies
The ATNF database is an abundant resource for published pulsar data. This may include flux densities over a broad range of frequencies. Should such data be available for a pulsar, we are able to make a first order estimate of the flux we expect from a pulsar.
To view the spectral data on the ATNF database for any pulsar:
sn_flux_est.py --mode ATNF -p <Pulsar Name>
Which will output a plot such as this one:
Note that a spectral Index will be derived from the points. If there is sufficient data to do so, like in this example, this is calculated from a least squares approach. Otherwise, the standard 1.4 +/- 1.0 will be applied from Bates et al. 2013
We can also make an estimate of a pulsar's flux at from any observation ID:
sn_flux_est.py --mode SNFE -p <Pulsar Name> -o <Obs ID>
This will apply the radiometer equation and estimate the flux and signal to noise ratio at the observation's central frequency. It is also possible to specify beginning and end times if you don't intend to use the entire observation length by adding:
-b <beginning GPS time > -e <end GPS time>
Plotting the SED like before is also possible with the tag:
--plot_est
For example:
sn_flux_est.py --mode SNFE -p J0630-2834 -o 1258221008 --plot_est
Will output the following:
J0630-2834 derived spectral index: -1.3698868864530596 +/- 0.04796669612862415 J0630-2834 flux estimate at 154.24 MHz: 0.41368443304349944 +/- 0.006638533625983657 Jy Pulsar S/N: 747.0705031130714 +/- 161.9689693670181