...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
VCS pulsar data processing
This documentation is focused on the pulsar science applications of VCS data. For a detailed description of the VCS and its science goals and capabilities, please see Tremblay et al. (2015).
While not necessary, it will be useful to have a basic understanding of radio interferometry when reading this documentation, so this might help (if you're unfamiliar). We also have a series of papers on the tied-array beamformer: Ord et al. (2019), Xue et al. (2019) and McSweeney et al. (2020).
The wrapper script process_vcs.py
is used for many of the steps needed to reduce the voltages to produce incoherently/coherently summed data in PSRFITS and/or VDIF format. It attempts to make things as streamlined and user-friendly as possible. To see detailed usage information, process_vcs.py -h
. There is also the recently developed beamform.nf
which can display its detailed usage with beamform.nf --help
.
...
Table of Contents | ||||||||
---|---|---|---|---|---|---|---|---|
|
...
Downloading data
Anchor | ||||
---|---|---|---|---|
|
Data are downloaded via the "copyq" on the Zeus machine at Pawsey. In the background on the NGAS servers, the raw voltages are (slowly) being "recombined" - they are labeled with the "Processed" flag on the MWA data archive.
Therefore, there are two different "kinds" of downloads available: raw data OR recombined data. In both cases, the time from a download request to having the data on disk is typically ~few days (depending on volume, load on queues/archives, etc).
All being well, the decision between "raw" and "recombined" downloads should be auto-magically made for you by the vcs_download.nf
script (see Nextflow Notes for tips on running Nextflow scripts). If the data on the NGAS server has been processed, the script will download the "tarballed" recombined files (and, rather nicely, sends them off to get "untarred" on the fly). If the data has not been completely processed, then the raw data will be downloaded and then recombined.
...
and this will help you determine the correct GPS times to pass to the -b
/-e
options. This applies for all jobs that have an optional beginning and end times.
Recombine Anchor recombine recombine
recombine | |
recombine |
As noted above in Downloading Data, the vcs_download.nf
script includes the recombine step, so manually recombining will only be necessary for situations where the Nextflow option is either unavailable or undesirable. (Note that with the recent switch to using ASVO as the data downloading server, vcs_download.nf
is now a bit of a misnomer – it can still be used to do the recombining, even if the data are already downloaded.) Other methods of recombining are detailed at the Recombine page.
Incoherent sums
Anchor | ||||
---|---|---|---|---|
|
After the download has completed, you already have all that is necessary to create the first kind of beamformed data: an incoherent sum. The data used to create this kind of beamformed output are labelled as <obsID>_<GPS second>_ics.dat
in the downloaded data directory (/group/mwavcs/vcs/<obs ID>/combined
by default), hence we refer to the incoherent sum files as ICS files. The incoherent sum is, basically, the sum of the tile powers, producing a full tile's field of view, with a √N improvement in sensitivity over a single tile, where N is the number of tiles combined. Unfortunately, this kind of data is also very susceptible to RFI corruption, thus you will need to be quite stringent with your RFI mitigation techniques (in time and frequency).
...
where the optional argument -b
can be used to make the script look in the input path for the ICS data, but the default is /group/mwavcs/vcs/<obs ID>
. This script will create a folder called ics
in the base directory and populate it with the incoherently summed PSRFITS data products. There will be one file per 200-seconds of data, and the files are labelled as <obs ID>_XXXX.fits
, where XXXX
corresponds to which 200-second chunk is included (i.e. 0001
means the first 200 seconds of data).
Calibration
Anchor | ||||
---|---|---|---|---|
|
In order to coherently beamform the data (i.e. form a tied-array or phased-array beam at a given pointing somewhere within the "incoherent" beam), we need to account for a number of factors. These corrections and how they are calculated is totally beyond the scope of this guide, but it amounts to: "What delay do we add to each tile to make sure they all phase-up and point at the same place on the sky?"
...
dedicated calibrator: usually directly before or after the target observation on a bright calibrator source. These are stored on the MWA data archive as visibilities (they are run through the online correlator like normal MWA observations). You can find the observation ID for the calibrator by searching on the MWA data archive or by using the following command which will list compatible calibrator IDs sorted by how close they are in time to the observation:
Code Block language text mwa_metadb_utils.py -c <obs ID>
If this command generates an error, it may be due to the lack of calibration observations with the same frequency channels. If so try the Calibration Combine Method.
- in-beam calibrator: using data from the target observation itself, correlating a section offline, and using those visibilities. See Offline correlation.
In order to download the calibration observation, set your MW ASVO API key as an environment variable. Below are some steps to do so:
...
We currently only use the RTS to produce solutions, though in principle the imaging software pipeline is just as valid.
Calibrating with the Real Time System (RTS)
There are a number of steps to go from the visibilities to a usable calibration solution. Most of which is setting up the RTS input files. In all cases, we need a metafits file which contains a lot of information about the array setup, observation parameters, etc. By downloading a dedicated calibrator observation, there will be a [obs ID]*metafits*.fits
included. When correlating for in-beam calibration, process_vcs.py
will also check to see if there is a metafits file available and download it.
Getting calibrator source lists
In order to do the calibration, we need 1 or more bright sources in the field to actually get the phase and amplitude solutions. The RTS uses a list of known sources to calibrate the antennas.
...
This is particularly handy if, by default, the code does not pick up on the desired calibrator source (i.e. if you know Pic A is in the field, but not within the central degree or so). You can specify a wider search radius to begin with, at which point Pic A should be selected as the "base" calibrator source. If all else fails, please use $RTS_SRCLIST_V2
as the base source list and try again.
RTS configuration and tile/channel flagging
To create a RTS-specific configuration file (which the RTS reads to get all the information and file locations, etc, that it needs), use:
...
flagged_tiles.txt
, which contains the numbers of tiles that need to be flagged, based on what is in the metafits file provided,flagged_channels.txt
, which contains the channel numbers (depending on the fine channel resolution provided) to be flagged, and- a configuration file,
rts_[cal obs ID].in
, which is what the RTS will read during initialisation.
Inspecting the solutions
After the RTS has run, it will create a set of files in the "rts
" subdirectory (Bandpass calibration files and Direction Independent Jones matrix files) used to actually beamform the data.
Once the calibration is completed (it runs on the "gpuq"), move to the rts directory and run:
...
Remember to check the calibration solution for each coarse channel and if only a single coarse channel appears to have problems, that is okay, just remember that when processing your data (with Presto or DSPSR) as you may have to flag that coarse channel.
Improving the RTS solutions
If your calibration solutions don't look as clean as the above images you can use the following steps to attempt to improve them. Remember that calibration is more art than science so try a few different methods and see what works, auto_plot.bash will make a new "attempt_N" subdirectory so if you make your calibration solution worse you can copy the files from the attempt subdirectory with a better solution to the rts directory to overwrite your latest attempt. If a calibration solution isn't converging it may be easy to try a different calibration observation (3C444 often gives the best calibration solutions).
...
- If there are any tiles with max>3 on multiple channels it is worth flagging.
- If there are any tiles that have a max=0 this is worth flagging (even if only one polarisation has max=0!) as it is contributing no signal and can cause errors in beamforming
- It is best to only flag up to ~3 tiles at a time as bad tiles can affect other potentially good tiles
- Sensitivity scales as ~sqrt(128-N) where N is the number of tiles flagged so if you start flagging more than 50 tiles will start to lower your sensitivity and maybe worth abandoning
- Make sure you put the number in the right of the key (between flag and ?) into the
flagged_tiles.txt
file
In the "attempt_number_N" subdirectory are a chan_x_output.txt and phase_x_output.txt file that contains all of the recommended flags that the calibration plotting script creates. These can be useful when deciding which tile(s) to flag next. The following bash command will output the tile with the greatest gain value for each channel:
Code Block | ||
---|---|---|
| ||
for i in $(ls chan*txt); do grep $(cat $i | cut -d '=' -f 3 | cut -d ' ' -f 1 | sort -n | tail -n 1 | head -n 1) $i; done |
...
It is now clear that you should add 23 to flagged_channels.txt
Picket fence calibration with the RTS
The MWA can optionally split its 30.72 MHz of bandwidth up into 24 x 1.28 MHz bands that can be spread anywhere within the nominal observing range (70-300 MHz). We call observations of this type "picket fence". Calibrating these observations and combining them can be a little more involved, but none-the-less doable.
...
- The RTS basically does a "
ls
" command to find all of the gpubox files, which on a parallel file system can sometimes hang. This will be obvious when watching the RTS job for a channel takes much longer than other channels. In this case, it is a good idea to look in the batch output file (named something likeRTS_[obs ID]*.out)
and the log files (which are written to the output directory specified when callingcalibrate_vcs.py
) to help determine which jobs are hanging. These jobs should be cancelled and resubmitted individually to the queue. - You may find that some coarse channel solutions simply don't converge. In this case, look at those channels that did converge and flag any obviously bad tiles or fine frequency channels (by adding the relevant entry to either
flagged_tiles.txt
orflagged_channels.txt
) and re-run the calibration (often you will need to iterate a few times anyway). Hopefully, by eliminating really bad tiles/fine-channels, other coarse channels will converge.
Cleaning up after using the RTS
Once you have a calibration solutions you're happy with, please remember to delete the raw visibilities, which are no longer needed.
...
Compression will take the size down from ~55 MB to ~20 MB. The final tarball can then be easily transported to a secure backup location of your choice.
Calibrating with the calibrate software from mwa-reduce package by Andre Offringa
This is initial introduction of using calibrate software from mwa-reduce package by Andre Offringa (AO). It is available on Garrawarla after loading module:
...
- check_solutions.sh 1103645160_calsol.bin
Offline correlation
Anchor | ||||
---|---|---|---|---|
|
In the case where trying to calibrate on a dedicated calibrator observation fails, we can try in-beam calibration. This requires visibilities made from the actual data you're ultimately trying to beamform on, thus we need to use the offline correlator. This is relatively easy, as it's a mode that process_vcs.py
has: it will send off the correlator jobs at the frequency and time resolution you request and will (by default) put them in the /astro/mwavcs/vcs/[obs ID]/vis
.
...
Code Block | ||
---|---|---|
| ||
squeue -u $USER -p gpuq |
Finding pulsars in the observation
The huge field-of-view of the MWA's tile beam means that there can be 100s of pulsars in an observation. To list all pulsars in the beam run the following command
...
To estimate the signal-to-noise ratio of the pulsars in this observation. Note that this is a time-consuming calculation so you'll have to be patient and they often have large uncertainties. See section Estimating pulsar fluxes at MWA frequencies for more information on the method.
We do have a data processing pipeline that will beamform, fold and create stokes profiles on all pulsars in the beam. This pipeline is extremely processing heavy and still in development so it's best that you ask Keegan or Nick to run the pipeline for you.
Beamforming (Nextflow method)
Anchor | ||||
---|---|---|---|---|
|
The new Nextflow beamforming method is just a better wrapper for the beamformer (see Nextflow Notes for pros, cons and tips) and automatically splices the fits files. The old method is shown below if you prefer it.
...
where [obs ID]
, [pointing] and [channel]
are defined as above.
Pulsar Processing on Garrawarla
PRESTO and DSPSR Pulsar software is difficult to install at the best of times, so the common packages are not currently natively installed on Garrawarla so the following singularity commands.
For PRESTO commands use:
...
, but are provided via containterisation. There are two generic Singularity containers available to users that focus on two different aspects of pulsar science/processing.
Pulsar searching: the psr-search
container includes the most common pulsar searching tools, namely PRESTO
and riptide
(FFA implementation). It can be accessed as shown below.
Code Block |
---|
/pawsey/mwa/singularity/psr-search/psr-search.sif <command> |
Pulsar follow-up analysis: the psr-analysis container includes the common pulsar analysis tools, such as PSRCHIVE, DSPSR, and various pulsar timing packages . It can be accessed as shown below.
Code Block |
---|
/pawsey/mwa/singularity/psr-analysis/psr-analysis.sif <command> |
Both of these images have been built to enable interactivity if required. To use this, one must modify how they run the container as follows:
Code Block |
---|
singularity run -B ~/.Xauthority <container> <command> |
--------------------------
There are other containers also available, but from January 2023 they will likely not be maintained or updated. They are "use at your own risk".
For PRESTO commands use:
Code Block | ||
---|---|---|
| ||
/pawsey/mwa/singularity/presto/presto.sif <command_here> |
...
Code Block | ||
---|---|---|
| ||
singularity run -B ~/.Xauthority /pawsey/mwa/singularity/pulseportraiture/pulseportraiture.sif <command_here> |
--------------------------
The Observation Processing Pipeline
...
First estimate the number of pulsars that will be beamformed using find_pulsar_in_obs.py -o <obsid>
and checking the output file. The number of sources may be higher than this because we also search for other sources (FRBs, RRATs and Fermi candidates) and pulsars further away from the beam centre if we estimate them to be extremely bright. We can estimate the storage the beamformed fits files require using <number of pulsars> X <observation duration in minutes> TB. If this number is below 50 TB then use the method described in Beamforming on all pulsars and Launching the OPP, if it is over 50 TB use the method described in Running the OPP in chunks
The PPP Workflow
Image accurate as of commit e6215f42c1d7c0b5a255721bc46840335170e579 to mwa_search repo
- Input Data: The OPP requires the calibrated and beamformed products of the VCS. These data can be acquired using the method described here.
- Pulsar Search: Given an observation ID, each pulsar within the field is identified and handed to the Pulsar Processing Pipeline (PPP)
- Initial Fold(s): Performs a PRESTO fold on the data. For slow pulsars, this will probably be 100 bins. Fast pulsars will be 50 bins.
- Classification: The resulting folds are classified as either a detection, or non-detection.
- Best Pointing: For the MWA's extended array configuration, there may be multiple pointings for a single pulsar. Should this be the case, we want to find the brightest detection to use for the rest of the pipeline. The "best" detection will be decided on and its pointing will be the only one used going forward.
- Post Folds: A series of high-bin folds will be done. This is in order to find the highest time resolution fold we can do while still getting a detection.
- Upload products to database: Uploads the initial fold and best fold to the pulsar database.
- IQUV Folding: Uses DSPSR to fold on stokes IQUV, making a timescrunched archive. This archive is immediately converted back to PSRFITS format for use with PSRSALSA
- RM Synthesis: Runs RM synthesis on the archive. If successful, will apply this RM correction.
- RVM Fitting: Attempts to fit the Rotating Vector Model to the profile. If successful, will upload products to the database.
Beamforming on all pulsars
Before running the OPP, we need to beamform on all pulsars in the field of view of the observation we're processing. This can done with the following command:
...
Like many nextflow scripts, it's recommended that you run this in a Documentation 24971739 as it may take some time to complete.
...
Code Block |
---|
/astro/mwavcs/vcs/<OBSID>/pointings |
Launching the OPP
To run the OPP, ensure you have the vcstools and mwa_search modules loaded. Then, the following command will launch the OPP:
...
- Fresh run: Creates a new config file for each pulsar, clears the working directories of any previous results and launches the PPP
- Rerun existing: Loads the existing config files in the 'dpp' directory and launches the PP for each of them without any modifications
- Rerun broken: Loads the existing config files, checks to see which did not complete properly and relaunches these jobs only
The Pulsar Processing Pipeline
The PPP is a piece of software designed to work in conjunction with the OPP. It handles the processing requirements for a single pulsar. You shouldn't need to interact with the PPP, however, this page will detail how it works.
...
Note that by default, the PPP will continue where it left off. If you want to restart the pipeline, add the --fresh_run tag. You can use the --help tag for more information
OPP_Status
Many observations will contain a large number of pulsars in their field of view. This makes it inconvenient to check each output of the PPP manually to check its outcome. Instead, you can use the following command to do so (provided the OPP run is completely finished):
...
Note that the positive detections are listed at the very top of the output. A detected pulsar is merely one that was classified as a detection by the LOTAAS classifier.
Running the OPP in chunks
This method explains how to run the OPP a chunk at a time so when you have a large number of pulsars in the FoV you can save storage space. It is best to run all of these commands in a Documentation 24971739 from your Nextflow work directory (/astro/mwaops/<user>/<obsid>_work
)
...
Repeat these 4 steps until you have used all the pointing chunk files.
Multi-Gaussian Profile Fitting
Should you need to fit a pulse profile, vcstools contains a program to do so using multiple Gaussian components. The purpose of the fitter is not to attempt to model individual beam components, but rather to dissect the profile into a clean mathematical entity that can be used to attain some useful measurements of the profile. The script we use to run the fitter is prof_estimate.py. This script requires a single input file that is one of the following formats:
...
The fitter is also capable of fitting noisy profiles:
The MWA Pulsar Database
The MWA Pulsar Database is where we store all of our detections and it is your responsibility to upload all detections you make to it.
Setting up your account
To get an account email Nick Swainston (nickaswainston@gmail.com) and then perform the following steps to set up your account on the Pawsey supercomputers.
...
This should now set your username and password every time you log in so log in again
Uploaded detections to the database
There are a variety of files that you can upload to the database so it's a good idea to look through the available options with
...
Where the <bestprof_file>
file should end in .pfd.bestprof and the <presto_plot>
file should end in .pfd.ps
Pulsar search pipelines
Nick has developed an MWA pulsar search pipeline which is getting stable enough for most users to be able to start running. These pipelines are very processing heavy and can cause bugs if run in a non-standard way so it's best to discuss with Nick what sort of search you're planning to do.
Working out a pointing grid
If you are doing a large scale search (more than a single pointing), for example, a search of a supernova remnant, then you will likely have to play how you grid your pointings using grid.py
. If you want to do a single loop (-l 1
) of beams at the half-power point (-f 0.5
) then you could use the following command
...
It's always best to check the output png file to check that the grid looks reasonable. Once you are happy with your grid you can use the search pipelines using --pointing_file <grid.py output>
Running mwa_search_pipeline.nf
The mwa_search_pipeline.nf will beamform and search. A common search command is:
...
Code Block | ||
---|---|---|
| ||
mwa_search_pipeline.nf --help |
Running pulsar_search.nf
If you have already beamformed and have the fits files available, you can use pulsar_search.nf.
...
You should only input a single fits file to prevent errors in the single pulse searcher. The candidates will also be output to /astro/mwavcs/pulsar_search/<obsid>_candidates
.
Estimating pulsar fluxes at MWA frequencies Anchor SN_est SN_est
SN_est | |
SN_est |
The ATNF database is an abundant resource for published pulsar data. This may include flux densities over a broad range of frequencies. Should such data be available for a pulsar, we are able to make a first-order estimate of the flux we expect from a pulsar.
...
No Format |
---|
J0630-2834 derived spectral index: -1.3698868864530596 +/- 0.04796669612862415 J0630-2834 flux estimate at 154.24 MHz: 0.41368443304349944 +/- 0.006638533625983657 Jy Pulsar S/N: 747.0705031130714 +/- 161.9689693670181 |
Nextflow Notes
Anchor | ||||
---|---|---|---|---|
|
This section explains the pros and cons of Nextflow and some tips on how to run the Nextflow scripts well
...
- The Nextflow job runs as long as the pipeline so it's often best you put it in a screen (explained below)
- Newer so there may be more bugs
Using a screen Anchor screen screen
screen | |
screen |
Screens allow you to run a command in the background that won't be interrupted when you log out of an ssh session. Here is a good guide of how to use screens: https://www.howtogeek.com/662422/how-to-use-linuxs-screen-command/
I will go through the basics that you will require for running Nextflow
...
Code Block | ||
---|---|---|
| ||
nswainston@garrawarla-1:~> ssh garrawarla-2 |
Resuming Nextflow Pipelines
One large benefit of Nextflow pipelines is that you can resume the pipelines. Once you have fixed the bug that caused the pipeline to crash simply relaunch the pipeline with the -resume
option added. For the resume option to work you must run the command from the same directory and the working directory can't be deleted
Cleaning up the work directories
Once the pipeline is done and you are confident you don't need to resume the pipeline or need the intermediate files then it is a good idea to remove the Nextflow work directories to save space. By default, the work directories are stored in /astro/mwavcs/$USER/<obsid>_work
Calibration Combining Method Anchor CalCombine CalCombine
CalCombine | |
CalCombine |
http://ws.mwatelescope.org/admin/observation/observationsetting/ (username is MWA-guest and the password is bowtie) can be used to search for observations and look at their metadata. To look for calibrators for your observations input your observation ID and click search like so
...
The name formatting for calibrator observations is the name of the calibrator source, an underscore and the centre frequency channel ID. Try and find a pair of calibration observations with the same calibrator source and, together, will cover the entire frequency range of the target observation. For the above example, this was 1195317056 and 1195316936. If you can't find any suitable calibration observations, then you can keep increasing the time search window up to 48 hours.
Now that you know which calibration observations you need, download and calibrate them as you normally would as explained in the Documentation 24971739 section. It is best to use the same values in the flagged_tiles.txt and flagged_channels.txt for all calibration obs to ensure your calibration solutions are consistent. Once the calibration is complete you can combine the two calibrations into one using the script
...
This will output the combined calibration solution to /astro/mwavcs/vcs/[obs ID]/cal/<first calibration ID> _<second calibration ID>/rts
and you can treat the calibrator ID as <first calibration ID> _<second calibration ID>
when being used in other scripts.
Deprecated Methods
These are old methods that are not maintained but may be useful if you need to do something specific or the new scripts have failed
Download (old python method)
If there are any issues with the above download method you can always try the previous method using the process_vcs.py
script
...
Each user can have a maximum of 8 concurrent download jobs running.
Checking downloaded data
By default, data are written to /astro/mwavcs/vcs/[obs ID]
. Once the observation appears to have downloaded correctly, it is good practice to check. The easiest way is to check to output from the process_vcs.py
download command. In /group/mwavcs/vcs/[obs ID]
there is a subdirectory labelled "batch
". Within the "batch" directory is stored all of the submitted SLURM scripts and their output. If there has been a problem in downloading data, this will appear as up to 10 files named "check_volt*.out
", i.e.
...
The total data volume downloaded will vary, but for maximum duration VCS observations this can easily be ~40 TB of just raw data. It is therefore important to keep in mind the amount of data you are processing and the disk space your are consuming. If only the raw voltages have been downloaded then you will need to recombine the data yourself, which doubles the amount of data (see next section).
Beamforming (old python method)
There are a few different modes in which the beamformer can be operated and these different "modes" produce different outputs.
...
where ["RA string"]
is formatted as "hh:mm:ss.ss" and ["DEC string"]
is formatted as "dd:mm:ss.ss" (including the sign: "+" or "-"). If you also want the incoherent sum, you can pass the --incoh
flag.
Splicing channels together
We are able to splice together adjacent frequency channels for each 200 second chunk using the splice_psrfits
program. In it's current form it accepts all the files you want to splice into one (i.e. the 24 coarse channel data files for an individual 200 second chunk), plus an output suffix. It is invoked as follows:
...