Input for SNaQ

SNaQ is a method implemented in the package to estimate a phylogenetic network from multiple molecular sequence alignments. There are two alternatives for the input data:

  1. A list of estimated gene trees for each locus, which can be obtained using MrBayes or RAxML. Or:
  2. A table of concordance factors (CF), i.e. gene tree frequencies, for each 4-taxon subset. This table can be obtained from BUCKy, to account for gene tree uncertainty

This pipeline can be used to obtain the table of quartet CF needed as input for SNaQ (see also the wiki.) It starts from the sequence alignments, runs MrBayes and then BUCKy (both parallelized), producing the table of estimated CFs and their credibility intervals. Additional details on this TICR pipeline describe how to insert data at various stages (e.g. after running MrBayes on each locus).

Tutorial data: gene trees

We suggest that you create a special directory for running these examples, where input files can be downloaded and where output files will be created (with estimated networks for instance). Enter this directory and run Julia from there.

Suppose you have a file with a list of gene trees in parenthetical format called raxmltrees.tre. You can access the example file of input trees here or here for easier download.

Do not copy-paste into a "smart" text-editor. Instead, save the file directly into your working directory using "save link as" or "download linked file as". This file contains 30 gene trees, each in parenthetical format on 6 taxa like this (with rounded branch lengths):

(E:0.038,((A:0.014,B:0.010):0.010,(C:0.008,D:0.002):0.010):0.025,O:0.078);

If raxmltrees.tre is in your working directory, you can view its content within Julia:

less("raxmltrees.tre")

or like this, to view the version downloaded with the package:

raxmltrees = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","raxmltrees.tre")
less(raxmltrees)

Just type q to quit viewing this file. You could read in these 30 trees and visualize the third one (say) like this:

using PhyloNetworks
raxmltrees = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","raxmltrees.tre");
julia> genetrees = readMultiTopology(raxmltrees);
julia> genetrees[3]HybridNetwork, Rooted Network 9 edges 10 nodes: 6 tips, 0 hybrid nodes, 4 internal tree nodes. tip labels: E, A, B, C, ... ((E:0.015,(A:0.006,B:0.006):0.003):0.041,(C:0.006,D:0.0):0.041,O:0.052);

To visualize any of these input trees, use the PhyloPlots package:

using PhyloPlots
plot(genetrees[3]); # tree for 3rd gene

gene3

To read in all gene trees and directly summarize them by a list of quartet CFs (proportion of input trees with a given quartet):

julia> q,t = countquartetsintrees(genetrees); # read in trees, calculate quartet CFsReading in trees, looking at 15 quartets in each...
0+------------------------------+100%
  ******************************
julia> df = writeTableCF(q,t) # data frame with observed CFs: gene frequencies15×9 DataFrame Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ⋯ │ Int64 String String String String Float64 Float64 Float64 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 1 A B C D 1.0 0.0 0.0 ⋯ 2 │ 2 A B C E 0.833333 0.0333333 0.133333 3 │ 3 A B D E 0.833333 0.0333333 0.133333 4 │ 4 A C D E 0.0 0.0 1.0 5 │ 5 B C D E 0.0 0.0 1.0 ⋯ 6 │ 6 A B C O 0.866667 0.0 0.133333 7 │ 7 A B D O 0.866667 0.0 0.133333 8 │ 8 A C D O 0.0 0.0 1.0 9 │ 9 B C D O 0.0 0.0 1.0 ⋯ 10 │ 10 A B E O 0.833333 0.0666667 0.1 11 │ 11 A C E O 0.533333 0.333333 0.133333 12 │ 12 B C E O 0.666667 0.266667 0.0666667 13 │ 13 A D E O 0.533333 0.333333 0.133333 ⋯ 14 │ 14 B D E O 0.666667 0.266667 0.0666667 15 │ 15 C D E O 1.0 0.0 0.0 1 column omitted
julia> using CSV
julia> CSV.write("tableCF.csv", df); # to save the data frame to a file
julia> raxmlCF = readTableCF("tableCF.csv") # read in the file and produces a "DataCF" object30.0 gene trees per 4-taxon set Object DataCF number of quartets: 15

less("tableCF.csv") lets you see the content of the newly created file "tableCF.csv", within Julia. Again, type q to quit viewing this file.

In this table, each 4-taxon set is listed in one row. The 3 "CF" columns gives the proportion of genes that has each of the 3 possible trees on these 4 taxa.

For more help on any function, type ? to enter the help mode, then type the name of the function. For example: type ? then countquartetsintrees for information on the various options of that function.

When there are many more taxa, the number of quartets might be very large and we might want to use a subset to speed things up. Here, if we wanted to use a random sample of 10 quartets instead of all quartets, we could do:

raxmlCF = readTrees2CF(raxmltrees, whichQ="rand", numQ=10, CFfile="tableCF10.txt")

Be careful to use a numQ value smaller than the total number of possible 4-taxon subsets, which is n choose 4 on n taxa (e.g. 15 on 6 taxa). To get a predictable random sample, you may set the seed with using Random; Random.seed!(12321) (for instance) prior to sampling the quartets as above. The readTrees2CF is much slower than the function countquartetsintrees to read in trees and calculate the quartet CFs observed in the trees, when we want to get all quartet CFs. But for a small sample of quartets, then readTrees2CF is available.

Tutorial data: quartet CFs

If we already have a table of quartet concordance factor (CF) values in a file buckyCF.csv in this format

Taxon1Taxon2Taxon3Taxon4CF12_34CF13_24CF14_23
DAEO0.5650.09030.3447
......

we would read it in one step like this: readTableCF("buckyCF.csv"). An example file comes with the package, available here or here.

julia> buckyCFfile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","buckyCF.csv");
julia> buckyCF = readTableCF(buckyCFfile)30.0 gene trees per 4-taxon set Object DataCF number of quartets: 15

The same thing could be done in 2 steps: first to read the file and convert it to a 'DataFrame' object, and then to convert this DataFrame into a DataCF object.

julia> using CSV, DataFrames
julia> dat = DataFrame(CSV.File(buckyCFfile); copycols=false);
julia> first(dat, 6) # to see the first 6 rows6×14 DataFrame Row │ taxon1 taxon2 taxon3 taxon4 CF12.34 CF12.34_lo CF12.34_hi ⋯ │ String1 String1 String1 String1 Float64 Float64 Float64 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ D A E O 0.565033 0.5 0.633333 ⋯ 2 │ D A E B 0.0005 0.0 0.0 3 │ C A E B 0.0005 0.0 0.0 4 │ C A E O 0.565033 0.5 0.633333 5 │ D A O B 3.33333e-5 0.0 0.0 ⋯ 6 │ C D E O 0.999867 1.0 1.0 7 columns omitted
julia> buckyCF = readTableCF(dat)30.0 gene trees per 4-taxon set Object DataCF number of quartets: 15
julia> writeTableCF(buckyCF)15×8 DataFrame Row │ t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 nge ⋯ │ String String String String Float64 Float64 Float64 Flo ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ D A E O 0.565033 0.0903 0.344667 ⋯ 2 │ D A E B 0.0005 0.8599 0.139567 3 │ C A E B 0.0005 0.8599 0.139567 4 │ C A E O 0.565033 0.0903 0.344667 5 │ D A O B 3.33333e-5 0.8885 0.111467 ⋯ 6 │ C D E O 0.999867 6.66667e-5 6.66667e-5 7 │ C E O B 0.0401667 0.263067 0.696767 8 │ C D A E 1.0 0.0 0.0 9 │ C D O B 0.999867 6.66667e-5 6.66667e-5 ⋯ 10 │ C D E B 1.0 0.0 0.0 11 │ D E O B 0.0401667 0.263067 0.696767 12 │ C D A O 0.999867 6.66667e-5 6.66667e-5 13 │ C D A B 1.0 0.0 0.0 ⋯ 14 │ A E O B 0.0731667 0.0424667 0.884367 15 │ C A O B 3.33333e-5 0.8885 0.111467 1 column omitted

In the input file, columns need to be in the right order: with the first 4 columns giving the names of the taxa in each 4-taxon set. The CF values are assumed to be in columns named "CF12_34", etc., or else in columns 5,6,7. If available, a column named "ngenes" will be taken to have the the number of genes for each 4-taxon subset.

Tutorial data: starting tree

If we have a tree for the data set at hand, it can be used as a starting point for the optimization. From our gene trees, we estimated a species tree with ASTRAL. This tree comes with the package in file astral.tre here. This file has 102 trees: 100 bootstrap species trees, followed by their greedy consensus, followed by the best tree on the original data. It's this last tree that we are most interested in. We can read it with

astralfile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","astral.tre");
astraltree = readMultiTopology(astralfile)[102] # 102th tree: last tree here
plot(astraltree, showedgelength=true);

astraltree

To start its search, SNaQ will need a network of "level 1". All trees and all networks with 1 hybridization are of level 1. To make sure that a network with 2 or more hybridizations is of level 1, we can read it in with readTopologyLevel1 (which also unroots the tree, resolves polytomies, replaces missing branch lengths by 1 for starting values etc.):

T=readTopologyLevel1("startNetwork.txt")

(here startNetwork.txt is a hypothetical file: replace this by the name of a file that contains your network of interest.)

Next: Getting a Network