This is the pipeline used in the Pandan Plastome Phylogenetics Project manuscript. All data (raw, intermediate, and final) and code are stored in a single director, with its subdirectories organized thusly:
## levelName
## 1 PandanPlastomePhylogeneticsProject2024
## 2 ¦--0_Raw_Data
## 3 ¦--1_Functions
## 4 ¦ °--SoilsSeeR
## 5 ¦--2_Scripts
## 6 ¦--3_Plastome_Reconstruction
## 7 ¦ ¦--3.0_Uncurated_Reconstructions
## 8 ¦ °--3.1_Curated_Reconstructions
## 9 ¦--4_Plastome_Genes_Analysis
## 10 ¦--5_ML_Analysis
## 11 ¦--6_Phyly_Analysis
## 12 ¦ ¦--Constraint_Newicks
## 13 ¦ ¦--Trees
## 14 ¦ ¦--Clades
## 15 ¦ °--Subgenera
## 16 ¦--7_BEAST_Analysis
## 17 ¦ °--Trees
## 18 ¦--8_Climate_Analysis
## 19 ¦--9_Soils_Analysis
## 20 ¦--10_Hydrenchyma_Images
## 21 ¦--11_Morphological_Analysis
## 22 °--12_External_Data
## 23 ¦--DSMW
## 24 °--wc2.1_30s_bio
Input is in the form of sets of compressed fastq reads (forward and reverse). Forward and reverse files are designated by R1 vs R2 in the filename. They are available on the SRA. Below is a table with their accession numbers.
library(knitr)
uwu <- read.table("/Users/Mike/Downloads/metadata-14388163-processed-ok.tsv",sep = "\t")
colnames(uwu) <- uwu[1,]
uwu <- uwu[-1,]
uwu <- uwu[,colnames(uwu) %in% c("accession","biosample_accession","bioproject_accession","filetype")]
knitr::kable(uwu,col.names=colnames(uwu))
Illumina® barcodes were removed directly after sequencing. Paired-end (PE) reads were trimmed (i.e., had low-quality reads and their mates removed, as well as adaptors removed) using Trimmomatic (Bolger, 2014) with default parameters (i.e., threshold set to 33). Trimmed PE reads were then merged using interleave_fastq (Garrison and Boisvert, 2021). The template below was used to perform the analysis:
#!/bin/bash
#R1_reads is a placeholder for the forward reads filename
#R2_reads is a placeholder for the reverse reads filename
java -jar /usr/local/Cellar/trimmomatic/0.39/libexec/trimmomatic-0.39.jar PE R1_reads.fastq R2_reads.fastq R1_reads_P.fastq R1_reads_U.fastq R2_reads_P.fastq R2_reads_U.fastq -threads 10 -phred33 HEADCROP:20 SLIDINGWINDOW:4:30 MINLEN:30
To reconstruct plastomes, we used a hybrid de novo assembly approach (MITObim), with the P. tectorius (Wang B247) plastome from Tan et al. (2019; file renamed NC_042747_plastome.fasta) as the reference and a maximum of 30 iterations for each set of PE reads (Hahn et al., 2013). MITObim uses a hybrid de novo assembly approach, taking advantage of the first round of reference-based assembly (where only reads matching 100% of the reference genome were conserved) to seed a de novo assembly (done using MIRA 4.0.2, Hahn et al., 2013). This approach is scalable for assembling thousands of plastomes and is more time- and resource-efficient than pure de novo assembly methods (Hahn et al., 2013). Assembled plastome sequences (along with the reference sequence) were written in a fasta file and aligned using MAFFT v. 7.505 (Katoh et al., 2002) using default parameters. Alignments were then manually curated using Unipro UGENE (Okonechnikov et al., 2012). Manual curation identified the sort single-copy (SSC) region as being reverse-complemented in our reconstructions relative to that from Tan et al. (2019) and of the Carludovia palmata (Chase 14836) outgroup sequence from Mennes et al. (2015). We then corrected the orientation of the SSC in our reconstructions using Unipro UGENE.
#!/bin/bash
./interleave_fastq.sh R1_reads_P.fastq R2_reads_P.fastq > R12_reads_P.fastq
MITObim.pl -start 0 -end 100 -sample samplename -ref genome -readpool R12_reads_P.fastq --quick NC_042747_plastome.fasta -mirapath /Users/Mike/mira_4.0.2_darwin13.1.0_x86_64_static/bin --clean --verbose --paired
mafft --auto RevCompGood.fasta > AlignedRevCompGood.fasta
install.packages(c("remotes","seqinr","genbankr", "zoo", "ape","pegas","phangorn", "treeio", "phyloch","strap","coda","factoextra","FactoMineR","raster","varSelRF","ranger","devtools","measurements","readxl","geiger","sp","sf"))
remotes::install_github("wojahn/PrideBar")
remotes::install_github("wojahn/SoilSeeR")
To compare the plastomes’ phylogenetic potential relative to the regions used in prior studies, we inferred phylogenetic information for the sequences using PAUP* 4.0b10 (Swofford, 2003). The information obtained included No. of ingroup sampled species, No. sequences incl. outgroup, Sequence length range, Alignment length, Missing data (percentage of ingroup sequences; in brackets percentage of nucleotides for the combined analyses), No. constant characters (%), No. variable characters (%), and No. potentially parsimony-informative (PI) characters (%). Mean amount of phylogenetic information per sample (averaged by variable sites number/PI sites number). To determine where (i.e., in which regions, and in which genes or intergenic spacers) the plastomes’ nucleotide diversity (and thus potential phylogenetic signal) lay, we computed Nei nucleotide diversity (Nei, 1987) for each of the four regions of the plastome. The full pipeline for this part is contained in the R script Genewise_Nei_Analysis.R in the supplementary materials. Optimal moving window size was determined by taking the median of the gene sizes in Tan et al. (2019)’s annotation of their P. tectorius (Wang B247) plastome. This optimal window size was then used in the pipeline developed in Simmonds et al. (2021) to calculate Nei nucleotide diversity. The pipeline used the R packages ape, pegas (Paradis, 2010), seqinr (Charif and Lobry, 2007), and zoo (Zeileis and Grothendieck, 2005). For each of the four regions, peaks whose heights were in the 95th quantile and above had their location in the plastome and any associated gene(s)/intergenic spacer(s) recorded.
####################################################
# Inferring Median Gene Length
####################################################
PanTecGB <- genbankr::parseGenBank("4_Plastome_Genes_Analysis/sequence.gb")
message("...extracting features")
PanTecGB_f <- PanTecGB[["FEATURES"]]
Infoz <- as.data.frame(matrix(nrow=0,ncol=5))
colnames(Infoz) <- c("start","end","strand","type","name")
PrideBar::SetPrideBar(1,length(PanTecGB_f),2)
for(i in 1:length(PanTecGB_f))
{
PrideBar::PrideBar()
#message(sprintf("......processing No. %s of %s",i,length(PanTecGB_f)))
tmp <- as.data.frame(PanTecGB_f[[i]])
if(nrow(tmp) == 1)
{
toadd <- as.data.frame(matrix(nrow=1,ncol=5))
colnames(toadd) <- c("start","end","strand","type","name")
toadd[1,1] <- tmp[1,2]
toadd[1,2] <- tmp[1,3]
toadd[1,3] <- tmp[1,4]
toadd[1,4] <- tmp[1,5]
toadd[1,5] <- tmp[1,6]
Infoz <- rbind(Infoz,toadd)
}else{
toadd <- as.data.frame(matrix(nrow=nrow(tmp),ncol=5))
colnames(toadd) <- c("start","end","strand","type","name")
for(j in 1:nrow(tmp))
{
toadd[j,1] <- tmp[j,2]
toadd[j,2] <- tmp[j,3]
toadd[j,3] <- tmp[j,4]
toadd[j,4] <- tmp[j,5]
toadd[j,5] <- tmp[j,6]
}
Infoz <- rbind(Infoz,toadd)
}
}
PrideBar::FadePrideBar()
Infoz_smol <- Infoz[!Infoz[,2] %in% c("source","misc_feature","repeat_region"),]
Lengthz <- Infoz_smol[,2] - Infoz_smol[,1]
median(Lengthz) #272
mean(Lengthz) #1692.753
pdf("4_Plastome_Genes_Analysis/GeneLengthHistogram.pdf")
hist(Lengthz)
dev.off()
# Heavily skewed smaller, so we'll use median, round up to 300
####################################################
# Nei CBP
####################################################
message("Parsing GenBank File...")
GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
library(genbankr)
PanTecGB <- genbankr::parseGenBank(GBFile)
message("...extracting features")
PanTecGB_f <- PanTecGB[["FEATURES"]]
Infoz <- as.data.frame(matrix(nrow=0,ncol=5))
colnames(Infoz) <- c("start","end","strand","type","name")
PrideBar::SetPrideBar(1,length(PanTecGB_f),2)
for(i in 1:length(PanTecGB_f))
{
#message(sprintf("......processing No. %s of %s",i,length(PanTecGB_f)))
PrideBar::PrideBar()
tmp <- as.data.frame(PanTecGB_f[[i]])
if(nrow(tmp) == 1)
{
toadd <- as.data.frame(matrix(nrow=1,ncol=5))
colnames(toadd) <- c("start","end","strand","type","name")
toadd[1,1] <- tmp[1,2]
toadd[1,2] <- tmp[1,3]
toadd[1,3] <- tmp[1,4]
toadd[1,4] <- tmp[1,5]
toadd[1,5] <- tmp[1,6]
Infoz <- rbind(Infoz,toadd)
}else{
toadd <- as.data.frame(matrix(nrow=nrow(tmp),ncol=5))
colnames(toadd) <- c("start","end","strand","type","name")
for(j in 1:nrow(tmp))
{
toadd[j,1] <- tmp[j,2]
toadd[j,2] <- tmp[j,3]
toadd[j,3] <- tmp[j,4]
toadd[j,4] <- tmp[j,5]
toadd[j,5] <- tmp[j,6]
}
Infoz <- rbind(Infoz,toadd)
}
}
PrideBar::FadePrideBar()
AWD <- seqinr::read.fasta(ToAnal)
message("...isolating reference")
AWD_ref <- as.character(AWD[[which(names(AWD) == "gi|1693290702|ref|NC_042747.1|")]])
themz <- which(as.character(AWD_ref) != "-")
TruNumz <- rep(NA, length(AWD_ref))
for(i in 1:length(themz))
{
TruNumz[as.numeric(themz[i])] <- i
}
# turn to df
uwu <- as.data.frame(matrix(nrow=length(AWD),ncol=length(as.character(AWD[[1]]))))
for(i in 1:length(AWD))
{
message(sprintf("Processing No. %s of %s",i,length(AWD)))
uwu[i,] <- as.vector(as.character(AWD[[i]]))
}
LSCi <- Infoz[Infoz$name == "large single copy region (LSC)",]
IRBi <- Infoz[Infoz$name == "inverted repeat B (IRB)",]
SSCi <- Infoz[Infoz$name == "small single copy region (SSC)",]
IRAi <- Infoz[Infoz$name == "inverted repeat A (IRA)",]
countR <- 0
RealNumz <- rep(NA, length(AWD_ref))
for(i in 1:length(AWD_ref))
{
if(AWD_ref[i] != "-")
{
countR <- countR + 1
RealNumz[i] <- countR
}
}
LSCnumz <- seq(LSCi[1,1],LSCi[1,2],1)
IRBnumz <- seq(IRBi[1,1],IRBi[1,2],1)
SSCnumz <- seq(SSCi[1,1],SSCi[1,2],1)
IRAnumz <- seq(IRAi[1,1],IRAi[1,2],1)
LSC_only <- uwu[,seq(which(RealNumz == LSCi[1,1]),which(RealNumz == LSCi[1,2]))]
IRB_only <- uwu[,seq(which(RealNumz == IRBi[1,1]),which(RealNumz == IRBi[1,2]))]
SSC_only <- uwu[,seq(which(RealNumz == SSCi[1,1]),which(RealNumz == SSCi[1,2]))]
IRA_only <- uwu[,seq(which(RealNumz == IRAi[1,1]),which(RealNumz == IRAi[1,2]))]
LSC <- LSC_only
tmp <- LSC
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/LSC_Good.fasta")
#IRB
IRB <- IRB_only
tmp <- IRB
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/IRB_Good.fasta")
SSC <- SSC_only
tmp <- SSC
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/SSC_Good.fasta")
IRA <- IRA_only
tmp <- IRA
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/IRA_Good.fasta")
gc()
LSC <- seqinr::read.fasta("4_Plastome_Genes_Analysis/LSC_Good.fasta")
IRB <- seqinr::read.fasta("4_Plastome_Genes_Analysis/IRB_Good.fasta")
SSC <- seqinr::read.fasta("4_Plastome_Genes_Analysis/SSC_Good.fasta")
IRA <- seqinr::read.fasta("4_Plastome_Genes_Analysis/IRA_Good.fasta")
###########################################
# SLIDING WINDOW PART
###########################################
# Sliding window across cpDNA genome
require(seqinr)
require(zoo)
require(ape)
require(pegas)
# All first
Gen <- seqinr::read.fasta(ToAnal)
genelength <- 300
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)
# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
setTxtProgressBar(pb,i)
# Matrix
val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)
Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
for(j in 1:length(names(Gen)))
{
Short[j,] <- Gen[[j]][val]
}
# Infer nucleotide diversity and number of polymorphic char
INFO[i,1] <- nuc.div(as.DNAbin(Short))
INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)
INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv", quote = F)
INFO_all <- INFO
# Aligned cpDNA genomes
Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/IRA_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)
# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
setTxtProgressBar(pb,i)
# Matrix
val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)
Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
for(j in 1:length(names(Gen)))
{
Short[j,] <- Gen[[j]][val]
}
# Infer nucleotide diversity and number of polymorphic char
INFO[i,1] <- nuc.div(as.DNAbin(Short))
INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)
INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv", quote = F)
INFO_IRA_T <- INFO
Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/IRB_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)
# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
setTxtProgressBar(pb,i)
# Matrix
val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)
Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
for(j in 1:length(names(Gen)))
{
Short[j,] <- Gen[[j]][val]
}
# Infer nucleotide diversity and number of polymorphic char
INFO[i,1] <- nuc.div(as.DNAbin(Short))
INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)
INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv", quote = F)
INFO_IRB_T <- INFO
Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/LSC_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)
# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
setTxtProgressBar(pb,i)
# Matrix
val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)
Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
for(j in 1:length(names(Gen)))
{
Short[j,] <- Gen[[j]][val]
}
# Infer nucleotide diversity and number of polymorphic char
INFO[i,1] <- nuc.div(as.DNAbin(Short))
INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)
INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv", quote = F)
INFO_LSC_T <- INFO
Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/SSC_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)
# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
setTxtProgressBar(pb,i)
# Matrix
val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)
Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
for(j in 1:length(names(Gen)))
{
Short[j,] <- Gen[[j]][val]
}
# Infer nucleotide diversity and number of polymorphic char
INFO[i,1] <- nuc.div(as.DNAbin(Short))
INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)
INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv", quote = F)
INFO_SSC_T <- INFO
#############################################
# MAKING NEI DIVERSITY FIGURE
#############################################
INFO_IRA_T <- read.csv("4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv")
INFO_IRB_T <- read.csv("4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv")
INFO_LSC_T <- read.csv("4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv")
INFO_SSC_T <- read.csv("4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv")
{
pdf("4_Plastome_Genes_Analysis/300_bp_NeiPartitions_2024.pdf")
par(mfrow = c(2, 2))
plot(as.numeric(INFO_LSC_T[,1])/1000, as.numeric(INFO_LSC_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2,main="LSC")
abline(h = quantile(as.numeric(INFO_LSC_T[,1]), .95), lwd = 1, col="blue")
plot(as.numeric(INFO_IRB_T[,1])/1000, as.numeric(INFO_IRB_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2, main="IRB")
abline(h = quantile(as.numeric(INFO_IRB_T[,1]), .95), lwd = 1, col="blue")
plot(as.numeric(INFO_SSC_T[,1])/1000, as.numeric(INFO_SSC_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2, main="SSC")
abline(h = quantile(as.numeric(INFO_SSC_T[,1]), .95), lwd = 1, col="blue")
plot(as.numeric(INFO_IRA_T[,1])/1000, as.numeric(INFO_IRA_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2, main="IRA")
abline(h = quantile(as.numeric(INFO_IRA_T[,1]), .95), lwd = 1, col="blue")
dev.off()
}
INFO <- read.csv("4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv")
pdf("4_Plastome_Genes_Analysis/300_bp_NeiPartitions_Whole_2024.pdf")
par(mfrow = c(2, 2))
plot(as.numeric(row.names(INFO_all))/1000, as.numeric(INFO_all[,1]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2,main="LSC")
abline(h = quantile(as.numeric(INFO_all[,1]), .95), lwd = 1, col="blue")
dev.off()
####################################################
# Exploring Peaks
####################################################
source("1_Functions/Peaks2Annotations.R")
#If unannotated Gene before then gene after for intergenic spacer
# add addrow with number for peak
# Table with arrow number, partition, gene, and diversity
#Put other regions in past (12 and 15) and tell Nei gd as comparison (could be in gray, if not retrieved in our system)
GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/LSC_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv"
Wherez <- "large single copy region (LSC)"
RefTaxon <- "Pandanus tectorius"
LSC_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(LSC_Infoz,"4_Plastome_Genes_Analysis/LSC_Infoz.csv",row.names=F)
GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/IRA_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv"
Wherez <- "inverted repeat A (IRA)"
RefTaxon <- "Pandanus tectorius"
IRA_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(IRA_Infoz,"4_Plastome_Genes_Analysis/IRA_Infoz.csv",row.names=F)
GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/IRB_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv"
Wherez <- "inverted repeat B (IRB)"
RefTaxon <- "Pandanus tectorius"
IRB_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(IRB_Infoz,"4_Plastome_Genes_Analysis/IRB_Infoz.csv",row.names=F)
GBFile <- "/Users/Mike/Downloads/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/SSC_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv"
Wherez <- "small single copy region (SSC)"
RefTaxon <- "Pandanus tectorius"
SSC_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(SSC_Infoz,"4_Plastome_Genes_Analysis/SSC_Infoz.csv",row.names=F)
SSC_Infoz <- read.csv("4_Plastome_Genes_Analysis/SSC_Infoz.csv")
LSC_Infoz <- read.csv("4_Plastome_Genes_Analysis/LSC_Infoz.csv")
IRA_Infoz <- read.csv("4_Plastome_Genes_Analysis/IRA_Infoz.csv")
IRB_Infoz <- read.csv("4_Plastome_Genes_Analysis/IRB_Infoz.csv")
SSC_Infoz <- cbind(rep("SSC",nrow(SSC_Infoz)),SSC_Infoz)
colnames(SSC_Infoz) <- "Partition"
IRB_Infoz <- cbind(rep("IRB",nrow(IRB_Infoz)),IRB_Infoz)
colnames(IRB_Infoz) <- "Partition"
LSC_Infoz <- cbind(rep("LSC",nrow(LSC_Infoz)),LSC_Infoz)
colnames(LSC_Infoz) <- "Partition"
IRA_Infoz <- cbind(rep("IRA",nrow(IRA_Infoz)),IRA_Infoz)
colnames(IRA_Infoz) <- "Partition"
All_Infoz <- rbind(LSC_Infoz,IRB_Infoz,SSC_Infoz,IRA_Infoz)
All_Infoz <- All_Infoz[,-2]
colnames(All_Infoz) <- c("Partition","Peak Nei Diversity","Name")
write.csv(All_Infoz,"4_Plastome_Genes_Analysis/AllInfoz2.csv",row.names = F)
GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv"
Kwery <- "trnL"
source("1_Functions/GetPeakFromName.R")
GetPeakFromName(ToAnal,GBFile,NeiDiv,Kwery="matK")
#0.01016148
#Buerki et al. 2012
#matK (only , not much)
#trnL-trnF (trnL gene is in but not IGS in top 5%) 49127-50108 0.005081316
#trnQ-rps16 (Found in top 5% of SSC peaks in ours) 0.0198589
#Gallaher et al. 2015
#trnL-trnF (trnL gene but not IGS in top 5%) 49127-50108 0.005081316
#ndhF-rpl32 (not in top 5%) 129243-130403 0.002525253
#trnQ-rps16 (Found in top 5% of SSC peaks in ours) 0.0198589
#There was no substantial conflict between the chloroplast and nuclear genomes (congruent topology and BPP > 0.9, >80% bootstrap for major clades, see Supplementary Fig. 1).
GBFile <- "/Users/Mike/Downloads/sequence.gb"
ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv"
Kwery1 <- "ndhF"
source("1_Functions/IntergenicPeaks2Annotations.R")
IntergenicPeaks2Annotations(ToAnal,GBFile,NeiDiv,Poz1=49127,Poz2=50108) #0.005081316
IntergenicPeaks2Annotations(ToAnal,GBFile,NeiDiv,Poz1=129243,Poz2=130403) #0.002525253
#######################################
# MAKING ANNOTATED FIGURE
#######################################
Allz <- read.csv("4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv")
which(TruNumz == 1844) # 2734 matK
which(TruNumz == 49127) # 57966 trnL-trnF mean(c(57966,59226)) 58596
which(TruNumz == 50108) # 59226 trnL-trnF
which(TruNumz == 5910) # 7457 trnQ-rps16 mean(c(7457,9083)) 8270
which(TruNumz == 6950) # 9083 trnQ-rps16
which(TruNumz == 129243) # 150561 ndhF-rpl32 mean(c(150561,152573)) 151567
which(TruNumz == 130403) # 152573 ndhF-rpl32
which(TruNumz == 130403)
tmp <- read.fasta("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/4_Plastome_Genes_Analysis/IRB_Good.fasta")
length(as.character(tmp[[1]]))
#104751 + 28770
INFO_IRA_T <- read.csv("4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv")
INFO_IRB_T <- read.csv("4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv")
INFO_LSC_T <- read.csv("4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv")
INFO_SSC_T <- read.csv("4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv")
library(shape)
{
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024.pdf",width=6, height= 6)
{
#par(mfrow = c(2, 2),mar = c(1.5, 3, 1, 1),oma = c(2, 2, 0.1, 0.1),cex.lab=1.7)
par(mfrow = c(2, 2))#,oma = c(1,2,1,1))
par(mar = c(1.5, 4, 1, 0.1))
plot(as.numeric(INFO_LSC_T[,1])/1000, as.numeric(INFO_LSC_T[,2]), type='l', xlab='', ylab='Nucleotide Diversity', ylim=c(0, 0.1), mgp=c(2,1,0))
abline(h = quantile(as.numeric(INFO_LSC_T[,2]), .95), lwd = 1, col="blue")
text(x=2734/1000,y=.075,labels="matK",col="red",srt = 90,font = 3, cex=1)
Arrows(x0=2734/1000,y0=.06,x=2734/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
text(x=58801/1000,y=.08,labels="trnL-trnF",col="red",srt = 90,font = 3, cex=1)
Arrows(x0=58596/1000,y0=.06,x=58596/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
text(x=8270/1000,y=.083,labels="trnQ-rps16",col="red",srt = 90,font = 3, cex=1)
Arrows(x0=8270/1000,y0=.06,x=8270/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
par(mar = c(1.5, 1, 1, 1))
plot(as.numeric(INFO_IRB_T[,1])/1000, as.numeric(INFO_IRB_T[,2]), type='l', xlab='', ylab = "Nucleotide Diversity", ylim=c(0, 0.1),las = 0,yaxt='n',mgp=c(2,1,0))
abline(h = quantile(as.numeric(INFO_IRB_T[,2]), .95), lwd = 1, col="blue")
par(mar = c(3, 4, 1, 0.1))
plot(as.numeric(INFO_SSC_T[,1])/1000, as.numeric(INFO_SSC_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='', ylim=c(0, 0.1),las = 0)
abline(h = quantile(as.numeric(INFO_SSC_T[,2]), .95), lwd = 1, col="blue")
text(x=18365/1000,y=.08,labels="ndhF-rpl32",col="red",srt = 90,font = 3, cex=1)
Arrows(x0=18365/1000,y0=.06,x=18365/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
par(mar = c(3, 1, 1, 1))
plot(as.numeric(INFO_IRA_T[,1])/1000, as.numeric(INFO_IRA_T[,2]), type='l', xlab='Alignment (Kbp)', ylab="", ylim=c(0, 0.1),las = 0,yaxt = "n")
abline(h = quantile(as.numeric(INFO_IRA_T[,2]), .95), lwd = 1, col="blue")
}
dev.off()
}
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_LSC.pdf")
plot(as.numeric(INFO_LSC_T[,1])/1000, as.numeric(INFO_LSC_T[,2]), type='h', xlab='', ylab='Nucleotide Diversity', ylim=c(0, 0.1))
abline(h = quantile(as.numeric(INFO_LSC_T[,2]), .95), lwd = 1, col="gray")
dev.off()
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_IRB.pdf")
plot(as.numeric(INFO_IRB_T[,1])/1000, as.numeric(INFO_IRB_T[,2]), type='h', xlab='', ylab = "Nucleotide Diversity", ylim=c(0, 0.1),las = 0,yaxt='n')
abline(h = quantile(as.numeric(INFO_IRB_T[,2]), .95), lwd = 1, col="gray")
dev.off()
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_SSC.pdf")
plot(as.numeric(INFO_SSC_T[,1])/1000, as.numeric(INFO_SSC_T[,2]), type='h', xlab='Alignment (Kbp)', ylab='', ylim=c(0, 0.1),las = 0)
abline(h = quantile(as.numeric(INFO_SSC_T[,2]), .95), lwd = 1, col="gray")
dev.off()
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_IRA.pdf")
plot(as.numeric(INFO_IRA_T[,1])/1000, as.numeric(INFO_IRA_T[,2]), type='h', xlab='Alignment (Kbp)', ylab="", ylim=c(0, 0.1),las = 0,yaxt = "n")
abline(h = quantile(as.numeric(INFO_IRA_T[,2]), .95), lwd = 1, col="gray")
dev.off()
# Making partition file
message("Parsing GenBank File...")
library(genbankr)
PanTecGB <- genbankr::parseGenBank(GBFile)
message("...extracting features")
PanTecGB_f <- PanTecGB[["FEATURES"]]
Infoz <- as.data.frame(matrix(nrow=0,ncol=5))
colnames(Infoz) <- c("start","end","strand","type","name")
for(i in 1:length(PanTecGB_f))
{
message(sprintf("......processing No. %s of %s",i,length(PanTecGB_f)))
tmp <- as.data.frame(PanTecGB_f[[i]])
if(nrow(tmp) == 1)
{
toadd <- as.data.frame(matrix(nrow=1,ncol=5))
colnames(toadd) <- c("start","end","strand","type","name")
toadd[1,1] <- tmp[1,2]
toadd[1,2] <- tmp[1,3]
toadd[1,3] <- tmp[1,4]
toadd[1,4] <- tmp[1,5]
toadd[1,5] <- tmp[1,6]
Infoz <- rbind(Infoz,toadd)
}else{
toadd <- as.data.frame(matrix(nrow=nrow(tmp),ncol=5))
colnames(toadd) <- c("start","end","strand","type","name")
for(j in 1:nrow(tmp))
{
toadd[j,1] <- tmp[j,2]
toadd[j,2] <- tmp[j,3]
toadd[j,3] <- tmp[j,4]
toadd[j,4] <- tmp[j,5]
toadd[j,5] <- tmp[j,6]
}
Infoz <- rbind(Infoz,toadd)
}
}
ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
AWD <- seqinr::read.fasta(ToAnal)
message("...isolating reference")
AWD_ref <- as.character(AWD[[which(names(AWD) == "gi|1693290702|ref|NC_042747.1|")]])
AWD_c <- data.frame(as.numeric(as.numeric(INFO_AWD_T[,1])),as.numeric(INFO_AWD_T[,2]))
themz <- which(as.character(AWD_ref) != "-")
TruNumz <- rep(NA, length(AWD_ref))
for(i in 1:length(themz))
{
TruNumz[as.numeric(themz[i])] <- i
}
To infer a maximum likelihood tree to be used for testing subgeneric monophyly, evolutionary models for each partition were identified using using ModelTest-NG (Darriba et al., 2020)
untrimmed <- seqinr::read.fasta("/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/AlignedPandanusPlastomes.fasta")
ref <- untrimmed[[which(names(untrimmed) == "gi|1693290702|ref|NC_042747.1|")]]
refchar <- as.character(ref)
countR <- 0
poz <- rep(NA,length(refchar))
refpoz <- rep(NA,length(refchar))
for(i in 1:length(refchar))
{
if(refchar[[i]] != "-")
{
countR <- countR + 1
refpoz[i] <- countR
}else{
poz[i] <- i
}
}
# turn to df
uwu <- as.data.frame(matrix(nrow=length(untrimmed),ncol=length(as.character(untrimmed[[1]]))))
for(i in 1:length(untrimmed))
{
message(sprintf("Processing No. %s of %s",i,length(untrimmed)))
uwu[i,] <- as.vector(as.character(untrimmed[[i]]))
}
#LSC
LSC <- uwu[,c(seq(which(refpoz == 1),which(refpoz == 87448)))]
tmp <- LSC
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/LSC.fasta")
#IRB
IRB <- uwu[,c(seq(which(refpoz == 87448),which(refpoz == 114149)))]
tmp <- IRB
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/IRB.fasta")
SSC <- uwu[,c(seq(which(refpoz == 114150),which(refpoz == 132660)))]
tmp <- SSC
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/SSC.fasta")
IRA <- uwu[,c(seq(which(refpoz == 132661),which(refpoz == 159361)))]
tmp <- IRA
outlist <- list()
for(i in 1:nrow(tmp))
{
message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/IRA.fasta")
A best-scoring tree was built by RAxML-HPC version 8 using the curated plastome alignment (Stamatakis, 2014). Default parameters were used, the analysis type was changed to rapid bootstrap analysis/search for best-scoring ML tree (-f a), bootstrapping type was set to rapid bootstrapping (-x), four partitions (corresponding to the LSC, IRB, SSC, and IRA) were set (-q), and autoMRE was used to determine when to stop bootstrapping.
raxmlHPC -T 28 -f a -N autoMRE -n Wojahn2024 -s AlignedRevCompGood.fasta -m GTRCAT -p 12345 -x 12345 -o OUTGROUP_Carludovica_palmata_voucher_K_Chase -q RAxML_Partition_Scheme.txt
To test subgeneric phyly and validate cladewise monophyly, we first made constraint newicks in R.
# Make constraint Newick files for each group for which we plan to test phyly
source("1_Functions/ConstraintNewickMakeR.R")
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Kurzia"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Vinsonia"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Pandanus"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Rykia"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Coronata"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Lophostigma"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "I"
KeyCol <- 15
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "II"
KeyCol <- 15
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Ia"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "IIa"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Ib"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "IIb"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)
We then used them to make ML best-scoring trees were built by RAxML-HPC version 8 (Stamatakis, 2014) using the curated plastome alignment with four partitions (LSC, IRB, SSC, IRA). Default parameters were used, the analysis type was changed to rapid bootstrap analysis/search for best-scoring ML tree (-f a), bootstrapping type was set to rapid bootstrapping (-x), and autoMRE was used to determine when to stop bootstrapping. Depending on the purpose of the tree, a constraint file was added using the -g option to constrain the clade being tested to monophyly. Here is an example where the subgenus/clad/subclade name is replaced by the placeholder XXX.
raxmlHPC -T 28 -f a -N autoMRE -n Wojahn2024_XXX -s AlignedRevCompGood.fasta -m GTRCAT -p 12345 -x 12345 -o OUTGROUP_Carludovica_palmata_voucher_K_Chase -q RAxML_Partition_Scheme.txt -g Newick_XXX.txt.
The phangorn package was used to conduct the Shimodira-Hasagawa tests in R. Phyly was tested for subgenera and for deep clades (i.e., clades I and II from Buerki et al., 2012 and four subclades Ia, Ib, IIa, and IIb).
setwd("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/Trees/Subgenera")
SubgenTreez <- list.files()
PhylyResultz <- list()
for(i in 1:length(SubgenTreez))
{
nomen <- gsub("RAxML_bipartitions.","",SubgenTreez[i])
nomen <- gsub("BackboneNewLowThread.tree","",nomen)
message(sprintf("Processing Subgenus '%s', No. %s of %s",nomen,i,length(SubgenTreez)))
library(phangorn)
UwUTree1 <- ape::read.tree("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/5_ML_Analysis/PartitionedMLPLastomeTree.tree")
UwUTree2<- ape::read.tree(SubgenTreez[i])
PhyData <- ape::read.FASTA("/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/AlignedRevCompGood.fasta")
PhyData <- phangorn::as.phyDat(PhyData)
fit1 <- phangorn::pml(UwUTree1, PhyData)
fit2 <- phangorn::pml(UwUTree2, PhyData)
fit1 <- phangorn::optim.pml(fit1) # optimize edge weights
fit2 <- phangorn::optim.pml(fit2)
#ape::write.tree(fit1,"6_Phyly_Analysis/fit1.tree")
#ape::write.tree(fit2,"6_Phyly_Analysis/fit2.tree")
# with pml objects as input
PhylyResult <- phangorn::SH.test(fit1, fit2, B=10000)
# in real analysis use larger B, e.g. 10000
PhylyResultz[[i]] <- PhylyResult
names(PhylyResultz)[[i]] <- nomen
}
SH_test_results <- as.data.frame(matrix(nrow=length(names(PhylyResultz)),ncol=8))
colnames(SH_test_results) <- c("Subgenus","Unconstrained ln L","Unconstrained Diff ln L", "Unconstrained p-value","Constrained ln L","Constrained Diff ln L", "Constrained p-value", "Phyly")
SH_test_results[,1] <- names(PhylyResultz)
for(i in 1:nrow(SH_test_results))
{
tmp <- PhylyResultz[[i]]
SH_test_results[i,2] <- tmp[1,2]
SH_test_results[i,3] <- tmp[1,3]
SH_test_results[i,4] <- tmp[1,4]
SH_test_results[i,5] <- tmp[2,2]
SH_test_results[i,6] <- tmp[2,3]
SH_test_results[i,7] <- tmp[2,4]
}
write.csv(SH_test_results,"/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/SH_test_results.csv",row.names=F)
setwd("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/Trees/Clades")
SubgenTreez <- list.files()
PhylyResultz <- list()
for(i in 1:length(SubgenTreez))
{
nomen <- gsub("RAxML_bipartitions.","",SubgenTreez[i])
nomen <- gsub("BackboneNewLowThread.tree","",nomen)
nomen <- gsub("_","",nomen)
message(sprintf("Processing Subgenus '%s', No. %s of %s",nomen,i,length(SubgenTreez)))
library(phangorn)
UwUTree1 <- ape::read.tree("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/5_ML_Analysis/PartitionedMLPLastomeTree.tree")
UwUTree2<- ape::read.tree(SubgenTreez[i])
PhyData <- ape::read.FASTA("/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/AlignedRevCompGood.fasta")
PhyData <- phangorn::as.phyDat(PhyData)
fit1 <- phangorn::pml(UwUTree1, PhyData)
fit2 <- phangorn::pml(UwUTree2, PhyData)
fit1 <- phangorn::optim.pml(fit1) # optimize edge weights
fit2 <- phangorn::optim.pml(fit2)
#ape::write.tree(fit1,"6_Phyly_Analysis/fit1.tree")
#ape::write.tree(fit2,"6_Phyly_Analysis/fit2.tree")
# with pml objects as input
PhylyResult <- phangorn::SH.test(fit1, fit2, B=10000)
# in real analysis use larger B, e.g. 10000
PhylyResultz[[i]] <- PhylyResult
names(PhylyResultz)[[i]] <- nomen
}
SH_test_results <- as.data.frame(matrix(nrow=length(names(PhylyResultz)),ncol=8))
colnames(SH_test_results) <- c("Subgenus","Unconstrained ln L","Unconstrained Diff ln L", "Unconstrained p-value","Constrained ln L","Constrained Diff ln L", "Constrained p-value", "Phyly")
SH_test_results[,1] <- names(PhylyResultz)
for(i in 1:nrow(SH_test_results))
{
tmp <- PhylyResultz[[i]]
SH_test_results[i,2] <- tmp[1,2]
SH_test_results[i,3] <- tmp[1,3]
SH_test_results[i,4] <- tmp[1,4]
SH_test_results[i,5] <- tmp[2,2]
SH_test_results[i,6] <- tmp[2,3]
SH_test_results[i,7] <- tmp[2,4]
}
write.csv(SH_test_results,"/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/Clades_SH_test_results.csv",row.names=F)
To examine phylogenetic clustering considering past events and to perform climatic, pedological, and morphological analyses, a rooted and time-measured isochronous phylogenetic tree was inferred. Beauti v. 1.10.4 (Drummond & Rambaut, 2018) was then used to create an input file for BEAST v. 1.10.4 (Drummond & Rambaut, 2018). Four partitions were set (LSC, IRB, SSC, and IRA). Uniform priors were set as follows using three secondary calibration points obtained by Gallaher et al. (2015; see Table 1 in reference and values of 95% HPD obtained from Cyclanthus and Pandanaceae fossils): (a) Crown of Pandanaceae (upper : 97.5 MA, lower : 41.9 MA), splits between b) Freycinetia and clade including Pandanus and Benstonea (upper : 60.7 MA, lower : 24.6 MA), and c) Pandanus and Benstonea (upper : 33.0 MA, lower : 12.5 MA). We used an uncorrelated relaxed molecular clock with a lognormal distribution of rates and a Yule speciation model. The analysis was run on the CIPRES portal 345,287,000 generations total (divided into four analyses), sampling one tree every 1,000th generation Parameter convergence was confirmed in Tracer v. 1.7.2 (Rambaut et al, 2018). A maximum clade credibility tree with median branch lengths and 95% highest posterior density (HPD) interval on nodes was reconstructed was using Tree Annotator v. 1.10.4. (Drummond & Rambaut, 2018) with the first 10% of trees discarded as burn-in.
# Read in tree, data, and key
library(treeio)
BeastTree <- treeio::read.beast("7_BEAST_Analysis/Trees/med.tree")
# Destroy the tibble heresy
BeastTable <- BeastTree@data
BeastTreePhylo <- treeio::as.phylo(BeastTree)
rm(BeastTree)
# Remove outgroup
Key <- read.csv("0_Raw_Data/Key.csv")
# Make tip labels
BT_names <- BeastTreePhylo$tip.label
NewNames <- rep(NA,length(BT_names))
nonPandanus <- c("OUTGROUP_Carludovica_palmata_voucher_K_Chase",
"Pandan0062-Sararanga_philippinensis-Chloroplast-Genome",
"Pandan0007-Freycinetia_caudata-Chloroplast-Genome",
"Pandan0003-Benstonea_inquilina-Chloroplast-Genome",
"Pandan0005-Benstonea_serpentinica-Chloroplast-Genome",
"Pandan0001-Benstonea_brevistyla-Chloroplast-Genome",
"Pandan0006-Benstonea_tunicata-Chloroplast-Genome",
"Pandan0002-Benstonea_epiphytica-Chloroplast-Genome",
"Pandan0004-Benstonea_pachyphylla-Chloroplast-Genome")
for(i in 1:length(BT_names))
{
uwu <- BT_names[i]
if(uwu %in% c(nonPandanus,"gi|1693290702|ref|NC_042747.1|"))
{
if(grepl("OUTGROUP",uwu))
{
NewNames[i] <- "Carludovica palmata (Chase 14836)"
}else if(uwu == "gi|1693290702|ref|NC_042747.1|"){
NewNames[i] <- "Pandanus tectorius (Wang B247)"
}else{
PNum <- strsplit(uwu, split="-")[[1]][1]
owo <- Key[Key$UniqueID == PNum,]
NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
}
}else{
PNum <- strsplit(uwu, split="-")[[1]][1]
owo <- Key[Key$UniqueID == PNum,]
NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
}
}
# Make figure
library("phytools")
library("phyloch")
library("strap")
library("coda")
t <- phyloch::read.beast("7_BEAST_Analysis/Trees/med.tree") # I couldn't upload the file here
t$tip.label <- NewNames
t <- drop.tip(t,"Carludovica palmata (Chase 14836)")
t$root.time <- t$height[1]
num_taxa <- length(t$tip.label)
display_all_node_bars <- TRUE
names_list <-vector()
for (name in t$tip){
v <- strsplit(name, "_")[[1]]
if(display_all_node_bars){
names_list = c(names_list, name)
}
else if(v[length(v)]=="0"){
names_list = c(names_list, name)
}
}
nidsnamez <- list()
nids <- vector()
pos <- 1
len_nl <- length(names_list)
countR <- 1
for(n in names_list){
for(nn in names_list[pos:len_nl]){
if(n != nn){
m <- getMRCA(t,c(n, nn))
nidsnamez[[countR]] <- paste(c(n, nn),collapse=" ")
if(m %in% nids == FALSE){
nids <- c(nids, m)
countR <- countR + 1
}
}
}
pos <- pos+1
}
nidsnamez <- unlist(nidsnamez)
t <- ladderize(t)
II_mrca <- phytools::fastMRCA(t,"Pandanus tectorius (Wang B247)","Pandanus multispicatus (Callmander 91)")
I_mrca <- phytools::fastMRCA(t,"Pandanus kaida (Altafhusain 3)","Pandanus pugnax (Callmander 1035)")
IIa_mrca <- phytools::fastMRCA(t,"Pandanus tectorius (Wang B247)","Pandanus aquaticus (Dowe 290809B)")
Ia_mrca <- phytools::fastMRCA(t,"Pandanus kaida (Altafhusain 3)","Pandanus calcis (Callmander 1048)")
IIb_mrca <- phytools::fastMRCA(t,"Pandanus multispicatus (Callmander 91)","Pandanus pygmaeus (Rakotovao 5009)")
Ib_mrca <- phytools::fastMRCA(t,"Pandanus pugnax (Callmander 1035)","Pandanus zea (Dowe 201009D)")
geoinfo <- matrix(nrow=7,ncol=2)
row.names(geoinfo) <- c("Paleocene","Eocene","Oligocene","Miocene","Pliocene","Pleistocene","Holocene")
geoinfo[,1] <- c(66,56,33.9,23.3,5.333,2.58,0.0117)
geoinfo[,2] <- c(56,33.9,23.3,5.333,2.58,0.0117,0)
library(strap)
library(phytools)
data(strat2012)
{
pdf("7_BEAST_Analysis/AnnotatedMedMCCTree.pdf", width = 30, height = 20)
plot.phylo(t,x.lim=c(-25,75),show.tip.label = T,cex=1)
#strap::geoscalePhylo(tree = t, units=c("Period", "Epoch", "Age"), boxes="Epoch", cex.tip=0.5, cex.age=0.7, cex.ts=0.7, label.offset=0, x.lim=c(-15,80), lwd=3, width=2)
axisGeo(strat2012, tip.time = 0, unit = c("epoch","period","eon"), gridty = 0, gridcol = "black",ages=F)
#col = c("antiquewhite1","moccasin","khaki1","yellow","burlywood2","sandybrown","khaki1","gold","tan1")
axisChrono(side = 3, unit = "Ma")
lastPP <- get("last_plot.phylo", envir = .PlotPhyloEnv)
for(nv in nids){
bar_xx_a <- c(lastPP$xx[nv]+t$height[nv-num_taxa]-t$"height_95%_HPD_MIN"[nv-num_taxa],
lastPP$xx[nv]-(t$"height_95%_HPD_MAX"[nv-num_taxa]-t$height[nv-num_taxa]))
lines(bar_xx_a, c(lastPP$yy[nv], lastPP$yy[nv]), col = rgb(0, 0, 1, alpha = 0.3), lwd = 5)
}
t$node.label <- t$posterior
p <- character(length(t$node.label))
p[t$node.label >= 0.95] <- "green"
p[t$node.label < 0.95 & t$node.label >= 0.75] <- "yellow"
p[t$node.label < 0.75] <- "red"
nodelabels(pch = 21, cex = 1.5, bg = p)
phytools::cladelabels(t,"Ia",Ia_mrca,offset=2,orientation="horizontal",cex=3)
phytools::cladelabels(t,"Ib",Ib_mrca,offset=2,orientation="horizontal",cex=3)
phytools::cladelabels(t,"IIa",IIa_mrca,offset=2,orientation="horizontal",cex=3)
phytools::cladelabels(t,"IIb",IIb_mrca,offset=2,orientation="horizontal",cex=3)
phytools::cladelabels(t,"I",I_mrca,offset=6,orientation="horizontal",cex=3)
phytools::cladelabels(t,"II",II_mrca,offset=6,orientation="horizontal",cex=3)
dev.off()
}
To explore cladewise climatological and pedological differences among clades (I, II) and subclades (Ia, Ib, IIa, IIb) as well as between species with and without a hypertrophied adaxial hydrenchyma we used a random forest approach (Breiman, 2001). We chose to use random forests because of it’s robustness to collinearity (Belgiu and Drăguţ, 2016; Matsuki et al., 2016), which is present in the Bioclim and Soils data (see supplementary materials for correlation matrices) and because it is not prone to overfitting (Matsuki et al., 2016). Geolocations for taxonomically verified vouchered non-cultivated samples were extracted from the Pandanaceae Project (2024) website (Fig. 6a) and bioclimatic variables were extracted for each sample using its latitude and longitude coordinates in R with the measurements and raster packages from the Worldclim data set (Worldclim.org; Birk, 2019 and Hijmans, 2020, respectively). Important variables were identified by successive feature elimination and OOB error of iteratively grown random forests (whole range computed, 50,000 trees for the first iteration, 50,000 for each successive iteration) using the varSelRF package (Diaz-Uriarte, 2007; Diaz-Uriarte and Alvarez de Andres, 2006). The climatic data was then filtered to contain only the important variables and a new model was created using ranger (citation) to compute updated variable importance using the parameters used in the best model as identified by the successive feature elimination. To validate variable importance and assign p-values to each variable, Altmann’s corrected permutation feature importance (Altmann et al., 2010) with 100 permutaions was performed on the new model using ranger. Directionality and additional validation was provided by performing Mann-Whitney U tests (Mann and Whitney, 1947) on cladewise data for each of the important variables to test for a statistically significant difference in means between clades. The vioplot package (Adler et al., 2022) was then used to produce violin plots for each important variable. The same method as above was used for the pedological analysis, however it used the Digital Soil Map of the World (Food and Agriculture Organization of the United Nations, 2003) rather than bioclim data.
# brew install libgit2
CallmOcc <- read.csv("12_External_Data/ClimatPP.csv")
GoodOccz <- as.data.frame(matrix(nrow=0, ncol=4))
colnames(GoodOccz) <- c("Species","Lat","Long","Clade")
for(i in 1:nrow(CallmOcc))
{
owo <- unlist(strsplit(CallmOcc[i,2], split = " "))
message(sprintf("Processing No. %s of %s",i,nrow(CallmOcc)))
if(T %in% grepl("°",owo))
{
message("......geolocation retrieved!")
uwu <- as.data.frame(matrix(nrow=1, ncol=3))
uwu[1,1] <- CallmOcc[i,1]
uwu[1,4] <- CallmOcc[i,3]
colnames(uwu) <- c("Species","Lat","Long","Clade")
uwu[1,c(2,3)] <- gsub(",","",owo[grep("°",owo)])
GoodOccz <- rbind(GoodOccz,uwu)
}
}
# Add our collections (if not already there)
PandanData <- read.csv("3_Plastome_Reconstruction/MasterKeyPandanusChloroplast2022.csv")
PandanData <- PandanData[!PandanData[,1] == "Pandan0034",] #remove dup
Key <- read.csv("5_ML_Analysis/SubcladeKey.csv")
Geolocales <- as.data.frame(matrix(nrow=nrow(Key),ncol=22))
colnames(Geolocales) <- c("Species","Lat","Long")
Geolocales$Species <- Key$Species
gonez <- c("Collection_Info_Missing","No_Coordinates","Only_In_Cultivation","Living_Collection")
# Ignore warning
for(i in 1:nrow(Geolocales))
{
message(sprintf("Processing No. %s of %s",i,nrow(Geolocales)))
if(Geolocales[i,1] == "Pandanus_sermollianus")
{
coords <- PandanData[PandanData$Taxon == "Pandanus_sermolliana",which(colnames(PandanData) == "Coordinates")]
}else if(Geolocales[i,1] == "Pandanus_purpurascens"){
coords <- PandanData[PandanData$Taxon == "Pandanus_purpuracens",which(colnames(PandanData) == "Coordinates")]
}else if(Geolocales[i,1] == "Pandanus_halmaherensis"){
coords <- PandanData[PandanData$Taxon == "Pandanus_halmaheraensis",which(colnames(PandanData) == "Coordinates")]
}else{
coords <- PandanData[PandanData$Taxon == Geolocales[i,1],which(colnames(PandanData) == "Coordinates")]
}
if(coords %in% gonez)
{
if(Geolocales[i,1] == "Pandanus_purpurascens")
{
coords <- PandanData[PandanData$Taxon == "Pandanus_purpuracens",17]
}else{
coords <- PandanData[PandanData$Taxon == Geolocales[i,1],17]
}
}
coords <- unlist(strsplit(coords,split = " "))
Geolocales[i,2] <- coords[1]
Geolocales[i,3] <- coords[2]
}
for(i in 1:nrow(Geolocales))
{
tmp <- Key[Key$Species == Geolocales[i,1],1]
if(tmp == "I"){tmp <- 1}
if(tmp == "II"){tmp <- 2}
Geolocales[i,4] <- tmp
}
Geolocales <- as.data.frame(Geolocales[,c(1,2,3,4)])
Geolocales <- Geolocales[!is.na(Geolocales[,3]),]
Geolocales_wilds <- Geolocales[!Geolocales$Species %in% c("Pandanus amaryllifolius","Pandanus kaida","Pandanus conoideus"),]
colnames(Geolocales) <- colnames(GoodOccz)
AllCoordz <- as.data.frame(rbind(GoodOccz,Geolocales))
AllCoordz <- unique(AllCoordz) # Remove duplicates (i.e. same specimen sent to different herbaria)
write.csv(AllCoordz,"8_Climate_Analysis/AllCoordz.csv",row.names=F)
AllCoordz <- read.csv("8_Climate_Analysis/AllCoordz.csv")
# PCA analysis
PCA_Food_Lat_Long <- data.frame(AllCoordz[,4],AllCoordz[,2],AllCoordz[,3])
WC_Nomen <- read.csv("8_Climate_Analysis/WC_Nomen.csv")
BC_Points <- PCA_Food_Lat_Long
BC_Points <- cbind(BC_Points,as.data.frame(matrix(nrow=nrow(BC_Points), ncol=length(WC_Nomen[,1]))))
colnames(BC_Points)[1:22] <- c("Species","ddlat","ddlong",WC_Nomen[,1])
# Ignore warning
for(i in 1:nrow(BC_Points))
{
message(sprintf("Processing No. %s of %s",i,nrow(BC_Points)))
BC_Points[i,2] <- gsub(",","",BC_Points[i,2])
if(grepl("N",BC_Points[i,2]))
{
tmp <- gsub("°"," ",BC_Points[i,2])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("N","",tmp)
BC_Points[i,2] <- tmp
}else if(grepl("S",BC_Points[i,2])){
tmp <- gsub("°"," ",BC_Points[i,2])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("S","",tmp)
BC_Points[i,2] <- paste0("-",tmp)
}
BC_Points[i,3] <- gsub(",","",BC_Points[i,3])
if(grepl("E",BC_Points[i,3])){
tmp <- gsub("°"," ",BC_Points[i,3])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("E","",tmp)
BC_Points[i,3] <- tmp
}else if(grepl("W",BC_Points[i,3])){
tmp <- gsub("°"," ",BC_Points[i,3])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("W","",tmp)
BC_Points[i,3] <- paste0("-",tmp)
}
BC_Points[i,2] <- as.numeric(measurements::conv_unit(BC_Points[i,2],from = "deg_min_sec", to = "dec_deg"))
BC_Points[i,3] <- as.numeric(measurements::conv_unit(BC_Points[i,3],from = "deg_min_sec", to = "dec_deg"))
}
UniNumz <- row.names(AllCoordz)
SppNamez <- AllCoordz[!is.na(BC_Points$ddlat),]
UniNumz <- UniNumz[!is.na(BC_Points$ddlat)]
BC_Points <- BC_Points[!is.na(BC_Points$ddlat),]
BC_Points[,2] <- as.numeric(BC_Points[,2])
BC_Points[,3] <- as.numeric(BC_Points[,3])
# bc converted point was over water so movedit directly inland a few meters
# BC_Points[BC_Points$Species == "Pandanus_leram",2:3] <- c(7.0104, 93.934)
for(i in 1:19)
{
print(sprintf("Processing BioClim Variable No. %s",i))
thisraster <- raster::raster(sprintf("12_External_Data/wc2.1_30s_bio/wc2.1_30s_bio_%s.tif",i))
BC_Points[,i+3] <- raster::extract(thisraster,data.frame(BC_Points$ddlong,BC_Points$ddlat))
}
write.csv(BC_Points,"8_Climate_Analysis/BC_Points.csv",row.names=F)
write.csv(SppNamez,"8_Climate_Analysis/BC_Points_Spp.csv",row.names=F)
write.csv(UniNumz,"8_Climate_Analysis/BC_Points_UniNumz.csv",row.names=F)
message("Preparing extracted BioClim data")
BC_Points_for_PCA <- BC_Points[,-2]
BC_Points_for_PCA <- BC_Points_for_PCA[,-2]
Nomen <- BC_Points_for_PCA$Species
BC_Points_for_PCA <- BC_Points_for_PCA[,-1]
message("Removing rows with an NA in any of the 19 variables")
togo <- c()
pb <- txtProgressBar(min = 1, max = nrow(BC_Points_for_PCA), style = 3)
for(i in 1:nrow(BC_Points_for_PCA))
{
setTxtProgressBar(pb, i)
if(T %in% is.na(BC_Points_for_PCA[i,]))
{
togo <- c(togo,i)
}
}
require(factoextra)
library("FactoMineR")
Nomen <- Nomen[-togo]
BC_Points_for_PCA_Clean <- BC_Points_for_PCA[-togo,]
##################################################
# Random Forest Feature Selection
# With Corrected Feature Importance (p-value)
##################################################
# Prepare data
BC_Points <- read.csv("8_Climate_Analysis/BC_Points.csv")
SppNamez<- read.csv("8_Climate_Analysis/BC_Points_Spp.csv")
message("Preparing extracted BioClim data")
BC_Points_for_PCA <- BC_Points[,-2]
BC_Points_for_PCA <- BC_Points_for_PCA[,-2]
Nomen <- BC_Points_for_PCA$Species
BC_Points_for_PCA <- BC_Points_for_PCA[,-1]
message("Removing rows with an NA in any of the 19 variables")
togo <- c()
pb <- txtProgressBar(min = 1, max = nrow(BC_Points_for_PCA), style = 3)
for(i in 1:nrow(BC_Points_for_PCA))
{
setTxtProgressBar(pb, i)
if(T %in% is.na(BC_Points_for_PCA[i,]))
{
togo <- c(togo,i)
}
}
Nomen <- Nomen[-togo]
BC_Points_for_PCA_Clean <- BC_Points_for_PCA[-togo,]
SppNamezClean <- SppNamez[-togo,]
write.csv(BC_Points_for_PCA_Clean,"8_Climate_Analysis/BC_Points_for_PCA_Clean.csv",row.names=F)
write.csv(SppNamezClean,"8_Climate_Analysis/SppNamezClean.csv",row.names=F)
RandomForestAnalyzeR <- function(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
{
set.seed(123)
message("Preparing data")
Key <- read.csv(Keypath)
if("I" %in% c(comp1,comp2))
{
comp1_Spp <- unique(Key[Key$Clade == comp1,1])
comp2_Spp <- unique(Key[Key$Clade == comp2,1])
}else if("Ia" %in% c(comp1,comp2)){
comp1_Spp <- unique(Key[Key$Subclade == comp1,1])
comp2_Spp <- unique(Key[Key$Subclade == comp2,1])
}else{
comp1_Spp <- unique(Key[Key[,2] == comp1,1])
comp2_Spp <- unique(Key[Key[,2] == comp2,1])
}
BC_Points_for_PCA_Clean <- Data4Comp
comp1_Spp_numz <- which(SppNames[,1] %in% comp1_Spp)
comp2_Spp_numz <- which(SppNames[,1] %in% comp2_Spp)
BC_Points_Clean_comp1 <- BC_Points_for_PCA_Clean[comp1_Spp_numz,]
BC_Points_Clean_comp2 <- BC_Points_for_PCA_Clean[comp2_Spp_numz,]
BC_Points_Clean_comps12 <- cbind(c(rep(1,nrow(BC_Points_Clean_comp1)),rep(2,nrow(BC_Points_Clean_comp2))),rbind(BC_Points_Clean_comp1,BC_Points_Clean_comp2))
colnames(BC_Points_Clean_comps12)[1] <- "Subclade"
n_features <- length(colnames(BC_Points_Clean_comps12))-1
message("Selecting variables by successive feature elimination and OOB error")
message("...using 50,000 trees at each iteration")
varselrf_pandan <- varSelRF::varSelRF(BC_Points_Clean_comps12[,-1],Class = as.factor(BC_Points_Clean_comps12$Subclade),recompute.var.imp=F,whole.range=T,ntree=50000,ntreeIterat=50000, keep.forest = TRUE,verbose=TRUE,vars.drop.num=1,vars.drop.frac=NULL)
colnames(BC_Points_Clean_comps12)[1] <- "Subclade"
SelectedVars <- varselrf_pandan$selected.vars
sh <- varselrf_pandan$selec.history
message(sprintf("%s features were found as important",length(SelectedVars)))
print(SelectedVars)
NewRF <- ranger::ranger(as.formula(sprintf("Subclade ~ %s",paste(SelectedVars, collapse = " + "))), data = BC_Points_Clean_comps12,
num.trees = varselrf_pandan$ntreeIterat,importance = "permutation")
pz <- ranger::importance_pvalues(NewRF,method="altmann",num.permutations = 100,formula=as.formula(sprintf("Subclade ~ %s",paste(SelectedVars, collapse = " + "))),data = BC_Points_Clean_comps12)
PandanForest_P_p_df <- as.data.frame(pz)
PandanForest_P_p_df[,c(2,3)] <- PandanForest_P_p_df[,c(1,2)]
PandanForest_P_p_df[,1] <- row.names(PandanForest_P_p_df)
row.names(PandanForest_P_p_df) <- NULL
colnames(PandanForest_P_p_df) <- c("Variable","Importance","p-Value")
message("Writing variable importance and p-value table")
write.csv(PandanForest_P_p_df,sprintf("%s/RandomFOrest_P_%s_%s_VarImp.csv",folder,comp1,comp2),row.names = F)
SigVals <- PandanForest_P_p_df[PandanForest_P_p_df[,3] < 0.05,1]
if(length(SigVals) > 0)
{
message("Conducting Variablewise Wilcoxan Rank-Sum Tests")
ThisData <- BC_Points_Clean_comps12
CladewiseCompPvals <- as.data.frame(matrix(nrow=length(SigVals),ncol=3))
CladewiseCompPvals[,1] <- SigVals
colnames(CladewiseCompPvals) <- c("Variable","p-value","W")
for(i in 1:length(SigVals))
{
OneVar <- ThisData[ThisData[,1] == 1,colnames(ThisData) == SigVals[i]]
TwoVar <- ThisData[ThisData[,1] == 2,colnames(ThisData) == SigVals[i]]
Wilcox1 <- wilcox.test(as.numeric(OneVar),as.numeric(TwoVar))
CladewiseCompPvals[i,2] <- Wilcox1[[3]]
CladewiseCompPvals[i,3] <- Wilcox1[[1]]
}
write.csv(CladewiseCompPvals,sprintf("%s/CladewiseWilcox_Sig_%s%s%s.csv",folder,comp1,comp2,analtype),row.names=F)
message("Creating Variablewise Violin Plots")
if("Ia" %in% c(comp1,comp2))
{
uwucolorz <- c("brown2","brown")
}else if("IIa" %in% c(comp1,comp2)){
uwucolorz <- c("cyan2","cyan4")
}else{
uwucolorz <- c("brown1","cyan")
}
library(vioplot)
for(i in 1:length(SigVals))
{
OneVar <- ThisData[ThisData[,1] == 1,colnames(ThisData) == SigVals[i]]
TwoVar <- ThisData[ThisData[,1] == 2,colnames(ThisData) == SigVals[i]]
{
pdf(sprintf("%s/ViolinPlot_Sig_%s%s%s_%s.pdf",folder,comp1,comp2,analtype,SigVals[i]))
vioplot(OneVar, side = "left", plotCentre = "line", ylim=c(.8*min(OneVar,TwoVar),max(OneVar,TwoVar)+.2*max(OneVar,TwoVar)),col = uwucolorz[1], xaxt= "n",ylab=SigVals[i])# = TRUE)
vioplot(TwoVar, side = "right", plotCentre = "line", col = uwucolorz[2], add = TRUE )
legend("topleft", legend = c(comp1, comp2), fill = uwucolorz,cex=2)
dev.off()
}
}
}else{
message("NO SIGNIFICANT PREDICTORS")
}
}
# I v II
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "I"
comp2 <- "II"
analtype <- "BioClim"
folder <- "8_Climate_Analysis"
Data4Comp <- BC_Points_for_PCA_Clean
SppNames <- SppNamezClean
RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
# Ia v Ib
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "Ia"
comp2 <- "Ib"
analtype <- "BioClim"
folder <- "8_Climate_Analysis"
Data4Comp <- BC_Points_for_PCA_Clean
SppNames <- SppNamezClean
RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
# IIa v IIb
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "IIa"
comp2 <- "IIb"
analtype <- "BioClim"
folder <- "8_Climate_Analysis"
Data4Comp <- BC_Points_for_PCA_Clean
SppNames <- SppNamezClean
RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
C_filez <- list.files("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/8_Climate_Analysis")
VI <- C_filez[grepl("VarImp.csv",C_filez)]
MWU <- c("CladewiseWilcox_Sig_IIIBioClim.csv","CladewiseWilcox_Sig_IaIbBioClim.csv" ,"CladewiseWilcox_Sig_IIaIIbBioClim.csv")
cladez <- c("I_II","Ia_Ib","IIa_IIb")
for(i in 1:length(VI))
{
uwu <- read.csv(sprintf("8_Climate_Analysis/%s",VI[i]))
owo <- read.csv(sprintf("8_Climate_Analysis/%s",MWU[i]))
alluwusmol <- cbind(uwu,owo)
alluwusmol <- cbind(rep(cladez[i],nrow(alluwusmol)),alluwusmol)
if(i == 1)
{
alluwu <- alluwusmol
}else{
alluwu <- rbind(alluwu,alluwusmol)
}
}
write.csv(alluwu, "8_Climate_Analysis/AllResultsTable.csv",row.names=F)
AllCoordz <- read.csv("8_Climate_Analysis/AllCoordz.csv")
DSMW <- "12_External_Data/DSMW/DSMW.shp"
# PCA analysis
PCA_Food_Lat_Long <- data.frame(AllCoordz[,4],AllCoordz[,2],AllCoordz[,3])
BC_Points <- PCA_Food_Lat_Long
BC_Points <- cbind(BC_Points,as.data.frame(matrix(nrow=nrow(BC_Points), ncol=3)))
colnames(BC_Points)[1:3] <- c("Species","ddlat","ddlong")
# Ignore warning
for(i in 1:nrow(BC_Points))
{
message(sprintf("Processing No. %s of %s",i,nrow(BC_Points)))
BC_Points[i,2] <- gsub(",","",BC_Points[i,2])
if(grepl("N",BC_Points[i,2]))
{
tmp <- gsub("°"," ",BC_Points[i,2])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("N","",tmp)
BC_Points[i,2] <- tmp
}else if(grepl("S",BC_Points[i,2])){
tmp <- gsub("°"," ",BC_Points[i,2])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("S","",tmp)
BC_Points[i,2] <- paste0("-",tmp)
}
BC_Points[i,3] <- gsub(",","",BC_Points[i,3])
if(grepl("E",BC_Points[i,3])){
tmp <- gsub("°"," ",BC_Points[i,3])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("E","",tmp)
BC_Points[i,3] <- tmp
}else if(grepl("W",BC_Points[i,3])){
tmp <- gsub("°"," ",BC_Points[i,3])
tmp <- gsub("'"," ",tmp)
tmp <- gsub("\"","",tmp)
tmp <- gsub("W","",tmp)
BC_Points[i,3] <- paste0("-",tmp)
}
BC_Points[i,2] <- as.numeric(measurements::conv_unit(BC_Points[i,2],from = "deg_min_sec", to = "dec_deg"))
BC_Points[i,3] <- as.numeric(measurements::conv_unit(BC_Points[i,3],from = "deg_min_sec", to = "dec_deg"))
}
ptz <- paste(BC_Points[,2],BC_Points[,3])
ptz <- data.frame(BC_Points[,1],ptz)
UniNumz <- seq(1,nrow(AllCoordz),1)
SppNamez <- AllCoordz[-grep("NA",ptz[,2]),]
SL_Points <- BC_Points[-grep("NA",ptz[,2]),]
UniNumz <- UniNumz[-grep("NA",ptz[,2])]
which(ptz[,2] == "NA NA")
removedd <- grep("NA",ptz[,2])
ptz <- ptz[-grep("NA",ptz[,2]),]
library(SoilsSeeR)
Soilz <- SoilsInfoGrabbeR(DSMW,ptz)
write.csv(Soilz,"9_Soils_Analysis/AllCoordz.csv",row.names=F)
Soilz <- read.csv("9_Soils_Analysis/AllCoordz.csv")
length(which(Soilz$FAOSOIL == "No_Soil_Data"))/nrow(Soilz)*100 #7.06% have no data
median(table(SppNamez[-which(Soilz$FAOSOIL == "No_Soil_Data"),1])) # on average each species has 1 observations
Soilz_goods <- Soilz[!Soilz$FAOSOIL == "No_Soil_Data",]
SppNamez <- SppNamez[!Soilz$FAOSOIL == "No_Soil_Data",]
UniNumz <- UniNumz[!Soilz$FAOSOIL == "No_Soil_Data"]
FAO_sol <- Soilz_goods$FAOSOIL
FAO_sol_info <- as.data.frame(matrix(nrow=length(FAO_sol),ncol=2))
for(i in 1:length(FAO_sol))
{
if(grepl("-",FAO_sol[i]))
{
Nomen <- gsub("[0-9]","",gsub("-.*","",FAO_sol[i]))
Tex <- gsub("[^0-9]","",gsub(".*-","",FAO_sol[i]))
}else{
Nomen <- gsub("[0-9]","",FAO_sol[i])
Tex <- NA
}
FAO_sol_info[i,1] <- Nomen
FAO_sol_info[i,2] <- Tex
}
AllInfoz <- readxl::read_excel("12_External_Data/DSMW/Generalized_SU_Info.xls")
AllInfoz <- as.data.frame(AllInfoz) # Destroy tibble heresy
SoilInfoz <- AllInfoz[0,]
CladeInfo <- list()
tickeR <- 0
RepSppNamez <- c()
RepUniNumz <- c()
SppNamez <- SppNamez[,1]
for(i in 1:nrow(FAO_sol_info))
{
if(is.na(FAO_sol_info[i,2]) || FAO_sol_info[i,2] == "")
{
these <- AllInfoz[AllInfoz[,1] == toupper(FAO_sol_info[i,1]),]
SoilInfoz <- rbind(these,SoilInfoz)
#print(sprintf("%s %s",i,nrow(these)))
tickeR <- tickeR + 1
CladeInfo[[tickeR]] <- Soilz_goods[i,1]
RepSppNamez <- c(RepSppNamez,SppNamez[i])
RepUniNumz <- c(RepUniNumz,UniNumz[i])
}else{
if(as.numeric(FAO_sol_info[i,2]) < 3)
{
these <- AllInfoz[AllInfoz[,1] == sprintf("%s %s",toupper(FAO_sol_info[i,1]),FAO_sol_info[i,2]),]
SoilInfoz <- rbind(SoilInfoz,these)
tickeR <- tickeR + 1
CladeInfo[[tickeR]] <- Soilz_goods[i,1]
RepSppNamez <- c(RepSppNamez,SppNamez[i])
RepUniNumz <- c(RepUniNumz,UniNumz[i])
#print(sprintf("%s %s",i,nrow(these)))
}else{
#print(sprintf("%s %s %s",i,nrow(these),FAO_sol_info[i,2]))
uwu <- unlist(strsplit(FAO_sol_info[i,2], split=""))
for(j in 1:length(uwu))
{
these <- AllInfoz[AllInfoz[,1] == sprintf("%s %s",toupper(FAO_sol_info[i,1]),uwu[j]),]
SoilInfoz <- rbind(SoilInfoz,these)
tickeR <- tickeR + 1
CladeInfo[[tickeR]] <- Soilz_goods[i,1]
RepSppNamez <- c(RepSppNamez,SppNamez[i])
RepUniNumz <- c(RepUniNumz,UniNumz[i])
}
}
}
}
write.csv(unlist(CladeInfo),"9_Soils_Analysis/CladeInfo.csv",row.names=F)
write.csv(SoilInfoz,"9_Soils_Analysis/SoilInfoz.csv",row.names=F)
write.csv(RepSppNamez,"9_Soils_Analysis/SoilInfoz_Spp.csv",row.names=F)
write.csv(RepUniNumz,"9_Soils_Analysis/SoilInfoz_UniNumz.csv",row.names=F)
RepSppNamez <- as.data.frame(read.csv("9_Soils_Analysis/SoilInfoz_Spp.csv"))
SoilInfoz <- read.csv("9_Soils_Analysis/SoilInfoz.csv")
SoilInfoz_Points_for_PCA_Clean <- SoilInfoz[,-1]
source("1_Functions/RandomForestAnalyzeR.R")
# I v II
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "I"
comp2 <- "II"
analtype <- "Soils"
folder <- "9_Soils_Analysis"
Data4Comp <- SoilInfoz_Points_for_PCA_Clean
SppNames <- RepSppNamez
RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
# Ia v Ib
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "Ia"
comp2 <- "Ib"
analtype <- "Soils"
folder <- "9_Soils_Analysis"
Data4Comp <- SoilInfoz_Points_for_PCA_Clean
SppNames <- RepSppNamez
RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
# IIa v IIb
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "IIa"
comp2 <- "IIb"
analtype <- "Soils"
folder <- "9_Soils_Analysis"
Data4Comp <- SoilInfoz_Points_for_PCA_Clean
SppNames <- RepSppNamez
RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
C_filez <- list.files("9_Soils_Analysis")
VI <- C_filez[grepl("VarImp.csv",C_filez)]
MWU <- c("CladewiseWilcox_Sig_IIISoils.csv","CladewiseWilcox_Sig_IaIbSoils.csv" ,"CladewiseWilcox_Sig_IIaIIbSoils.csv")
cladez <- c("I_II","Ia_Ib","IIa_IIb")
for(i in 1:length(VI))
{
uwu <- read.csv(sprintf("9_Soils_Analysis/%s",VI[i]))
owo <- read.csv(sprintf("9_Soils_Analysis/%s",MWU[i]))
alluwusmol <- cbind(uwu,owo)
alluwusmol <- cbind(rep(cladez[i],nrow(alluwusmol)),alluwusmol)
if(i == 1)
{
alluwu <- alluwusmol
}else{
alluwu <- rbind(alluwu,alluwusmol)
}
}
write.csv(alluwu, "9_Soils_Analysis/AllResultsTable.csv",row.names=F)
To investigate morphological traits that we hypothesize could be of taxonomical or evolutionary importance, we scored six pistillate traits (spadix: solitary or spicate; carpels: connate or free; endocarp: basal, median, or submedian; supramedian; upper mesocarp: air-filled, fibrous, or solid; lower mesocarp: air-filled, fibrous, or fibrous-fleshy; stigma: acute, auricular, bilobate, circle, cordate, cristiform plate, deltoid, filiform, lanceolate, lobed sunburst, lunate, obovate, reniform, spinescent, sublinear, suborbicular, wider than long), three staminate traits (staminal arrangement: infundibuliform column, basally fused, single, peltate umbrellaform, column, and racinous; filament insertion: candelabra, column rim, edge subumbellately, fasciculate column, peltate umbrellaform, racinous, single, subumbellate, under apical disk, under apical disk margin; anther shape: deltoid, elongate, oblong, weakly deltoid) and one sterile trait (hydrenchyma: normal, i.e., 0-1 layers of hydrenchyma cells) and hypertrophied, i.e., 2+ layers of hydrenchyma cells). These traits were chosen because they were identified as being the most promising based on a thorough review of past Pandanus taxonomic works. We used stochastic character mapping (Huelsenbeck et al., 2003; Bollback, 2006) to reconstruct ancestral states for clades I and II and subclades Ia, Ib, IIa, and IIb. Individuals with multiple traits observed were excluded from the analysis because the relative proportions of phenotypes in the species populations is unknown, the taxonomy of the species used to compile morphological data may not be verifiable, and species-level taxonomical investigation is out of the purvue of this study. The R-package geiger (Pennell et a., 2014) was used to decide which continuous time Markov model of trait evolution best fit the data (equal rates, all rates different, and symmetric backward and forward; Anderson, 1991) based on lowest AICc (Akaike, 1974; Takeuchi, 1976). If more than one model had the lowest AICc, then one of them was picked at random. Stochastic character mapping was then done via the R-package phytools (Revell, 2024): the root prior was set to that of Fitzjohn et al. (2009), Bayesian MCMC (Metropolis et al., 1953) was used to calculate posterior probabilities of the transition matrix Q (with 20,000 generations sampling every 100 generations). Posterior probabilities of each node being in each state was extracted for the most recent common ancestors of six clades (I, II, Ia, Ib, IIa, and IIb), and a majority rule was used to determine if the ancestral state of a node could be estimated.
##########################################
# Morphological Analysis
##########################################
AncestralCharacterizeR <- function(CharData,ThisTree,KeyPath,AnalType)
{
# Get data ready
library(phytools)
# This function uses Bayesian MCMC to sample from the posterior distribution for the states at internal nodes in the tree.
TipData <- read.csv(CharData)
BeastTree <- treeio::read.beast(ThisTree)
BeastTreePhylo <- treeio::as.phylo(BeastTree)
rm(BeastTree)
# Remove outgroup
Key <- read.csv(KeyPath)
# Make tip labels
BT_names <- BeastTreePhylo$tip.label
NewNames <- rep(NA,length(BT_names))
nonPandanus <- c("OUTGROUP_Carludovica_palmata_voucher_K_Chase",
"Pandan0062-Sararanga_philippinensis-Chloroplast-Genome",
"Pandan0007-Freycinetia_caudata-Chloroplast-Genome",
"Pandan0003-Benstonea_inquilina-Chloroplast-Genome",
"Pandan0005-Benstonea_serpentinica-Chloroplast-Genome",
"Pandan0001-Benstonea_brevistyla-Chloroplast-Genome",
"Pandan0006-Benstonea_tunicata-Chloroplast-Genome",
"Pandan0002-Benstonea_epiphytica-Chloroplast-Genome",
"Pandan0004-Benstonea_pachyphylla-Chloroplast-Genome")
for(i in 1:length(BT_names))
{
uwu <- BT_names[i]
if(uwu %in% c(nonPandanus,"gi|1693290702|ref|NC_042747.1|"))
{
if(grepl("OUTGROUP",uwu))
{
NewNames[i] <- "Carludovica palmata (Chase 14836)"
}else if(uwu == "gi|1693290702|ref|NC_042747.1|"){
NewNames[i] <- "Pandanus tectorius (Wang B247)"
}else{
PNum <- strsplit(uwu, split="-")[[1]][1]
owo <- Key[Key$UniqueID == PNum,]
NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
}
}else{
PNum <- strsplit(uwu, split="-")[[1]][1]
owo <- Key[Key$UniqueID == PNum,]
NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
}
}
BeastTreePhylo$tip.label <- NewNames
olds<- NewNames
news <- rep(NA,length(olds))
for(i in 1:length(olds))
{
news[i] <- paste(unlist(strsplit(olds[i], split = " "))[c(1,2)], collapse="_")
}
for(i in 2:ncol(TipData))
{
message(sprintf("Processing Variable No. %s of %s",i-1,ncol(TipData)-1))
TipzVar <- rep(NA,length(NewNames))
for(j in 1:length(NewNames))
{
if(grepl("Pandanus",NewNames[j]))
{
thisspp <- NewNames[j]
if(length(TipData[grepl(paste(unlist(strsplit(thisspp, split = " "))[c(1,2)], collapse="_"),TipData$Species),i]) == 0)
{
TipzVar[j] <- NA
}else{
TipzVar[j] <- TipData[grepl(paste(unlist(strsplit(thisspp, split = " "))[c(1,2)], collapse="_"),TipData$Species),i]
}
}else{
TipzVar[j] <- NA
}
}
TipzVar<-setNames(TipzVar,NewNames)
TipzVar <- TipzVar[!is.na(TipzVar)]
TipzVar <- TipzVar[!TipzVar == ""]
BeastTreePhylo_smol <- ape::drop.tip(BeastTreePhylo,BeastTreePhylo$tip.label[!BeastTreePhylo$tip.label %in% names(TipzVar)])
tree <- BeastTreePhylo_smol
x <- TipzVar
rainboww <- c("darkblue","gold","orange","green","violet","blue","red",
"yellow","turquoise","maroon","beige","darkgreen",
"gray","pink","brown4","purple",
"coral","cyan","black","white")
cols<-setNames(rainboww[1:length(unique(x))],sort(unique(x)))
#blackz <- rep("black",length(unique(x)))
#stochastic character mapping (Huelsenbeck et al., 2003)
message("Fitting models")
library(geiger)
ERmod <- fitDiscrete(tree, x,
model = "ER",niter = 100, FAIL = 1e+200,ncores=8)
ARDmod <- fitDiscrete(tree, x,
model = "ARD",niter = 100, FAIL = 1e+200,ncores=8)
SYMmod <- fitDiscrete(tree, x,
model = "SYM",niter = 100, FAIL = 1e+200,ncores=8)
AICcResultz <- c(ERmod$opt$aicc,ARDmod$opt$aicc,SYMmod$opt$aicc)
AICcResultzNamez <- c("ER","ARD","SYM")
BestModel <- AICcResultzNamez[AICcResultz == min(AICcResultz)]
BestModel <- sample(BestModel,1)
write.csv(BestModel,sprintf("15_Morphological_Analysis/MorphologicalAnalysis_%s_%s_BestModel.csv",AnalType,colnames(TipData)[i]),row.names=F)
message(sprintf("Best model is: %s",BestModel))
message("Making SIMMAP")
mtrees<-make.simmap(tree,x,model=BestModel,nsim=200,Q="mcmc",pi="fitzjohn")
pd<-describe.simmap(mtrees,plot=F)
# Extract stochastic character mapping results for nodes of interest and place in table
message("...Making cladewise table")
noderatz <- as.data.frame(pd$ace)
noderatz <- cbind(row.names(noderatz),noderatz)
row.names(noderatz) <- NULL
colnames(noderatz)[1] <- "Node"
tipzz <- tree$tip.label
for(k in 1:length(tipzz))
{
tipzz[k] <- paste(unlist(strsplit(tipzz[k], split = " "))[c(1,2)], collapse=" ")
}
TipsKey <- Key[Key$Taxon %in% tipzz,]
CladeISp1 <- TipsKey[TipsKey$Clade == "I",1][1]
CladeIISp1 <- TipsKey[TipsKey$Clade == "II",1][1]
tipzzraw <- tree$tip.label
CladeISp1 <- tipzzraw[grepl(CladeISp1,tipzzraw)]
CladeIISp1 <- tipzzraw[grepl(CladeIISp1,tipzzraw)]
CladeIaSp1 <- TipsKey[TipsKey$Subclade == "Ia",1][1]
CladeIbSp1 <- TipsKey[TipsKey$Subclade == "Ib",1][1]
CladeIaSp1 <- tipzzraw[grepl(CladeIaSp1,tipzzraw)]
CladeIbSp1 <- tipzzraw[grepl(CladeIbSp1,tipzzraw)]
CladeIIaSp1 <- TipsKey[TipsKey$Subclade == "IIa",1][1]
CladeIIbSp1 <- TipsKey[TipsKey$Subclade == "IIb",1][1]
CladeIIaSp1 <- tipzzraw[grepl(CladeIIaSp1,tipzzraw)]
CladeIIbSp1 <- tipzzraw[grepl(CladeIIbSp1,tipzzraw)]
AllCladeIaSpp <- TipsKey[TipsKey$Subclade == "Ia",1]
treesppIa <- c()
for(k in 1:length(AllCladeIaSpp))
{
treesppIa <- unique(c(treesppIa,tipzzraw[grepl(AllCladeIaSpp[k],tipzzraw)]))
}
AllCladeIbSpp <- TipsKey[TipsKey$Subclade == "Ib",1]
treesppIb <- c()
for(k in 1:length(AllCladeIbSpp))
{
treesppIb <- unique(c(treesppIb,tipzzraw[grepl(AllCladeIbSpp[k],tipzzraw)]))
}
AllCladeIIaSpp <- TipsKey[TipsKey$Subclade == "IIa",1]
treesppIIa <- c()
for(k in 1:length(AllCladeIIaSpp))
{
treesppIIa <- unique(c(treesppIIa,tipzzraw[grepl(AllCladeIIaSpp[k],tipzzraw)]))
}
AllCladeIIbSpp <- TipsKey[TipsKey$Subclade == "IIb",1]
treesppIIb <- c()
for(k in 1:length(AllCladeIIbSpp))
{
treesppIIb <- unique(c(treesppIIb,tipzzraw[grepl(AllCladeIIbSpp[k],tipzzraw)]))
}
MRCA_Pandanus <- ape::getMRCA(tree,c(CladeISp1,CladeIISp1))
MRCA_CladeI <- ape::getMRCA(tree,c(CladeIaSp1,CladeIbSp1))
MRCA_CladeII <- ape::getMRCA(tree,c(CladeIIaSp1,CladeIIbSp1))
MRCA_CladeIa <- ape::getMRCA(tree,treesppIa)
MRCA_CladeIb <- ape::getMRCA(tree,treesppIb)
MRCA_CladeIIa <- ape::getMRCA(tree,treesppIIa)
MRCA_CladeIIb <- ape::getMRCA(tree,treesppIIb)
CladewiseOutputTable <- as.data.frame(matrix(nrow=7,ncol=length(colnames(noderatz))+1))
colnames(CladewiseOutputTable) <- c("MRCA",colnames(noderatz))
CladewiseOutputTable$MRCA <- c("MRCA_Pandanus","MRCA_CladeI","MRCA_CladeII","MRCA_CladeIa","MRCA_CladeIb","MRCA_CladeIIa","MRCA_CladeIIb")
CladewiseOutputTable[1,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_Pandanus,]
CladewiseOutputTable[2,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeI,]
CladewiseOutputTable[3,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeII,]
CladewiseOutputTable[4,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIa,]
CladewiseOutputTable[5,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIb,]
CladewiseOutputTable[6,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIIa,]
CladewiseOutputTable[7,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIIb,]
write.csv(CladewiseOutputTable,sprintf("15_Morphological_Analysis/MorphologicalAnalysis_%s_%s_Table.csv",AnalType,colnames(TipData)[i]),row.names=F)
murr <- pd$ace
murr[row.names(murr) == as.character(MRCA_Pandanus),] <- NA
library(phyloch)
data(strat2012)
message("Making plot")
pdf(sprintf("15_Morphological_Analysis/MCMC_MorphologicalAnalysis_%s_%s.pdf",AnalType,colnames(TipData)[i]),height=25,width=40)
par(mar=c(0,0,0,-5)+5)
plot.phylo(tree,show.tip.label = T,cex = 2,label.offset=0.666)
axisGeo(strat2012, tip.time = 0, unit = c("stage","epoch","period","eon"), gridty = 0, gridcol = "black",ages=F,cex=2)
axisChrono(side = 3, unit = "Ma",fact = 1,cex=1.666)
nodelabels(pie=murr,piecol=cols,cex = 0.33)
phytools::cladelabels(tree,"Ia",MRCA_CladeIa,offset=5,orientation="horizontal",cex=3)
phytools::cladelabels(tree,"Ib",MRCA_CladeIb,offset=5,orientation="horizontal",cex=3)
phytools::cladelabels(tree,"IIa",MRCA_CladeIIa,offset=5,orientation="horizontal",cex=3)
phytools::cladelabels(tree,"IIb",MRCA_CladeIIb,offset=5,orientation="horizontal",cex=3)
phytools::cladelabels(tree,"I",MRCA_CladeI,offset=8,orientation="horizontal",cex=3)
phytools::cladelabels(tree,"II",MRCA_CladeII,offset=8,orientation="horizontal",cex=3)
tiplabels(pie=to.matrix(x,sort(unique(x))),piecol = cols,cex = 0.33)
legend("topleft",names(cols),fill=cols,cex = 2.22,bty="n")
dev.off()
}
}
CharData <- "15_Morphological_Analysis/StaminateCurated.csv"
ThisTree <- "7_BEAST_Analysis/Trees/med.tree"
KeyPath <- "0_Raw_Data/Key.csv"
AnalType <- "Staminate"
AncestralCharacterizeR(CharData,ThisTree,KeyPath,AnalType)
source("1_Functions/AncestralCharacterizeR.R")
CharData <- "15_Morphological_Analysis/WDAST_Pandan.csv"
ThisTree <- "7_BEAST_Analysis/Trees/med.tree"
KeyPath <- "0_Raw_Data/Key.csv"
AnalType <- "WDAST"
AncestralCharacterizeR(CharData,ThisTree,KeyPath,AnalType)
source("1_Functions/AncestralCharacterizeR.R")
CharData <- "15_Morphological_Analysis/PistillateCurated.csv"
ThisTree <- "7_BEAST_Analysis/Trees/med.tree"
KeyPath <- "0_Raw_Data/Key.csv"
AnalType <- "Pistillate"
AncestralCharacterizeR(CharData,ThisTree,KeyPath,AnalType)
AncestralCharacterTabulatoR <- function(foldR,todoz,threshold)
{
filez <- list.files(foldR)
allTabz <- filez[grepl("_Table.csv",filez)]
todoz <- as.vector(todoz)
OUT <- as.data.frame(matrix(nrow=0,ncol=5))
colnames(OUT) <- c("Node","Type","Trait","Largest","Proportion")
for(i in 1:length(todoz))
{
message(sprintf("Processing No. %s of %s",i,length(todoz)))
theseTabz <- allTabz[grepl(todoz[i],allTabz)]
for(j in 1:length(theseTabz))
{
uwu <- read.csv(sprintf("15_Morphological_Analysis/%s",theseTabz[j]))
toBind <- as.data.frame(matrix(nrow=6,ncol=5))
colnames(toBind) <- c("Node","Type","Trait","Largest","Proportion")
toBind[,1] <- uwu[2:7,1]
toBind[,2] <- todoz[i]
toBind[,3] <- rep(gsub("_Table.csv","",gsub("MorphologicalAnalysis_","",theseTabz[j])),nrow(toBind))
for(k in 2:nrow(uwu))
{
toBind[k-1,4] <- paste(colnames(uwu)[which(uwu[k,c(3:ncol(uwu))] == max(uwu[k,c(3:ncol(uwu))]))+2],collapse=" & ")
toBind[k-1,5] <- paste(max(uwu[k,c(3:ncol(uwu))]),collapse=" & ")
}
OUT <- rbind(OUT,toBind)
}
}
ACTab <- OUT
UniCladez <- sort(unique(ACTab$Node))
UniCharz <- unique(ACTab$Trait)
ShortFormTab <- as.data.frame(matrix(nrow=length(UniCharz),ncol=length(UniCladez)+1))
ShortFormTab[,1] <- UniCharz
colnames(ShortFormTab) <- c("Character",UniCladez)
for(i in 1:length(UniCladez))
{
them <- ACTab[ACTab$Node == UniCladez[i],]
for(j in 1:length(them$Largest))
{
message(sprintf("Processing Clade No. %s of %s Character No. %s of %s",i,length(todoz),j,length(them$Largest)))
if(them$Proportion[j] > .5)
{
ShortFormTab[j,i+1] <- sprintf("%s (%s)",gsub("\\."," ",them$Largest[j]),them$Proportion[j])
}else{
ShortFormTab[j,i+1] <- "Uncertain"
}
}
}
return(ShortFormTab)
}
foldR <- "15_Morphological_Analysis"
todoz <- c("Staminate","Pistillate","WDAST")
ShortFormTab <- AncestralCharacterTabulatoR(foldR,todoz,threshold)
write.csv(ShortFormTab,"15_Morphological_Analysis/ShortFormTab.csv",row.names=F)
BC_Points_for_PCA_Clean <- read.csv("8_Climate_Analysis/BC_Points_for_PCA_Clean.csv")
SppNamezClean <- read.csv ("8_Climate_Analysis/SppNamezClean.csv")
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "Ia"
comp2 <- "Ib"
Key <- read.csv(Keypath)
if("I" %in% c(comp1,comp2))
{
comp1_Spp <- unique(Key[Key$Clade == comp1,1])
comp2_Spp <- unique(Key[Key$Clade == comp2,1])
}else{
comp1_Spp <- unique(Key[Key$Subclade == comp1,1])
comp2_Spp <- unique(Key[Key$Subclade == comp2,1])
}
comp1_Spp_Ia <- comp1_Spp
comp2_Spp_Ib <- comp2_Spp
Keypath <- "0_Raw_Data/Key.csv"
comp1 <- "IIa"
comp2 <- "IIb"
Key <- read.csv(Keypath)
if("I" %in% c(comp1,comp2))
{
comp1_Spp <- unique(Key[Key$Clade == comp1,1])
comp2_Spp <- unique(Key[Key$Clade == comp2,1])
}else{
comp1_Spp <- unique(Key[Key$Subclade == comp1,1])
comp2_Spp <- unique(Key[Key$Subclade == comp2,1])
}
comp1_Spp_IIa <- comp1_Spp
comp2_Spp_IIb <- comp2_Spp
Spp_Ia <- comp1_Spp_Ia
Spp_Ib <- comp2_Spp_Ib
Spp_IIa <- comp1_Spp_IIa
Spp_IIb <- comp2_Spp_IIb
Spp_Ia_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_Ia,])
Spp_Ib_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_Ib,])
Spp_IIa_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_IIa,])
Spp_IIb_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_IIb,])
measurements::conv_unit(Spp_Ia_geo$Lat, from = 'deg_min_sec', to = 'dec_deg')
source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/dms2dec", encoding = "UTF-8")
Spp_Ia_geo$Lat <- dms2dec(Spp_Ia_geo$Lat)
Spp_Ia_geo$Long <- dms2dec(Spp_Ia_geo$Long)
Spp_Ib_geo$Lat <- dms2dec(Spp_Ib_geo$Lat)
Spp_Ib_geo$Long <- dms2dec(Spp_Ib_geo$Long)
Spp_IIa_geo$Lat <- dms2dec(Spp_IIa_geo$Lat)
Spp_IIa_geo$Long <- dms2dec(Spp_IIa_geo$Long)
Spp_IIb_geo$Lat <- dms2dec(Spp_IIb_geo$Lat)
Spp_IIb_geo$Long <- dms2dec(Spp_IIb_geo$Long)
Spp_Ia_geo <- cbind(Spp_Ia_geo,rep("Ia",nrow(Spp_Ia_geo)))
colnames(Spp_Ia_geo)[5] <-"Subclade"
Spp_Ib_geo <- cbind(Spp_Ib_geo,rep("Ia",nrow(Spp_Ib_geo)))
colnames(Spp_Ib_geo)[5] <-"Subclade"
Spp_IIa_geo <- cbind(Spp_IIa_geo,rep("Ia",nrow(Spp_IIa_geo)))
colnames(Spp_IIa_geo)[5] <-"Subclade"
Spp_IIb_geo <- cbind(Spp_IIb_geo,rep("Ia",nrow(Spp_IIb_geo)))
colnames(Spp_IIb_geo)[5] <-"Subclade"
Latz <- rbind(Spp_Ia_geo,Spp_Ib_geo,Spp_IIa_geo,Spp_IIb_geo)
library(sp)
Spp_Ia_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_Ia_geo$Long,Spp_Ia_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
Spp_Ib_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_Ib_geo$Long,Spp_Ib_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
Spp_IIa_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_IIa_geo$Long,Spp_IIa_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
Spp_IIb_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_IIb_geo$Long,Spp_IIb_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
world <- sf::st_read("/Users/Mike/Downloads/ne_10m_land/ne_10m_land.shp")
world_robin <- sf::st_transform(world, CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
world_robin_sp <- sf::as_Spatial(world_robin)
Spp_Ia_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_Ia_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
Spp_Ib_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_Ia_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
Spp_IIa_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_IIa_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
Spp_IIb_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_IIb_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
world_sp <- sf::as_Spatial(world)
library(sp)
pdf("newmap3.pdf")
plot(world_sp,col="lightgrey")
plot(Spp_Ib_geo_pts,col="pink",pch=8,add=T)
plot(Spp_Ia_geo_pts,col="red",pch=8,add=T)
plot(Spp_IIa_geo_pts,col="cyan",pch=8,add=T)
plot(Spp_IIb_geo_pts,col="blue",pch=8,add=T)
dev.off()