Introduction
Raw data
Trimming reads
Reconstructing plastomes
Downloading required R packages
Assessing phylogenetic utility of plastomes, genes/spacers, and regions
Inferring maximum likelihood tree
- Finding best model for each partition file.
Testing phyly
Inferring maximum clade credibility phylogenetic tree
- Making annotated BEAST tree figure
Abiotic analyses
- Climate analysis code
- Soil analysis code
Morphologic analyses
Making accessions map

Introduction

This is the pipeline used in the Pandan Plastome Phylogenetics Project manuscript. All data (raw, intermediate, and final) and code are stored in a single director, with its subdirectories organized thusly:

##                                 levelName
## 1  PandanPlastomePhylogeneticsProject2024
## 2   ¦--0_Raw_Data                        
## 3   ¦--1_Functions                       
## 4   ¦   °--SoilsSeeR                     
## 5   ¦--2_Scripts                         
## 6   ¦--3_Plastome_Reconstruction         
## 7   ¦   ¦--3.0_Uncurated_Reconstructions 
## 8   ¦   °--3.1_Curated_Reconstructions   
## 9   ¦--4_Plastome_Genes_Analysis         
## 10  ¦--5_ML_Analysis                     
## 11  ¦--6_Phyly_Analysis                  
## 12  ¦   ¦--Constraint_Newicks            
## 13  ¦   ¦--Trees                         
## 14  ¦   ¦--Clades                        
## 15  ¦   °--Subgenera                     
## 16  ¦--7_BEAST_Analysis                  
## 17  ¦   °--Trees                         
## 18  ¦--8_Climate_Analysis                
## 19  ¦--9_Soils_Analysis                  
## 20  ¦--10_Hydrenchyma_Images             
## 21  ¦--11_Morphological_Analysis         
## 22  °--12_External_Data                  
## 23      ¦--DSMW                          
## 24      °--wc2.1_30s_bio

Raw data

Input is in the form of sets of compressed fastq reads (forward and reverse). Forward and reverse files are designated by R1 vs R2 in the filename. They are available on the SRA. Below is a table with their accession numbers.

library(knitr)
uwu <- read.table("/Users/Mike/Downloads/metadata-14388163-processed-ok.tsv",sep = "\t")
colnames(uwu) <- uwu[1,]
uwu <- uwu[-1,]
uwu <- uwu[,colnames(uwu) %in% c("accession","biosample_accession","bioproject_accession","filetype")]
knitr::kable(uwu,col.names=colnames(uwu))

Trimming reads

Illumina® barcodes were removed directly after sequencing. Paired-end (PE) reads were trimmed (i.e., had low-quality reads and their mates removed, as well as adaptors removed) using Trimmomatic (Bolger, 2014) with default parameters (i.e., threshold set to 33). Trimmed PE reads were then merged using interleave_fastq (Garrison and Boisvert, 2021). The template below was used to perform the analysis:

#!/bin/bash

#R1_reads is a placeholder for the forward reads filename
#R2_reads is a placeholder for the reverse reads filename

java -jar /usr/local/Cellar/trimmomatic/0.39/libexec/trimmomatic-0.39.jar PE R1_reads.fastq R2_reads.fastq R1_reads_P.fastq R1_reads_U.fastq R2_reads_P.fastq R2_reads_U.fastq -threads 10 -phred33 HEADCROP:20 SLIDINGWINDOW:4:30 MINLEN:30

Reconstructing plastomes

To reconstruct plastomes, we used a hybrid de novo assembly approach (MITObim), with the P. tectorius (Wang B247) plastome from Tan et al. (2019; file renamed NC_042747_plastome.fasta) as the reference and a maximum of 30 iterations for each set of PE reads (Hahn et al., 2013). MITObim uses a hybrid de novo assembly approach, taking advantage of the first round of reference-based assembly (where only reads matching 100% of the reference genome were conserved) to seed a de novo assembly (done using MIRA 4.0.2, Hahn et al., 2013). This approach is scalable for assembling thousands of plastomes and is more time- and resource-efficient than pure de novo assembly methods (Hahn et al., 2013). Assembled plastome sequences (along with the reference sequence) were written in a fasta file and aligned using MAFFT v. 7.505 (Katoh et al., 2002) using default parameters. Alignments were then manually curated using Unipro UGENE (Okonechnikov et al., 2012). Manual curation identified the sort single-copy (SSC) region as being reverse-complemented in our reconstructions relative to that from Tan et al. (2019) and of the Carludovia palmata (Chase 14836) outgroup sequence from Mennes et al. (2015). We then corrected the orientation of the SSC in our reconstructions using Unipro UGENE.

#!/bin/bash

    ./interleave_fastq.sh R1_reads_P.fastq R2_reads_P.fastq > R12_reads_P.fastq


MITObim.pl -start 0 -end 100 -sample samplename -ref genome -readpool R12_reads_P.fastq --quick NC_042747_plastome.fasta -mirapath /Users/Mike/mira_4.0.2_darwin13.1.0_x86_64_static/bin --clean --verbose --paired

mafft --auto RevCompGood.fasta > AlignedRevCompGood.fasta

Downloading required R packages

install.packages(c("remotes","seqinr","genbankr", "zoo", "ape","pegas","phangorn", "treeio", "phyloch","strap","coda","factoextra","FactoMineR","raster","varSelRF","ranger","devtools","measurements","readxl","geiger","sp","sf"))
remotes::install_github("wojahn/PrideBar")
remotes::install_github("wojahn/SoilSeeR")

Assessing phylogenetic utility of plastomes, genes/spacers, and regions

To compare the plastomes’ phylogenetic potential relative to the regions used in prior studies, we inferred phylogenetic information for the sequences using PAUP* 4.0b10 (Swofford, 2003). The information obtained included No. of ingroup sampled species, No. sequences incl. outgroup, Sequence length range, Alignment length, Missing data (percentage of ingroup sequences; in brackets percentage of nucleotides for the combined analyses), No. constant characters (%), No. variable characters (%), and No. potentially parsimony-informative (PI) characters (%). Mean amount of phylogenetic information per sample (averaged by variable sites number/PI sites number). To determine where (i.e., in which regions, and in which genes or intergenic spacers) the plastomes’ nucleotide diversity (and thus potential phylogenetic signal) lay, we computed Nei nucleotide diversity (Nei, 1987) for each of the four regions of the plastome. The full pipeline for this part is contained in the R script Genewise_Nei_Analysis.R in the supplementary materials. Optimal moving window size was determined by taking the median of the gene sizes in Tan et al. (2019)’s annotation of their P. tectorius (Wang B247) plastome. This optimal window size was then used in the pipeline developed in Simmonds et al. (2021) to calculate Nei nucleotide diversity. The pipeline used the R packages ape, pegas (Paradis, 2010), seqinr (Charif and Lobry, 2007), and zoo (Zeileis and Grothendieck, 2005). For each of the four regions, peaks whose heights were in the 95th quantile and above had their location in the plastome and any associated gene(s)/intergenic spacer(s) recorded.

####################################################
# Inferring Median Gene Length
####################################################

  PanTecGB <- genbankr::parseGenBank("4_Plastome_Genes_Analysis/sequence.gb")
  message("...extracting features")
  PanTecGB_f <- PanTecGB[["FEATURES"]]
  Infoz <- as.data.frame(matrix(nrow=0,ncol=5))
  colnames(Infoz) <- c("start","end","strand","type","name")
  PrideBar::SetPrideBar(1,length(PanTecGB_f),2)
  for(i in 1:length(PanTecGB_f))
  {
    PrideBar::PrideBar()
    #message(sprintf("......processing No. %s of %s",i,length(PanTecGB_f)))
    tmp <- as.data.frame(PanTecGB_f[[i]])
    if(nrow(tmp) == 1)
    {
      toadd <- as.data.frame(matrix(nrow=1,ncol=5))
      colnames(toadd) <- c("start","end","strand","type","name")
      toadd[1,1] <- tmp[1,2]
      toadd[1,2] <- tmp[1,3]
      toadd[1,3] <- tmp[1,4]
      toadd[1,4] <- tmp[1,5]
      toadd[1,5] <- tmp[1,6]
      Infoz <- rbind(Infoz,toadd)
    }else{
      toadd <- as.data.frame(matrix(nrow=nrow(tmp),ncol=5))
      colnames(toadd) <- c("start","end","strand","type","name")
      for(j in 1:nrow(tmp))
      {
        toadd[j,1] <- tmp[j,2]
        toadd[j,2] <- tmp[j,3]
        toadd[j,3] <- tmp[j,4]
        toadd[j,4] <- tmp[j,5]
        toadd[j,5] <- tmp[j,6]
      }
      Infoz <- rbind(Infoz,toadd)
    }
  }
  PrideBar::FadePrideBar()
  Infoz_smol <- Infoz[!Infoz[,2] %in% c("source","misc_feature","repeat_region"),]
  Lengthz <- Infoz_smol[,2] - Infoz_smol[,1]
  median(Lengthz) #272
  mean(Lengthz) #1692.753
  pdf("4_Plastome_Genes_Analysis/GeneLengthHistogram.pdf")
  hist(Lengthz)
  dev.off()
  # Heavily skewed smaller, so we'll use median, round up to 300

####################################################
# Nei CBP
####################################################

  message("Parsing GenBank File...")
  GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
  ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
  library(genbankr)
  PanTecGB <- genbankr::parseGenBank(GBFile)
  message("...extracting features")
  PanTecGB_f <- PanTecGB[["FEATURES"]]
  Infoz <- as.data.frame(matrix(nrow=0,ncol=5))
  colnames(Infoz) <- c("start","end","strand","type","name")
  PrideBar::SetPrideBar(1,length(PanTecGB_f),2)
  for(i in 1:length(PanTecGB_f))
  {
    #message(sprintf("......processing No. %s of %s",i,length(PanTecGB_f)))
    PrideBar::PrideBar()
    tmp <- as.data.frame(PanTecGB_f[[i]])
    if(nrow(tmp) == 1)
    {
      toadd <- as.data.frame(matrix(nrow=1,ncol=5))
      colnames(toadd) <- c("start","end","strand","type","name")
      toadd[1,1] <- tmp[1,2]
      toadd[1,2] <- tmp[1,3]
      toadd[1,3] <- tmp[1,4]
      toadd[1,4] <- tmp[1,5]
      toadd[1,5] <- tmp[1,6]
      Infoz <- rbind(Infoz,toadd)
    }else{
      toadd <- as.data.frame(matrix(nrow=nrow(tmp),ncol=5))
      colnames(toadd) <- c("start","end","strand","type","name")
      for(j in 1:nrow(tmp))
      {
        toadd[j,1] <- tmp[j,2]
        toadd[j,2] <- tmp[j,3]
        toadd[j,3] <- tmp[j,4]
        toadd[j,4] <- tmp[j,5]
        toadd[j,5] <- tmp[j,6]
      }
      Infoz <- rbind(Infoz,toadd)
    }
  }
  PrideBar::FadePrideBar()
  AWD <- seqinr::read.fasta(ToAnal)
  message("...isolating reference")
  AWD_ref <- as.character(AWD[[which(names(AWD) == "gi|1693290702|ref|NC_042747.1|")]])
  themz <- which(as.character(AWD_ref) != "-")
  TruNumz <- rep(NA, length(AWD_ref))
  for(i in 1:length(themz))
  {
    TruNumz[as.numeric(themz[i])] <- i
  }

# turn to df
uwu <- as.data.frame(matrix(nrow=length(AWD),ncol=length(as.character(AWD[[1]]))))
for(i in 1:length(AWD))
{
  message(sprintf("Processing No. %s of %s",i,length(AWD)))
  uwu[i,] <- as.vector(as.character(AWD[[i]]))
}

LSCi <- Infoz[Infoz$name == "large single copy region (LSC)",]
IRBi <- Infoz[Infoz$name == "inverted repeat B (IRB)",]
SSCi <- Infoz[Infoz$name == "small single copy region (SSC)",]
IRAi <- Infoz[Infoz$name == "inverted repeat A (IRA)",]
countR <- 0
RealNumz <- rep(NA, length(AWD_ref))
for(i in 1:length(AWD_ref))
{
  if(AWD_ref[i] != "-")
  {
    countR <- countR + 1
    RealNumz[i] <- countR
  }
}

LSCnumz <- seq(LSCi[1,1],LSCi[1,2],1)
IRBnumz <- seq(IRBi[1,1],IRBi[1,2],1)
SSCnumz <- seq(SSCi[1,1],SSCi[1,2],1)
IRAnumz <- seq(IRAi[1,1],IRAi[1,2],1)


LSC_only <- uwu[,seq(which(RealNumz == LSCi[1,1]),which(RealNumz == LSCi[1,2]))]
IRB_only <- uwu[,seq(which(RealNumz == IRBi[1,1]),which(RealNumz == IRBi[1,2]))]
SSC_only <- uwu[,seq(which(RealNumz == SSCi[1,1]),which(RealNumz == SSCi[1,2]))]
IRA_only <- uwu[,seq(which(RealNumz == IRAi[1,1]),which(RealNumz == IRAi[1,2]))]

LSC <- LSC_only
tmp <- LSC
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/LSC_Good.fasta")

#IRB
IRB <-  IRB_only
tmp <- IRB
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/IRB_Good.fasta")

SSC <- SSC_only
tmp <- SSC
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/SSC_Good.fasta")

IRA <- IRA_only
tmp <- IRA
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(AWD), "4_Plastome_Genes_Analysis/IRA_Good.fasta")
gc()
LSC <- seqinr::read.fasta("4_Plastome_Genes_Analysis/LSC_Good.fasta")
IRB <- seqinr::read.fasta("4_Plastome_Genes_Analysis/IRB_Good.fasta")
SSC <- seqinr::read.fasta("4_Plastome_Genes_Analysis/SSC_Good.fasta")
IRA <- seqinr::read.fasta("4_Plastome_Genes_Analysis/IRA_Good.fasta")

###########################################
# SLIDING WINDOW PART
###########################################

# Sliding window across cpDNA genome
require(seqinr)
require(zoo)
require(ape)
require(pegas)

# All first
Gen <- seqinr::read.fasta(ToAnal)
genelength <- 300
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)

# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
  setTxtProgressBar(pb,i)
  # Matrix
  val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)

  Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
  for(j in 1:length(names(Gen)))
  {
    Short[j,] <- Gen[[j]][val]
  }
  # Infer nucleotide diversity and number of polymorphic char
  INFO[i,1] <- nuc.div(as.DNAbin(Short))
  INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)

INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv", quote = F)
INFO_all <- INFO

# Aligned cpDNA genomes
Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/IRA_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)

# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
  setTxtProgressBar(pb,i)
  # Matrix
  val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)

  Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
  for(j in 1:length(names(Gen)))
  {
    Short[j,] <- Gen[[j]][val]
  }
  # Infer nucleotide diversity and number of polymorphic char
  INFO[i,1] <- nuc.div(as.DNAbin(Short))
  INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)

INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv", quote = F)
INFO_IRA_T <- INFO


Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/IRB_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)

# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
  setTxtProgressBar(pb,i)
  # Matrix
  val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)

  Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
  for(j in 1:length(names(Gen)))
  {
    Short[j,] <- Gen[[j]][val]
  }
  # Infer nucleotide diversity and number of polymorphic char
  INFO[i,1] <- nuc.div(as.DNAbin(Short))
  INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)

INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv", quote = F)
INFO_IRB_T <- INFO

Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/LSC_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)

# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
  setTxtProgressBar(pb,i)
  # Matrix
  val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)

  Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
  for(j in 1:length(names(Gen)))
  {
    Short[j,] <- Gen[[j]][val]
  }
  # Infer nucleotide diversity and number of polymorphic char
  INFO[i,1] <- nuc.div(as.DNAbin(Short))
  INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)

INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv", quote = F)
INFO_LSC_T <- INFO

Gen <- seqinr::read.fasta('4_Plastome_Genes_Analysis/SSC_Good.fasta')
genelength <- 300
#Gen <- Gen[grepl("Pandanus",names(Gen))]
# Establish moving window
seq <- seq(from=1, to=length(Gen[[1]]), by=genelength)

# Infer nucleotide diversity for all species in matrix
INFO <- matrix(ncol=2, nrow=length(seq))
rownames(INFO) <- seq
colnames(INFO) <- c("Nuc_div", "N_Poly_char")
pb = txtProgressBar(min = 0, max = length(seq), initial = 0)
for(i in 1:length(seq))
{
  setTxtProgressBar(pb,i)
  # Matrix
  val <- as.numeric(seq[i]):as.numeric(seq[i]+genelength-1)

  Short <- matrix(ncol=genelength, nrow=length(names(Gen)))
  for(j in 1:length(names(Gen)))
  {
    Short[j,] <- Gen[[j]][val]
  }
  # Infer nucleotide diversity and number of polymorphic char
  INFO[i,1] <- nuc.div(as.DNAbin(Short))
  INFO[i,2] <- length(seg.sites(as.DNAbin(Short)))
}
close(pb)

INFO[,1] <- gsub('NaN', '0', INFO[,1])
# Write results out
write.csv(INFO, file="4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv", quote = F)
INFO_SSC_T <- INFO

#############################################
# MAKING NEI DIVERSITY FIGURE
#############################################

INFO_IRA_T <- read.csv("4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv")
INFO_IRB_T <- read.csv("4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv")
INFO_LSC_T <- read.csv("4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv")
INFO_SSC_T <- read.csv("4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv")

{
pdf("4_Plastome_Genes_Analysis/300_bp_NeiPartitions_2024.pdf")
par(mfrow = c(2, 2))
plot(as.numeric(INFO_LSC_T[,1])/1000, as.numeric(INFO_LSC_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2,main="LSC")
abline(h = quantile(as.numeric(INFO_LSC_T[,1]), .95), lwd = 1, col="blue")
plot(as.numeric(INFO_IRB_T[,1])/1000, as.numeric(INFO_IRB_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2, main="IRB")
abline(h = quantile(as.numeric(INFO_IRB_T[,1]), .95), lwd = 1, col="blue")
plot(as.numeric(INFO_SSC_T[,1])/1000, as.numeric(INFO_SSC_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2, main="SSC")
abline(h = quantile(as.numeric(INFO_SSC_T[,1]), .95), lwd = 1, col="blue")
plot(as.numeric(INFO_IRA_T[,1])/1000, as.numeric(INFO_IRA_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2, main="IRA")
abline(h = quantile(as.numeric(INFO_IRA_T[,1]), .95), lwd = 1, col="blue")
dev.off()
}

INFO <- read.csv("4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv")

pdf("4_Plastome_Genes_Analysis/300_bp_NeiPartitions_Whole_2024.pdf")
par(mfrow = c(2, 2))
plot(as.numeric(row.names(INFO_all))/1000, as.numeric(INFO_all[,1]), type='l', xlab='Alignment (Kbp)', ylab='Nucleotide diversity', ylim=c(0, 0.1),las = 2,main="LSC")
abline(h = quantile(as.numeric(INFO_all[,1]), .95), lwd = 1, col="blue")
dev.off()

####################################################
# Exploring Peaks
####################################################

source("1_Functions/Peaks2Annotations.R")

#If unannotated Gene before then gene after for intergenic spacer
# add addrow with number for peak
# Table with arrow number, partition, gene, and diversity
#Put other regions in past (12 and 15) and tell Nei gd as comparison (could be in gray, if not retrieved in our system)
GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/LSC_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv"
Wherez <- "large single copy region (LSC)"
RefTaxon <- "Pandanus tectorius"
LSC_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(LSC_Infoz,"4_Plastome_Genes_Analysis/LSC_Infoz.csv",row.names=F)

GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/IRA_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv"
Wherez <- "inverted repeat A (IRA)"
RefTaxon <- "Pandanus tectorius"
IRA_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(IRA_Infoz,"4_Plastome_Genes_Analysis/IRA_Infoz.csv",row.names=F)

GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/IRB_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv"
Wherez <- "inverted repeat B (IRB)"
RefTaxon <- "Pandanus tectorius"
IRB_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(IRB_Infoz,"4_Plastome_Genes_Analysis/IRB_Infoz.csv",row.names=F)

GBFile <- "/Users/Mike/Downloads/sequence.gb"
ToAnal <- "4_Plastome_Genes_Analysis/SSC_Good.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv"
Wherez <- "small single copy region (SSC)"
RefTaxon <- "Pandanus tectorius"
SSC_Infoz <- Peaks2Annotations(ToAnal,GBFile,NeiDiv,Wherez,RefTaxon)
write.csv(SSC_Infoz,"4_Plastome_Genes_Analysis/SSC_Infoz.csv",row.names=F)

SSC_Infoz <- read.csv("4_Plastome_Genes_Analysis/SSC_Infoz.csv")
LSC_Infoz <- read.csv("4_Plastome_Genes_Analysis/LSC_Infoz.csv")
IRA_Infoz <- read.csv("4_Plastome_Genes_Analysis/IRA_Infoz.csv")
IRB_Infoz <- read.csv("4_Plastome_Genes_Analysis/IRB_Infoz.csv")

SSC_Infoz <- cbind(rep("SSC",nrow(SSC_Infoz)),SSC_Infoz)
colnames(SSC_Infoz) <- "Partition"
IRB_Infoz <- cbind(rep("IRB",nrow(IRB_Infoz)),IRB_Infoz)
colnames(IRB_Infoz) <- "Partition"
LSC_Infoz <- cbind(rep("LSC",nrow(LSC_Infoz)),LSC_Infoz)
colnames(LSC_Infoz) <- "Partition"
IRA_Infoz <- cbind(rep("IRA",nrow(IRA_Infoz)),IRA_Infoz)
colnames(IRA_Infoz) <- "Partition"

All_Infoz <- rbind(LSC_Infoz,IRB_Infoz,SSC_Infoz,IRA_Infoz)
All_Infoz <- All_Infoz[,-2]
colnames(All_Infoz) <- c("Partition","Peak Nei Diversity","Name")
write.csv(All_Infoz,"4_Plastome_Genes_Analysis/AllInfoz2.csv",row.names = F)

GBFile <- "4_Plastome_Genes_Analysis/sequence.gb"
ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv"
Kwery <- "trnL"

source("1_Functions/GetPeakFromName.R")

GetPeakFromName(ToAnal,GBFile,NeiDiv,Kwery="matK")
#0.01016148

#Buerki et al. 2012
#matK (only , not much)
#trnL-trnF (trnL gene is in but not IGS in top 5%) 49127-50108 0.005081316
#trnQ-rps16 (Found in top 5% of SSC peaks in ours) 0.0198589

#Gallaher et al. 2015
#trnL-trnF (trnL gene but not IGS in top 5%) 49127-50108 0.005081316
#ndhF-rpl32 (not in top 5%) 129243-130403 0.002525253
#trnQ-rps16 (Found in top 5% of SSC peaks in ours) 0.0198589

#There was no substantial conflict between the chloroplast and nuclear genomes (congruent topology and BPP > 0.9, >80% bootstrap for major clades, see Supplementary Fig. 1).
GBFile <- "/Users/Mike/Downloads/sequence.gb"
ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
NeiDiv <- "4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv"
Kwery1 <- "ndhF"

source("1_Functions/IntergenicPeaks2Annotations.R")

IntergenicPeaks2Annotations(ToAnal,GBFile,NeiDiv,Poz1=49127,Poz2=50108) #0.005081316
IntergenicPeaks2Annotations(ToAnal,GBFile,NeiDiv,Poz1=129243,Poz2=130403) #0.002525253

#######################################
# MAKING ANNOTATED FIGURE
#######################################

Allz <- read.csv("4_Plastome_Genes_Analysis/All_moving_window_nuc_div.csv")

which(TruNumz == 1844) # 2734 matK
which(TruNumz == 49127) # 57966 trnL-trnF mean(c(57966,59226)) 58596
which(TruNumz == 50108) # 59226 trnL-trnF
which(TruNumz == 5910) # 7457 trnQ-rps16 mean(c(7457,9083)) 8270
which(TruNumz == 6950) # 9083 trnQ-rps16
which(TruNumz == 129243) # 150561 ndhF-rpl32 mean(c(150561,152573)) 151567
which(TruNumz == 130403) # 152573 ndhF-rpl32
which(TruNumz == 130403)

tmp <- read.fasta("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/4_Plastome_Genes_Analysis/IRB_Good.fasta")
length(as.character(tmp[[1]]))
#104751 + 28770
INFO_IRA_T <- read.csv("4_Plastome_Genes_Analysis/IRA_moving_window_nuc_div.csv")
INFO_IRB_T <- read.csv("4_Plastome_Genes_Analysis/IRB_moving_window_nuc_div.csv")
INFO_LSC_T <- read.csv("4_Plastome_Genes_Analysis/LSC_moving_window_nuc_div.csv")
INFO_SSC_T <- read.csv("4_Plastome_Genes_Analysis/SSC_moving_window_nuc_div.csv")

library(shape)
{
  pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024.pdf",width=6, height= 6)
  {
    #par(mfrow = c(2, 2),mar = c(1.5, 3, 1, 1),oma = c(2, 2, 0.1, 0.1),cex.lab=1.7)
    par(mfrow = c(2, 2))#,oma = c(1,2,1,1))
    par(mar = c(1.5, 4, 1, 0.1))
    plot(as.numeric(INFO_LSC_T[,1])/1000, as.numeric(INFO_LSC_T[,2]), type='l', xlab='', ylab='Nucleotide Diversity', ylim=c(0, 0.1), mgp=c(2,1,0))
      abline(h = quantile(as.numeric(INFO_LSC_T[,2]), .95), lwd = 1, col="blue")
      text(x=2734/1000,y=.075,labels="matK",col="red",srt = 90,font = 3, cex=1)
      Arrows(x0=2734/1000,y0=.06,x=2734/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
      text(x=58801/1000,y=.08,labels="trnL-trnF",col="red",srt = 90,font = 3, cex=1)
      Arrows(x0=58596/1000,y0=.06,x=58596/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
      text(x=8270/1000,y=.083,labels="trnQ-rps16",col="red",srt = 90,font = 3, cex=1)
      Arrows(x0=8270/1000,y0=.06,x=8270/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
    par(mar = c(1.5, 1, 1, 1))
    plot(as.numeric(INFO_IRB_T[,1])/1000, as.numeric(INFO_IRB_T[,2]), type='l', xlab='', ylab = "Nucleotide Diversity", ylim=c(0, 0.1),las = 0,yaxt='n',mgp=c(2,1,0))
      abline(h = quantile(as.numeric(INFO_IRB_T[,2]), .95), lwd = 1, col="blue")
    par(mar = c(3, 4, 1, 0.1))
    plot(as.numeric(INFO_SSC_T[,1])/1000, as.numeric(INFO_SSC_T[,2]), type='l', xlab='Alignment (Kbp)', ylab='', ylim=c(0, 0.1),las = 0)
      abline(h = quantile(as.numeric(INFO_SSC_T[,2]), .95), lwd = 1, col="blue")
      text(x=18365/1000,y=.08,labels="ndhF-rpl32",col="red",srt = 90,font = 3, cex=1)
      Arrows(x0=18365/1000,y0=.06,x=18365/1000,y=.02,arr.type="curved", arr.width=0.2,col="red")
    par(mar = c(3, 1, 1, 1))
    plot(as.numeric(INFO_IRA_T[,1])/1000, as.numeric(INFO_IRA_T[,2]), type='l', xlab='Alignment (Kbp)', ylab="", ylim=c(0, 0.1),las = 0,yaxt = "n")
      abline(h = quantile(as.numeric(INFO_IRA_T[,2]), .95), lwd = 1, col="blue")
  }
  dev.off()
}

pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_LSC.pdf")
 plot(as.numeric(INFO_LSC_T[,1])/1000, as.numeric(INFO_LSC_T[,2]), type='h', xlab='', ylab='Nucleotide Diversity', ylim=c(0, 0.1))
      abline(h = quantile(as.numeric(INFO_LSC_T[,2]), .95), lwd = 1, col="gray")
dev.off()
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_IRB.pdf")
    plot(as.numeric(INFO_IRB_T[,1])/1000, as.numeric(INFO_IRB_T[,2]), type='h', xlab='', ylab = "Nucleotide Diversity", ylim=c(0, 0.1),las = 0,yaxt='n')
      abline(h = quantile(as.numeric(INFO_IRB_T[,2]), .95), lwd = 1, col="gray")
dev.off()
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_SSC.pdf")
    plot(as.numeric(INFO_SSC_T[,1])/1000, as.numeric(INFO_SSC_T[,2]), type='h', xlab='Alignment (Kbp)', ylab='', ylim=c(0, 0.1),las = 0)
      abline(h = quantile(as.numeric(INFO_SSC_T[,2]), .95), lwd = 1, col="gray")
dev.off()
pdf("4_Plastome_Genes_Analysis/Annotated_300_bp_NeiPartitions_2024_IRA.pdf")
    plot(as.numeric(INFO_IRA_T[,1])/1000, as.numeric(INFO_IRA_T[,2]), type='h', xlab='Alignment (Kbp)', ylab="", ylim=c(0, 0.1),las = 0,yaxt = "n")
      abline(h = quantile(as.numeric(INFO_IRA_T[,2]), .95), lwd = 1, col="gray")
      dev.off()

# Making partition file

  message("Parsing GenBank File...")
  library(genbankr)
  PanTecGB <- genbankr::parseGenBank(GBFile)
  message("...extracting features")
  PanTecGB_f <- PanTecGB[["FEATURES"]]
  Infoz <- as.data.frame(matrix(nrow=0,ncol=5))
  colnames(Infoz) <- c("start","end","strand","type","name")
  for(i in 1:length(PanTecGB_f))
  {
    message(sprintf("......processing No. %s of %s",i,length(PanTecGB_f)))
    tmp <- as.data.frame(PanTecGB_f[[i]])
    if(nrow(tmp) == 1)
    {
      toadd <- as.data.frame(matrix(nrow=1,ncol=5))
      colnames(toadd) <- c("start","end","strand","type","name")
      toadd[1,1] <- tmp[1,2]
      toadd[1,2] <- tmp[1,3]
      toadd[1,3] <- tmp[1,4]
      toadd[1,4] <- tmp[1,5]
      toadd[1,5] <- tmp[1,6]
      Infoz <- rbind(Infoz,toadd)
    }else{
      toadd <- as.data.frame(matrix(nrow=nrow(tmp),ncol=5))
      colnames(toadd) <- c("start","end","strand","type","name")
      for(j in 1:nrow(tmp))
      {
        toadd[j,1] <- tmp[j,2]
        toadd[j,2] <- tmp[j,3]
        toadd[j,3] <- tmp[j,4]
        toadd[j,4] <- tmp[j,5]
        toadd[j,5] <- tmp[j,6]
      }
      Infoz <- rbind(Infoz,toadd)
    }
  }
  ToAnal <- "3_Plastome_Reconstruction/AlignedRevCompGood.fasta"
  AWD <- seqinr::read.fasta(ToAnal)
  message("...isolating reference")
  AWD_ref <- as.character(AWD[[which(names(AWD) == "gi|1693290702|ref|NC_042747.1|")]])
  AWD_c <- data.frame(as.numeric(as.numeric(INFO_AWD_T[,1])),as.numeric(INFO_AWD_T[,2]))
  themz <- which(as.character(AWD_ref) != "-")
  TruNumz <- rep(NA, length(AWD_ref))
  for(i in 1:length(themz))
  {
    TruNumz[as.numeric(themz[i])] <- i
  }

Inferring maximum likelihood tree

Finding best model for each partition file.

To infer a maximum likelihood tree to be used for testing subgeneric monophyly, evolutionary models for each partition were identified using using ModelTest-NG (Darriba et al., 2020)

untrimmed <- seqinr::read.fasta("/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/AlignedPandanusPlastomes.fasta")
ref <- untrimmed[[which(names(untrimmed) == "gi|1693290702|ref|NC_042747.1|")]]
refchar <- as.character(ref)
countR <- 0
poz <- rep(NA,length(refchar))
refpoz <- rep(NA,length(refchar))
for(i in 1:length(refchar))
{
  if(refchar[[i]] != "-")
  {
    countR <- countR + 1
    refpoz[i] <- countR
  }else{
    poz[i] <- i
  }
}

# turn to df
uwu <- as.data.frame(matrix(nrow=length(untrimmed),ncol=length(as.character(untrimmed[[1]]))))
for(i in 1:length(untrimmed))
{
  message(sprintf("Processing No. %s of %s",i,length(untrimmed)))
  uwu[i,] <- as.vector(as.character(untrimmed[[i]]))
}

#LSC
LSC <- uwu[,c(seq(which(refpoz == 1),which(refpoz == 87448)))]
tmp <- LSC
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/LSC.fasta")

#IRB
IRB <- uwu[,c(seq(which(refpoz == 87448),which(refpoz == 114149)))]
tmp <- IRB
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/IRB.fasta")

SSC <- uwu[,c(seq(which(refpoz == 114150),which(refpoz == 132660)))]
tmp <- SSC
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/SSC.fasta")
IRA <- uwu[,c(seq(which(refpoz == 132661),which(refpoz == 159361)))]
tmp <- IRA
outlist <- list()
for(i in 1:nrow(tmp))
{
  message(sprintf("Processing No. %s of %s",i,nrow(tmp)))
  outlist[[i]] <- as.vector(tmp[i,])
}
seqinr::write.fasta(outlist, names(untrimmed), "/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/IRA.fasta")

A best-scoring tree was built by RAxML-HPC version 8 using the curated plastome alignment (Stamatakis, 2014). Default parameters were used, the analysis type was changed to rapid bootstrap analysis/search for best-scoring ML tree (-f a), bootstrapping type was set to rapid bootstrapping (-x), four partitions (corresponding to the LSC, IRB, SSC, and IRA) were set (-q), and autoMRE was used to determine when to stop bootstrapping.


raxmlHPC -T 28 -f a -N autoMRE -n Wojahn2024 -s AlignedRevCompGood.fasta -m GTRCAT -p 12345 -x 12345 -o OUTGROUP_Carludovica_palmata_voucher_K_Chase -q RAxML_Partition_Scheme.txt

Testing phyly

Making constraint Newicks

To test subgeneric phyly and validate cladewise monophyly, we first made constraint newicks in R.

# Make constraint Newick files for each group for which we plan to test phyly
source("1_Functions/ConstraintNewickMakeR.R")

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Kurzia"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Vinsonia"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Pandanus"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Rykia"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Coronata"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Lophostigma"
KeyCol <- 4
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "I"
KeyCol <- 15
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "II"
KeyCol <- 15
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Ia"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "IIa"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "Ib"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

BaseTree <- "5_ML_Analysis/PartitionedMLPLastomeTree.tree"
Keypath <- "0_Raw_Data/Key.csv"
ContraintName <- "IIb"
KeyCol <- 14
ConstraintNewickMakeR(Keypath,ContraintName,KeyCol,BaseTree)

Inferring ML Trees

We then used them to make ML best-scoring trees were built by RAxML-HPC version 8 (Stamatakis, 2014) using the curated plastome alignment with four partitions (LSC, IRB, SSC, IRA). Default parameters were used, the analysis type was changed to rapid bootstrap analysis/search for best-scoring ML tree (-f a), bootstrapping type was set to rapid bootstrapping (-x), and autoMRE was used to determine when to stop bootstrapping. Depending on the purpose of the tree, a constraint file was added using the -g option to constrain the clade being tested to monophyly. Here is an example where the subgenus/clad/subclade name is replaced by the placeholder XXX.


raxmlHPC -T 28 -f a -N autoMRE -n Wojahn2024_XXX -s AlignedRevCompGood.fasta -m GTRCAT -p 12345 -x 12345 -o OUTGROUP_Carludovica_palmata_voucher_K_Chase -q RAxML_Partition_Scheme.txt -g Newick_XXX.txt.

Conducting Shimodaira-Hasegawa tests

The phangorn package was used to conduct the Shimodira-Hasagawa tests in R. Phyly was tested for subgenera and for deep clades (i.e., clades I and II from Buerki et al., 2012 and four subclades Ia, Ib, IIa, and IIb).

setwd("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/Trees/Subgenera")
SubgenTreez <- list.files()
PhylyResultz <- list()
for(i in 1:length(SubgenTreez))
{
  nomen <- gsub("RAxML_bipartitions.","",SubgenTreez[i])
  nomen <- gsub("BackboneNewLowThread.tree","",nomen)
  message(sprintf("Processing Subgenus '%s', No. %s of %s",nomen,i,length(SubgenTreez)))
  library(phangorn)
  UwUTree1 <- ape::read.tree("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/5_ML_Analysis/PartitionedMLPLastomeTree.tree")
  UwUTree2<- ape::read.tree(SubgenTreez[i])
  PhyData <- ape::read.FASTA("/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/AlignedRevCompGood.fasta")
  PhyData <- phangorn::as.phyDat(PhyData)
  fit1 <- phangorn::pml(UwUTree1, PhyData)
  fit2 <- phangorn::pml(UwUTree2, PhyData)
  fit1 <- phangorn::optim.pml(fit1) # optimize edge weights
  fit2 <- phangorn::optim.pml(fit2)
  #ape::write.tree(fit1,"6_Phyly_Analysis/fit1.tree")
  #ape::write.tree(fit2,"6_Phyly_Analysis/fit2.tree")
  # with pml objects as input
  PhylyResult <- phangorn::SH.test(fit1, fit2, B=10000)
  # in real analysis use larger B, e.g. 10000
  PhylyResultz[[i]] <- PhylyResult
  names(PhylyResultz)[[i]] <- nomen
}

SH_test_results <- as.data.frame(matrix(nrow=length(names(PhylyResultz)),ncol=8))
colnames(SH_test_results) <- c("Subgenus","Unconstrained ln L","Unconstrained Diff ln L", "Unconstrained p-value","Constrained ln L","Constrained Diff ln L", "Constrained p-value", "Phyly")
SH_test_results[,1] <- names(PhylyResultz)
for(i in 1:nrow(SH_test_results))
{
  tmp <- PhylyResultz[[i]]
  SH_test_results[i,2] <- tmp[1,2]
  SH_test_results[i,3] <- tmp[1,3]
  SH_test_results[i,4] <- tmp[1,4]
  SH_test_results[i,5] <- tmp[2,2]
  SH_test_results[i,6] <- tmp[2,3]
  SH_test_results[i,7] <- tmp[2,4]
}
write.csv(SH_test_results,"/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/SH_test_results.csv",row.names=F)


setwd("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/Trees/Clades")
SubgenTreez <- list.files()
PhylyResultz <- list()
for(i in 1:length(SubgenTreez))
{
  nomen <- gsub("RAxML_bipartitions.","",SubgenTreez[i])
  nomen <- gsub("BackboneNewLowThread.tree","",nomen)
  nomen <- gsub("_","",nomen)
  message(sprintf("Processing Subgenus '%s', No. %s of %s",nomen,i,length(SubgenTreez)))
  library(phangorn)
  UwUTree1 <- ape::read.tree("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/5_ML_Analysis/PartitionedMLPLastomeTree.tree")
  UwUTree2<- ape::read.tree(SubgenTreez[i])
  PhyData <- ape::read.FASTA("/Users/Mike/BSU_DRIVE/BiorganellarManuscript/ReAnalysisPlastomes/PartitionedML/New/AlignedRevCompGood.fasta")
  PhyData <- phangorn::as.phyDat(PhyData)
  fit1 <- phangorn::pml(UwUTree1, PhyData)
  fit2 <- phangorn::pml(UwUTree2, PhyData)
  fit1 <- phangorn::optim.pml(fit1) # optimize edge weights
  fit2 <- phangorn::optim.pml(fit2)
  #ape::write.tree(fit1,"6_Phyly_Analysis/fit1.tree")
  #ape::write.tree(fit2,"6_Phyly_Analysis/fit2.tree")
  # with pml objects as input
  PhylyResult <- phangorn::SH.test(fit1, fit2, B=10000)
  # in real analysis use larger B, e.g. 10000
  PhylyResultz[[i]] <- PhylyResult
  names(PhylyResultz)[[i]] <- nomen
}

SH_test_results <- as.data.frame(matrix(nrow=length(names(PhylyResultz)),ncol=8))
colnames(SH_test_results) <- c("Subgenus","Unconstrained ln L","Unconstrained Diff ln L", "Unconstrained p-value","Constrained ln L","Constrained Diff ln L", "Constrained p-value", "Phyly")
SH_test_results[,1] <- names(PhylyResultz)
for(i in 1:nrow(SH_test_results))
{
  tmp <- PhylyResultz[[i]]
  SH_test_results[i,2] <- tmp[1,2]
  SH_test_results[i,3] <- tmp[1,3]
  SH_test_results[i,4] <- tmp[1,4]
  SH_test_results[i,5] <- tmp[2,2]
  SH_test_results[i,6] <- tmp[2,3]
  SH_test_results[i,7] <- tmp[2,4]
}
write.csv(SH_test_results,"/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/6_Phyly_Analysis/Clades_SH_test_results.csv",row.names=F)

Inferring maximum clade credibility phylogenetic tree

To examine phylogenetic clustering considering past events and to perform climatic, pedological, and morphological analyses, a rooted and time-measured isochronous phylogenetic tree was inferred. Beauti v. 1.10.4 (Drummond & Rambaut, 2018) was then used to create an input file for BEAST v. 1.10.4 (Drummond & Rambaut, 2018). Four partitions were set (LSC, IRB, SSC, and IRA). Uniform priors were set as follows using three secondary calibration points obtained by Gallaher et al. (2015; see Table 1 in reference and values of 95% HPD obtained from Cyclanthus and Pandanaceae fossils): (a) Crown of Pandanaceae (upper : 97.5 MA, lower : 41.9 MA), splits between b) Freycinetia and clade including Pandanus and Benstonea (upper : 60.7 MA, lower : 24.6 MA), and c) Pandanus and Benstonea (upper : 33.0 MA, lower : 12.5 MA). We used an uncorrelated relaxed molecular clock with a lognormal distribution of rates and a Yule speciation model. The analysis was run on the CIPRES portal 345,287,000 generations total (divided into four analyses), sampling one tree every 1,000th generation Parameter convergence was confirmed in Tracer v. 1.7.2 (Rambaut et al, 2018). A maximum clade credibility tree with median branch lengths and 95% highest posterior density (HPD) interval on nodes was reconstructed was using Tree Annotator v. 1.10.4. (Drummond & Rambaut, 2018) with the first 10% of trees discarded as burn-in.

Making annotated BEAST tree figure

# Read in tree, data, and key
library(treeio)
BeastTree <- treeio::read.beast("7_BEAST_Analysis/Trees/med.tree")
# Destroy the tibble heresy
BeastTable <- BeastTree@data
BeastTreePhylo <- treeio::as.phylo(BeastTree)
rm(BeastTree)
# Remove outgroup
Key <- read.csv("0_Raw_Data/Key.csv")

# Make tip labels
BT_names <- BeastTreePhylo$tip.label
NewNames <- rep(NA,length(BT_names))
nonPandanus <- c("OUTGROUP_Carludovica_palmata_voucher_K_Chase",
                "Pandan0062-Sararanga_philippinensis-Chloroplast-Genome",
                "Pandan0007-Freycinetia_caudata-Chloroplast-Genome",
                "Pandan0003-Benstonea_inquilina-Chloroplast-Genome",
                "Pandan0005-Benstonea_serpentinica-Chloroplast-Genome",
                "Pandan0001-Benstonea_brevistyla-Chloroplast-Genome",
                "Pandan0006-Benstonea_tunicata-Chloroplast-Genome",
                "Pandan0002-Benstonea_epiphytica-Chloroplast-Genome",
                "Pandan0004-Benstonea_pachyphylla-Chloroplast-Genome")
for(i in 1:length(BT_names))
{
  uwu <- BT_names[i]
  if(uwu %in% c(nonPandanus,"gi|1693290702|ref|NC_042747.1|"))
  {
    if(grepl("OUTGROUP",uwu))
    {
       NewNames[i] <- "Carludovica palmata (Chase 14836)"
    }else if(uwu == "gi|1693290702|ref|NC_042747.1|"){
       NewNames[i] <- "Pandanus tectorius (Wang B247)"
    }else{
      PNum <- strsplit(uwu, split="-")[[1]][1]
      owo <- Key[Key$UniqueID == PNum,]
      NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
    }
  }else{
    PNum <- strsplit(uwu, split="-")[[1]][1]
    owo <- Key[Key$UniqueID == PNum,]
    NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
  }
}

# Make figure
library("phytools")
library("phyloch")
library("strap")
library("coda")

t <- phyloch::read.beast("7_BEAST_Analysis/Trees/med.tree") # I couldn't upload the file here
t$tip.label <- NewNames
t <- drop.tip(t,"Carludovica palmata (Chase 14836)")
t$root.time <- t$height[1]

num_taxa <- length(t$tip.label)

display_all_node_bars <- TRUE

names_list <-vector()
for (name in t$tip){
  v <- strsplit(name, "_")[[1]]
  if(display_all_node_bars){
    names_list = c(names_list, name)
  }
  else if(v[length(v)]=="0"){
    names_list = c(names_list, name)
  }
}
nidsnamez <- list()
nids <- vector()
pos <- 1
len_nl <- length(names_list)
countR <- 1
for(n in names_list){
  for(nn in names_list[pos:len_nl]){
    if(n != nn){
      m <- getMRCA(t,c(n, nn))
      nidsnamez[[countR]] <- paste(c(n, nn),collapse=" ")
      if(m %in% nids == FALSE){
        nids <- c(nids, m)
        countR <- countR + 1
      }
    }
  }
  pos <- pos+1
}
nidsnamez <- unlist(nidsnamez)

t <- ladderize(t)
II_mrca <- phytools::fastMRCA(t,"Pandanus tectorius (Wang B247)","Pandanus multispicatus (Callmander 91)")
I_mrca <- phytools::fastMRCA(t,"Pandanus kaida (Altafhusain 3)","Pandanus pugnax (Callmander 1035)")
IIa_mrca <- phytools::fastMRCA(t,"Pandanus tectorius (Wang B247)","Pandanus aquaticus (Dowe 290809B)")
Ia_mrca <- phytools::fastMRCA(t,"Pandanus kaida (Altafhusain 3)","Pandanus calcis (Callmander 1048)")
IIb_mrca <- phytools::fastMRCA(t,"Pandanus multispicatus (Callmander 91)","Pandanus pygmaeus (Rakotovao 5009)")
Ib_mrca <- phytools::fastMRCA(t,"Pandanus pugnax (Callmander 1035)","Pandanus zea (Dowe 201009D)")




geoinfo <- matrix(nrow=7,ncol=2)
row.names(geoinfo) <- c("Paleocene","Eocene","Oligocene","Miocene","Pliocene","Pleistocene","Holocene")
geoinfo[,1] <- c(66,56,33.9,23.3,5.333,2.58,0.0117)
geoinfo[,2] <- c(56,33.9,23.3,5.333,2.58,0.0117,0)
library(strap)
library(phytools)
data(strat2012)
{
  pdf("7_BEAST_Analysis/AnnotatedMedMCCTree.pdf", width = 30, height = 20)
  plot.phylo(t,x.lim=c(-25,75),show.tip.label = T,cex=1)
  #strap::geoscalePhylo(tree = t, units=c("Period", "Epoch", "Age"), boxes="Epoch", cex.tip=0.5, cex.age=0.7, cex.ts=0.7, label.offset=0, x.lim=c(-15,80), lwd=3, width=2)
  axisGeo(strat2012, tip.time = 0, unit = c("epoch","period","eon"),    gridty = 0, gridcol = "black",ages=F)
  #col = c("antiquewhite1","moccasin","khaki1","yellow","burlywood2","sandybrown","khaki1","gold","tan1")
  axisChrono(side = 3, unit = "Ma")
  lastPP <- get("last_plot.phylo", envir = .PlotPhyloEnv)
  for(nv in nids){
    bar_xx_a <- c(lastPP$xx[nv]+t$height[nv-num_taxa]-t$"height_95%_HPD_MIN"[nv-num_taxa],
                  lastPP$xx[nv]-(t$"height_95%_HPD_MAX"[nv-num_taxa]-t$height[nv-num_taxa]))
    lines(bar_xx_a, c(lastPP$yy[nv], lastPP$yy[nv]), col = rgb(0, 0, 1, alpha = 0.3), lwd = 5)
  }

  t$node.label <- t$posterior
  p <- character(length(t$node.label))
  p[t$node.label >= 0.95] <- "green"
  p[t$node.label < 0.95 & t$node.label >= 0.75] <- "yellow"
  p[t$node.label < 0.75] <- "red"
  nodelabels(pch = 21, cex = 1.5, bg = p)
  phytools::cladelabels(t,"Ia",Ia_mrca,offset=2,orientation="horizontal",cex=3)
  phytools::cladelabels(t,"Ib",Ib_mrca,offset=2,orientation="horizontal",cex=3)
  phytools::cladelabels(t,"IIa",IIa_mrca,offset=2,orientation="horizontal",cex=3)
  phytools::cladelabels(t,"IIb",IIb_mrca,offset=2,orientation="horizontal",cex=3)
  phytools::cladelabels(t,"I",I_mrca,offset=6,orientation="horizontal",cex=3)
  phytools::cladelabels(t,"II",II_mrca,offset=6,orientation="horizontal",cex=3)
  dev.off()
}

Abiotic analyses

To explore cladewise climatological and pedological differences among clades (I, II) and subclades (Ia, Ib, IIa, IIb) as well as between species with and without a hypertrophied adaxial hydrenchyma we used a random forest approach (Breiman, 2001). We chose to use random forests because of it’s robustness to collinearity (Belgiu and Drăguţ, 2016; Matsuki et al., 2016), which is present in the Bioclim and Soils data (see supplementary materials for correlation matrices) and because it is not prone to overfitting (Matsuki et al., 2016). Geolocations for taxonomically verified vouchered non-cultivated samples were extracted from the Pandanaceae Project (2024) website (Fig. 6a) and bioclimatic variables were extracted for each sample using its latitude and longitude coordinates in R with the measurements and raster packages from the Worldclim data set (Worldclim.org; Birk, 2019 and Hijmans, 2020, respectively). Important variables were identified by successive feature elimination and OOB error of iteratively grown random forests (whole range computed, 50,000 trees for the first iteration, 50,000 for each successive iteration) using the varSelRF package (Diaz-Uriarte, 2007; Diaz-Uriarte and Alvarez de Andres, 2006). The climatic data was then filtered to contain only the important variables and a new model was created using ranger (citation) to compute updated variable importance using the parameters used in the best model as identified by the successive feature elimination. To validate variable importance and assign p-values to each variable, Altmann’s corrected permutation feature importance (Altmann et al., 2010) with 100 permutaions was performed on the new model using ranger. Directionality and additional validation was provided by performing Mann-Whitney U tests (Mann and Whitney, 1947) on cladewise data for each of the important variables to test for a statistically significant difference in means between clades. The vioplot package (Adler et al., 2022) was then used to produce violin plots for each important variable. The same method as above was used for the pedological analysis, however it used the Digital Soil Map of the World (Food and Agriculture Organization of the United Nations, 2003) rather than bioclim data.

Climate analysis code

# brew install libgit2
CallmOcc <- read.csv("12_External_Data/ClimatPP.csv")
GoodOccz <- as.data.frame(matrix(nrow=0, ncol=4))
colnames(GoodOccz) <- c("Species","Lat","Long","Clade")
for(i in 1:nrow(CallmOcc))
{
  owo <- unlist(strsplit(CallmOcc[i,2], split = " "))
  message(sprintf("Processing No. %s of %s",i,nrow(CallmOcc)))
  if(T %in% grepl("°",owo))
  {
    message("......geolocation retrieved!")
    uwu <- as.data.frame(matrix(nrow=1, ncol=3))
    uwu[1,1] <- CallmOcc[i,1]
    uwu[1,4] <- CallmOcc[i,3]
    colnames(uwu) <- c("Species","Lat","Long","Clade")
    uwu[1,c(2,3)] <- gsub(",","",owo[grep("°",owo)])
    GoodOccz <- rbind(GoodOccz,uwu)
  }
}

# Add our collections (if not already there)
PandanData <- read.csv("3_Plastome_Reconstruction/MasterKeyPandanusChloroplast2022.csv")
PandanData <- PandanData[!PandanData[,1] == "Pandan0034",] #remove dup
Key <- read.csv("5_ML_Analysis/SubcladeKey.csv")
Geolocales <- as.data.frame(matrix(nrow=nrow(Key),ncol=22))
colnames(Geolocales) <- c("Species","Lat","Long")
Geolocales$Species <- Key$Species
gonez <- c("Collection_Info_Missing","No_Coordinates","Only_In_Cultivation","Living_Collection")
# Ignore warning

for(i in 1:nrow(Geolocales))
{
  message(sprintf("Processing No. %s of %s",i,nrow(Geolocales)))
  if(Geolocales[i,1] == "Pandanus_sermollianus")
  {
    coords <- PandanData[PandanData$Taxon == "Pandanus_sermolliana",which(colnames(PandanData) == "Coordinates")]

  }else if(Geolocales[i,1] == "Pandanus_purpurascens"){
    coords <- PandanData[PandanData$Taxon == "Pandanus_purpuracens",which(colnames(PandanData) == "Coordinates")]
  }else if(Geolocales[i,1] == "Pandanus_halmaherensis"){
    coords <- PandanData[PandanData$Taxon == "Pandanus_halmaheraensis",which(colnames(PandanData) == "Coordinates")]
  }else{
    coords <- PandanData[PandanData$Taxon == Geolocales[i,1],which(colnames(PandanData) == "Coordinates")]
  }
  if(coords %in% gonez)
  {
    if(Geolocales[i,1] == "Pandanus_purpurascens")
    {
      coords <- PandanData[PandanData$Taxon == "Pandanus_purpuracens",17]
    }else{
      coords <- PandanData[PandanData$Taxon == Geolocales[i,1],17]
    }
  }
  coords <- unlist(strsplit(coords,split = " "))
  Geolocales[i,2] <- coords[1]
  Geolocales[i,3] <- coords[2]
}
for(i in 1:nrow(Geolocales))
{
  tmp <- Key[Key$Species == Geolocales[i,1],1]
  if(tmp == "I"){tmp <- 1}
  if(tmp == "II"){tmp <- 2}
  Geolocales[i,4] <- tmp
}
Geolocales <- as.data.frame(Geolocales[,c(1,2,3,4)])
Geolocales <- Geolocales[!is.na(Geolocales[,3]),]
Geolocales_wilds <- Geolocales[!Geolocales$Species %in% c("Pandanus amaryllifolius","Pandanus kaida","Pandanus conoideus"),]
colnames(Geolocales) <- colnames(GoodOccz)
AllCoordz <- as.data.frame(rbind(GoodOccz,Geolocales))
AllCoordz <- unique(AllCoordz) # Remove duplicates (i.e. same specimen sent to different herbaria)
write.csv(AllCoordz,"8_Climate_Analysis/AllCoordz.csv",row.names=F)

AllCoordz <- read.csv("8_Climate_Analysis/AllCoordz.csv")

# PCA analysis
PCA_Food_Lat_Long <- data.frame(AllCoordz[,4],AllCoordz[,2],AllCoordz[,3])
WC_Nomen <- read.csv("8_Climate_Analysis/WC_Nomen.csv")
BC_Points <- PCA_Food_Lat_Long
BC_Points <- cbind(BC_Points,as.data.frame(matrix(nrow=nrow(BC_Points), ncol=length(WC_Nomen[,1]))))
colnames(BC_Points)[1:22] <- c("Species","ddlat","ddlong",WC_Nomen[,1])
# Ignore warning
for(i in 1:nrow(BC_Points))
{
  message(sprintf("Processing No. %s of %s",i,nrow(BC_Points)))
  BC_Points[i,2] <- gsub(",","",BC_Points[i,2])
  if(grepl("N",BC_Points[i,2]))
  {
    tmp <- gsub("°"," ",BC_Points[i,2])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("N","",tmp)
    BC_Points[i,2] <- tmp
  }else if(grepl("S",BC_Points[i,2])){
    tmp <- gsub("°"," ",BC_Points[i,2])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("S","",tmp)
    BC_Points[i,2] <- paste0("-",tmp)
  }
  BC_Points[i,3] <- gsub(",","",BC_Points[i,3])
  if(grepl("E",BC_Points[i,3])){
    tmp <- gsub("°"," ",BC_Points[i,3])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("E","",tmp)
    BC_Points[i,3] <- tmp
  }else if(grepl("W",BC_Points[i,3])){
    tmp <- gsub("°"," ",BC_Points[i,3])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("W","",tmp)
    BC_Points[i,3] <- paste0("-",tmp)
  }
  BC_Points[i,2] <- as.numeric(measurements::conv_unit(BC_Points[i,2],from = "deg_min_sec", to = "dec_deg"))
  BC_Points[i,3] <- as.numeric(measurements::conv_unit(BC_Points[i,3],from = "deg_min_sec", to = "dec_deg"))
}
UniNumz <- row.names(AllCoordz)
SppNamez <- AllCoordz[!is.na(BC_Points$ddlat),]
UniNumz <- UniNumz[!is.na(BC_Points$ddlat)]
BC_Points <- BC_Points[!is.na(BC_Points$ddlat),]

BC_Points[,2] <- as.numeric(BC_Points[,2])
BC_Points[,3] <- as.numeric(BC_Points[,3])

# bc converted point was over water so movedit directly inland a few meters
# BC_Points[BC_Points$Species == "Pandanus_leram",2:3] <- c(7.0104, 93.934)

for(i in 1:19)
{
    print(sprintf("Processing BioClim Variable No. %s",i))
    thisraster <- raster::raster(sprintf("12_External_Data/wc2.1_30s_bio/wc2.1_30s_bio_%s.tif",i))
     BC_Points[,i+3] <- raster::extract(thisraster,data.frame(BC_Points$ddlong,BC_Points$ddlat))
 }

write.csv(BC_Points,"8_Climate_Analysis/BC_Points.csv",row.names=F)
write.csv(SppNamez,"8_Climate_Analysis/BC_Points_Spp.csv",row.names=F)
write.csv(UniNumz,"8_Climate_Analysis/BC_Points_UniNumz.csv",row.names=F)

message("Preparing extracted BioClim data")
  BC_Points_for_PCA <- BC_Points[,-2]
  BC_Points_for_PCA <- BC_Points_for_PCA[,-2]
  Nomen <- BC_Points_for_PCA$Species
  BC_Points_for_PCA <- BC_Points_for_PCA[,-1]
  message("Removing rows with an NA in any of the 19 variables")
  togo <- c()
  pb <- txtProgressBar(min = 1, max = nrow(BC_Points_for_PCA), style = 3)
  for(i in 1:nrow(BC_Points_for_PCA))
  {
    setTxtProgressBar(pb, i)
    if(T %in% is.na(BC_Points_for_PCA[i,]))
    {
      togo <- c(togo,i)
    }
  }
  require(factoextra)
  library("FactoMineR")
  Nomen <- Nomen[-togo]
  BC_Points_for_PCA_Clean <- BC_Points_for_PCA[-togo,]

##################################################
# Random Forest Feature Selection
# With Corrected Feature Importance (p-value)
##################################################

# Prepare data
BC_Points <- read.csv("8_Climate_Analysis/BC_Points.csv")
SppNamez<- read.csv("8_Climate_Analysis/BC_Points_Spp.csv")
message("Preparing extracted BioClim data")
  BC_Points_for_PCA <- BC_Points[,-2]
  BC_Points_for_PCA <- BC_Points_for_PCA[,-2]
  Nomen <- BC_Points_for_PCA$Species
  BC_Points_for_PCA <- BC_Points_for_PCA[,-1]
  message("Removing rows with an NA in any of the 19 variables")
  togo <- c()
  pb <- txtProgressBar(min = 1, max = nrow(BC_Points_for_PCA), style = 3)
  for(i in 1:nrow(BC_Points_for_PCA))
  {
    setTxtProgressBar(pb, i)
    if(T %in% is.na(BC_Points_for_PCA[i,]))
    {
      togo <- c(togo,i)
    }
  }
  Nomen <- Nomen[-togo]
  BC_Points_for_PCA_Clean <- BC_Points_for_PCA[-togo,]
  SppNamezClean <- SppNamez[-togo,]
  write.csv(BC_Points_for_PCA_Clean,"8_Climate_Analysis/BC_Points_for_PCA_Clean.csv",row.names=F)
  write.csv(SppNamezClean,"8_Climate_Analysis/SppNamezClean.csv",row.names=F)

RandomForestAnalyzeR <- function(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)
{
  set.seed(123)
  message("Preparing data")
  Key <- read.csv(Keypath)
  if("I" %in% c(comp1,comp2))
  {
    comp1_Spp <- unique(Key[Key$Clade == comp1,1])
    comp2_Spp <- unique(Key[Key$Clade == comp2,1])
  }else if("Ia" %in% c(comp1,comp2)){
    comp1_Spp <- unique(Key[Key$Subclade == comp1,1])
    comp2_Spp <- unique(Key[Key$Subclade == comp2,1])
  }else{
    comp1_Spp <- unique(Key[Key[,2] == comp1,1])
    comp2_Spp <- unique(Key[Key[,2] == comp2,1])
  }
  BC_Points_for_PCA_Clean <- Data4Comp
  comp1_Spp_numz <- which(SppNames[,1] %in% comp1_Spp)
  comp2_Spp_numz <- which(SppNames[,1] %in% comp2_Spp)
  BC_Points_Clean_comp1 <- BC_Points_for_PCA_Clean[comp1_Spp_numz,]
  BC_Points_Clean_comp2 <- BC_Points_for_PCA_Clean[comp2_Spp_numz,]

  BC_Points_Clean_comps12 <- cbind(c(rep(1,nrow(BC_Points_Clean_comp1)),rep(2,nrow(BC_Points_Clean_comp2))),rbind(BC_Points_Clean_comp1,BC_Points_Clean_comp2))
  colnames(BC_Points_Clean_comps12)[1] <- "Subclade"
  n_features <- length(colnames(BC_Points_Clean_comps12))-1
  message("Selecting variables by successive feature elimination and OOB error")
  message("...using 50,000 trees at each iteration")
  varselrf_pandan <- varSelRF::varSelRF(BC_Points_Clean_comps12[,-1],Class = as.factor(BC_Points_Clean_comps12$Subclade),recompute.var.imp=F,whole.range=T,ntree=50000,ntreeIterat=50000, keep.forest = TRUE,verbose=TRUE,vars.drop.num=1,vars.drop.frac=NULL)
  colnames(BC_Points_Clean_comps12)[1] <- "Subclade"
  SelectedVars <- varselrf_pandan$selected.vars
  sh <- varselrf_pandan$selec.history
  message(sprintf("%s features were found as important",length(SelectedVars)))
  print(SelectedVars)
  NewRF <- ranger::ranger(as.formula(sprintf("Subclade ~ %s",paste(SelectedVars, collapse = " + "))), data = BC_Points_Clean_comps12,
             num.trees = varselrf_pandan$ntreeIterat,importance = "permutation")
  pz <- ranger::importance_pvalues(NewRF,method="altmann",num.permutations = 100,formula=as.formula(sprintf("Subclade ~ %s",paste(SelectedVars, collapse = " + "))),data = BC_Points_Clean_comps12)
  PandanForest_P_p_df <- as.data.frame(pz)



  PandanForest_P_p_df[,c(2,3)] <- PandanForest_P_p_df[,c(1,2)]
  PandanForest_P_p_df[,1] <- row.names(PandanForest_P_p_df)
  row.names(PandanForest_P_p_df) <- NULL
  colnames(PandanForest_P_p_df) <- c("Variable","Importance","p-Value")
  message("Writing variable importance and p-value table")
  write.csv(PandanForest_P_p_df,sprintf("%s/RandomFOrest_P_%s_%s_VarImp.csv",folder,comp1,comp2),row.names = F)
  SigVals <- PandanForest_P_p_df[PandanForest_P_p_df[,3] < 0.05,1]
  if(length(SigVals) > 0)
  {
    message("Conducting Variablewise Wilcoxan Rank-Sum Tests")
    ThisData <- BC_Points_Clean_comps12
    CladewiseCompPvals <- as.data.frame(matrix(nrow=length(SigVals),ncol=3))
    CladewiseCompPvals[,1] <- SigVals
    colnames(CladewiseCompPvals) <- c("Variable","p-value","W")
    for(i in 1:length(SigVals))
    {
      OneVar <- ThisData[ThisData[,1] == 1,colnames(ThisData) == SigVals[i]]
      TwoVar <- ThisData[ThisData[,1] == 2,colnames(ThisData) == SigVals[i]]
      Wilcox1 <- wilcox.test(as.numeric(OneVar),as.numeric(TwoVar))
      CladewiseCompPvals[i,2] <- Wilcox1[[3]]
      CladewiseCompPvals[i,3] <- Wilcox1[[1]]
    }
    write.csv(CladewiseCompPvals,sprintf("%s/CladewiseWilcox_Sig_%s%s%s.csv",folder,comp1,comp2,analtype),row.names=F)
    message("Creating Variablewise Violin Plots")
    if("Ia" %in% c(comp1,comp2))
    {
      uwucolorz <- c("brown2","brown")
    }else if("IIa" %in% c(comp1,comp2)){
      uwucolorz <- c("cyan2","cyan4")
    }else{
      uwucolorz <- c("brown1","cyan")
    }
    library(vioplot)
    for(i in 1:length(SigVals))
    {
      OneVar <- ThisData[ThisData[,1] == 1,colnames(ThisData) == SigVals[i]]
      TwoVar <- ThisData[ThisData[,1] == 2,colnames(ThisData) == SigVals[i]]
      {
        pdf(sprintf("%s/ViolinPlot_Sig_%s%s%s_%s.pdf",folder,comp1,comp2,analtype,SigVals[i]))
        vioplot(OneVar, side = "left", plotCentre = "line", ylim=c(.8*min(OneVar,TwoVar),max(OneVar,TwoVar)+.2*max(OneVar,TwoVar)),col = uwucolorz[1], xaxt= "n",ylab=SigVals[i])# = TRUE)
        vioplot(TwoVar, side = "right", plotCentre = "line", col = uwucolorz[2], add = TRUE )
        legend("topleft", legend = c(comp1, comp2), fill = uwucolorz,cex=2)
        dev.off()
      }
    }
  }else{
    message("NO SIGNIFICANT PREDICTORS")
  }
}


  # I v II
  Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "I"
  comp2 <- "II"
  analtype <- "BioClim"
  folder <- "8_Climate_Analysis"
  Data4Comp <- BC_Points_for_PCA_Clean
  SppNames <- SppNamezClean
  RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)


  # Ia v Ib
  Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "Ia"
  comp2 <- "Ib"
  analtype <- "BioClim"
  folder <- "8_Climate_Analysis"
  Data4Comp <- BC_Points_for_PCA_Clean
  SppNames <- SppNamezClean
  RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)

  # IIa v IIb
  Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "IIa"
  comp2 <- "IIb"
  analtype <- "BioClim"
  folder <- "8_Climate_Analysis"
  Data4Comp <- BC_Points_for_PCA_Clean
  SppNames <- SppNamezClean
  RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)


  C_filez <- list.files("/Users/Mike/BSU_DRIVE/PlastomeAnalysis2024/8_Climate_Analysis")
  VI <- C_filez[grepl("VarImp.csv",C_filez)]
  MWU <- c("CladewiseWilcox_Sig_IIIBioClim.csv","CladewiseWilcox_Sig_IaIbBioClim.csv" ,"CladewiseWilcox_Sig_IIaIIbBioClim.csv")
  cladez <- c("I_II","Ia_Ib","IIa_IIb")
  for(i in 1:length(VI))
  {
    uwu <- read.csv(sprintf("8_Climate_Analysis/%s",VI[i]))
    owo <- read.csv(sprintf("8_Climate_Analysis/%s",MWU[i]))
    alluwusmol <- cbind(uwu,owo)
    alluwusmol <- cbind(rep(cladez[i],nrow(alluwusmol)),alluwusmol)
    if(i == 1)
    {
      alluwu <- alluwusmol
    }else{
      alluwu <- rbind(alluwu,alluwusmol)
    }
  }
  write.csv(alluwu, "8_Climate_Analysis/AllResultsTable.csv",row.names=F)

Soil analysis code

AllCoordz <- read.csv("8_Climate_Analysis/AllCoordz.csv")
DSMW <- "12_External_Data/DSMW/DSMW.shp"
# PCA analysis
PCA_Food_Lat_Long <- data.frame(AllCoordz[,4],AllCoordz[,2],AllCoordz[,3])
BC_Points <- PCA_Food_Lat_Long
BC_Points <- cbind(BC_Points,as.data.frame(matrix(nrow=nrow(BC_Points), ncol=3)))
colnames(BC_Points)[1:3] <- c("Species","ddlat","ddlong")
# Ignore warning
for(i in 1:nrow(BC_Points))
{
  message(sprintf("Processing No. %s of %s",i,nrow(BC_Points)))
  BC_Points[i,2] <- gsub(",","",BC_Points[i,2])
  if(grepl("N",BC_Points[i,2]))
  {
    tmp <- gsub("°"," ",BC_Points[i,2])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("N","",tmp)
    BC_Points[i,2] <- tmp
  }else if(grepl("S",BC_Points[i,2])){
    tmp <- gsub("°"," ",BC_Points[i,2])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("S","",tmp)
    BC_Points[i,2] <- paste0("-",tmp)
  }
  BC_Points[i,3] <- gsub(",","",BC_Points[i,3])
  if(grepl("E",BC_Points[i,3])){
    tmp <- gsub("°"," ",BC_Points[i,3])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("E","",tmp)
    BC_Points[i,3] <- tmp
  }else if(grepl("W",BC_Points[i,3])){
    tmp <- gsub("°"," ",BC_Points[i,3])
    tmp <- gsub("'"," ",tmp)
    tmp <- gsub("\"","",tmp)
    tmp <- gsub("W","",tmp)
    BC_Points[i,3] <- paste0("-",tmp)
  }
  BC_Points[i,2] <- as.numeric(measurements::conv_unit(BC_Points[i,2],from = "deg_min_sec", to = "dec_deg"))
  BC_Points[i,3] <- as.numeric(measurements::conv_unit(BC_Points[i,3],from = "deg_min_sec", to = "dec_deg"))
}
ptz <- paste(BC_Points[,2],BC_Points[,3])
ptz <- data.frame(BC_Points[,1],ptz)
UniNumz <- seq(1,nrow(AllCoordz),1)
SppNamez <- AllCoordz[-grep("NA",ptz[,2]),]
SL_Points <- BC_Points[-grep("NA",ptz[,2]),]
UniNumz <- UniNumz[-grep("NA",ptz[,2])]
which(ptz[,2] == "NA NA")
removedd <- grep("NA",ptz[,2])
ptz <- ptz[-grep("NA",ptz[,2]),]
library(SoilsSeeR)
Soilz <- SoilsInfoGrabbeR(DSMW,ptz)
write.csv(Soilz,"9_Soils_Analysis/AllCoordz.csv",row.names=F)
Soilz <- read.csv("9_Soils_Analysis/AllCoordz.csv")
length(which(Soilz$FAOSOIL == "No_Soil_Data"))/nrow(Soilz)*100 #7.06% have no data

median(table(SppNamez[-which(Soilz$FAOSOIL == "No_Soil_Data"),1])) # on average each species has 1 observations

Soilz_goods <- Soilz[!Soilz$FAOSOIL == "No_Soil_Data",]
SppNamez <- SppNamez[!Soilz$FAOSOIL == "No_Soil_Data",]
UniNumz <- UniNumz[!Soilz$FAOSOIL == "No_Soil_Data"]

FAO_sol <- Soilz_goods$FAOSOIL
FAO_sol_info <- as.data.frame(matrix(nrow=length(FAO_sol),ncol=2))
for(i in 1:length(FAO_sol))
{
  if(grepl("-",FAO_sol[i]))
  {
    Nomen <- gsub("[0-9]","",gsub("-.*","",FAO_sol[i]))
    Tex <-  gsub("[^0-9]","",gsub(".*-","",FAO_sol[i]))
  }else{
    Nomen <- gsub("[0-9]","",FAO_sol[i])
    Tex <- NA
  }
  FAO_sol_info[i,1] <- Nomen
  FAO_sol_info[i,2] <- Tex
}
AllInfoz <- readxl::read_excel("12_External_Data/DSMW/Generalized_SU_Info.xls")
AllInfoz <- as.data.frame(AllInfoz) # Destroy tibble heresy
SoilInfoz <- AllInfoz[0,]
CladeInfo <- list()
tickeR <- 0
RepSppNamez <- c()
RepUniNumz <- c()
SppNamez <- SppNamez[,1]
for(i in 1:nrow(FAO_sol_info))
{
  if(is.na(FAO_sol_info[i,2]) || FAO_sol_info[i,2] == "")
  {
    these <- AllInfoz[AllInfoz[,1] == toupper(FAO_sol_info[i,1]),]
    SoilInfoz <- rbind(these,SoilInfoz)
    #print(sprintf("%s %s",i,nrow(these)))
    tickeR <- tickeR + 1
    CladeInfo[[tickeR]] <- Soilz_goods[i,1]
    RepSppNamez <- c(RepSppNamez,SppNamez[i])
    RepUniNumz <- c(RepUniNumz,UniNumz[i])
  }else{
    if(as.numeric(FAO_sol_info[i,2]) < 3)
    {
      these <- AllInfoz[AllInfoz[,1] == sprintf("%s %s",toupper(FAO_sol_info[i,1]),FAO_sol_info[i,2]),]
      SoilInfoz <- rbind(SoilInfoz,these)
      tickeR <- tickeR + 1
      CladeInfo[[tickeR]] <- Soilz_goods[i,1]
      RepSppNamez <- c(RepSppNamez,SppNamez[i])
      RepUniNumz <- c(RepUniNumz,UniNumz[i])
      #print(sprintf("%s %s",i,nrow(these)))
    }else{
      #print(sprintf("%s %s %s",i,nrow(these),FAO_sol_info[i,2]))
      uwu <- unlist(strsplit(FAO_sol_info[i,2], split=""))
      for(j in 1:length(uwu))
      {
        these <- AllInfoz[AllInfoz[,1] == sprintf("%s %s",toupper(FAO_sol_info[i,1]),uwu[j]),]
        SoilInfoz <- rbind(SoilInfoz,these)
        tickeR <- tickeR + 1
        CladeInfo[[tickeR]] <- Soilz_goods[i,1]
        RepSppNamez <- c(RepSppNamez,SppNamez[i])
        RepUniNumz <- c(RepUniNumz,UniNumz[i])
      }
    }
  }
}
write.csv(unlist(CladeInfo),"9_Soils_Analysis/CladeInfo.csv",row.names=F)
write.csv(SoilInfoz,"9_Soils_Analysis/SoilInfoz.csv",row.names=F)
write.csv(RepSppNamez,"9_Soils_Analysis/SoilInfoz_Spp.csv",row.names=F)
write.csv(RepUniNumz,"9_Soils_Analysis/SoilInfoz_UniNumz.csv",row.names=F)


RepSppNamez <- as.data.frame(read.csv("9_Soils_Analysis/SoilInfoz_Spp.csv"))
SoilInfoz <- read.csv("9_Soils_Analysis/SoilInfoz.csv")
SoilInfoz_Points_for_PCA_Clean <- SoilInfoz[,-1]

source("1_Functions/RandomForestAnalyzeR.R")
# I v II
  Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "I"
  comp2 <- "II"
  analtype <- "Soils"
  folder <- "9_Soils_Analysis"
  Data4Comp <- SoilInfoz_Points_for_PCA_Clean
  SppNames <- RepSppNamez
  RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)


# Ia v Ib
  Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "Ia"
  comp2 <- "Ib"
  analtype <- "Soils"
  folder <- "9_Soils_Analysis"
  Data4Comp <- SoilInfoz_Points_for_PCA_Clean
  SppNames <- RepSppNamez
  RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)

  # IIa v IIb
  Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "IIa"
  comp2 <- "IIb"
  analtype <- "Soils"
  folder <- "9_Soils_Analysis"
  Data4Comp <- SoilInfoz_Points_for_PCA_Clean
  SppNames <- RepSppNamez
  RandomForestAnalyzeR(Keypath, analtype, folder, comp1, comp2, Data4Comp, SppNames)


  C_filez <- list.files("9_Soils_Analysis")
  VI <- C_filez[grepl("VarImp.csv",C_filez)]
  MWU <- c("CladewiseWilcox_Sig_IIISoils.csv","CladewiseWilcox_Sig_IaIbSoils.csv" ,"CladewiseWilcox_Sig_IIaIIbSoils.csv")
  cladez <- c("I_II","Ia_Ib","IIa_IIb")
  for(i in 1:length(VI))
  {
    uwu <- read.csv(sprintf("9_Soils_Analysis/%s",VI[i]))
    owo <- read.csv(sprintf("9_Soils_Analysis/%s",MWU[i]))
    alluwusmol <- cbind(uwu,owo)
    alluwusmol <- cbind(rep(cladez[i],nrow(alluwusmol)),alluwusmol)
    if(i == 1)
    {
      alluwu <- alluwusmol
    }else{
      alluwu <- rbind(alluwu,alluwusmol)
    }
  }
  write.csv(alluwu, "9_Soils_Analysis/AllResultsTable.csv",row.names=F)

Morphologic analyses

To investigate morphological traits that we hypothesize could be of taxonomical or evolutionary importance, we scored six pistillate traits (spadix: solitary or spicate; carpels: connate or free; endocarp: basal, median, or submedian; supramedian; upper mesocarp: air-filled, fibrous, or solid; lower mesocarp: air-filled, fibrous, or fibrous-fleshy; stigma: acute, auricular, bilobate, circle, cordate, cristiform plate, deltoid, filiform, lanceolate, lobed sunburst, lunate, obovate, reniform, spinescent, sublinear, suborbicular, wider than long), three staminate traits (staminal arrangement: infundibuliform column, basally fused, single, peltate umbrellaform, column, and racinous; filament insertion: candelabra, column rim, edge subumbellately, fasciculate column, peltate umbrellaform, racinous, single, subumbellate, under apical disk, under apical disk margin; anther shape: deltoid, elongate, oblong, weakly deltoid) and one sterile trait (hydrenchyma: normal, i.e., 0-1 layers of hydrenchyma cells) and hypertrophied, i.e., 2+ layers of hydrenchyma cells). These traits were chosen because they were identified as being the most promising based on a thorough review of past Pandanus taxonomic works. We used stochastic character mapping (Huelsenbeck et al., 2003; Bollback, 2006) to reconstruct ancestral states for clades I and II and subclades Ia, Ib, IIa, and IIb. Individuals with multiple traits observed were excluded from the analysis because the relative proportions of phenotypes in the species populations is unknown, the taxonomy of the species used to compile morphological data may not be verifiable, and species-level taxonomical investigation is out of the purvue of this study. The R-package geiger (Pennell et a., 2014) was used to decide which continuous time Markov model of trait evolution best fit the data (equal rates, all rates different, and symmetric backward and forward; Anderson, 1991) based on lowest AICc (Akaike, 1974; Takeuchi, 1976). If more than one model had the lowest AICc, then one of them was picked at random. Stochastic character mapping was then done via the R-package phytools (Revell, 2024): the root prior was set to that of Fitzjohn et al. (2009), Bayesian MCMC (Metropolis et al., 1953) was used to calculate posterior probabilities of the transition matrix Q (with 20,000 generations sampling every 100 generations). Posterior probabilities of each node being in each state was extracted for the most recent common ancestors of six clades (I, II, Ia, Ib, IIa, and IIb), and a majority rule was used to determine if the ancestral state of a node could be estimated.

##########################################
#     Morphological Analysis
##########################################

AncestralCharacterizeR <- function(CharData,ThisTree,KeyPath,AnalType)
{
  # Get data ready
  library(phytools)
  # This function uses Bayesian MCMC to sample from the posterior distribution for the states at internal nodes in the tree.
  TipData <- read.csv(CharData)
  BeastTree <- treeio::read.beast(ThisTree)
  BeastTreePhylo <- treeio::as.phylo(BeastTree)
  rm(BeastTree)
  # Remove outgroup
  Key <- read.csv(KeyPath)
  # Make tip labels
  BT_names <- BeastTreePhylo$tip.label
  NewNames <- rep(NA,length(BT_names))
  nonPandanus <- c("OUTGROUP_Carludovica_palmata_voucher_K_Chase",
                  "Pandan0062-Sararanga_philippinensis-Chloroplast-Genome",
                  "Pandan0007-Freycinetia_caudata-Chloroplast-Genome",
                  "Pandan0003-Benstonea_inquilina-Chloroplast-Genome",
                  "Pandan0005-Benstonea_serpentinica-Chloroplast-Genome",
                  "Pandan0001-Benstonea_brevistyla-Chloroplast-Genome",
                  "Pandan0006-Benstonea_tunicata-Chloroplast-Genome",
                  "Pandan0002-Benstonea_epiphytica-Chloroplast-Genome",
                  "Pandan0004-Benstonea_pachyphylla-Chloroplast-Genome")
  for(i in 1:length(BT_names))
  {
    uwu <- BT_names[i]
    if(uwu %in% c(nonPandanus,"gi|1693290702|ref|NC_042747.1|"))
    {
      if(grepl("OUTGROUP",uwu))
      {
         NewNames[i] <- "Carludovica palmata (Chase 14836)"
      }else if(uwu == "gi|1693290702|ref|NC_042747.1|"){
         NewNames[i] <- "Pandanus tectorius (Wang B247)"
      }else{
        PNum <- strsplit(uwu, split="-")[[1]][1]
        owo <- Key[Key$UniqueID == PNum,]
        NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
      }
    }else{
      PNum <- strsplit(uwu, split="-")[[1]][1]
      owo <- Key[Key$UniqueID == PNum,]
      NewNames[i] <- sprintf("%s (%s %s)",owo$Taxon,owo$Collector,owo$Coll...)
    }
  }
  BeastTreePhylo$tip.label <- NewNames

  olds<- NewNames
  news <- rep(NA,length(olds))
  for(i in 1:length(olds))
  {
    news[i] <- paste(unlist(strsplit(olds[i], split = " "))[c(1,2)], collapse="_")
  }

  for(i in 2:ncol(TipData))
  {
    message(sprintf("Processing Variable No. %s of %s",i-1,ncol(TipData)-1))
    TipzVar <- rep(NA,length(NewNames))
    for(j in 1:length(NewNames))
      {
      if(grepl("Pandanus",NewNames[j]))
      {
        thisspp <- NewNames[j]
        if(length(TipData[grepl(paste(unlist(strsplit(thisspp, split = " "))[c(1,2)], collapse="_"),TipData$Species),i]) == 0)
        {
          TipzVar[j] <- NA
        }else{
          TipzVar[j] <- TipData[grepl(paste(unlist(strsplit(thisspp, split = " "))[c(1,2)], collapse="_"),TipData$Species),i]
        }
      }else{
        TipzVar[j] <- NA
      }
    }
    TipzVar<-setNames(TipzVar,NewNames)
    TipzVar <- TipzVar[!is.na(TipzVar)]
    TipzVar <- TipzVar[!TipzVar == ""]
    BeastTreePhylo_smol <- ape::drop.tip(BeastTreePhylo,BeastTreePhylo$tip.label[!BeastTreePhylo$tip.label %in% names(TipzVar)])
    tree <- BeastTreePhylo_smol
    x <- TipzVar
    rainboww <- c("darkblue","gold","orange","green","violet","blue","red",
  "yellow","turquoise","maroon","beige","darkgreen",
  "gray","pink","brown4","purple",
  "coral","cyan","black","white")
    cols<-setNames(rainboww[1:length(unique(x))],sort(unique(x)))
    #blackz <- rep("black",length(unique(x)))
    #stochastic character mapping (Huelsenbeck et al., 2003)
    message("Fitting models")
    library(geiger)
    ERmod <- fitDiscrete(tree, x,
                model = "ER",niter = 100, FAIL = 1e+200,ncores=8)
    ARDmod <- fitDiscrete(tree, x,
                model = "ARD",niter = 100, FAIL = 1e+200,ncores=8)
    SYMmod <- fitDiscrete(tree, x,
                model = "SYM",niter = 100, FAIL = 1e+200,ncores=8)
    AICcResultz <- c(ERmod$opt$aicc,ARDmod$opt$aicc,SYMmod$opt$aicc)
    AICcResultzNamez <- c("ER","ARD","SYM")
    BestModel <- AICcResultzNamez[AICcResultz == min(AICcResultz)]
    BestModel <- sample(BestModel,1)
    write.csv(BestModel,sprintf("15_Morphological_Analysis/MorphologicalAnalysis_%s_%s_BestModel.csv",AnalType,colnames(TipData)[i]),row.names=F)
    message(sprintf("Best model is: %s",BestModel))
    message("Making SIMMAP")
    mtrees<-make.simmap(tree,x,model=BestModel,nsim=200,Q="mcmc",pi="fitzjohn")
    pd<-describe.simmap(mtrees,plot=F)
    # Extract stochastic character mapping results for nodes of interest and place in table
    message("...Making cladewise table")
    noderatz <- as.data.frame(pd$ace)
    noderatz <- cbind(row.names(noderatz),noderatz)
    row.names(noderatz) <- NULL
    colnames(noderatz)[1] <- "Node"
    tipzz <- tree$tip.label
    for(k in 1:length(tipzz))
    {
      tipzz[k] <-  paste(unlist(strsplit(tipzz[k], split = " "))[c(1,2)], collapse=" ")
    }
    TipsKey <- Key[Key$Taxon %in% tipzz,]
    CladeISp1 <- TipsKey[TipsKey$Clade == "I",1][1]
    CladeIISp1 <- TipsKey[TipsKey$Clade == "II",1][1]
    tipzzraw <- tree$tip.label
    CladeISp1 <- tipzzraw[grepl(CladeISp1,tipzzraw)]
    CladeIISp1 <- tipzzraw[grepl(CladeIISp1,tipzzraw)]
    CladeIaSp1 <- TipsKey[TipsKey$Subclade == "Ia",1][1]
    CladeIbSp1 <- TipsKey[TipsKey$Subclade == "Ib",1][1]
    CladeIaSp1 <- tipzzraw[grepl(CladeIaSp1,tipzzraw)]
    CladeIbSp1 <- tipzzraw[grepl(CladeIbSp1,tipzzraw)]
    CladeIIaSp1 <- TipsKey[TipsKey$Subclade == "IIa",1][1]
    CladeIIbSp1 <- TipsKey[TipsKey$Subclade == "IIb",1][1]
    CladeIIaSp1 <- tipzzraw[grepl(CladeIIaSp1,tipzzraw)]
    CladeIIbSp1 <- tipzzraw[grepl(CladeIIbSp1,tipzzraw)]
    AllCladeIaSpp <- TipsKey[TipsKey$Subclade == "Ia",1]
    treesppIa <- c()
    for(k in 1:length(AllCladeIaSpp))
    {
      treesppIa <- unique(c(treesppIa,tipzzraw[grepl(AllCladeIaSpp[k],tipzzraw)]))
    }
    AllCladeIbSpp <- TipsKey[TipsKey$Subclade == "Ib",1]
    treesppIb <- c()
    for(k in 1:length(AllCladeIbSpp))
    {
      treesppIb <- unique(c(treesppIb,tipzzraw[grepl(AllCladeIbSpp[k],tipzzraw)]))
    }
    AllCladeIIaSpp <- TipsKey[TipsKey$Subclade == "IIa",1]
    treesppIIa <- c()
    for(k in 1:length(AllCladeIIaSpp))
    {
      treesppIIa <- unique(c(treesppIIa,tipzzraw[grepl(AllCladeIIaSpp[k],tipzzraw)]))
    }
    AllCladeIIbSpp <- TipsKey[TipsKey$Subclade == "IIb",1]
    treesppIIb <- c()
    for(k in 1:length(AllCladeIIbSpp))
    {
      treesppIIb <- unique(c(treesppIIb,tipzzraw[grepl(AllCladeIIbSpp[k],tipzzraw)]))
    }
    MRCA_Pandanus <- ape::getMRCA(tree,c(CladeISp1,CladeIISp1))
    MRCA_CladeI <- ape::getMRCA(tree,c(CladeIaSp1,CladeIbSp1))
    MRCA_CladeII <- ape::getMRCA(tree,c(CladeIIaSp1,CladeIIbSp1))
    MRCA_CladeIa <- ape::getMRCA(tree,treesppIa)
    MRCA_CladeIb <- ape::getMRCA(tree,treesppIb)
    MRCA_CladeIIa <- ape::getMRCA(tree,treesppIIa)
    MRCA_CladeIIb <- ape::getMRCA(tree,treesppIIb)
    CladewiseOutputTable <- as.data.frame(matrix(nrow=7,ncol=length(colnames(noderatz))+1))
    colnames(CladewiseOutputTable) <- c("MRCA",colnames(noderatz))
    CladewiseOutputTable$MRCA <- c("MRCA_Pandanus","MRCA_CladeI","MRCA_CladeII","MRCA_CladeIa","MRCA_CladeIb","MRCA_CladeIIa","MRCA_CladeIIb")
    CladewiseOutputTable[1,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_Pandanus,]
    CladewiseOutputTable[2,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeI,]
    CladewiseOutputTable[3,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeII,]
    CladewiseOutputTable[4,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIa,]
    CladewiseOutputTable[5,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIb,]
    CladewiseOutputTable[6,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIIa,]
    CladewiseOutputTable[7,seq(2,ncol(noderatz)+1,1)] <- noderatz[noderatz$Node == MRCA_CladeIIb,]
    write.csv(CladewiseOutputTable,sprintf("15_Morphological_Analysis/MorphologicalAnalysis_%s_%s_Table.csv",AnalType,colnames(TipData)[i]),row.names=F)
    murr <- pd$ace
    murr[row.names(murr) == as.character(MRCA_Pandanus),] <- NA
    library(phyloch)
    data(strat2012)
    message("Making plot")
    pdf(sprintf("15_Morphological_Analysis/MCMC_MorphologicalAnalysis_%s_%s.pdf",AnalType,colnames(TipData)[i]),height=25,width=40)
    par(mar=c(0,0,0,-5)+5)
    plot.phylo(tree,show.tip.label = T,cex = 2,label.offset=0.666)
    axisGeo(strat2012, tip.time = 0, unit = c("stage","epoch","period","eon"),  gridty = 0, gridcol = "black",ages=F,cex=2)
    axisChrono(side = 3, unit = "Ma",fact = 1,cex=1.666)
    nodelabels(pie=murr,piecol=cols,cex = 0.33)
    phytools::cladelabels(tree,"Ia",MRCA_CladeIa,offset=5,orientation="horizontal",cex=3)
    phytools::cladelabels(tree,"Ib",MRCA_CladeIb,offset=5,orientation="horizontal",cex=3)
    phytools::cladelabels(tree,"IIa",MRCA_CladeIIa,offset=5,orientation="horizontal",cex=3)
    phytools::cladelabels(tree,"IIb",MRCA_CladeIIb,offset=5,orientation="horizontal",cex=3)
    phytools::cladelabels(tree,"I",MRCA_CladeI,offset=8,orientation="horizontal",cex=3)
    phytools::cladelabels(tree,"II",MRCA_CladeII,offset=8,orientation="horizontal",cex=3)
    tiplabels(pie=to.matrix(x,sort(unique(x))),piecol = cols,cex = 0.33)
    legend("topleft",names(cols),fill=cols,cex = 2.22,bty="n")
    dev.off()
  }
}

CharData <- "15_Morphological_Analysis/StaminateCurated.csv"
ThisTree <- "7_BEAST_Analysis/Trees/med.tree"
KeyPath <- "0_Raw_Data/Key.csv"
AnalType <- "Staminate"
AncestralCharacterizeR(CharData,ThisTree,KeyPath,AnalType)

source("1_Functions/AncestralCharacterizeR.R")
CharData <- "15_Morphological_Analysis/WDAST_Pandan.csv"
ThisTree <- "7_BEAST_Analysis/Trees/med.tree"
KeyPath <- "0_Raw_Data/Key.csv"
AnalType <- "WDAST"
AncestralCharacterizeR(CharData,ThisTree,KeyPath,AnalType)


source("1_Functions/AncestralCharacterizeR.R")
CharData <- "15_Morphological_Analysis/PistillateCurated.csv"
ThisTree <- "7_BEAST_Analysis/Trees/med.tree"
KeyPath <- "0_Raw_Data/Key.csv"
AnalType <- "Pistillate"
AncestralCharacterizeR(CharData,ThisTree,KeyPath,AnalType)

AncestralCharacterTabulatoR <- function(foldR,todoz,threshold)
{
  filez <- list.files(foldR)
  allTabz <- filez[grepl("_Table.csv",filez)]
  todoz <- as.vector(todoz)
  OUT <- as.data.frame(matrix(nrow=0,ncol=5))
  colnames(OUT) <- c("Node","Type","Trait","Largest","Proportion")
  for(i in 1:length(todoz))
  {
    message(sprintf("Processing No. %s of %s",i,length(todoz)))
    theseTabz <- allTabz[grepl(todoz[i],allTabz)]
    for(j in 1:length(theseTabz))
    {
      uwu <- read.csv(sprintf("15_Morphological_Analysis/%s",theseTabz[j]))
      toBind <- as.data.frame(matrix(nrow=6,ncol=5))
      colnames(toBind) <- c("Node","Type","Trait","Largest","Proportion")
      toBind[,1] <- uwu[2:7,1]
      toBind[,2] <- todoz[i]
      toBind[,3] <- rep(gsub("_Table.csv","",gsub("MorphologicalAnalysis_","",theseTabz[j])),nrow(toBind))
      for(k in 2:nrow(uwu))
      {
        toBind[k-1,4] <- paste(colnames(uwu)[which(uwu[k,c(3:ncol(uwu))] == max(uwu[k,c(3:ncol(uwu))]))+2],collapse=" & ")
        toBind[k-1,5] <- paste(max(uwu[k,c(3:ncol(uwu))]),collapse=" & ")
      }
      OUT <- rbind(OUT,toBind)
    }
  }
  ACTab <- OUT
  UniCladez <- sort(unique(ACTab$Node))
  UniCharz <- unique(ACTab$Trait)
  ShortFormTab <- as.data.frame(matrix(nrow=length(UniCharz),ncol=length(UniCladez)+1))
  ShortFormTab[,1] <- UniCharz
  colnames(ShortFormTab) <- c("Character",UniCladez)
  for(i in 1:length(UniCladez))
  {
    them <- ACTab[ACTab$Node == UniCladez[i],]
    for(j in 1:length(them$Largest))
    {
      message(sprintf("Processing Clade No. %s of %s Character No. %s of %s",i,length(todoz),j,length(them$Largest)))
      if(them$Proportion[j] > .5)
      {
        ShortFormTab[j,i+1] <- sprintf("%s (%s)",gsub("\\."," ",them$Largest[j]),them$Proportion[j])
      }else{
        ShortFormTab[j,i+1] <- "Uncertain"
      }
    }
  }
  return(ShortFormTab)
}


foldR <- "15_Morphological_Analysis"
todoz <- c("Staminate","Pistillate","WDAST")
ShortFormTab <- AncestralCharacterTabulatoR(foldR,todoz,threshold)

write.csv(ShortFormTab,"15_Morphological_Analysis/ShortFormTab.csv",row.names=F)

Making accessions map

BC_Points_for_PCA_Clean <- read.csv("8_Climate_Analysis/BC_Points_for_PCA_Clean.csv")
SppNamezClean <- read.csv ("8_Climate_Analysis/SppNamezClean.csv")

 Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "Ia"
  comp2 <- "Ib"
  Key <- read.csv(Keypath)
  if("I" %in% c(comp1,comp2))
  {
    comp1_Spp <- unique(Key[Key$Clade == comp1,1])
    comp2_Spp <- unique(Key[Key$Clade == comp2,1])
  }else{
    comp1_Spp <- unique(Key[Key$Subclade == comp1,1])
    comp2_Spp <- unique(Key[Key$Subclade == comp2,1])
  }

  comp1_Spp_Ia <- comp1_Spp
  comp2_Spp_Ib <- comp2_Spp


   Keypath <- "0_Raw_Data/Key.csv"
  comp1 <- "IIa"
  comp2 <- "IIb"
  Key <- read.csv(Keypath)
  if("I" %in% c(comp1,comp2))
  {
    comp1_Spp <- unique(Key[Key$Clade == comp1,1])
    comp2_Spp <- unique(Key[Key$Clade == comp2,1])
  }else{
    comp1_Spp <- unique(Key[Key$Subclade == comp1,1])
    comp2_Spp <- unique(Key[Key$Subclade == comp2,1])
  }

  comp1_Spp_IIa <- comp1_Spp
  comp2_Spp_IIb <- comp2_Spp

  Spp_Ia <- comp1_Spp_Ia
  Spp_Ib <- comp2_Spp_Ib
  Spp_IIa <- comp1_Spp_IIa
  Spp_IIb <- comp2_Spp_IIb

  Spp_Ia_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_Ia,])
  Spp_Ib_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_Ib,])
  Spp_IIa_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_IIa,])
  Spp_IIb_geo <- unique(SppNamezClean[SppNamezClean$Species %in% Spp_IIb,])

  measurements::conv_unit(Spp_Ia_geo$Lat, from = 'deg_min_sec', to = 'dec_deg')

  source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/dms2dec", encoding = "UTF-8")

  Spp_Ia_geo$Lat <- dms2dec(Spp_Ia_geo$Lat)
  Spp_Ia_geo$Long <- dms2dec(Spp_Ia_geo$Long)
  Spp_Ib_geo$Lat <- dms2dec(Spp_Ib_geo$Lat)
  Spp_Ib_geo$Long <- dms2dec(Spp_Ib_geo$Long)
  Spp_IIa_geo$Lat <- dms2dec(Spp_IIa_geo$Lat)
  Spp_IIa_geo$Long <- dms2dec(Spp_IIa_geo$Long)
  Spp_IIb_geo$Lat <- dms2dec(Spp_IIb_geo$Lat)
  Spp_IIb_geo$Long <- dms2dec(Spp_IIb_geo$Long)
  Spp_Ia_geo <- cbind(Spp_Ia_geo,rep("Ia",nrow(Spp_Ia_geo)))
  colnames(Spp_Ia_geo)[5] <-"Subclade"
  Spp_Ib_geo <- cbind(Spp_Ib_geo,rep("Ia",nrow(Spp_Ib_geo)))
  colnames(Spp_Ib_geo)[5] <-"Subclade"
  Spp_IIa_geo <- cbind(Spp_IIa_geo,rep("Ia",nrow(Spp_IIa_geo)))
  colnames(Spp_IIa_geo)[5] <-"Subclade"
  Spp_IIb_geo <- cbind(Spp_IIb_geo,rep("Ia",nrow(Spp_IIb_geo)))
  colnames(Spp_IIb_geo)[5] <-"Subclade"
  Latz <- rbind(Spp_Ia_geo,Spp_Ib_geo,Spp_IIa_geo,Spp_IIb_geo)

  library(sp)
  Spp_Ia_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_Ia_geo$Long,Spp_Ia_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
  Spp_Ib_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_Ib_geo$Long,Spp_Ib_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
  Spp_IIa_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_IIa_geo$Long,Spp_IIa_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
  Spp_IIb_geo_pts <- sp::SpatialPoints(coords = cbind(Spp_IIb_geo$Long,Spp_IIb_geo$Lat),CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))

  world <- sf::st_read("/Users/Mike/Downloads/ne_10m_land/ne_10m_land.shp")
  world_robin <- sf::st_transform(world, CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
  world_robin_sp <- sf::as_Spatial(world_robin)
  Spp_Ia_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_Ia_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
  Spp_Ib_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_Ia_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
  Spp_IIa_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_IIa_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
  Spp_IIb_geo_pts_robin <- sf::as_Spatial(sf::st_transform(as(Spp_IIb_geo_pts,"sf"), CRS("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")))
  world_sp <- sf::as_Spatial(world)

  library(sp)
  pdf("newmap3.pdf")
  plot(world_sp,col="lightgrey")
  plot(Spp_Ib_geo_pts,col="pink",pch=8,add=T)
  plot(Spp_Ia_geo_pts,col="red",pch=8,add=T)
  plot(Spp_IIa_geo_pts,col="cyan",pch=8,add=T)
  plot(Spp_IIb_geo_pts,col="blue",pch=8,add=T)
  dev.off()

Pandan Plastome Project Pipeline

J. M. A. Wojahn, M. W. Callmander, S. Buerki

2024