3. How to use Pear on python¶
Here we report the code used to generate the examples in the manuscript, providing a general outline of the use of the python library
tree_set
is the main module of pear_ebi
- it contains both the tree_set and the set_collection objects.
from pear_ebi import tree_set
import numpy as np
import pandas as pd
import time
import os
Examples¶
Here we do not delve into the specifics or the rationale of the experiments - for that we redirect you to the related manuscript - but rather focus on the use of the library to analyze sets of trees in different settings.
BEAST¶
Bayesan - Markow Chain Monte Carlo ¶
This is a relatively simple example: we have a few files containing trees in Newick format, where each set of trees is produced by a program that produced them sequentially. We want to represent the distribution of these trees in order to analyze the single trajectories and to compare the different trjectories coming from different runs.
With this idea in mind, we simply compute the square distance matrix using the Robinson Foulds
metric, which generally represents the relations between trees in an effective way, using the most efficient algorithm: hashrf
.
We then compute and plot the PCoA
embeddings in 2 and 3 dimensions.
Step 1: load trees¶
tree_set is meant to store and manage the analysis of a single set of trees stored in the same file. These trees have to be in the (standard) Newick format.
tree_set = tree_set.tree_set(file, output_file=None, distance_matrix=None, metadata=None)
The first argument indicates the file where the trees are stored. output_file
can be empty and may be used to indicate a specific name and path for the output distance matrix. distance_matrix
can be used to indicate a precomputed distance matrix in the .csv
format. metadata
may be used to indicate a .csv
file with additional information relative to the trees. It must be of the same dimensionality of the set of trees, hence \(size = (|trees|, |meta\)-\(variables|)\).
set_collection performs the same tasks of tree_set, but stores multiple set of trees. NB: for each file, a set of trees is defined within the set collection. In practice, a set_collection is a collection of different tree_set, each one containing a set of trees coming from a different file. It should not come as a surprise, then, that the input collection=list()
, potentially empty (one may initialize an empty collection), is a list of files containing trees in the Newick format. set_collection=tree_set.set_collection(collection=list(), file="Set_collection_", output_file=None, distance_matrix=None, metadata=None,)
file
may be used to indicate an alternative name for the file containing the collection information. This file will be used by pear to store information relative to the sets of trees. output_file
may be used to indicate a specific name for the distance matrix file. distance_matrix
can be used to indicate a precomputed distance matrix in the .csv
format. metadata
may be used to indicate a .csv
file with additional information relative to the trees. It must be of the same dimensionality of the set of trees, hence \(size = (|trees|, |meta\)-\(variables|)\).
beast_dir = '../beast_trees'
beast_runs = [os.path.join(beast_dir, f"beast_run{i}.trees") for i in range(1,9)] + [os.path.join(beast_dir, f"beast_long.trees")]
beast_collection = tree_set.set_collection(beast_runs)
print(beast_collection)
─────────────────────────────
Tree set collection containing 9009 trees;
File: Set_collection_d67b3fc3-bac4-4e5f-934c-0c8dc668f012;
Distance matrix: not computed.
─────────────────────────────
beast_run1; Containing 1001 trees.
beast_run2; Containing 1001 trees.
beast_run3; Containing 1001 trees.
beast_run4; Containing 1001 trees.
beast_run5; Containing 1001 trees.
beast_run6; Containing 1001 trees.
beast_run7; Containing 1001 trees.
beast_run8; Containing 1001 trees.
beast_long; Containing 1001 trees.
Step 2: compute distances¶
beast_collection.calculate_distances(method="hashrf_RF")
# where method can be chosen among hashrf_RF,
# hashrf_wRF, smart_RF, tqdist_quartet, tqdist_triplet
Output()
hashrf_RF | Done!
Step 3: compute embeddings¶
beast_collection.embed(method="pcoa", dimensions=3, quality=False)
# where method can be chosen among pcoa, tsne, isomap, lle
# pro tip: if the distance matrix has not been computed prior to calling this function,
# it will automatically be computed using hashrf_RF
Output()
pcoa | Done!
Step 4: Plot embeddings¶
beast_collection.plot_2D(
method='pcoa', # method used for the embedding - if not previously computed, it will be computed when calling this function
save=False, # save the plot in a pdf format? The .html will be saved anyway!
name_plot=None, # specific name for the plot
static=False, # create a static element rather than a dynamic widget
plot_meta="SET-ID", # meta-variable used to colour the points (trees)
plot_set=None, # select specific set of sets of trees in the collection for the plot
select=False, # create widgets to hide/display tree_sets in the graph
same_scale=False,) # use the same colorscale for each tree_set (useful when the same metric is compared among sets)
beast_collection.plot_3D(
method='pcoa',
save=False,
name_plot=None,
static=False,
plot_meta="SET-ID",
plot_set=None,
select=False,
same_scale=False,
z_axis=None,) # substitute the 3rd axis (Z-axis/3rd PCoA) with a meta-variable of choice
RAxML¶
Bootstrap Analysis ¶
We do not delve into the biological interpretation of this specific example, as we discussed that in our manuscript.
Instead, we point your attention to some specific features of pear's interactive representations:
- First: you can directly plot the
PCoA
embedding of the Robinson Foulds distances computed withhashrf
without explicitly going through these steps. In fact, running a plotting function bypasses steps 2 and 3, performing them "under the hood"; - In the 2D representation:
- From the dropdown menu, one can choose the meta-vatiable used to color the points in the graphs;
- The red button "Save plot as pdf" saves the plot in a
.pdf
file; - The \(#show
name of the tree set
\) buttons allow showing and hiding specific sets of trees; - One may choose between the scatter representation or a contour plot;
- While visualizing a contour plot, one may choose to hide or show the points (trees) with the light blue button "Show points on a Contour plot";
- One can choose between visualizing only the points or also the connections between them (indicating the progress through an hypotetical sequential order);
- One can use the native tools of the plot so zoom in to focus on a specific area of the graph;
- If one wants to focus their attention on specific sets of trees, they just have to click them: this will highlight the selected set and make the other sets transparent. One may continue and highlight on other sets by simply clicking on them. In order to go back to the original representation, one just needs to click on one of the previously highlighted sets for a second time.
- In the 3D representation:
- Same features of the 2D representation. However, there is no contour plot in this case. ───────────────────────────── Tree set collection containing 3656 trees; File: Set_collection_4cbb9500-1e98-40cd-84ec-a9c8bd511698; Distance matrix: not computed. ───────────────────────────── bootstrap_6086; Containing 500 trees. bootstrap_5261; Containing 600 trees. bootstrap_5092; Containing 800 trees. bootstrap_281; Containing 600 trees. bootstrap_11289; Containing 600 trees. bootstrap_10409; Containing 550 trees. bestTree_6086; Containing 1 trees. bestTree_5261; Containing 1 trees. bestTree_5092; Containing 1 trees. bestTree_281; Containing 1 trees. bestTree_11289; Containing 1 trees. bestTree_10409; Containing 1 trees.
input_dir = '../bootstrap_mammals/' files = ([f'bootstrap_{i}' for i in [6086, 5261, 5092, 281, 11289, 10409]] + [f'bestTree_{i}' for i in [6086, 5261, 5092, 281, 11289, 10409]]) collection = tree_set.set_collection([os.path.join(input_dir, f) for f in files]) print(collection)
collection.plot_2D('pcoa', select=True)
collection.plot_3D('pcoa', select = True)
MAPLE
Maximum Likelihood Search
In this specific example, we focus the readers' attention on the possibility of specifying a metadata file to add meta-variable and use them to color the points in the plots or to replace the $Z_{axis}$ in the 3D plot.
In the cell below we specify a list of files containing trees and a list of files containing the likelihood of each tree.In the cell below we check that "DisMat_Maple.csv", the precomputed distance matrix, is present in the folder. We precomputed the distances as the `smart_RF` algorithm, which computes a modified version of the Robinson Foulds metric, takes much longer than `hashrf_RF`.maple_tree = [ "IQtreeStartingTree_Trees", "MapleStartingTree_Trees", "ParsimonyRAxMLStartingTree_GTRmodel_Trees", "RAxMLNGStartingTree_Trees", "UshERStartingTree_Trees", "TrueTreeSimulations", ] maple_LK = [ "IQtreeStartingTree_LK", "MapleStartingTree_LK", "ParsimonyRAxMLStartingTree_GTRmodel_LK", "RAxMLNGStartingTree_LK", "UshERStartingTree_LK", ] maple_dir = '../MAPLE_res/'
───────────────────────────── Tree set collection containing 138 trees; File: Set_collection_17e30756-0754-4fbb-9ed7-cb862716a7f8; Distance matrix: computed. ───────────────────────────── IQtreeStartingTree_Trees; Containing 29 trees. MapleStartingTree_Trees; Containing 5 trees. ParsimonyRAxMLStartingTree_GTRmodel_Trees; Containing 47 trees. RAxMLNGStartingTree_Trees; Containing 26 trees. UshERStartingTree_Trees; Containing 30 trees. TrueTreeSimulations; Containing 1 trees. This is how we extracted the information from multiple files - one may simply skip this step as we then saved the information in the "Likelihoods.csv" file.set_list = [os.path.join(maple_dir, tree) for tree in maple_tree] try:collection_maple = tree_set.set_collection(set_list, distance_matrix = "DistMat_Maple.csv") except:collection_maple = tree_set.set_collection(set_list) finally:print(collection_maple)
LKs = dict() for lk_file in maple_LK: file = os.path.join(maple_dir, lk_file) with open(file, 'r') as f: LKs[lk_file] = np.array(f.readlines()) f.close() LK = list() for lk in LKs.values(): LK.extend(lk) LK = np.array([tree.replace('\n', '') for tree in LK], dtype=np.float64) LK = np.concatenate([LK,np.array([-257513])], axis = 0) LK = np.interp(LK, (LK.min(), LK.max()), (0, +1)) # scale LKs between 0 and 1 df_LK = pd.DataFrame({'LK':LK}) df_LK.to_csv("Likelihoods.csv") df_LK
LK 0 0.983439 1 0.983822 2 0.984404 3 0.985124 4 0.985631 ... ... 133 0.999844 134 0.999850 135 0.999976 136 1.000000 137 0.706890 138 rows × 1 columns
# Method 1: # Given that metadata is a pandas dataframe, # one may simply add columns to it! df_LK = pd.read_csv("Likelihoods.csv") collection_maple.metadata['LK'] = df_LK['LK'] collection_maple.metadata
SET-ID STEP LK 0 IQtreeStartingTree_Trees 0 0.983439 1 IQtreeStartingTree_Trees 1 0.983822 2 IQtreeStartingTree_Trees 2 0.984404 3 IQtreeStartingTree_Trees 3 0.985124 4 IQtreeStartingTree_Trees 4 0.985631 ... ... ... ... 133 UshERStartingTree_Trees 26 0.999844 134 UshERStartingTree_Trees 27 0.999850 135 UshERStartingTree_Trees 28 0.999976 136 UshERStartingTree_Trees 29 1.000000 137 TrueTreeSimulations 0 0.706890 138 rows × 3 columns
# Method 2: # One may specify a file with additional # metadata when the object is created collection_maple = tree_set.set_collection(set_list, distance_matrix = "DistMat_Maple.csv", metadata="Likelihoods.csv") collection_maple.metadata
SET-ID STEP LK 0 IQtreeStartingTree_Trees 0 0.983439 1 IQtreeStartingTree_Trees 1 0.983822 2 IQtreeStartingTree_Trees 2 0.984404 3 IQtreeStartingTree_Trees 3 0.985124 4 IQtreeStartingTree_Trees 4 0.985631 ... ... ... ... 133 UshERStartingTree_Trees 26 0.999844 134 UshERStartingTree_Trees 27 0.999850 135 UshERStartingTree_Trees 28 0.999976 136 UshERStartingTree_Trees 29 1.000000 137 TrueTreeSimulations 0 0.706890 138 rows × 3 columns
(138, 138) We colour the trees based on their likelihood in the 2D plot, while we replace the $Z_{axis}$ with the likelihood in the 3D plot.if collection_maple.distance_matrix is None: start = time.time() collection_maple.calculate_distances('days_RF') np.savetxt("DistMat_Maple.csv", collection_maple.distance_matrix, delimiter=",") print(time.time() - start) collection_maple.distance_matrix.shape
`same_scale` makes sure that points with the same value have the same colour in the graphs.collection_maple.plot_3D('pcoa', plot_meta = 'LK', same_scale = True, select = True)
What if I wanted to make one of these trees to "pop out"? Well, we can add a coulumn called `"highlight"` that will automatically be read by the plotting function to "highloght" the points specified. The columns has to be binary (0s and 1s), where the 1s indicate that a tree should be highlighted. It is crucial here to know the order and size of the sets, as the column is shared by the whole collection. Note that: the vector `"highlight"` must be integer (dtype = int)!.collection_maple.plot_3D('pcoa', plot_meta = 'LK', same_scale = True, select = True, z_axis = 'LK')
Easy example:
we want to highight the true tree, which is the last in the collection.highlight_mask = np.zeros(collection_maple.metadata.shape[0], dtype=int) # vector of 0s with size = n_trees highlight_mask[-1] = 1 # last element set to 1 collection_maple.metadata['highlight'] = highlight_mask collection_maple.metadata
SET-ID STEP LK highlight 0 IQtreeStartingTree_Trees 0 0.983439 0 1 IQtreeStartingTree_Trees 1 0.983822 0 2 IQtreeStartingTree_Trees 2 0.984404 0 3 IQtreeStartingTree_Trees 3 0.985124 0 4 IQtreeStartingTree_Trees 4 0.985631 0 ... ... ... ... ... 133 UshERStartingTree_Trees 26 0.999844 0 134 UshERStartingTree_Trees 27 0.999850 0 135 UshERStartingTree_Trees 28 0.999976 0 136 UshERStartingTree_Trees 29 1.000000 0 137 TrueTreeSimulations 0 0.706890 1 138 rows × 4 columns
collection_maple.plot_2D('pcoa', plot_meta = 'LK', same_scale = True, select = False, static = True)
Hard (not really) example:
we want to highight the true tree, and the last tree of each set. In order to perform this task, we can exploit an additional tool: the `set_collection.data`. This object contains some useful information regarding the structure of each set composing the collection. In particular, we may be interested in the `n_trees` variable which can be used to assess the size of each set!IQtreeStartingTree_Trees has 29 trees MapleStartingTree_Trees has 5 trees ParsimonyRAxMLStartingTree_GTRmodel_Trees has 47 trees RAxMLNGStartingTree_Trees has 26 trees UshERStartingTree_Trees has 30 trees TrueTreeSimulations has 1 trees# this is the object --> collection_maple.data # Note that this is a nested dictionary for Tset, info in collection_maple.data.items(): print(Tset, f"has {info['n_trees']} trees")
highlight_mask = np.zeros(collection_maple.metadata.shape[0], dtype=int) last_tree = 0 for Tset, info in collection_maple.data.items(): last_tree += info["n_trees"] highlight_mask[last_tree - 1] = 1 collection_maple.metadata['highlight'] = highlight_mask collection_maple.metadata
SET-ID STEP LK highlight 0 IQtreeStartingTree_Trees 0 0.983439 0 1 IQtreeStartingTree_Trees 1 0.983822 0 2 IQtreeStartingTree_Trees 2 0.984404 0 3 IQtreeStartingTree_Trees 3 0.985124 0 4 IQtreeStartingTree_Trees 4 0.985631 0 ... ... ... ... ... 133 UshERStartingTree_Trees 26 0.999844 0 134 UshERStartingTree_Trees 27 0.999850 0 135 UshERStartingTree_Trees 28 0.999976 0 136 UshERStartingTree_Trees 29 1.000000 1 137 TrueTreeSimulations 0 0.706890 1 138 rows × 4 columns
collection_maple.plot_2D('pcoa', plot_meta = 'LK', same_scale = True, select = False)
Last update: 2024-04-29
- Same features of the 2D representation. However, there is no contour plot in this case.