Skip to content

1. How to use Pear

This tutorial goes through the basic features of pear_ebi and how to use them


Setup

First of all, check that you have a python version that is supported (python 3.7, 3.8, 3.9). We strongly encourage the creation of a dedicated virtual environment in order to avoid potential conflicts with other libraries due to the mismatch of dependencies' versions. We also support the use of mamba as a more efficient version of conda.

In your terminal:

Download and install mamba:
See documentation
Create new environment with one of the supported versions of python:
mamba create -n pear_env python=3.9
Activate environment:
mamba activate pear_ev
Install pear:
python -m pip install pear_ebi


Alternatively:

Install virtualenv:
pip install virtualenv
Create new environment with one of the supported versions of python:
python3.9 -m venv pear_env
Activate environment:
source pear_env/bin/activate
Install pear:
python -m pip install pear_ebi

Optional

If you're planning on performing more advanced analyses, such as the ones described in the "Advanced Examples", you should install the extended requirements:
python -m pip intall -r requirements.txt
and also install the new jupyter kernel:
python -m ipykernel install --user --name=pear_ebi


Basic Use

After following the steps above to set up pear_ebi, you should be ready to use all the features of pear_ebi!
This notebook is a good guide to learn how to use it and to check that your installation is succesfull. If you should have any problem, please contact us by filing an issue on github.
To start with, simply check your installation by running:

!pear_ebi
PEAR v0.1.85
No files specified (see --help for instructions)
- Leaving PEAR -

Pear is complaining because no file was given... since it looks like you don't know how to use it, it kindly suggests to seek help using the --help (or simply -h) flag. Good idea!

!pear_ebi -h
usage: PEAR [-h] [-o output] [--interactive] [-d distance_matrix]
            [--meta metadata] [-m METHOD] [--pcoa PCOA] [--tsne TSNE] [--plot]
            [--config CONFIG] [--quality] [--dir DIR] [--pattern PATTERN]
            [input ...]

PEAR-EBI v0.1.85 | Phylogeny Embedding and Approximate Representation
Calculates Robison-Foulds distances between large set of trees

positional arguments:
  input                 input file : tree set in newic format

optional arguments:
  -h, --help            show this help message and exit
  -o output             output file : storage of distance matrix
  --interactive, -i     run the program in interactive mode - only the input
                        file, distance matrix, output file, and metadata
                        arguments will be considered
  -d distance_matrix, --dM distance_matrix
                        distance matrix : specify file containing a
                        precomputed distance matrix
  --meta metadata       metadata : csv file with metadata for each tree
  -m METHOD, --method METHOD
                        calculates tree distances using specified method
                        (hashrf_RF, hashrf_wRF, smart_RF, tqdist_quartet,
                        tqdist_triplet)
  --pcoa PCOA           embedding using PCoA: select number of components
                        (int) to be calculated
  --tsne TSNE           embedding using t-SNE: select number of components
                        (int) to be calculated
  --plot, -p            plot embedding in 2 or 3 dimensions
  --config CONFIG, -c CONFIG
                        toml config file
  --quality, -q         asess quality of embedding
  --dir DIR             directory with files
  --pattern PATTERN     pattern of files in directory

Author: Andrea Rubbi - Goldman Group | European Bioinformatics Institute

Essentially, pear_ebi does the following:

  • Reads phylogenetic trees in newick format from one or multiple files;
  • Computes the distances between the trees with one of the available metrics (refer to --help to see them and to the manuscript for their specific functions);
  • Embeds the distances in a lower dimensional space. Basic usage allows using either PCoA or tSNE;
  • Plots the resulting embeddings.
Why should you use pear_ebi? Because it's simple, fast, and produces nice graphs!

Load trees

Pear has two kind of structures for your trees:

  1. tree_set;
  2. set_collection.
You define a tree_set with a set of trees coming from a single file, whereas a set_collection is composed of trees coming from multiple files, divided into groups depending on the file of origin. In practice, computationally and conceptually, a set_collection is a set of multiple tree_set.
We'll make it clear exploring how we can load trees on pear with a few examples.

# load single set of trees into a tree_set #
!pear_ebi beast_trees/beast_long.trees
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 1001 trees;
 File: beast_trees/beast_long.trees;
 Distance matrix: not computed.
─────────────────────────────


- Leaving PEAR -
# load two set of trees into a set_collection #
!pear_ebi beast_trees/beast_run1.trees beast_trees/beast_run2.trees
PEAR v0.1.85
Your input:
─────────────────────────────            
 Tree set collection containing 2002 trees;            
 File: Set_collection_1bfec55f-592e-43a6-bf2b-4000219b3092;
 Distance matrix: not computed.                
───────────────────────────── 
beast_run1; Containing 1001 trees. 
beast_run2; Containing 1001 trees.


- Leaving PEAR -

Although you may be perplexed about the actual practical difference between these structures, we suggest you keep these questions for the next chapters and bear with us to explore some more funky ways of loading trees onto pear.
In fact, you can avoid boring repetitions of the path of your files and simply indicate the directory!

# load files from directory #
!pear_ebi --dir beast_trees
PEAR v0.1.85
Your input:
─────────────────────────────            
 Tree set collection containing 9009 trees;            
 File: Set_collection_696e3bb7-7deb-46e1-97c8-3bcce630cddd;
 Distance matrix: not computed.                
───────────────────────────── 
beast_run7; Containing 1001 trees. 
beast_run2; Containing 1001 trees. 
beast_long; Containing 1001 trees. 
beast_run5; Containing 1001 trees. 
beast_run6; Containing 1001 trees. 
beast_run3; Containing 1001 trees. 
beast_run8; Containing 1001 trees. 
beast_run4; Containing 1001 trees. 
beast_run1; Containing 1001 trees.


- Leaving PEAR -

Whoa! Too many... thankfully we can indicate a pattern to look for

# load files from directory using pattern#
!pear_ebi --dir beast_trees --pattern "*run*"
PEAR v0.1.85
Your input:
─────────────────────────────            
 Tree set collection containing 8008 trees;            
 File: Set_collection_9781a60f-d6d7-4ffb-97c3-36d7aed9af4b;
 Distance matrix: not computed.                
───────────────────────────── 
beast_run7; Containing 1001 trees. 
beast_run2; Containing 1001 trees. 
beast_run5; Containing 1001 trees. 
beast_run6; Containing 1001 trees. 
beast_run3; Containing 1001 trees. 
beast_run8; Containing 1001 trees. 
beast_run4; Containing 1001 trees. 
beast_run1; Containing 1001 trees.


- Leaving PEAR -

If you are a regex wizard you can probably select any set of similarly-named files in this way. Me? I generally ask GPT to write the magic formula. NB: you can also use a combination of --dir, --pattern, and normal file definition if one or more files come from other directories.

!pear_ebi beast_trees/beast_long.trees --dir beast_trees --pattern "*run[1,2]*" 
PEAR v0.1.85
Your input:
─────────────────────────────            
 Tree set collection containing 3003 trees;            
 File: Set_collection_21d56458-1d6b-4dab-9842-f8a65287e845;
 Distance matrix: not computed.                
───────────────────────────── 
beast_run2; Containing 1001 trees. 
beast_run1; Containing 1001 trees. 
beast_long; Containing 1001 trees.


- Leaving PEAR -

Compute Distances

You can compute the distance matrix using different methods. Each method has a specific purpose, which is outlined in the associated paper. However, in general, the Robison Foulds distance metric is an admittedly good choice. Additionally, pear_ebi computes this metric using the hashrf algorithm, which is the fastest way of computing such metric to date. You can indicate any method available using the -m, or --method, flag. When a metric/method is indicated, pear will use it to compute the distance matrix even if a distance matrix is given, and it will overscribe any previous matrix saved at a file with the same standard format (realistically, this happens only if the matrix was produced using pear during an interactive or advanced session - see the interactive sessions chapter).

# compute Robison Foulds distances using hashrf #
!pear_ebi beast_trees/beast_run1.trees -m hashrf_RF
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 1001 trees;
 File: beast_trees/beast_run1.trees;
 Distance matrix: not computed.
─────────────────────────────

⠙ Calculating distances...0m
hashrf_RF | Done!

- Leaving PEAR -
# compute Robison Foulds distances using hashrf #
!pear_ebi --dir beast_trees --pattern "*run[12]*" --method hashrf_RF
PEAR v0.1.85
Your input:
─────────────────────────────            
 Tree set collection containing 2002 trees;            
 File: Set_collection_c9cf5472-d392-4825-b88f-e9feffdb331a;
 Distance matrix: not computed.                
───────────────────────────── 
beast_run2; Containing 1001 trees. 
beast_run1; Containing 1001 trees.

⠙ Calculating distances...0m
hashrf_RF | Done!

- Leaving PEAR -
# compute quartet distances #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m tqdist_quartet
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠴ Calculating distances...0m
tqdist_quartet | Done!

- Leaving PEAR -
# compute modified RF distances #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m smart_RF
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠙ Calculating distances...0m
smart_RF | Done!

- Leaving PEAR -

If you run the exaples above you will notice the difference between the time efficiency of hashrf and any other algorithm. Please note the astonishing performance of that algorithm, especially considering the sensible difference in the number of trees analyzed!

Given that one may want to analyze the distance matrix in multiple ways, possibly desiring to skip the hefty distance-computation step, we introduced the convenient -d flag that allows specifying a previosuly computed distance matrix. Please note that the order of the trees in the distance matrix is preserved and thus one must be consistent with the input specification when reusing a distance matrix. Should you disregard this suggestion, we suggest you also disregard your downstream analyses (and perhaps everything else as well).

!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m hashrf_RF -o precomputed_distance_matrix
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠴ Calculating distances...0m
hashrf_RF | Done!

- Leaving PEAR -
# reusing distances #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -d precomputed_distance_matrix
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: computed.
─────────────────────────────


- Leaving PEAR -

Please note the difference in the input "Distance matrix" status.

Embed distances

Here we show the easy way of embedding the distances and produce nice plots! First of all, one may choose any number of dimensions \(M \leq N\); where \(N\) is the number of trees and \(M\) the chosen dimensionality of the embedded space. Note that, should \(M\) be \(\lt 3\), pear will produce only one 2D plot. Whereas, should \(M\) be \(\geq 3\) pear will produce a 2D and a 3D plot. Please note that we are currently unable to produce human-friendly representations for \(\gt 3\)D data... In any case, whichever dimension \(M\) may be chosen, an \(M\)-dimensional embedding shall be produced and then saved on the machine.

# compute triplet distances, embed in 2D using pcoa #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m tqdist_triplet --pcoa 2
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠹ Calculating distances...0m
tqdist_triplet | Done!
[?25l⠋ Embedding distances...
pcoa | Done!

- Leaving PEAR -
# compute RF distances, embed in 5D using tsne #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m hashrf_RF --tsne 5
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠦ Calculating distances...0m
hashrf_RF | Done!
⠸ Embedding distances...0m
tsne | Done!

- Leaving PEAR -

Use the --plot flag to show the plots at the end. Note that, when an embedding method is specified, the plots are produced regardless of whether the --plot flag is present or not.

# compute RF distances, embed in 5D using tsne #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m hashrf_RF --tsne 5 --plot
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠴ Calculating distances...0m
hashrf_RF | Done!
⠸ Embedding distances...0m
tsne | Done!

- Leaving PEAR -

NB: this is something you are expected to run in a terminal, this is why the graph doesn't show up here.

The --quality, or -q, flag indicates to pear to provide some quality metrics for the embedding generated.
The toolset of quality-measures provided by pear may be expanded upon request in the future.

# compute RF distances, embed in 5D using tsne, computing quality #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m hashrf_RF --tsne 5 --quality
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠼ Calculating distances...0m
hashrf_RF | Done!
⠼ Embedding distances...0m
tsne | Done!
With 5 components/dimensions, the estimated correlation with the 32-dimensional 
coordinates is -0.03

- Leaving PEAR -
# compute RF distances, embed in 2D using pcoa, computing quality #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m hashrf_RF --pcoa 2 --quality
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠼ Calculating distances...0m
hashrf_RF | Done!
[?25l⠋ Embedding distances...
pcoa | Done!
With 2 components/dimensions, the explained variance is 96.56,
 with an estimated correlation 1.00 with the 32-dimensional coordinates

- Leaving PEAR -
# compute RF distances, embed in 10D using pcoa, computing quality #
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -m hashrf_RF --pcoa 10 --quality
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠦ Calculating distances...0m
hashrf_RF | Done!
[?25l⠋ Embedding distances...
pcoa | Done!
With 10 components/dimensions, the explained variance is 99.98,
 with an estimated correlation 1.00 with the 32-dimensional coordinates

- Leaving PEAR -

Pro tip: since in order to compute an embedded space we need the distance matrix first, pear will automatically compute the distances with hashrf_RF when no method is indicated (and no precomputed distance matrix is specified)

!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees --pcoa 2
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

⠦ Calculating distances...0m
hashrf_RF | Done!
[?25l⠋ Embedding distances...
pcoa | Done!

- Leaving PEAR -

tree_set vs set_collection

Pear has two kind of structures for your trees:

  1. tree_set;
  2. set_collection.
tree_set allows for analyzing a single set of trees. set_collection allows for analyzing multiple set of trees and the divergences between them. In our interactive plots, set_collection generates different sets of points that are colored according to the file of origin. This could be useful, for instance, if one performs multiple runs of a phylogenetic-tree estimating algorithm and wants to assess the robustness/consistency of that method.
The distances computation is overall the same as a square distance matrix is generated, encompassing the whole collection of trees.


Interactive Sessions

All the above, but staying in the loop!
You can add set of trees iteratively and compute distances/embeddings as many times as you want to.

!pear_ebi -i
PEAR v0.1.85
Specify file with tree set

File: ^C

- Leaving PEAR -
!pear_ebi MAPLE_res/IQtreeStartingTree_slower_Trees -i 
PEAR v0.1.85
Your input:
─────────────────────────────
 Tree set containing 32 trees;
 File: MAPLE_res/IQtreeStartingTree_slower_Trees;
 Distance matrix: not computed.
─────────────────────────────

PEAR | Interactive Mode
⣿⣿⣿⣿⣿⣿⣿⣿⠿⣟⠉⡿⠿⣿⣿⣿⣿⣿⣿⣿ -- Controls --
⣿⣿⣿⣿⡿⣿⢉⢳⠴⣞⠉⡷⢥⡏⡙⡿⢿⣿⣿⣿ 1 --> see status
⣿⣿⡋⢻⡤⣼⠉⢯⡆⣞⠙⣧⣢⠏⠪⣣⢦⡛⠹⣿ 2 --> calculate distances
⣿⣿⠓⢻⣄⣼⠋⢷⡠⡽⠚⣉⣤⡞⢚⢦⢢⠟⠹⣿ 3 --> embed distances
⣿⣿⢓⢻⡄⡼⠗⢃⣂⡒⠻⣧⣂⡿⠚⣨⢨⡓⠻⣿ 4 --> plot embeddings
⣿⣿⠗⢎⢄⠂⠾⣯⣂⡽⢓⢆⢔⠐⠿⣇⢅⡗⠻⣿ 5 --> add set to collection
⣿⣿⠖⢯⡡⣹⠗⣤⣉⠛⠶⣏⢌⡿⠲⡌⢌⢞⠼⣿ 6 --> get subset
⣿⣿⣮⣾⡉⣹⠦⣞⡉⡽⠶⡌⠌⡮⠲⣏⢩⣷⣼⣿
⣿⣿⣿⣿⣿⣿⣤⣞⢉⣳⠥⣏⠍⣧⣵⣿⣿⣿⣿⣿ 7 --> exit
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣿⣿⣿⣿⣿⣿⣿⣿⣿ 8 --> see list of controls
Command:

There is more...

the flags --meta and --config introduce another layer of flexibility into our analyses.
Please, refer to the "Advanced Examples" folder to get a gist of the full potential of pear!


Last update: 2024-04-29