BioPerl

Février 2016

Bérénice Batut
berenice.batut@udamail.fr

PDF

![BioPerl logo](images/bioperl/BioPerl-1.png "BioPerl logo")

## Principe - Collection de module Perl - Objectif: Faciliter le developpement de scripts Perl pour des applications bioinformatiques - Open-source via une organisation [GitHub](https://github.com/bioperl) - Soutenu par Open Bioinformatics Foundation

## Histoire - 1996 : Début - 2002 - Premier Open Bio Hackathon - BioPerl 1.0 - [Article](http://genome.cshlp.org/content/12/10/1611.abstract)

## Actuellement - [GitHub](https://github.com/bioperl) - 31 contributeurs - Dernière release : 1.6.924 en Juillet 2014 - Orienté-objet - \> 40 Modules Perl

## Comparaison avec les autres Bio Toolkits

## Bio Toolkits | Release 1.0 | Dernière release | Article majeur | Citations --- | --- | --- | --- | --- [BioPerl](http://www.bioperl.org/) | 2002 | 07/2014 | [2002](http://genome.cshlp.org/content/12/10/1611.abstract) | 1 306 [BioPython](http://biopython.org/) | 2000 | 10/2015 | [2009](http://bioinformatics.oxfordjournals.org/content/25/11/1422.short) | 608 [BioJava](http://biojava.org/) | 2008 | 07/2015 | [2008](http://bioinformatics.oxfordjournals.org/content/24/18/2096.short) | 201 [BioRuby](http://www.bioruby.org/) | 2006 | 07/2015 | - | - [BioPHP](http://www.biophp.org/) | 2003 | ? | - | - [BioJS](http://biojs.net/) | 2013 | 09/2014 | [2013](http://bioinformatics.oxfordjournals.org/content/early/2013/02/23/bioinformatics.btt100.short) | 44 [Bioconductor](https://www.bioconductor.org/) | 2001 | 10/2015 | |

![](images/bioperl/google_scholar_results.svg "")

## Installation

### Installation sous Linux/Mac OS ``` $ (sudo) cpan -i CPAN $ cpan cpan[1]> d /BioPerl/ Reading '/Users/cidam/.cpan/Metadata' Database was generated on Thu, 14 Jan 2016 13:53:43 GMT Distribution BOZO/Fry-Lib-BioPerl-0.15.tar.gz Distribution CDRAUG/Dist-Zilla-PluginBundle-BioPerl-0.20.tar.gz Distribution CJFIELDS/BioPerl-1.6.901.tar.gz Distribution CJFIELDS/BioPerl-1.6.923.tar.gz Distribution CJFIELDS/BioPerl-1.6.924.tar.gz ... 11 items found cpan[2]> install CJFIELDS/BioPerl-1.6.924.tar.gz ```

# Manipulation de séquences

## Représentation d'une séquence

### 3 types d'objets pour une séquence - `Bio::PrimarySeq` - Séquence + nom - Fichier fasta

### 3 types d'objets pour une séquence - `Bio::SeqFeatureI` - Caractéristique sur une séquence (séquence, localisation et annotation) - Entrée simple d'une table de caractéristique EMBL/GenBank/DDBJ

### 3 types d'objets pour une séquence - `Bio::Seq` - 1 séquence et une collection de caractéristiques - Entrée simple d'une table EMBL/GenBank/DDBJ

## Classe `Bio::Seq`

### Classe `Bio::Seq` ``` $ perldoc Bio::Seq NAME Bio::Seq - Sequence object, with features ... DESCRIPTION A Seq object is a sequence with sequence features placed on it. The Seq object contains a PrimarySeq object for the actual sequence and also implements its interface. ... ``` Note: Montrer dans un terminal

### Créer d'un objet `Bio::Seq` ``` use Bio::Seq; my $seqobj = Bio::Seq->new( -seq => "ACTGTGTGTCC", -id => "Chlorella sorokiniana", -accession_number => "CAA41635" ); ``` Note: Notebook

### Méthodes (1) Méthodes renvoyant des chaines de caractères et acceptant parfois des chaine de caractères pour modifier des propriétés ``` $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->alphabet(); # one of 'dna','rna',or 'protein' $seqobj->version() # when there, the version $seqobj->length() # length $seqobj->desc(); # description $seqobj->primary_id(); # a unique id for this sequence regardless # of its display_id or accession number $seqobj->display_id(); # the human readable id of the sequence ``` Note: Notebook

### Méthodes (2) Méthodes renvoyant des nouveaux objet `Bio::Seq` ``` $seqobj->trunc(5,10) # truncation from 5 to 10 as new object $seqobj->revcom # reverse complements sequence $seqobj->translate # translation of the sequence ``` Note: Notebook

### Méthodes (3) Méthode pour déterminer si une chaine de caractère peut être accepter par la méthode `seq()` ``` $seqobj->validate_seq($string) ``` Note: Notebook

### Manipulation de séquences ``` $seq = $seqobj->seq(); $length = $seqobj->length(); $subseq = $seqobj->subseq($length/2, $length); $new_seq = $seq.$subseq; if($seqobj->validate_seq($new_seq)){ $seqobj->seq($new_seq); } print $seqobj->seq()," ",$seqobj->length()\n"; ``` Que sera affiché? Note: Notebook

### Traduction ``` $translated_obj = $seqobj; if( $seqobj->alphabet() == 'dna'){ $translated_obj = $seqobj->translate(); } print $translated_obj->seq(),"\n"; ``` Que sera affiché? Note: Notebook

## Récupération de statistiques sur une séquence

### Classe `Bio::Tools::SeqStats` ``` $ perldoc Bio::Tools::SeqStats NAME Bio::Tools::SeqStats - Object holding statistics for one particular sequence ... DESCRIPTION Bio::Tools::SeqStats is a lightweight object for the calculation of simple statistical and numerical properties of a sequence. By "lightweight" I mean that only "primary" sequences are handled by the object. The calling script needs to create the appropriate primary sequence to be passed to SeqStats if statistics on a sequence feature are required. Similarly if a codon count is desired for a frame- shifted sequence and/or a negative strand sequence, the calling script needs to create that sequence and pass it to the SeqStats object. ... ``` Note: Montrer dans un terminal

### Création ``` $seq_stats = Bio::Tools::SeqStats->new(-seq => $seqobj); ``` Note: Notebook

### Méthodes - `count_monomers` - Comptage du nombre de chaque type de monomère - `get_mol_wt` - Calcul du poids moléculaire - `count_codons` - Comptage du nombre de chaque type de codons - `hydropathicity` - Calcul l'hydrophaticité moyenne de Kyte-Doolittle Note: Notebook

# Manipulation de fichiers de séquences

### Classe `Bio::SeqIO` ``` $ perldoc Bio::SeqIO NAME Bio::SeqIO - Handler for SeqIO Formats ... DESCRIPTION Bio::SeqIO is a handler module for the formats in the SeqIO set (eg, Bio::SeqIO::fasta). It is the officially sanctioned way of getting at the format objects, which most people should use. The Bio::SeqIO system can be thought of like biological file handles. They are attached to filehandles with smart formatting rules (eg, genbank format, or EMBL format, or binary trace file format) and can either read or write sequence objects (Bio::Seq objects, or more correctly, Bio::SeqI implementing objects, of which Bio::Seq is one ... ``` Note: Montrer dans un terminal

### Création d'un objet `Bio::SeqIO` Ouverture d'un flux sur le fichier ou la chaine de caractères

### Constructeur Paramètres possibles - `-file` - `-string` - `-format` : `fasta`, `nexus`, `fastq`, `quality`, `excel`, `raw`, `tab`, ... - `-alphabet` : `dna`, `rna` ou `protein`

### Méthodes - `next_seq` - Lecture du prochain objet "séquence" dans le flux - Renvoi d'un objet `Bio::Seq` ou rien si aucune séquence disponible - `write_seq` - Ecriture d'un object `Bio:Seq` dans le flux - `format`, `alphabet`, ...

### Ecrire de séquences dans un fichier ``` use Bio::SeqIO; use Bio::Seq; my $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); my $seqobj = Bio::Seq->new( -seq => "ACTGTGTGTCC", -id => "Chlorella sorokiniana" ); $seqio_obj->write_seq($seqobj); my $seqobj = Bio::Seq->new( -seq => "ACTGTGTGTCCTGTGTCC", -id => "Modified Chlorella sorokiniana" ); $seqio_obj->write_seq($seqobj); ``` Que fait ce code? Note: Notebook Montrer aussi le contenu du fichier de sortie

### Lecture des séquences d'un fichier ``` use Bio::SeqIO; $seqio_obj = Bio::SeqIO->new(-file => "sequence.fasta", -format => "fasta" ); while ($seq_obj = $seqio_obj->next_seq){ print $seq_obj->seq,"\n"; } ``` Que fait ce code? Note: Notebook

# Accès aux bases de données

## Récupération d'une séquence dans une base de données

### Bases de données accessibles Base de données | Module --- | --- [GenBank](http://www.ncbi.nlm.nih.gov/genbank/) | `Bio::DB::GenBank` [SwissProt](http://web.expasy.org/docs/swiss-prot_guideline.html) | `Bio::DB::SwissProt` [GenPept](http://www.ncbi.nlm.nih.gov/protein/) | `Bio::DB::GenPept` [EMBL](http://www.ebi.ac.uk/ena) | `Bio::DB::EMBL` SeqHound | `Bio::DB::SeqHound` [Entrez Gene](http://www.ncbi.nlm.nih.gov/gene) | `Bio::DB::EntrezGene` [RefSeq](http://www.ncbi.nlm.nih.gov/refseq/) | `Bio::DB::RefSeq`

### Classe `Bio::DB::GenBank` ``` $ perldoc Bio::DB::GenBank NAME Bio::DB::GenBank - Database object interface to GenBank ... DESCRIPTION Allows the dynamic retrieval of Bio::Seq sequence objects from the GenBank database at NCBI, via an Entrez query ... ``` Note: Montrer dans un terminal

### Constructeur ``` use Bio::DB::GenBank; $db_obj = Bio::DB::GenBank->new; ``` Note: Notebook

### Récupération d'une séquence dans une base de données ``` use Bio::DB::GenBank; use Bio::Seq; $db_obj = Bio::DB::GenBank->new; $seq_obj = $db_obj->get_Seq_by_id(2); print $seq_obj->display_id(),"\n"; ``` Note: Notebook

## Récupération de plusieurs séquences avec des requêtes plus complexes

### Bases de données et modules pour les requêtes Base de données | Module --- | --- [GenBank](http://www.ncbi.nlm.nih.gov/genbank/) | `Bio::DB::Query::GenBank` [SwissProt](http://web.expasy.org/docs/swiss-prot_guideline.html) | `Bio::DB::Query::SwissProt` [GenPept](http://www.ncbi.nlm.nih.gov/protein/) | `Bio::DB::Query::GenPept` [EMBL](http://www.ebi.ac.uk/ena) | `Bio::DB::Query::EMBL` SeqHound | `Bio::DB::Query::SeqHound` [Entrez Gene](http://www.ncbi.nlm.nih.gov/gene) | `Bio::DB::Query::EntrezGene` [RefSeq](http://www.ncbi.nlm.nih.gov/refseq/) | `Bio::DB::Query::RefSeq`

### Classe `Bio::DB::Query::GenBank` ``` $ perldoc Bio::DB::Query::GenBank NAME Bio::DB::Query::GenBank - Build a GenBank Entrez Query ... DESCRIPTION This class encapsulates NCBI Entrez queries. It can be used to store a list of GI numbers, to translate an Entrez query expression into a list of GI numbers, or to count the number of terms that would be returned by a query. Once created, the query object can be passed to a Bio::DB::GenBank object in order to retrieve the entries corresponding to the query. ... ``` Note: Montrer dans un terminal

### Création d'un objet `Bio::DB::Query::GenBank` Ouverture d'un flux sur des objets `Bio::Seq`

### Constructeur Paramètres possibles - `-db` : `protein`, `nucleotide`, ... - `-query` - `-mindate` - `-maxdate` - `-reldate` - `-datetype` - `-ids` - `-maxids`

### Méthodes - `count` - Renvoi du nombre de résultats de la requête - `ids` - Renvoi/Modifie la liste des identifiants des résultats

### Récupération de plusieurs séquences ``` use Bio::DB::GenBank; use Bio::DB::Query::GenBank; $query = "Arabidopsis[ORGN] AND topoisomerase[TITL] and 0:3000[SLEN]"; $query_obj = Bio::DB::Query::GenBank->new(-db => 'nucleotide', -query => $query ); $gb_obj = Bio::DB::GenBank->new; $stream_obj = $gb_obj->get_Stream_by_query($query_obj); while ($seq_obj = $stream_obj->next_seq) { print $seq_obj->display_id, "\t", $seq_obj->length, "\n"; } ``` Note: Notebook

# Parser des rapports de recherche

### Classe `Bio::SearchIO` ``` $ perldoc Bio::SearchIO NAME Bio::SearchIO - Driver for parsing Sequence Database Searches (BLAST, FASTA, ...) ... DESCRIPTION This is a driver for instantiating a parser for report files from sequence database searches. This object serves as a wrapper for the format parsers in Bio::SearchIO::* - you should not need to ever use those format parsers directly. (For people used to the SeqIO system it, we are deliberately using the same pattern). ... ``` Note: Notebook

### Création d'un objet `Bio::Search` Ouverture d'un flux sur le fichier contenant un rapport de recherche

### Constructeur Paramètres possibles - `-file` - `-format` - `-output_format` - `-inclusion_threshold` - `-signif` - `-check_all_hits` - `-min_query_len` - `-best`

### Formats Name | Format --- | --- `blast` | BLAST (WUBLAST, NCBIBLAST,bl2seq) `fasta` | FASTA `-m9` and `-m0` `blasttable` | BLAST `-m9` or `-m8` output (both NCBI and WUBLAST tabular) `megablast` | MEGABLAST `blastxml` | NCBI BLAST XML ... | ...

### Méthodes - `next_result` - `write_result` - `write_report` - `result_count` - `best_hit_only` - `check_all_hits`

### Représentation des données dans `Bio::Search` - `Bio::Search` - `Bio::Search::Result` - `Bio::Search::Hit` - `Bio::Search::HSP` (high-scoring segment pair)

### Méthodes de `Bio::Search::Result` - `algorithm` - `query_name` - `query_accession` - `query_length` - `query_description` - `database_name` - `available_statistics` - `available_parameters` - `num_hits` - `hits`

### Méthodes de `Bio::Search::Hit` - `name` - `length` - `accession` - `description` - `algorithm` - `raw_score` - `significance` - `hsps` - `num_hsps` - `locus` - `accession_number`

### Méthodes de `Bio::Search::HSP` (1) - `algorithm` - `evalue` - `expect` - `frac_identical` - `frac_conserved` - `gaps` - `query_string` - `hit_string` - `length('total'/'hit'/'query')` - `num_conserved` - `num_identical`

### Méthodes de `Bio::Search::HSP` (1) - `rank` - `seq_inds('hit'/'query', 'identical'/ 'conserved'/ 'conserved-notidentical')` - `score` - `range('hit'/'query')` - `percent_identity` - `strand('hit'/'query')` - `start('hit'/'query')` - `end('hit'/'query')` - `matches('hit'/'query')` - `get_aln`

### Parcours d'un fichier issu d'une requête Blast ``` use Bio::SearchIO; my $in = new Bio::SearchIO( -format => "blast", -file => "report.bls"); while(my $result = $in->next_result){ while(my $hit = $result->next_hit){ while(my $hsp = $hit->next_hsp){ print "Query=", $result->query_name, " Hit=", $hit->name, " Length=", $hsp->length('total'), " Percent_id=", $hsp->percent_identity, "\n"; } } } ```

## Références - [BioPerl GitHub Page](http://bioperl.github.io/index.html) - [Wiki BioPerl](http://www.bioperl.org/wiki/)