4,
LOSCHMIDT
LABORATORIES
Bioinformatics protein sequences and
databases
□ Introduction
□ Primary sequence of proteins
□ Protein sequence databases
□ Sequence alignments
■ evolution of proteins
■ Sequence-structure-function paradigm
■ Alignment of sequences
□ Prediction of protein properties from sequence
Bioinformatics databases & Structure prediction
3-Binf DB & Str. Pred -> Intro
Structure prediction
	ARTIFICIAL 1M         Q DeepMind	
	INTELLIGENCE	
	SOLVES 50 //A YEAR OLD v.^.W^	
		
Google DeepMind	SCIENCE     4 *^SSf^^L	
	PROBLEM (ALPHAFOLD)             V St	
3-Binf DB & Str. Pred -> Intro
Let's start from the beginning...
3-Binf DB & Str. Pred -> Intro
Protein synthesis
Single coding arrand of DNA
Double strand of DNA
ACT	Q AC	T C T	C G T	T A C	T C T	G A C
1    Transcription |						
U G A	CUG	AGA	G C A	A Ü Ü	AGA	CUG,
	ILL	ILL	1 1	JUL!	ILL	1   1 1
Triplet
Cod on
Protein synthesis occurs in two steps:
• Transcription: DNA -> RNA
• Splicing: RNA -> mRNA
• Translation: mRNA -> Protein
• Post-translational modifications: protein -> mature protein
Slrgnd of mRNA
Growing amro »dcftwi '"
AVnra Acnl
Translation
Translation
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
Protein synthesis
3-Binf DB & Str. Pred -> 1^ sequence of proteins
3-Binf DB & Str. Pred -> 1^ sequence of proteins
Primary Secondary Tertiary Quaternary
structure structure structure structure
Levels of protein structure
Primary structure
Amino acid
Secondary structure
a-Helixes
Tertiary structure
Polypeptide chains
Quaternary structure
Complex of protein molecule
3-Binf DB & Str. Pred -> 1^ sequence of proteins
Sources of protein sequences
□ Multiple databases available:
□ With different scope focus:
■ Generalist: sequences from any source (UniProtKB)
■ Specialist: sequences focusing on one more specific condition(s) (i.e. biologic pathway, disease, organism) (WormBase)
□ With different types of sequence content:
■ Primary sequence of proteins, and annotations and cross-references to that sequence (UniProtKB)
■ Motifs or profiles databases: contain information derived from the primary sequence, in the form of abstractions (patterns) that distil the most conserved features among related proteins (PFam)
3-Binf DB & Str. Pred -> protein seq. databases
□ UniProtKB
■ Collaboration between EBI, Swiss Institute of Bioinformatics and Protein Information
■ Central repository of protein sequences and functional information
■ Quality annotations - information on protein function and individual amino acids, experimental information, biological ontologies, classification, links to other databases Quality level of the annotation (manual vs. automatic)
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Proteins
UniProt Knowledgebase
Reviewed Swiss-Prot
Unreviewed TrEMBL
Species
Proteomes
Protein sets for species with sequenced genomes from across the tree of life
Protein Clusters
MR UniRef
Sequence Archive
UniPar
f-fffil
Clusters of protein sequences at 100%, 90% & 50% identity
Non-redundant archive of publicly available protein sequences seen across different databases
Supporting Data
Diseases Keywords
Analysis Tools
Search with a sequence to find homologs through pairwise sequence alignment
Taxonomy Literature Citations
viaepegt-hsfdgiw viaepegt viaepegt-h: viaepegt viaepegt
VIAE VIAEPE VIAEPI VIAEPE VLVEPl
GIW SFDGIW
SFDGIWKA
DGIWKAS
I1Z A
5fttftvt1 tftvtky1 tftvtkytky tvtkvtky rVTKYTKY tkytky jkytky kytky kytky kytky kytky kytky
in sequences find conserved
Subcellular locations Cross-referenced databases
Search with Lists Map IDs
Find proteins with lists of UniProt IDs or convert from/to other database I Ds
UniRule automatic annotation ARBA automatic annotation
Search Peptides
Search with a peptide sequence to find all UniProt proteins that contain exact matches
Bioinformatics databases & Structure prediction
Proteins
UniProt Knowledgebase
Reviewed
Swiss-Prot
Unreviewed
TrEMBL
□ Main component of the database
□ Reviewed protein entries (SwissProt):
• High quality manual annotations
• © Manual annotations -> reliable info
• © >570,000 protein records (2024)
□ Automatic protein entries (TrEMBL):
• Automatic translation of protein sequences from EMBL data bank
• © Automatic annotations -> lower quality, chance for errors.
• © -250,000,000 protein records (2024) (400x info ammount)
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Species
Proteomes
Protein sets for species with sequenced genomes from across the tree of life
Protein Clusters
UniRef
Clusters of protein sequences at 100%, 90% & 50% identity
Sequence Archive
UniParc
Non-redundant archive of publicly available protein sequences seen across different databases
Proteomes for 25,000 model organisms available Different degrees of coverage (other 160,000 available)
Clusters of proteins at 100%, 90%, and 50% seq. ID Groups of similar proteins where to sample from
Stable identifier repository
Cross-references to a wealth of 40 external different databases (generalist and specialist)
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
			
UniPfOt  •   BLAST    Align     Peptide search     ID mapping     SPARQL     UniProtKB ~ BETA ••	LinB                                    Advanced 1 List		A # 0 Help
			
Status
Reviewed (Swiss-Prot) (4) Unreviewed (TrEMBL) (102)
Taxonomy
Filter by taxonomy
Proteins with
3D structure (4) Active site (26) Activity regulation (1) Beta strand (2) Binary interaction (1)
I Protein existence
Predicted (62) Hnmnlnov (AO\
UniProtKB 106 results
A, Download
m
)4Z2G1LINB_SPHJU
Haloalkane dehalogenase • Sphingobium japonicum (strain DSM 16413 / Q^l 7287 / MTCC 6362 / UT26 / NBRC 101211 / UT26S) • EC number: 3.8.1.5 • Gene: linB • 296 amino acids • Evidence at protein level ffc/sl #Hydrolase#Detoxification
1 domain • 3 active sites • 16 3D structures • 14 reviewed publications
0c
* A0A1L5BTC1 • LINB_SPHIB
Haloal
•296 a f/V4PEU6-A4PEU6_9SPHN
#Hydro    Hajoalkane dehalogenase • Sphingobium sp. Ml 1205 • EC number: 3.8.1.5 • Gene: linB (dhaA) • 296 amino acids • Evidence at protein levej
1 doma SH^Irolase
1 domain • 3 active sites • 8 3D structures • 4 publications
We'd like to inform you that we have updated our Privacy Notice to comply with Europe's new General Data Protection Regulation (GDPR) that applies since 25 May 2018.
Quality   Info: Name/Organism source/EC activity/gene name/length. Filters     Protein evidence   +lnfo: Domain/3D structure/active site/pubs.
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
I Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
Similar Proteins
Human readable explanation of the protein function Wealth of systematically organized information. In the illustrated example:
•   Catalytic activity: with details of the enzymatic reaction and cross-links to chemical databases Activity regulation: competitive inhibitors Kinetics: experimental measurements towards n substrates Optimal pH
Implication in biological pathways Catalytic and Key Residues (active/binding sites) Gene Ontology (GO) annotations (enrichment values) Enzyme/Pathways and Protein Family DBs Keywords
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
I Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
Similar Proteins
D4Z2G1 • LINB SPHJU
Haloalkane dehalogenase ■ Sphingobium japonicum (strain D5M16413 / CCM 7287 / MTCC 6362 / UT26 / NBRC101211 / UT26S) ■ EC number: 3.8.1.5 Gene: linB ■ 296 amino adds • Evidence at protein level • 0 Entry     Featureviewer     Publications     External links History
A Download   T   ft Add   Adda publication   Entry feedback
Function
Catalyzes hydrolytk cleavage of carbon-halogen bond? In halogenated aliphatic compounds, leading to the formation of the corresponding primary alcohols, halide ions and protons. Has a broad substrate specificity since not only monochloroalkanes (C3 to CIO) but also dichloroalkane5<>C3], bramoalkane5.and chlorinated aliphatic alcohols are good su bst rales (PuhMed:9293022. PubMed: 10100638). Shows almost no activity with 1.2-dlchloroethane. but very high activity with the bromlnated analog (PubMed:Y293022).
Is involved In the degradation of the important environmental pollutant gamma-hexachlorocytlohexanc (gamma-HCH or lindane) as it also catalyzes conversion of 1.3.4.6-tetrachlorol,4-cvclohexadiene[l,4-TCDIM| to 2,5 dicbloro-2,5-cyclohexadier>e-l,4-diol (2,5-0D0L)via the Intermediate2.4.5-trichloro-2.5-cyclohexadiene-l-ol (2,45-DNOL) (PubMed:7691794).
This degradation pathway allows Sjaponicum UT26 to grow on gamma-HCH as the sole source of rarbon and energy     3 Publications
Miscellaneous
Is not N-terminally processed during export, so it may be secreted into the periplasmic space via a hitherto unknown mechanism. [" 1 Publication 1
Catalytic Activity
l-haloalkane + H20 = a hallde anion + a primary alcohol + H(+) I   1 Automatic Annotation      2 Publications EC: 3.8.1.5 (UniProtKBQ. ENZYME    | RheaCS ) Source: Rhea 19081
Activity Regulation
Competitively inhibited by the key pollutants 1.2-dichloroethane (l,2-DCE)and 1,2-dichloroprc
Kinetics
l-haloalkane CHEBI:18060
R'
H20 CHEBLÍ5377 H
\ /
a halide anion CHEBI:16042
.3 primary .ilcohol CHEBI15734
H H
H-
CHEBI:15378
K|v(=1.9mMfor 1,2-dibromoethane I " 1 Publication Km=3.9itiM for l-chloro-2-bromoethane
1 Publication
KM=0.9mMfor 1,2-dibromopropane I " 1 Publication I
K|vi=0.05mM for l-bromo-2-methylpropane     1 Publication
H
+  KM=n.7mMfor2,3-dichloropropene I " 1 Publication I K|v|=0.14mMfor 1-chlorobutane I n 1 Publication I
OH
kcat is 0.98 sec(-l) with 1-chlorobutane as substrate. I n 1 Publication!
pH Dependence
(3R.6R)-13.4,6-tetrachlorocvclohc«a l14-dlenet2H20 = 2,}dlchlorocvclohe»3-215-dlen-l,4dlolt2chlorlclet2H(í] I" 1 Publication I
Optimum pH is 8.2. I" 1 Publication I
Pathway
Xenobiotic degradation; gamma-hexachlorocyclohexane degradation. I   1 Publication I
Features
Showing features for domain, active site, binding site.
e. « «
........g«..iii.n-fii.i»rrir,i
3-Binf DB & Str. Pred -> protein seq. databases
(Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
Similar Proteins
Features
Showing features for domain, active site, binding site. &   <$. Sf
132-132
Binding site
Nucleophile [_ 3 Publications Proton donor    3 Publications
Proton acceptor     3 Publications
Binding site
GO Annotations
Slimming set:
109-109
Chloride [ " 1 Publication |
Chloride    2 Publications     Combined Sources
Cell color indicative of number of GO terms ASPECT
TERM
CellularComponent
Molecular Function
periplasms space Ľ IEA:UniProtKB-SubCell
haloalkanedehalogenase activity tí IEA:UniProtKB-UniRule
Biological Process
response to toxic substance ü IEA:llniProtKB-KW
Keywords Enzyme and pathway databases       Protein family/group databases
Molecular function | #Hydrolase Biological process I #Detoxincation
BRENDA J 3.8.1.5 a 10293 UniPathway I UPAO0689
ESTHER   sphpi-linbrJ Haloalkane_dehalogenase-HLD2
3-Binf DB & Str. Pred -> protein seq. databases
Function	Names & Taxonomy
	Protein names
1 Names & Taxonomy	Recommended name    Halnjillranpriphalngpnasp f   1 Automatic Annotation I I " 1 Publication)
	EC number 1 3.8.1.5 f   1 Automatic Annotatlonl 1" 1 Publication 1
Subcellular Location	Alternative names 1 1 t 4 (Hprrarhlnro-I 4-ryrinhPYarlipnp halirinhyrirnlasp (" 1 Publication! (1.4-TCDN halidohvdrolase 1 " 1 Publication])
Phenotypes	Gene names
	Name | linR 1 ■ 2 Publications 1
PTM/Processing	Ordered locus names 1 SJA C1-19590 I" ImDortedl
Expression	Organism names
	Organism    Sphingobium japonicum (strain DSM16413 / CCM 7287 / MTCC 6362 / UT26 / NBRC 101211 / UT26S)
Interaction	Taxonomic identifier 1 452662 NCBIl:
	Taxonomiclineage 1 Bacteria > Proteobacteria > Alphaproteobacteria > Sphingomonadales > Sphingomonadaceae > Sphingobium
Structure	
	Accessions
Family & Domains	Primary accession 1 D4Z2G1
	Secondary accessions 1 P51698
Sequence	
	Proteome
Similar Proteins	Identifier 1 UP000007753
	Component 1 chromosome 1
3-Binf DB & Str. Pred -> protein seq. databases 31	
Function
I Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins
Names & Taxonomy
Protein names
Recommended name EC number Alternative names
Gene names
Name
Ordered locus names
Organism names
Organism Taxonomie identifier Taxonomie lineage
Haloalkanedehalogenase I   1 Automatic Annotation! I " 1 Publication]
3.8.1.5 I   1 Automatic Annotation] I " 1 Publication I
l,3,4,6-tetrachloro-l,4-cyclohexadiene halidohydrolase Í" 1 Publication I I1.4-TCDN halidohydrolase I " 1 Publication!)
MnB I " 7 Publications! SJA_C 1-19590 lp Imported!
Sphingobium japonicum (strainDSM 16413/CCM 7287/ MTCC 6362 /UT26/ NBRC 101211 / UT26S) 452662 NCBICJ
Bacteria > Proteobacteria =• Alphaproteobacteria > Sphingomonadales > Sphingomonadaceae > Sphingobium
Proteome
Ider
Compc
Unique accession numbers
Serialized for sequence variants {later)
D4Z2G1-LINB_SPHJU
Haloalkane dehalogenase • Sphingobium japonicum (strain DSM 16413 / CCM 7287 / MTCC 6362 / UT26 / NBRC 101211 / UT26S) • EC number: 3.8.1.5 • Gene: MnB • 296 amino acids • Evidence at protein level ■ (5/5)
#Hydrolase#Detoxincation
1 domain • 3 active sites • 16 3D structures • 14 reviewed publications
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
Similar Proteins
Subcellular Location
UniProt Annotation     GO Annotation
		f	f	(	f	f 1
\						\ \
Keywords
Cellular component I #periplasm
£> Periplasm |w 1 Publication |
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy Subcellular Location | Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins
Phenotypes
Features
Showing features for mutagenesis.
a * «!
TYPE_
-Select
Mutagenesis Mutagenesis Mutagenesis
Mutagenesis
Mutagenesis
DESCRIPTION
Loss of activity. 1 Publication Loss of activity. | ' 1 Publication
5B?4al wild-type activity. I * 1 Publication I
Loss of activity. I' 1 Publication!
Loss of activity. | p 1 Publication!
Describe the effect of mutations in the activity of the protein
Mutations mapped on the protein sequence
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
PTM/Processing
Features
5hewing features for Initiator methionine, chain.
Q £™
IHJtllL^tJlLllMLLI.ILflLP.I.lll.l.l.....lULUlLlLWlLl.lM.lll.l.l.tLt.l.Iglll.IUII.Ill.k.tlljl.lLrgl.l.i.^llUD.I.I.IiVLMLIL.ILIll.l.l.H I'UA-M.t .L.tllvl >IUM1>V(> ia|»tU.L.l. IJ |ÉL> I .1. I^ILII......111. I.tllj 1.^111111........l.lir.l.L.t.
ID POSITIONS DESCRIPTION
TYPE
►     Initiator methionine
Removed  " 2 Publications
PRO 0000216778
HaloalkanerJerialogenase
Describe post-translational modifications and other processing of the protein (i.e. cleaving for activation) Positions mapped on the protein sequence.
Similar Proteins
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
I Interaction
Structure
Family & Domains
Sequence
Similar Proteins
Expression
Induction
Constitutively expressed.
Interaction
Subunit
Monomer. I * 1 Puhllratlon
Protein-protein interaction databases
STRING    452662.SJA_C1-19590 a
Expression:
• Describe the expression conditions of the protein Interaction:
• Refers to the quaternary structure of the protein
• Describes its native oligomeric state, and
• Lists interactions with other proteins
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction I Structure Family & Domains Sequence Similar Proteins
Structure
SOURCE	IDENTIFIER	METHOD	RESOLUTION	CHAIN	POSITIOHS	LINKS
-Select - ■		■-Select-- ■				
						
PUB	1CV2	X-ray	1.5SA	A	1-296	PDB-RCSBPDBPDBj-PDBsuin i
PDB	1007	X-ray	200A	A	1-196	PDB ■ RCSB-PDB ■ PDBj - PDBsum ±
PDB	1042	X-rsy	ISO A	A	1-296	PDB - RC5B-PDB -PDBj- PDB&um A
PDB	1G4H	X-ray	1.80 A	A	1-2Í6	PDB-RCSB-PDBPDBjPDBsum i
PDB	1C5F	X-ray	1.80 A	A	1-296	PDB-RCSB-PDB-PDBj PDBíuin A
Displays available tertiary structures (experimentally determined) for the protein.
Links to AlphoFold predictions if available (cover later) Describes secondary structure content mapped to seq Links to databases with 3D structure models
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
Similar Proteins
Structure
CHAIN POSITIONS
■ Select -
Features
Showing features For beta strand, helix, turr
$ <a. es
TYPE
Select
Combined Sources
Combined Sources I
Combined Sources
■■CHI»
Combined Sources
Turn
3D structure databases
Combined Sources
SMH J D4Z2G1B ModBase   Search,,, a
PDBe KB    Search., E
I
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
I Family & Domains
Sequence
Similar Proteins
Family & Domains
Features
Showing features for domain
a « at
TYPE
- Select -
►      Domain 31-15
Similarity
Belongs to the haloalkane dehalogenase family. Type 2 subfamily     1 Automatic Annotation
Phylogenomic databases
HOGENOM | CLU_O20336_13_3_5E OMA | TLFCQDW tr
Family and domain databases
Gene3D | 3.40.50.1820 cj 1 hit
hamap | MF_01231rJ Hsloalk_dehal_type21 hit
InterPro   viewproteininlnterPro ci IPR029058 U AB.hydrolase IPH000073 a AB.hydrolase.l IPR000639 ci Epox.hydcolaseTike IPR023594 U Haloalkane_dehalogenase_2
PRINTS I PR00412B EPOXHYDRIASE
DESCRIPTION
AB hydrolase!     1 Automatic Annotation
eggNOG   COG0596 a Bacteria
Pram View protein in Pf am c
I PF00561 a Abnydrolase.l 1 hit
SUPFAM J SSFS3474BSSF534741hlt
MobiDB I Search... B
ProtoNet Search :t
Cross-references to motifs and profiles databases
Convenient to find other proteins that share one particular sequence feature.
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins
Sequence
Tools        ± Download   ö Add    Highlight CopyFASTA
Length 296 Mass (Da) 33,108
Lastupdated 2010-06-15 vl Checksum 6EEE011B157DBAE1
10 20 30 40 50                  60 70 80 90
MSLGAKPFGE KKFIEIKGRR MAYIDEGTGD PILFQHGNPT SSYLWRNIMP HCAGLGRLIA CDLIGMGDSD KLDPSGPERY AYAEHRDYLD
100 110 120 130 140               150 160 170 180
ALWEALDLGD RWLVVHDWG SALGFDWARR HRERVQGIAY MEAIAMPIEW ADFPEQDRDL FQAFRSQAGE ELVLQDNVFV EQVLPGLILR
190 200 210 220 230                240 250 260 270
PLSEAEMAAY REPFLAAGEA RRPTLSWPRQ IPIAGTPADV VAIARDYAGW LSESPIPKLF INAEPGALTT GRMRDFCRTW PNQTEITVAG
280 290 AHFIQEDSPD   EIGAAIAAFV RRLRPA
When multiple isoforms are available due to alternative splicing the different sequences are available here, with serialized accession codes (i.e. P21397-1, P21397-2)
3-Binf DB & Str. Pred -> protein seq. databases
UniProt KB
Function
Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains I Sequence Similar Proteins
Sequence
Tools        ± Download   ft Add    Highlight   ^ CopyFASTA
Length 296 Mass (Da) 33,108
Lastupdated 2010-06-15 vl Checksum 6EEE011B157DBAE1
10                 20 30 40                 50 60 70 80 90
MSLGAKPFGE KKFIEIKGRR MAYIDEGTGD PILFQHGNPT SSYLWRNIMP HCAGLGRLIA CDLIGMGDSD KLDPSGPERY AYAEHRDYLD
100               110 120 130               140 150 160 170 180
ALWEALDLGD RWLWHDWG SALGFDWARR HRERVQGIAY MEAIAMPIEW ADFPEQDRDL FQAFRSQAGE ELVLQDNVFV EQVLPGLILR
190               200 210 220               230 240 250 260 270
PLSEAEMAAY REPFLAAGEA RRPTLSWPRQ IPIAGTPADV VAIARDYAGW LSESPIPKLF INAEPGALTT GRMRDFCRTW PNQTEITVAG
280 290 AHFIQEDSPD   EIGAAIAAFV RRLRPA
Keywords
Technical term
#3D-structure
#Direct protein sequencing
#Reference proteome
Genome annotation databases
EnsemblBacteria | BAI96793cJ SJA.C1-19590 a KEGG I sjp:SJA.Cl-19590 a
Sequence databases
EMBL
(EMBL CJ | GenBank    | DDBJ lI ) D14594tf Genomic DNATranslation: BAA03443.2 n (EMBL CJ | GenBank    |DDBJtf ) AP010803 E Genomic DNA Translation: BAI96793.1 C3
HR I A49896 CÍ A49896 RefSeq    WP 013040256.1 d NC.014006.1
3-Binf DB & Str. Pred -> protein seq. databases
Names & Taxonomy
Subcellular Location
Phenotypes
PTM/Processing
Expression
Interaction
Structure
Family & Domains
Sequence
Isimilar Proteins
1003é Identity     90X Identity 50%identlty
LINB_SPHJU
llriiReflO0_D4Z2Gl
I Accession		Organism	Length
A0A258B05Ó	Hi loalkane deh sloge n35e	Sphingopyitis lindanitolcrans	Tib
A8CFB7	Ha loalkane deh aloge nase	5phinEobium inditum	216
A8CFC0	Ha loalkane deh aloge naw	Sphingobium sp. SSW-4	lit
1 more			
Retrieve groups of proteins that are 100%, up to 90%, or up to 50% identical
Protein Clusters
UniRef
Clusters of protein sequences at 100%, 90% & 50% identity
3-Binf DB & Str. Pred -> protein seq. databases
Uses for protein sequences
What can we do with protein sequences and computers?
Bioinformatics databases & Structure prediction
Different protein properties or characteristics can be predicted from its primary sequence:
• Secondary structure
• Solvent accessibility
• Solubility/expressability O               • Transmembrane regions
The methods that do such predictions improve if they consider evolutionary information
Bioinformatics databases & Structure prediction
Protein sequences can also be directly "compared" among them. Their similarities or differences can be assessed..
Alignments are models that aim to pair the most similar parts among different proteins. If the model considers evolutionary information (and biologically relevant protein alignments do), evolutionary relationships [homology) can be inferred from sequence similarity.
Analysis Tools
		MMTI..I .....mk.M             ., , lk>
BLAST		;>;;;; JllSI !><;i\\ l-V ■ si DCIW K \sl 1 IXilUK \S#
		
Search with t sequence to find homology		Align two or more protetn sequences
through pjirwise sequence Alignment		with Oust jI Omega to find conserved
		regions
		
3-Binf DB & Str. Pred -> Sequence alignments
* 1870
VJ
1. Geospiza mn<jnirostris. 3. Geospiza parvula.
2. Geospiza fortis. 4. Certhidea olivaiea.
"[...] one might really fancy that from an original paucity of birds in this archipelago, one species had been taken and modified for different ends"
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
Darwinian ideas on evolution:
All species of organisms arise and develop through the natural selection of small, inherited variations that increase the individual's ability to compete, survive, and reproduce [biologicalfitness).
Inter-individual differences need to be:
• Small
• Inheritable
There exists a natural selective pressure.
Variations that make an individual fitter (improve its functions) to the conditions of the selective pressure are more likely to be transmitted to next generations.
Accumulation of variation causes speciation.
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
A few words on molecular evolution
Improved function on a given environment (adaptation) is a key concept in evolution.
How does this apply to proteins?
How do proteins function?
Molecular Catalyst     Molecular Pore [gift box] [tube]
Function is dictated by shape (3D structure)
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
A few words on molecular evolution
Improved function on a given environment (adaptation) is a key concept in evolution.
How does this apply to proteins?
How do proteins function?
Structure is determined by sequence Function is dictated by shape (3D structure)
3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm
Sequence, Structure, Function Paradigm
□ 3D structure is determined by the sequence
□ Function is dictated by 3D structure
MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIMPHCA GLGRLIACDLIGMGDSDKLDPSGPERYAYAEHRDYLDALWEALDLGDRVVLVV HDWGSALGFDWARRHRERVQGIAYMEAIAMPIEWADFPEQDRDLFQAFRS QAGEELVLQD
sequence
structure
function
3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm
A few words on molecular evolution
□ Innovation happens at the sequence level
• Mutations (smallchanges) introduced in DNA (inheritable)
• Subsequently transcribed, processed, and translated into polypeptidic chains (proteins)
□ Selective pressure operates at the function level
• Proteins working better in their environments make individuals fitter, adaptation occurred in human lineage
Schaffner S. & Sabeti P (2008) Evolutionary adaptation in human lineage. Nature Education 1:14.
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
Structure
Function
Sequence
Diversity
Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor
Bird
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
Structure
Function
Sequence
Paralogs
Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
Structure
Function
Sequence
Annotation problem
Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor
3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u l^V |SI            K \>1 l)(;lu K xw
		
		
Search with t sequence (o find homology		Align two or more protetn sequences
through i ■. i '.% . ■ sequence alignment		with CluM j| Omega to find conserved
		regions
		
Alignments are models that aim to pair the most similar parts among different proteins.
Global alignments: consider similarity across the entire sequence
Local alignments: consider similarity across sequence fragments
Pairwise alignments: two sequences compared Multiple sequence alignments: multiple
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Classification
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u l^V |SI            K \>1
		
		
Search with t sequence (o find homolog)		Align two or more protetn sequences
through i ■. i '.% . ■ sequence alignment		with CluM j( Omega to find conserved
		regions
		
Alignments are models that aim to pair the most similar parts among different proteins.
Pairwise alignment techniques
• DotPlot methods
• Dynamic programming algorithm
• Needelman & Wunsch (Global)
• Smith & Waterman (Local)
• Word methods
Multiple sequence alignment techniques:
• Dynamic programming
• Progressive methods
• Iterative methods
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Classification
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u i^V |SI            K \>1 iEVl)('lu K xw
		
		
Search with t sequence to find homologs		Align two or more pf otetn sequences
through i '■ >      ■ sequence alignment		with r   , ■ il Omega to find conserved
		
Alignments are models that aim to pair the most similar parts among different proteins.
How can similarity among different parts of proteins be measured?
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST
HSFDGIW1 ISFDGIW K t U IXilWKA
i m
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved regions
Similitude in between amino-acids:
A.  Amino Adds with Electrically Charged Side Chains
Positive
Argirvine (Arg) Q	Histidine (His) Q	Lysine	As partie Acid (Aspí Q	Glutamic Acid MM O
	y>	>	/°	
o=ť        o =	=<	°=(		
N-NH2	C			\-NH2
NH ©NHa	"VNH	©NH3	ef e	
C.   Special Cases
Cysteine (Cyi] 0
^_NH, SH
Glycine [Glyl©
Proline (P™> Q
Jo o=<
^—NH,       \—NH
O
D.   Amino Acids with Hydrophobic Side Chains
Alanine Valine Isoleucine Leucine Methionine Phenylalanine Tyrosine Tryptophan
(Al.) © (Uli) Q (ll») O (L»^J O (M-tj(J| (Ph,) Q <Tyr) © (Trp) Q
>      0 >      o=> o
)—NH l-NH
\
CMch with 4 sequence to find homology through im n .v I ■■< ■ sequence alignment
3-Binf DB 81 Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u i^V |SI            K \>1 iEVl)('lu K xw
		
		
Search with t sequence to find homology		Align two or more protetn sequences
through i ■. i '.% . ■ sequence alignment		with CluM j| Omega to find conserved
		regions
		
How can similarity among different parts of proteins be measured?
Assessing similarity in pairs of Amino-acids:
• Each possible pair of amino-acids is given a substitution score (substitution matrix)
• Amino-acids from the (two) sequences should be paired such as the total alignment score is optimized.
• Sometimes no good pairing can be found and a gap needs to be introduced.
• Gaps require a special penalty (negative score) in order to force longer and biologically meaningful alignments.
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST
Search with a sequence to find homologs through [i.nr wise sequence alignment
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved
How can similarity among different parts of proteins be measured?
•  Identity matrix (Dot-matrix plots):
• 1 if same amino-acid
• 0 otherwise Limited model: forces the introduction of too many gaps.
......«• •       ->  - -      ■ s:'-.",i>*s- ''l^""*'v"
-ft r
1. 1.1.1.1.1.1.1.1.1. J. 1. 1. 1.1.1.1.1.1.1.1.1.1.1.1 .J.1. 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.] NCJ11083 1 M     1,500 K     2M     2,500 K     3M     3,500 K     4M 4,338,763
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u T^V |SI            K \>1
		
		
Search with t sequence (o find homology		Align two or more pf otetn sequences
through i '■ >      ■ sequence alignment		with r   , ■ il Omega to find conserved
		
How can similarity among different parts of proteins be measured?
• Identity matrix (Dot-matrix plots):
• 1 if same amino-acid
• 0 otherwise
-> Limited model: forces the introduction of too many gaps.
• Substitution models:
• Score depending on the probability of observing a substitution (mutation) of one particular Aa for another (i.e. Arg -> Lys should score better than Arg -> Glu)
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST
HSFDGIVi I isfdííisn k f
Li ih.iw k \
i m
Search with » sequence (o find homology through [i.nr wise sequence alignment
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved
Substitution models include evolutionary information
Margaret Dayhoff Atlas of protein sequence and structure
ATLAS of PROTEIN SEQUENCE and STRUCTURE
ms
Richard V. Eck Macw A. Chang Minnie E. &Q<harcJ
" I.+"
Hrln  .1  Hiw L-.
-m iHiD ra epl3piibi jn mm hl ih nr
" irSI in->IM. ii JS  VlbíllL  IULI .3K- ■_
1 i í v i e i
IP  riE-CELDLL  I  hl  i Ĺ E ^ bii I i I
LI  T H  i  ... . ,■ .,  i  ■  r ■  ■  m .  ■ i,   ■ .  .  i  .   >. h  I  p ■. I  I I
HIIikiitL.......
i   -t   h »   t  ±   '  +  * it ii n u. i*, ii
i t.-" 'W- w. Li» Lri ŕ. 3 Lrt lt+ Ili rri wl «> 1.1 j tri «u
ni Lid ni r«. ml ku m »t kli na hi3 m ip-p v. a wr
■I  ■      MU 111 iii  LIM "pi  p.t h lti  IH -M. 3  y.■ u. j  «■ -3.3
pm nu. rvm rat alf k.± hu> i u íii as ill- m iv la u «|fLV h      Vr hu ■■ uval^uheiía |vl rvi \w
™*f « Ľl-      *iM 'H i**** Kvľ+ !■■«■ ľt « Pív
**  "í- -^V llu ílu 1^ h- LlU Ml M* HA PF  TH k'.'
t '."k l
1 Ľl* ^
■ r
IZ B.I t
i nlJ h
■ ILL I
■. LM   I ů u*
L-i LrL    t |* LM
i ili
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u l^V |SI            K \>1 iEVl)('lu K xw
		
		
Search with a sequence to find homologs		Align two or more protetn sequences
through i ■. i '.% . ■ sequence alignment		with CluM j| Omega to hnd conserved
		regions
		
Substitution models include evolutionary information
Dayhoff Mutation Data Matrix
Score is based on the concept of Point Accepted Mutation (PAM)
Evolutionary distance 1 PAM = time in which 1/100 amino acids are expected to mutate. Higher evolutionary times inferred from a Markov chain model: PAM matrix product. 250 PAM matrix - targets the limit where is safe to infer homology in proteins (twilight). Limitation: derived from 1572 observed mutations in (manual) alignment of sequences >85% identical
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
Search with a sequence to find homotogs through pairwise sequence alignment
Align two of more i>' ottin sequences with (",.■■! Omega to hnd conserved regions
Substitution models include evolutionary information PAM250
l_l u
a. -j-
ORIGINAL AMINO ACID
			R	ti	D	c	0	-1 r	G	H	I	L	K	M	F	P	S	r	w	If	1
		Ala	Arg	Asn	Asn	Cys	Gin	Glu	Gly	His	He	Leu	Lys	Met	Phe	Pro	Ser	Thr	Trp	Tyr	Val
A	Ala	13	6	9	0	5	8	9	12		8	6	7	7	4	11	11	11	2	4	S
R	Arg	3	* ?	4	3	2	5	3	2	6	3	2	9	4	1	t	i	J	7	2	2
H	Asn	4		6	3	2	5	6		6	3	2	5	3	2	4	c	4	2	3	3
0		5	4	a	11	1	7	10	5	6	3	2	5	3	]	4	5	5	1	2	3
c	Cys		]	i	1	52	j	1	2	2	2	1	1	1	1	2	3	2	1	4	2
Q	Gin	3'	c	5	6		10	7	j	7	2	3	S	~2	i	4	3	3	1	2	3
E	Gin	5	4	7	11	1	a	12	5	6	3	2	5	3	1	4	5	5	1	2	3
G	Gl y	12	s	10	10	4	7	9	27	5	5	4	6	5	3	S	".1	a	2	3	7
H	HIS	2	5	5	4	2	1	4	2	15	2	2	3	2	2	3	3	2	2	3	2
i	:ie	3	2	2	2	2	2	2	2	2	10	5	2	6	5	2	3	4	1	3	9
L	Leu	6	4		3	2	6	4		5	15	34	4	20	13	5	4	6	6	7	13
K	Lys		15	10	8	2	10	8	5	3	5	4	24	a	2	6	8	8	4	3	5
M	Met	I	1	1	1	o	»	1	j	1	2	3	2	6	2	1	1	1	1	1	2
F	Phe	2	1	2	I	1	l	1	1	3	5	6	1	4	32	1	I	2	4	20	3
P	Pro	7	$	5	1	3	5	4	5	5	3	T	4	2	2	20	i	5	1	2	4
S	Ser	9	5	8	7	7	6	7	0	6	5	4	7	5	3	9	13	9	4	4	6
T	Thr	3	□	6	6	4	5	5	5	4	6	4	6	5	•>	6	3	11	2	3	6
W	TrJ	c		0	0	0	0	0	n	!	0	1	0	0	i	0	1	0	55	1	0
y	Tyr	1	1	?	1	3	j	]	1	3	2	2	1	2	15	1	j	2	3	31	2
V	Val	7	4	4	t	4	4	4	5	4	15	10	4	It!	5	5	5	7	2	4	
3-Binf DB & Str. Pred -> Seq align -> Alignments -» Substitution Models
Sequence alignments
Analysis Tools
BLAST
HSFDGIW I |SFD(;iSN K f i\ IK.IW K \
Substitution models include evolutionary information
BLOSSUM matrices
• BLOcks Substitution Matrix
• Derived from blocks of aligned sequences in BLOCKS database - implicitly represents distant relationships.
• bias from identical sequences is removed by clustering at a sequence identity threshold
• BLOSUM62 = matrix derived from sequences clustered at 62% or greater identity
Search with a sequence to find homology through [i.nr wtst> sequence alignment
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST
Search with a sequence to find homology through [i.nr wtst> sequence alignment
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved regions
PAM	BLOSUM
Similar proteins compared as	Conserved BLOKS (fragments)
whole	compared
PAM1 corresponds to 1 * residue in 100 -> 99% ID	BLOSUM1 corresponds to 1% ID
Other PAM matrices	Each matrix based on observed
extrapolated from PAM1	alignments
Higher numbers, more	Higher numbers, more similarity
evolutionary distance	(less evolutionary distance)
100	90
120	80
160	62
200	50
250	45
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....llk»
		^"Tl , 1 u T^V |SI            K \>1
		
		
Search with a sequence to find homologs		Align two or more pr otetn sequences
through i ■. i '.% . ■ sequence alignment		with CluM j( Omega to find conserved
		regions
		
Dynamic Programing Algorithm
Matrix:
• Each dimension corresponds to one of the proteins to be aligned.
• Each cell contains the score value from the substitution model corresponding to the residue pair.
• Diagonal transitions represent aligned positions
• Vertical and horizontal transitions represent gaps and are penalized.
• The final alignment corresponds to the path in the matrix that maximizes the score.
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Pairwise alignment
Sequence alignments
Analysis Tools
BLAST
Dynamic Programing Algorithm
Pair of protein sequences U GGQLAKEEAL T EGQPVEVL
Optimal alignment (no gaps)
U GGQLAKEEAL Tl EVL T2 EGQPVEVL
Optimal alignment (with gaps)
U GGQLAKEEAL T EGQP.VE.VL
	GGQLAKEE	A L	
E	00000011	0 0	
G	11000001	1 0	
Q	0 ^^0 0 0 0 0	1 1	
P	00120000	0 1	
y	00012000	0 0	
E	00001211	0 0	
y	00000121	1 0	
L	00010012	1 2	
			
Search with a sequence to find homology through ii.nr wtst> sequence alignment
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved regions
Back-trace from bottom-right
Global: Needelman & Wunsch. From the corner
Local: Smith & Waterman. From any position.
DETERMINISTIC     © Comp. expensive
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Pairwise alignment
Sequence alignments
Analysis Tools
BLAST
Search with a sequence to find homology through pairwise sequence alignment
Align two or more protetn sequences with f' .. ■ ii Omega to find conserved
Word methods
• Short non-overlapping sequence stretches (k-tuples or words) are identified in the query sequence and matched in target sequence(s).
• Relative positions of the matching region define an offset (subtraction)
• Multiple words matching with similar offset define a region prone to alignment.
• Alignments are subsequently extended in alingment-prone regions.
• © HEURISTIC, optimal align not guaranteed.
• © Efficient for database searches.
• BLAST, FASTA.
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Pairwise alignment
Sequence alignments
Analysis Tools
BLAST		MMII..I IIMIK.IttfctM      i M IM IK1
		^"Tl , 1 u i^V |SI            K \>1 iEVl)('lu K xw
		
		
Search with a sequence to find homology		Align two or more protetn sequences
through i '■ >      ■ sequence alignment		with r   , ■ il Omega to find conserved
		
Multiple sequence alignments
•  Dynamic programming algorithm (N dimensional matrix)
Ü
Q	D	N	V	Q
				
L				
D - - Q - L F U H V Q - - ----Ö G L -
3-Binf DB & Str. Pred -> Seq align -> Alignments -> MSA
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....llk»
		|SI IX.IW K \>1
		
		
Search with a sequence to find homology		Align two or more protetn sequences
through i ■. i '.% . ■ sequence alignment		with Oust at Omega to hnd conserved
		regions
		
Multiple sequence alignments
* Dynamic programming algorithm
* Progressive methods
• First align the most similar pair
• Subsequently add less similar sequences
• Sensitive to similarity inaccuracy (i.e. due to differences in sequence length)
• CLUSTAL
• Additional info considered: T-Coffee (slow)
* Iterative methods
3-Binf DB & Str. Pred -> Seq align -> Alignments -> MSA
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		|SI IX.IW K \sl
		
		
Search with a sequence to find homologs		Align two or more protetn sequences
through i ■. i '.% . ■ sequence alignment		with Oust at Omega to find conserved
		regions
		
Multiple sequence alignments
* Dynamic programming algorithm
* Progressive methods
* Iterative methods
• Initial global alignment
• Objective function (based on score) to optimise similarity assessment. Chose best.
• All possible remaining sequence subsets re aligned and re-scored
• Best subset included in the alignment/iter.
• Typically slower, more accurate
• MUSCLE, MAFT.
3-Binf DB & Str. Pred -> Seq align -> Alignments -> MSA
Sequence alignments
Analysis Tools
BLAST		MMII..I IISMK.IttktM .....Ilk.
		^"Tl , 1 u l^V |SI            K \>1
		
		
Search with a sequence to find homologs		Align two or more protetn sequences
through pairwfie sequence alignment		with Clustal Omega to find conserved
		regions
		
Beyond pure sequences: patterns and models
• Aligned sequences can be used to define patterns, that can then be used to perform searches in databases.
• Position Specific Scoring Matrices
• Hidden Markov Models
3-Binf DB & Str. Pred -> Seq align -> Alignments -> Motifs
Summary of ID predictions
Different protein properties or characteristics can be predicted from its primary sequence:
• Secondary structure
• Solvent accessibility
• Solubility/expressability
• Transmembrane regions
The methods that do such predictions improve if they consider evolutionary information
3-Binf DB & Str. Pred -> Prediction of protein properties from sequence
Secondary structure prediction
□ prediction of the conformational state of each amino acid (AA) residue of a protein sequence as one of the possible states:
■ helix (H)
■ strand (S)
■ coil (C)
3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure
Secondary structure prediction
□ amino acid propensities derived from known 3D structures
■ probability of a particular AA for a particular secondary structure state
■ first-generation methods - low accuracy
□ propensities of segments of adjacent residues
■ local environment of residues considered (3-51 consecutive residues)
■ second-generation methods - accuracy ~ 60 % - 65 %
□ evolutionary information combined with machine learning
■ training set - sequence profiles associated with a particular secondary structure arrangement (based on known 3D structures)
■ sequence profiles derived from family sequence alignments
■ state-of-the-art methods - accuracy ~ 70 % - 80 %
3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure
Secondary structure prediction programs
□ PSI-PRED
http://bioinf.cs.ucl.ac.uk/psipred/
combination of PSI-BLAST profiles and neural networks
careful selection of sequences used for profile construction
Pred:
>
■í
>
Pred: HHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEEEE AA: QQMNQKAVTSFLSVQDGIYNSDLTPKSDIKNPDVWYEFF
130 140 150 160
Legend:
¥  = hell:
Co
nf:      3a □ D 1 J E   = confiden
ce of prediction
= strant
Pred:  predicted secondary structure
= coil
AA:   target seguence
3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure
Secondary structure prediction programs
□ Quick2D (MPI toolkit)
■ https://toolkit.tuebingen.mpg.de/tools/quick2d
■ overview of secondary structure features (a-helices, extended |3-strands, coiled coils, transmembrane helices, disorder regions)
■ predictions by PSI-PRED, J NET, Prof, Coils, MEMSAT2, HMMTOP,...
SO
I I I I I I I I
MS L GAKP F G-EKKFIEIKG-RPHAYID E G-T G-D PIL F QHG-HP T S S YLURHIHPHC AG-L G-RLIACDLIGHG-D SDKLD P S G-P E PY
ss	PSIPPED	EEEEE	EEEEEEEE	EEEEE	HHHHHHHHHHHHH	EEEEE
ss	JHET	EEE	EEEEEE	EEEEEE	HHHHHHHHHHH	EEEEE
ss	Prof (Ouali)	EEEE	EEEEEEE	EEEEE	HHHHHHHHHHHH	EEEEEE
CC Coils TH HHHTOP TH HEHSAT-SVH TH PHOEIUS DO DISOPREDZ DO IUPPED SO JHET
DD E E
E      EEE E    E EEEE    E      EEEEEEEEEEEE EE    EE    E      EEEEEEEEEEEE EE
3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure
Secondary structure prediction programs
T
□ GeneSilico metaserver
■ https://genesilico.pl/meta2/
■ meta-server for protein structure prediction, including secondary structure prediction
SECONDARY STRUCTURE PREDICTION
Secondary Structure
sspro4
cdm
psipred
fdm
jnet
porter
sable
prof
gor
consensus
1........10........ 20........ 30........40........SO........SO_____
MTI SAD IS LHHRAVL GS THAYPE T G-RSDAPHVL F LHGNP T S S YIWRNIMP LVAPVG-HCIAPD LIGT
---------EEEEE--EEEEEEEE-------EEEE------HHHH---HHHH-----EEEE-----
---------EEEEE--EEEEEEEE-----EEEEEE------HHHH---HHHH----EEEEE-----
------EEEEEEEE--EEEEEEE-------EEEEE------HHHHHHHHHHHH---EEEEE-----
-----HHEEEEEEE--EEEEEEEE-----EEEEEE------HHHH---HHHH----EEEEE-----
-------EEEEEEE--EEEEEEEE------EEEEEE----HHHHHHHHHHHH----EEEEEE----
-EEEEEE---EEEEEE-------EEEEE--
---------EEEEEEE-------EEEEEE-
----------EEEEE--------EEEEE--
--EEEEE--EEEEEEE-------EEEEE--
-HHHHHHHHHHHHHH---EEEEE-
---HHHHHHHHHHHHH---EEEE-
------HHHHHHHHH---EEEEE-
---HHHHHHHHHHH----EEEEE-
3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure
Solvent accessibility prediction
□ prediction of the extent to which a residue embedded in a protein structure is accessible to solvent
■ comparison of accessibility of different amino acids - relative values (actual area as percentage of maximally accessible area)
■ simplified two state description - buried vs. exposed residues
3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessibility
□ residue hydrophobicity
■ very hydrophobic stretches are predicted as buried
□ propensities of single residues or segments of residues to be solvent accessible
■ superior to simple hydrophobicity analyses
□ evolutionary information
■ solvent accessibility at each position of protein structure is evolutionary conserved within sequence families -> methods using multiple sequence alignment information
■ prediction accuracy above 75%
3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessibility
11
Solvent accessibility prediction programs
□ PHD
http://www.predictprotein.org/
combination of evolutionary information with neural network
□ PROFphd
http://www.predictprotein.org/ improved version of PHD
combination of evolutionary information and secondary structure
prediction with neural network
trained only on high resolution structures
3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessionlity
□ SABLE2
■ http://sable.cchmc.org/
■ combines solvent accessibility and secondary structure predictions
□ GeneSilico metaserver
■ https://genesilico.pl/meta2/
■ meta-server for structure prediction, including solvent accessibility Protein Solvation \M\M & i........10........20........30........40........so
MaiRRPEDFKHYEVQLPDVKIHYVEEGAGPTLLLLHGWPGFWWEWSKVIGPLaE
netsurfp_sol25
soprano_sol25
sable_acc
spine_sol25
spineX_sol25
paleale_sol25
accpro_sol25
jnet_sol25
paleale_sol5
333—3—3-3333-3—333333333-3333333333333333333333333-
--------E—3-3-3—3-33333—3333333333333333333-333333-
---------------3—3-33-3-------33333333333-33—33—3 —
— 3—3—3----
333—3—3-3333-3—3333333-3—3333333333333333-333—33-
B—
3-3-B—
3-3—3
3-3333—3---3333333333333333-33—33-
3-3333----333333333333333333-33—33-
333333333333-333-333333-
3333333333333333-33—33-
3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessionlity
11
Solubility and expressability prediction
□ Complicated definition of the property
□ Prediction of the extent to which a given sequence will produce a soluble protein in a given expression system or
□ Prediction of aggregation propensity
□ Methods heavily rely on machine learning.
3-Binf DB & Str. Pred -> Properties prediction -> Solubiility & Expressability
Solubility and expressability prediction
□ Methods based on:
■ Plain protein sequences
■ Evolutionary information implicit in the learning data
■ SOLpro http://scratch.proteomics.ics.uci.edu
■ ESPRESSO http://mbs.cbrc.jp/ESPRESSO
■ SoluProt https://loschmidt.chemi.muni.cz/soluprot/
■ Sequence profiles
■ Evolutionary Information implicit in the profile
■ AGGRESCAN http://bioinf.uab.es/aggrescan/
■ TANGO http://tango.crg.es
■ PASTA http://protein.cribi.unipd.it/pasta/
3-Binf DB & Str. Pred -> Properties prediction -> Solubiility & Expressability
Transmembrane region prediction
□ transmembrane (TM) proteins - challenge for experimental determination of 3D structure -> structure prediction needed even more than for globular water-soluble proteins
□ two major classes of integral membrane proteins
■ transmembrane helices (TMH)
■ transmembrane beta-strand barrels (TMB)
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
TMH: bacteriorhodopsin (PDB-ID lap9)     TMB: matrix porin (PDB-ID 2omf)
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction
□ prediction of TMH simplified by strong environmental constraints - lipid bilayer of the membrane
■ TMHs are predominantly apolar and 12-35 residues long (hydrophobicity)
■ specific distribution of Arg and Lys (positively charged) -> connecting loop regions at the inside
of the membrane have more positive charges than loop regions at the outside = positive-inside rule
NH2
COOH
Cytoplasmic
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction
□ prediction of TMB
■ transmembrane beta-strands contain 10 - 25 residues
■ only every second residue faces the lipid bilayers and is hydrophobic, other residues face the pore of the (3-barrel and are more hydrophilic -> analysis of hydrophobicity NOT useful for TMB prediction
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction
□ hydrophobicity-based methods (for TMH)
■ hydrophobicity along the sequence, hydrophobic moment or other membrane-specific amino acid preferences
■ averaging hydrophobicity values over windows of adjacent residues
■ prediction of orientation of TMH using positive-inside rule
□ evolutionary information combined with machine learning or hidden Markov models (for TMH)
■ superior to methods based solely on hydrophobicity
□ evolutionary information combined with machine learning or hidden Markov models (for TMB)
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction programs v .
□ no appropriate estimate of performance available
■ insufficient number of high-resolution structures (needed for a statistically significant analysis)
■ in the papers, accuracy of methods usually largely overestimated -methods perform much better on proteins for which they were developed than on new proteins
■ the best methods for TMH estimated to have ~70% accuracy
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction programs v .
□ TM H MM 2.0
http://wwwxbs.dtu.dk/services/TMHMM/
a number of statistical preferences and rules embedded in hidden
Markov model -> localization and orientation of TMH
TMHMM posterior probabilities for sp_P785B8_FREL_CANAL
transmem bran e
inside
outside
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction programs v .
□ TOPCONS
http://topcons.cbr.su.se/ consensus prediction of TMHs
— Inside  —Outside      TM-helix (IN-*0UT)      TW-helix (OUT->IN)  ■ Reentrant region
SCAMPl-seq SCAMPI-msa PRODIV PRO OCTOPUS
■->
N o
>
Si =
■Jl
_E
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
Transmembrane region prediction programs v .
□ TBBpred
http://wwwjmtech.res.in/raghava/tbbpred/ prediction of TMB using machine learning
□ PROFtmb
■ http://www.predictprotein.org/
■ profile-based hidden Markov model
■ prediction of bacterial TMB
□ ...
3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region
□ Gu, J. & Bourne, P. E. (2009). Structural Bioinformatics, 2nd Edition,
Wiley-Blackwell, Hoboken, p. 1067.
□ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352.
□ Schwede, T. & Peitsch, M. C. (2008). Computational Structural Biology: Methods and Applications, World Scientific Publishing Company, Singapore, p. 700.
Bioinformatics databases & Structure prediction