4, LOSCHMIDT LABORATORIES Bioinformatics protein sequences and databases □ Introduction □ Primary sequence of proteins □ Protein sequence databases □ Sequence alignments ■ evolution of proteins ■ Sequence-structure-function paradigm ■ Alignment of sequences □ Prediction of protein properties from sequence Bioinformatics databases & Structure prediction 3-Binf DB & Str. Pred -> Intro Structure prediction ARTIFICIAL 1M Q DeepMind INTELLIGENCE SOLVES 50 //A YEAR OLD v.^.W^ Google DeepMind SCIENCE 4 *^SSf^^L PROBLEM (ALPHAFOLD) V St 3-Binf DB & Str. Pred -> Intro Let's start from the beginning... 3-Binf DB & Str. Pred -> Intro Protein synthesis Single coding arrand of DNA Double strand of DNA ACT Q AC T C T C G T T A C T C T G A C 1 Transcription | U G A CUG AGA G C A A Ü Ü AGA CUG, ILL ILL 1 1 JUL! ILL 1 1 1 Triplet Cod on Protein synthesis occurs in two steps: • Transcription: DNA -> RNA • Splicing: RNA -> mRNA • Translation: mRNA -> Protein • Post-translational modifications: protein -> mature protein Slrgnd of mRNA Growing amro »dcftwi '" AVnra Acnl Translation Translation 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins Protein synthesis 3-Binf DB & Str. Pred -> 1^ sequence of proteins 3-Binf DB & Str. Pred -> 1^ sequence of proteins Primary Secondary Tertiary Quaternary structure structure structure structure Levels of protein structure Primary structure Amino acid Secondary structure a-Helixes Tertiary structure Polypeptide chains Quaternary structure Complex of protein molecule 3-Binf DB & Str. Pred -> 1^ sequence of proteins Sources of protein sequences □ Multiple databases available: □ With different scope focus: ■ Generalist: sequences from any source (UniProtKB) ■ Specialist: sequences focusing on one more specific condition(s) (i.e. biologic pathway, disease, organism) (WormBase) □ With different types of sequence content: ■ Primary sequence of proteins, and annotations and cross-references to that sequence (UniProtKB) ■ Motifs or profiles databases: contain information derived from the primary sequence, in the form of abstractions (patterns) that distil the most conserved features among related proteins (PFam) 3-Binf DB & Str. Pred -> protein seq. databases □ UniProtKB ■ Collaboration between EBI, Swiss Institute of Bioinformatics and Protein Information ■ Central repository of protein sequences and functional information ■ Quality annotations - information on protein function and individual amino acids, experimental information, biological ontologies, classification, links to other databases Quality level of the annotation (manual vs. automatic) 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Proteins UniProt Knowledgebase Reviewed Swiss-Prot Unreviewed TrEMBL Species Proteomes Protein sets for species with sequenced genomes from across the tree of life Protein Clusters MR UniRef Sequence Archive UniPar f-fffil Clusters of protein sequences at 100%, 90% & 50% identity Non-redundant archive of publicly available protein sequences seen across different databases Supporting Data Diseases Keywords Analysis Tools Search with a sequence to find homologs through pairwise sequence alignment Taxonomy Literature Citations viaepegt-hsfdgiw viaepegt viaepegt-h: viaepegt viaepegt VIAE VIAEPE VIAEPI VIAEPE VLVEPl GIW SFDGIW SFDGIWKA DGIWKAS I1Z A 5fttftvt1 tftvtky1 tftvtkytky tvtkvtky rVTKYTKY tkytky jkytky kytky kytky kytky kytky kytky in sequences find conserved Subcellular locations Cross-referenced databases Search with Lists Map IDs Find proteins with lists of UniProt IDs or convert from/to other database I Ds UniRule automatic annotation ARBA automatic annotation Search Peptides Search with a peptide sequence to find all UniProt proteins that contain exact matches Bioinformatics databases & Structure prediction Proteins UniProt Knowledgebase Reviewed Swiss-Prot Unreviewed TrEMBL □ Main component of the database □ Reviewed protein entries (SwissProt): • High quality manual annotations • © Manual annotations -> reliable info • © >570,000 protein records (2024) □ Automatic protein entries (TrEMBL): • Automatic translation of protein sequences from EMBL data bank • © Automatic annotations -> lower quality, chance for errors. • © -250,000,000 protein records (2024) (400x info ammount) 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Species Proteomes Protein sets for species with sequenced genomes from across the tree of life Protein Clusters UniRef Clusters of protein sequences at 100%, 90% & 50% identity Sequence Archive UniParc Non-redundant archive of publicly available protein sequences seen across different databases Proteomes for 25,000 model organisms available Different degrees of coverage (other 160,000 available) Clusters of proteins at 100%, 90%, and 50% seq. ID Groups of similar proteins where to sample from Stable identifier repository Cross-references to a wealth of 40 external different databases (generalist and specialist) 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB UniPfOt • BLAST Align Peptide search ID mapping SPARQL UniProtKB ~ BETA •• LinB Advanced 1 List A # 0 Help Status Reviewed (Swiss-Prot) (4) Unreviewed (TrEMBL) (102) Taxonomy Filter by taxonomy Proteins with 3D structure (4) Active site (26) Activity regulation (1) Beta strand (2) Binary interaction (1) I Protein existence Predicted (62) Hnmnlnov (AO\ UniProtKB 106 results A, Download m )4Z2G1LINB_SPHJU Haloalkane dehalogenase • Sphingobium japonicum (strain DSM 16413 / Q^l 7287 / MTCC 6362 / UT26 / NBRC 101211 / UT26S) • EC number: 3.8.1.5 • Gene: linB • 296 amino acids • Evidence at protein level ffc/sl #Hydrolase#Detoxification 1 domain • 3 active sites • 16 3D structures • 14 reviewed publications 0c * A0A1L5BTC1 • LINB_SPHIB Haloal •296 a f/V4PEU6-A4PEU6_9SPHN #Hydro Hajoalkane dehalogenase • Sphingobium sp. Ml 1205 • EC number: 3.8.1.5 • Gene: linB (dhaA) • 296 amino acids • Evidence at protein levej 1 doma SH^Irolase 1 domain • 3 active sites • 8 3D structures • 4 publications We'd like to inform you that we have updated our Privacy Notice to comply with Europe's new General Data Protection Regulation (GDPR) that applies since 25 May 2018. Quality Info: Name/Organism source/EC activity/gene name/length. Filters Protein evidence +lnfo: Domain/3D structure/active site/pubs. 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB I Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Human readable explanation of the protein function Wealth of systematically organized information. In the illustrated example: • Catalytic activity: with details of the enzymatic reaction and cross-links to chemical databases Activity regulation: competitive inhibitors Kinetics: experimental measurements towards n substrates Optimal pH Implication in biological pathways Catalytic and Key Residues (active/binding sites) Gene Ontology (GO) annotations (enrichment values) Enzyme/Pathways and Protein Family DBs Keywords 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB I Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins D4Z2G1 • LINB SPHJU Haloalkane dehalogenase ■ Sphingobium japonicum (strain D5M16413 / CCM 7287 / MTCC 6362 / UT26 / NBRC101211 / UT26S) ■ EC number: 3.8.1.5 Gene: linB ■ 296 amino adds • Evidence at protein level • 0 Entry Featureviewer Publications External links History A Download T ft Add Adda publication Entry feedback Function Catalyzes hydrolytk cleavage of carbon-halogen bond? In halogenated aliphatic compounds, leading to the formation of the corresponding primary alcohols, halide ions and protons. Has a broad substrate specificity since not only monochloroalkanes (C3 to CIO) but also dichloroalkane5<>C3], bramoalkane5.and chlorinated aliphatic alcohols are good su bst rales (PuhMed:9293022. PubMed: 10100638). Shows almost no activity with 1.2-dlchloroethane. but very high activity with the bromlnated analog (PubMed:Y293022). Is involved In the degradation of the important environmental pollutant gamma-hexachlorocytlohexanc (gamma-HCH or lindane) as it also catalyzes conversion of 1.3.4.6-tetrachlorol,4-cvclohexadiene[l,4-TCDIM| to 2,5 dicbloro-2,5-cyclohexadier>e-l,4-diol (2,5-0D0L)via the Intermediate2.4.5-trichloro-2.5-cyclohexadiene-l-ol (2,45-DNOL) (PubMed:7691794). This degradation pathway allows Sjaponicum UT26 to grow on gamma-HCH as the sole source of rarbon and energy 3 Publications Miscellaneous Is not N-terminally processed during export, so it may be secreted into the periplasmic space via a hitherto unknown mechanism. [" 1 Publication 1 Catalytic Activity l-haloalkane + H20 = a hallde anion + a primary alcohol + H(+) I 1 Automatic Annotation 2 Publications EC: 3.8.1.5 (UniProtKBQ. ENZYME | RheaCS ) Source: Rhea 19081 Activity Regulation Competitively inhibited by the key pollutants 1.2-dichloroethane (l,2-DCE)and 1,2-dichloroprc Kinetics l-haloalkane CHEBI:18060 R' H20 CHEBLÍ5377 H \ / a halide anion CHEBI:16042 .3 primary .ilcohol CHEBI15734 H H H- CHEBI:15378 K|v(=1.9mMfor 1,2-dibromoethane I " 1 Publication Km=3.9itiM for l-chloro-2-bromoethane 1 Publication KM=0.9mMfor 1,2-dibromopropane I " 1 Publication I K|vi=0.05mM for l-bromo-2-methylpropane 1 Publication H + KM=n.7mMfor2,3-dichloropropene I " 1 Publication I K|v|=0.14mMfor 1-chlorobutane I n 1 Publication I OH kcat is 0.98 sec(-l) with 1-chlorobutane as substrate. I n 1 Publication! pH Dependence (3R.6R)-13.4,6-tetrachlorocvclohc«a l14-dlenet2H20 = 2,}dlchlorocvclohe»3-215-dlen-l,4dlolt2chlorlclet2H(í] I" 1 Publication I Optimum pH is 8.2. I" 1 Publication I Pathway Xenobiotic degradation; gamma-hexachlorocyclohexane degradation. I 1 Publication I Features Showing features for domain, active site, binding site. e. « « ........g«..iii.n-fii.i»rrir,i 3-Binf DB & Str. Pred -> protein seq. databases (Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Features Showing features for domain, active site, binding site. & <$. Sf 132-132 Binding site Nucleophile [_ 3 Publications Proton donor 3 Publications Proton acceptor 3 Publications Binding site GO Annotations Slimming set: 109-109 Chloride [ " 1 Publication | Chloride 2 Publications Combined Sources Cell color indicative of number of GO terms ASPECT TERM CellularComponent Molecular Function periplasms space Ľ IEA:UniProtKB-SubCell haloalkanedehalogenase activity tí IEA:UniProtKB-UniRule Biological Process response to toxic substance ü IEA:llniProtKB-KW Keywords Enzyme and pathway databases Protein family/group databases Molecular function | #Hydrolase Biological process I #Detoxincation BRENDA J 3.8.1.5 a 10293 UniPathway I UPAO0689 ESTHER sphpi-linbrJ Haloalkane_dehalogenase-HLD2 3-Binf DB & Str. Pred -> protein seq. databases Function Names & Taxonomy Protein names 1 Names & Taxonomy Recommended name Halnjillranpriphalngpnasp f 1 Automatic Annotation I I " 1 Publication) EC number 1 3.8.1.5 f 1 Automatic Annotatlonl 1" 1 Publication 1 Subcellular Location Alternative names 1 1 t 4 (Hprrarhlnro-I 4-ryrinhPYarlipnp halirinhyrirnlasp (" 1 Publication! (1.4-TCDN halidohvdrolase 1 " 1 Publication]) Phenotypes Gene names Name | linR 1 ■ 2 Publications 1 PTM/Processing Ordered locus names 1 SJA C1-19590 I" ImDortedl Expression Organism names Organism Sphingobium japonicum (strain DSM16413 / CCM 7287 / MTCC 6362 / UT26 / NBRC 101211 / UT26S) Interaction Taxonomic identifier 1 452662 NCBIl: Taxonomiclineage 1 Bacteria > Proteobacteria > Alphaproteobacteria > Sphingomonadales > Sphingomonadaceae > Sphingobium Structure Accessions Family & Domains Primary accession 1 D4Z2G1 Secondary accessions 1 P51698 Sequence Proteome Similar Proteins Identifier 1 UP000007753 Component 1 chromosome 1 3-Binf DB & Str. Pred -> protein seq. databases 31 Function I Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Names & Taxonomy Protein names Recommended name EC number Alternative names Gene names Name Ordered locus names Organism names Organism Taxonomie identifier Taxonomie lineage Haloalkanedehalogenase I 1 Automatic Annotation! I " 1 Publication] 3.8.1.5 I 1 Automatic Annotation] I " 1 Publication I l,3,4,6-tetrachloro-l,4-cyclohexadiene halidohydrolase Í" 1 Publication I I1.4-TCDN halidohydrolase I " 1 Publication!) MnB I " 7 Publications! SJA_C 1-19590 lp Imported! Sphingobium japonicum (strainDSM 16413/CCM 7287/ MTCC 6362 /UT26/ NBRC 101211 / UT26S) 452662 NCBICJ Bacteria > Proteobacteria =• Alphaproteobacteria > Sphingomonadales > Sphingomonadaceae > Sphingobium Proteome Ider Compc Unique accession numbers Serialized for sequence variants {later) D4Z2G1-LINB_SPHJU Haloalkane dehalogenase • Sphingobium japonicum (strain DSM 16413 / CCM 7287 / MTCC 6362 / UT26 / NBRC 101211 / UT26S) • EC number: 3.8.1.5 • Gene: MnB • 296 amino acids • Evidence at protein level ■ (5/5) #Hydrolase#Detoxincation 1 domain • 3 active sites • 16 3D structures • 14 reviewed publications 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Subcellular Location UniProt Annotation GO Annotation f f ( f f 1 \ \ \ Keywords Cellular component I #periplasm £> Periplasm |w 1 Publication | 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location | Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Phenotypes Features Showing features for mutagenesis. a * «! TYPE_ -Select Mutagenesis Mutagenesis Mutagenesis Mutagenesis Mutagenesis DESCRIPTION Loss of activity. 1 Publication Loss of activity. | ' 1 Publication 5B?4al wild-type activity. I * 1 Publication I Loss of activity. I' 1 Publication! Loss of activity. | p 1 Publication! Describe the effect of mutations in the activity of the protein Mutations mapped on the protein sequence 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence PTM/Processing Features 5hewing features for Initiator methionine, chain. Q £™ IHJtllL^tJlLllMLLI.ILflLP.I.lll.l.l.....lULUlLlLWlLl.lM.lll.l.l.tLt.l.Iglll.IUII.Ill.k.tlljl.lLrgl.l.i.^llUD.I.I.IiVLMLIL.ILIll.l.l.H I'UA-M.t .L.tllvl >IUM1>V(> ia|»tU.L.l. IJ |ÉL> I .1. I^ILII......111. I.tllj 1.^111111........l.lir.l.L.t. ID POSITIONS DESCRIPTION TYPE ► Initiator methionine Removed " 2 Publications PRO 0000216778 HaloalkanerJerialogenase Describe post-translational modifications and other processing of the protein (i.e. cleaving for activation) Positions mapped on the protein sequence. Similar Proteins 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression I Interaction Structure Family & Domains Sequence Similar Proteins Expression Induction Constitutively expressed. Interaction Subunit Monomer. I * 1 Puhllratlon Protein-protein interaction databases STRING 452662.SJA_C1-19590 a Expression: • Describe the expression conditions of the protein Interaction: • Refers to the quaternary structure of the protein • Describes its native oligomeric state, and • Lists interactions with other proteins 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction I Structure Family & Domains Sequence Similar Proteins Structure SOURCE IDENTIFIER METHOD RESOLUTION CHAIN POSITIOHS LINKS -Select - ■ ■-Select-- ■ PUB 1CV2 X-ray 1.5SA A 1-296 PDB-RCSBPDBPDBj-PDBsuin i PDB 1007 X-ray 200A A 1-196 PDB ■ RCSB-PDB ■ PDBj - PDBsum ± PDB 1042 X-rsy ISO A A 1-296 PDB - RC5B-PDB -PDBj- PDB&um A PDB 1G4H X-ray 1.80 A A 1-2Í6 PDB-RCSB-PDBPDBjPDBsum i PDB 1C5F X-ray 1.80 A A 1-296 PDB-RCSB-PDB-PDBj PDBíuin A Displays available tertiary structures (experimentally determined) for the protein. Links to AlphoFold predictions if available (cover later) Describes secondary structure content mapped to seq Links to databases with 3D structure models 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Structure CHAIN POSITIONS ■ Select - Features Showing features For beta strand, helix, turr $ protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure I Family & Domains Sequence Similar Proteins Family & Domains Features Showing features for domain a « at TYPE - Select - ► Domain 31-15 Similarity Belongs to the haloalkane dehalogenase family. Type 2 subfamily 1 Automatic Annotation Phylogenomic databases HOGENOM | CLU_O20336_13_3_5E OMA | TLFCQDW tr Family and domain databases Gene3D | 3.40.50.1820 cj 1 hit hamap | MF_01231rJ Hsloalk_dehal_type21 hit InterPro viewproteininlnterPro ci IPR029058 U AB.hydrolase IPH000073 a AB.hydrolase.l IPR000639 ci Epox.hydcolaseTike IPR023594 U Haloalkane_dehalogenase_2 PRINTS I PR00412B EPOXHYDRIASE DESCRIPTION AB hydrolase! 1 Automatic Annotation eggNOG COG0596 a Bacteria Pram View protein in Pf am c I PF00561 a Abnydrolase.l 1 hit SUPFAM J SSFS3474BSSF534741hlt MobiDB I Search... B ProtoNet Search :t Cross-references to motifs and profiles databases Convenient to find other proteins that share one particular sequence feature. 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Similar Proteins Sequence Tools ± Download ö Add Highlight CopyFASTA Length 296 Mass (Da) 33,108 Lastupdated 2010-06-15 vl Checksum 6EEE011B157DBAE1 10 20 30 40 50 60 70 80 90 MSLGAKPFGE KKFIEIKGRR MAYIDEGTGD PILFQHGNPT SSYLWRNIMP HCAGLGRLIA CDLIGMGDSD KLDPSGPERY AYAEHRDYLD 100 110 120 130 140 150 160 170 180 ALWEALDLGD RWLVVHDWG SALGFDWARR HRERVQGIAY MEAIAMPIEW ADFPEQDRDL FQAFRSQAGE ELVLQDNVFV EQVLPGLILR 190 200 210 220 230 240 250 260 270 PLSEAEMAAY REPFLAAGEA RRPTLSWPRQ IPIAGTPADV VAIARDYAGW LSESPIPKLF INAEPGALTT GRMRDFCRTW PNQTEITVAG 280 290 AHFIQEDSPD EIGAAIAAFV RRLRPA When multiple isoforms are available due to alternative splicing the different sequences are available here, with serialized accession codes (i.e. P21397-1, P21397-2) 3-Binf DB & Str. Pred -> protein seq. databases UniProt KB Function Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains I Sequence Similar Proteins Sequence Tools ± Download ft Add Highlight ^ CopyFASTA Length 296 Mass (Da) 33,108 Lastupdated 2010-06-15 vl Checksum 6EEE011B157DBAE1 10 20 30 40 50 60 70 80 90 MSLGAKPFGE KKFIEIKGRR MAYIDEGTGD PILFQHGNPT SSYLWRNIMP HCAGLGRLIA CDLIGMGDSD KLDPSGPERY AYAEHRDYLD 100 110 120 130 140 150 160 170 180 ALWEALDLGD RWLWHDWG SALGFDWARR HRERVQGIAY MEAIAMPIEW ADFPEQDRDL FQAFRSQAGE ELVLQDNVFV EQVLPGLILR 190 200 210 220 230 240 250 260 270 PLSEAEMAAY REPFLAAGEA RRPTLSWPRQ IPIAGTPADV VAIARDYAGW LSESPIPKLF INAEPGALTT GRMRDFCRTW PNQTEITVAG 280 290 AHFIQEDSPD EIGAAIAAFV RRLRPA Keywords Technical term #3D-structure #Direct protein sequencing #Reference proteome Genome annotation databases EnsemblBacteria | BAI96793cJ SJA.C1-19590 a KEGG I sjp:SJA.Cl-19590 a Sequence databases EMBL (EMBL CJ | GenBank | DDBJ lI ) D14594tf Genomic DNATranslation: BAA03443.2 n (EMBL CJ | GenBank |DDBJtf ) AP010803 E Genomic DNA Translation: BAI96793.1 C3 HR I A49896 CÍ A49896 RefSeq WP 013040256.1 d NC.014006.1 3-Binf DB & Str. Pred -> protein seq. databases Names & Taxonomy Subcellular Location Phenotypes PTM/Processing Expression Interaction Structure Family & Domains Sequence Isimilar Proteins 1003é Identity 90X Identity 50%identlty LINB_SPHJU llriiReflO0_D4Z2Gl I Accession Organism Length A0A258B05Ó Hi loalkane deh sloge n35e Sphingopyitis lindanitolcrans Tib A8CFB7 Ha loalkane deh aloge nase 5phinEobium inditum 216 A8CFC0 Ha loalkane deh aloge naw Sphingobium sp. SSW-4 lit 1 more Retrieve groups of proteins that are 100%, up to 90%, or up to 50% identical Protein Clusters UniRef Clusters of protein sequences at 100%, 90% & 50% identity 3-Binf DB & Str. Pred -> protein seq. databases Uses for protein sequences What can we do with protein sequences and computers? Bioinformatics databases & Structure prediction Different protein properties or characteristics can be predicted from its primary sequence: • Secondary structure • Solvent accessibility • Solubility/expressability O • Transmembrane regions The methods that do such predictions improve if they consider evolutionary information Bioinformatics databases & Structure prediction Protein sequences can also be directly "compared" among them. Their similarities or differences can be assessed.. Alignments are models that aim to pair the most similar parts among different proteins. If the model considers evolutionary information (and biologically relevant protein alignments do), evolutionary relationships [homology) can be inferred from sequence similarity. Analysis Tools MMTI..I .....mk.M ., , lk> BLAST ;>;;;; JllSI !><;i\\ l-V ■ si DCIW K \sl 1 IXilUK \S# Search with t sequence to find homology Align two or more protetn sequences through pjirwise sequence Alignment with Oust jI Omega to find conserved regions 3-Binf DB & Str. Pred -> Sequence alignments * 1870 VJ 1. Geospiza mn Seq align -> Evolution of proteins Darwinian ideas on evolution: All species of organisms arise and develop through the natural selection of small, inherited variations that increase the individual's ability to compete, survive, and reproduce [biologicalfitness). Inter-individual differences need to be: • Small • Inheritable There exists a natural selective pressure. Variations that make an individual fitter (improve its functions) to the conditions of the selective pressure are more likely to be transmitted to next generations. Accumulation of variation causes speciation. 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins A few words on molecular evolution Improved function on a given environment (adaptation) is a key concept in evolution. How does this apply to proteins? How do proteins function? Molecular Catalyst Molecular Pore [gift box] [tube] Function is dictated by shape (3D structure) 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins A few words on molecular evolution Improved function on a given environment (adaptation) is a key concept in evolution. How does this apply to proteins? How do proteins function? Structure is determined by sequence Function is dictated by shape (3D structure) 3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm Sequence, Structure, Function Paradigm □ 3D structure is determined by the sequence □ Function is dictated by 3D structure MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIMPHCA GLGRLIACDLIGMGDSDKLDPSGPERYAYAEHRDYLDALWEALDLGDRVVLVV HDWGSALGFDWARRHRERVQGIAYMEAIAMPIEWADFPEQDRDLFQAFRS QAGEELVLQD sequence structure function 3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm A few words on molecular evolution □ Innovation happens at the sequence level • Mutations (smallchanges) introduced in DNA (inheritable) • Subsequently transcribed, processed, and translated into polypeptidic chains (proteins) □ Selective pressure operates at the function level • Proteins working better in their environments make individuals fitter, adaptation occurred in human lineage Schaffner S. & Sabeti P (2008) Evolutionary adaptation in human lineage. Nature Education 1:14. 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins Structure Function Sequence Diversity Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor Bird 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins Structure Function Sequence Paralogs Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins Structure Function Sequence Annotation problem Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u l^V |SI K \>1 l)(;lu K xw Search with t sequence (o find homology Align two or more protetn sequences through i ■. i '.% . ■ sequence alignment with CluM j| Omega to find conserved regions Alignments are models that aim to pair the most similar parts among different proteins. Global alignments: consider similarity across the entire sequence Local alignments: consider similarity across sequence fragments Pairwise alignments: two sequences compared Multiple sequence alignments: multiple 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Classification Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u l^V |SI K \>1 Search with t sequence (o find homolog) Align two or more protetn sequences through i ■. i '.% . ■ sequence alignment with CluM j( Omega to find conserved regions Alignments are models that aim to pair the most similar parts among different proteins. Pairwise alignment techniques • DotPlot methods • Dynamic programming algorithm • Needelman & Wunsch (Global) • Smith & Waterman (Local) • Word methods Multiple sequence alignment techniques: • Dynamic programming • Progressive methods • Iterative methods 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Classification Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u i^V |SI K \>1 iEVl)('lu K xw Search with t sequence to find homologs Align two or more pf otetn sequences through i '■ > ■ sequence alignment with r , ■ il Omega to find conserved Alignments are models that aim to pair the most similar parts among different proteins. How can similarity among different parts of proteins be measured? 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST HSFDGIW1 ISFDGIW K t U IXilWKA i m Align two or more protetn sequences with f' .. ■ ii Omega to find conserved regions Similitude in between amino-acids: A. Amino Adds with Electrically Charged Side Chains Positive Argirvine (Arg) Q Histidine (His) Q Lysine As partie Acid (Aspí Q Glutamic Acid MM O y> > /° o=ť o = =< °=( N-NH2 C \-NH2 NH ©NHa "VNH ©NH3 ef e C. Special Cases Cysteine (Cyi] 0 ^_NH, SH Glycine [Glyl© Proline (P™> Q Jo o=< ^—NH, \—NH O D. Amino Acids with Hydrophobic Side Chains Alanine Valine Isoleucine Leucine Methionine Phenylalanine Tyrosine Tryptophan (Al.) © (Uli) Q (ll») O (L»^J O (M-tj(J| (Ph,) Q 0 > o=> o )—NH l-NH \ CMch with 4 sequence to find homology through im n .v I ■■< ■ sequence alignment 3-Binf DB 81 Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u i^V |SI K \>1 iEVl)('lu K xw Search with t sequence to find homology Align two or more protetn sequences through i ■. i '.% . ■ sequence alignment with CluM j| Omega to find conserved regions How can similarity among different parts of proteins be measured? Assessing similarity in pairs of Amino-acids: • Each possible pair of amino-acids is given a substitution score (substitution matrix) • Amino-acids from the (two) sequences should be paired such as the total alignment score is optimized. • Sometimes no good pairing can be found and a gap needs to be introduced. • Gaps require a special penalty (negative score) in order to force longer and biologically meaningful alignments. 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST Search with a sequence to find homologs through [i.nr wise sequence alignment Align two or more protetn sequences with f' .. ■ ii Omega to find conserved How can similarity among different parts of proteins be measured? • Identity matrix (Dot-matrix plots): • 1 if same amino-acid • 0 otherwise Limited model: forces the introduction of too many gaps. ......«• • -> - - ■ s:'-.",i>*s- ''l^""*'v" -ft r 1. 1.1.1.1.1.1.1.1.1. J. 1. 1. 1.1.1.1.1.1.1.1.1.1.1.1 .J.1. 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.] NCJ11083 1 M 1,500 K 2M 2,500 K 3M 3,500 K 4M 4,338,763 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u T^V |SI K \>1 Search with t sequence (o find homology Align two or more pf otetn sequences through i '■ > ■ sequence alignment with r , ■ il Omega to find conserved How can similarity among different parts of proteins be measured? • Identity matrix (Dot-matrix plots): • 1 if same amino-acid • 0 otherwise -> Limited model: forces the introduction of too many gaps. • Substitution models: • Score depending on the probability of observing a substitution (mutation) of one particular Aa for another (i.e. Arg -> Lys should score better than Arg -> Glu) 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST HSFDGIVi I isfdííisn k f Li ih.iw k \ i m Search with » sequence (o find homology through [i.nr wise sequence alignment Align two or more protetn sequences with f' .. ■ ii Omega to find conserved Substitution models include evolutionary information Margaret Dayhoff Atlas of protein sequence and structure ATLAS of PROTEIN SEQUENCE and STRUCTURE ms Richard V. Eck Macw A. Chang Minnie E. &QIM. ii JS VlbíllL IULI .3K- ■_ 1 i í v i e i IP riE-CELDLL I hl i Ĺ E ^ bii I i I LI T H i ... . ,■ ., i ■ r ■ ■ m . ■ i, ■ . . i . >. h I p ■. I I I HIIikiitL....... i -t h » t ± ' + * it ii n u. i*, ii i t.-" 'W- w. Li» Lri ŕ. 3 Lrt lt+ Ili rri wl «> 1.1 j tri «u ni Lid ni r«. ml ku m »t kli na hi3 m ip-p v. a wr ■I ■ MU 111 iii LIM "pi p.t h lti IH -M. 3 y.■ u. j «■ -3.3 pm nu. rvm rat alf k.± hu> i u íii as ill- m iv la u «|fLV h Vr hu ■■ uval^uheiía |vl rvi \w ™*f « Ľl- *iM 'H i**** Kvľ+ !■■«■ ľt « Pív ** "í- -^V llu ílu 1^ h- LlU Ml M* HA PF TH k'.' t '."k l 1 Ľl* ^ ■ r IZ B.I t i nlJ h ■ ILL I ■. LM I ů u* L-i LrL t |* LM i ili 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u l^V |SI K \>1 iEVl)('lu K xw Search with a sequence to find homologs Align two or more protetn sequences through i ■. i '.% . ■ sequence alignment with CluM j| Omega to hnd conserved regions Substitution models include evolutionary information Dayhoff Mutation Data Matrix Score is based on the concept of Point Accepted Mutation (PAM) Evolutionary distance 1 PAM = time in which 1/100 amino acids are expected to mutate. Higher evolutionary times inferred from a Markov chain model: PAM matrix product. 250 PAM matrix - targets the limit where is safe to infer homology in proteins (twilight). Limitation: derived from 1572 observed mutations in (manual) alignment of sequences >85% identical 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools Search with a sequence to find homotogs through pairwise sequence alignment Align two of more i>' ottin sequences with (",.■■! Omega to hnd conserved regions Substitution models include evolutionary information PAM250 l_l u a. -j- ORIGINAL AMINO ACID R ti D c 0 -1 r G H I L K M F P S r w If 1 Ala Arg Asn Asn Cys Gin Glu Gly His He Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A Ala 13 6 9 0 5 8 9 12 8 6 7 7 4 11 11 11 2 4 S R Arg 3 * ? 4 3 2 5 3 2 6 3 2 9 4 1 t i J 7 2 2 H Asn 4 6 3 2 5 6 6 3 2 5 3 2 4 c 4 2 3 3 0 5 4 a 11 1 7 10 5 6 3 2 5 3 ] 4 5 5 1 2 3 c Cys ] i 1 52 j 1 2 2 2 1 1 1 1 2 3 2 1 4 2 Q Gin 3' c 5 6 10 7 j 7 2 3 S ~2 i 4 3 3 1 2 3 E Gin 5 4 7 11 1 a 12 5 6 3 2 5 3 1 4 5 5 1 2 3 G Gl y 12 s 10 10 4 7 9 27 5 5 4 6 5 3 S ".1 a 2 3 7 H HIS 2 5 5 4 2 1 4 2 15 2 2 3 2 2 3 3 2 2 3 2 i :ie 3 2 2 2 2 2 2 2 2 10 5 2 6 5 2 3 4 1 3 9 L Leu 6 4 3 2 6 4 5 15 34 4 20 13 5 4 6 6 7 13 K Lys 15 10 8 2 10 8 5 3 5 4 24 a 2 6 8 8 4 3 5 M Met I 1 1 1 o » 1 j 1 2 3 2 6 2 1 1 1 1 1 2 F Phe 2 1 2 I 1 l 1 1 3 5 6 1 4 32 1 I 2 4 20 3 P Pro 7 $ 5 1 3 5 4 5 5 3 T 4 2 2 20 i 5 1 2 4 S Ser 9 5 8 7 7 6 7 0 6 5 4 7 5 3 9 13 9 4 4 6 T Thr 3 □ 6 6 4 5 5 5 4 6 4 6 5 •> 6 3 11 2 3 6 W TrJ c 0 0 0 0 0 n ! 0 1 0 0 i 0 1 0 55 1 0 y Tyr 1 1 ? 1 3 j ] 1 3 2 2 1 2 15 1 j 2 3 31 2 V Val 7 4 4 t 4 4 4 5 4 15 10 4 It! 5 5 5 7 2 4 3-Binf DB & Str. Pred -> Seq align -> Alignments -» Substitution Models Sequence alignments Analysis Tools BLAST HSFDGIW I |SFD(;iSN K f i\ IK.IW K \ Substitution models include evolutionary information BLOSSUM matrices • BLOcks Substitution Matrix • Derived from blocks of aligned sequences in BLOCKS database - implicitly represents distant relationships. • bias from identical sequences is removed by clustering at a sequence identity threshold • BLOSUM62 = matrix derived from sequences clustered at 62% or greater identity Search with a sequence to find homology through [i.nr wtst> sequence alignment Align two or more protetn sequences with f' .. ■ ii Omega to find conserved 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST Search with a sequence to find homology through [i.nr wtst> sequence alignment Align two or more protetn sequences with f' .. ■ ii Omega to find conserved regions PAM BLOSUM Similar proteins compared as Conserved BLOKS (fragments) whole compared PAM1 corresponds to 1 * residue in 100 -> 99% ID BLOSUM1 corresponds to 1% ID Other PAM matrices Each matrix based on observed extrapolated from PAM1 alignments Higher numbers, more Higher numbers, more similarity evolutionary distance (less evolutionary distance) 100 90 120 80 160 62 200 50 250 45 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Substitution Models Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....llk» ^"Tl , 1 u T^V |SI K \>1 Search with a sequence to find homologs Align two or more pr otetn sequences through i ■. i '.% . ■ sequence alignment with CluM j( Omega to find conserved regions Dynamic Programing Algorithm Matrix: • Each dimension corresponds to one of the proteins to be aligned. • Each cell contains the score value from the substitution model corresponding to the residue pair. • Diagonal transitions represent aligned positions • Vertical and horizontal transitions represent gaps and are penalized. • The final alignment corresponds to the path in the matrix that maximizes the score. 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Pairwise alignment Sequence alignments Analysis Tools BLAST Dynamic Programing Algorithm Pair of protein sequences U GGQLAKEEAL T EGQPVEVL Optimal alignment (no gaps) U GGQLAKEEAL Tl EVL T2 EGQPVEVL Optimal alignment (with gaps) U GGQLAKEEAL T EGQP.VE.VL GGQLAKEE A L E 00000011 0 0 G 11000001 1 0 Q 0 ^^0 0 0 0 0 1 1 P 00120000 0 1 y 00012000 0 0 E 00001211 0 0 y 00000121 1 0 L 00010012 1 2 Search with a sequence to find homology through ii.nr wtst> sequence alignment Align two or more protetn sequences with f' .. ■ ii Omega to find conserved regions Back-trace from bottom-right Global: Needelman & Wunsch. From the corner Local: Smith & Waterman. From any position. DETERMINISTIC © Comp. expensive 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Pairwise alignment Sequence alignments Analysis Tools BLAST Search with a sequence to find homology through pairwise sequence alignment Align two or more protetn sequences with f' .. ■ ii Omega to find conserved Word methods • Short non-overlapping sequence stretches (k-tuples or words) are identified in the query sequence and matched in target sequence(s). • Relative positions of the matching region define an offset (subtraction) • Multiple words matching with similar offset define a region prone to alignment. • Alignments are subsequently extended in alingment-prone regions. • © HEURISTIC, optimal align not guaranteed. • © Efficient for database searches. • BLAST, FASTA. 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Pairwise alignment Sequence alignments Analysis Tools BLAST MMII..I IIMIK.IttfctM i M IM IK1 ^"Tl , 1 u i^V |SI K \>1 iEVl)('lu K xw Search with a sequence to find homology Align two or more protetn sequences through i '■ > ■ sequence alignment with r , ■ il Omega to find conserved Multiple sequence alignments • Dynamic programming algorithm (N dimensional matrix) Ü Q D N V Q L D - - Q - L F U H V Q - - ----Ö G L - 3-Binf DB & Str. Pred -> Seq align -> Alignments -> MSA Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....llk» |SI IX.IW K \>1 Search with a sequence to find homology Align two or more protetn sequences through i ■. i '.% . ■ sequence alignment with Oust at Omega to hnd conserved regions Multiple sequence alignments * Dynamic programming algorithm * Progressive methods • First align the most similar pair • Subsequently add less similar sequences • Sensitive to similarity inaccuracy (i.e. due to differences in sequence length) • CLUSTAL • Additional info considered: T-Coffee (slow) * Iterative methods 3-Binf DB & Str. Pred -> Seq align -> Alignments -> MSA Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. |SI IX.IW K \sl Search with a sequence to find homologs Align two or more protetn sequences through i ■. i '.% . ■ sequence alignment with Oust at Omega to find conserved regions Multiple sequence alignments * Dynamic programming algorithm * Progressive methods * Iterative methods • Initial global alignment • Objective function (based on score) to optimise similarity assessment. Chose best. • All possible remaining sequence subsets re aligned and re-scored • Best subset included in the alignment/iter. • Typically slower, more accurate • MUSCLE, MAFT. 3-Binf DB & Str. Pred -> Seq align -> Alignments -> MSA Sequence alignments Analysis Tools BLAST MMII..I IISMK.IttktM .....Ilk. ^"Tl , 1 u l^V |SI K \>1 Search with a sequence to find homologs Align two or more protetn sequences through pairwfie sequence alignment with Clustal Omega to find conserved regions Beyond pure sequences: patterns and models • Aligned sequences can be used to define patterns, that can then be used to perform searches in databases. • Position Specific Scoring Matrices • Hidden Markov Models 3-Binf DB & Str. Pred -> Seq align -> Alignments -> Motifs Summary of ID predictions Different protein properties or characteristics can be predicted from its primary sequence: • Secondary structure • Solvent accessibility • Solubility/expressability • Transmembrane regions The methods that do such predictions improve if they consider evolutionary information 3-Binf DB & Str. Pred -> Prediction of protein properties from sequence Secondary structure prediction □ prediction of the conformational state of each amino acid (AA) residue of a protein sequence as one of the possible states: ■ helix (H) ■ strand (S) ■ coil (C) 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure Secondary structure prediction □ amino acid propensities derived from known 3D structures ■ probability of a particular AA for a particular secondary structure state ■ first-generation methods - low accuracy □ propensities of segments of adjacent residues ■ local environment of residues considered (3-51 consecutive residues) ■ second-generation methods - accuracy ~ 60 % - 65 % □ evolutionary information combined with machine learning ■ training set - sequence profiles associated with a particular secondary structure arrangement (based on known 3D structures) ■ sequence profiles derived from family sequence alignments ■ state-of-the-art methods - accuracy ~ 70 % - 80 % 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure Secondary structure prediction programs □ PSI-PRED http://bioinf.cs.ucl.ac.uk/psipred/ combination of PSI-BLAST profiles and neural networks careful selection of sequences used for profile construction Pred: > ■í > Pred: HHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEEEE AA: QQMNQKAVTSFLSVQDGIYNSDLTPKSDIKNPDVWYEFF 130 140 150 160 Legend: ¥ = hell: Co nf: 3a □ D 1 J E = confiden ce of prediction = strant Pred: predicted secondary structure = coil AA: target seguence 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure Secondary structure prediction programs □ Quick2D (MPI toolkit) ■ https://toolkit.tuebingen.mpg.de/tools/quick2d ■ overview of secondary structure features (a-helices, extended |3-strands, coiled coils, transmembrane helices, disorder regions) ■ predictions by PSI-PRED, J NET, Prof, Coils, MEMSAT2, HMMTOP,... SO I I I I I I I I MS L GAKP F G-EKKFIEIKG-RPHAYID E G-T G-D PIL F QHG-HP T S S YLURHIHPHC AG-L G-RLIACDLIGHG-D SDKLD P S G-P E PY ss PSIPPED EEEEE EEEEEEEE EEEEE HHHHHHHHHHHHH EEEEE ss JHET EEE EEEEEE EEEEEE HHHHHHHHHHH EEEEE ss Prof (Ouali) EEEE EEEEEEE EEEEE HHHHHHHHHHHH EEEEEE CC Coils TH HHHTOP TH HEHSAT-SVH TH PHOEIUS DO DISOPREDZ DO IUPPED SO JHET DD E E E EEE E E EEEE E EEEEEEEEEEEE EE EE E EEEEEEEEEEEE EE 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure Secondary structure prediction programs T □ GeneSilico metaserver ■ https://genesilico.pl/meta2/ ■ meta-server for protein structure prediction, including secondary structure prediction SECONDARY STRUCTURE PREDICTION Secondary Structure sspro4 cdm psipred fdm jnet porter sable prof gor consensus 1........10........ 20........ 30........40........SO........SO_____ MTI SAD IS LHHRAVL GS THAYPE T G-RSDAPHVL F LHGNP T S S YIWRNIMP LVAPVG-HCIAPD LIGT ---------EEEEE--EEEEEEEE-------EEEE------HHHH---HHHH-----EEEE----- ---------EEEEE--EEEEEEEE-----EEEEEE------HHHH---HHHH----EEEEE----- ------EEEEEEEE--EEEEEEE-------EEEEE------HHHHHHHHHHHH---EEEEE----- -----HHEEEEEEE--EEEEEEEE-----EEEEEE------HHHH---HHHH----EEEEE----- -------EEEEEEE--EEEEEEEE------EEEEEE----HHHHHHHHHHHH----EEEEEE---- -EEEEEE---EEEEEE-------EEEEE-- ---------EEEEEEE-------EEEEEE- ----------EEEEE--------EEEEE-- --EEEEE--EEEEEEE-------EEEEE-- -HHHHHHHHHHHHHH---EEEEE- ---HHHHHHHHHHHHH---EEEE- ------HHHHHHHHH---EEEEE- ---HHHHHHHHHHH----EEEEE- 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure Solvent accessibility prediction □ prediction of the extent to which a residue embedded in a protein structure is accessible to solvent ■ comparison of accessibility of different amino acids - relative values (actual area as percentage of maximally accessible area) ■ simplified two state description - buried vs. exposed residues 3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessibility □ residue hydrophobicity ■ very hydrophobic stretches are predicted as buried □ propensities of single residues or segments of residues to be solvent accessible ■ superior to simple hydrophobicity analyses □ evolutionary information ■ solvent accessibility at each position of protein structure is evolutionary conserved within sequence families -> methods using multiple sequence alignment information ■ prediction accuracy above 75% 3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessibility 11 Solvent accessibility prediction programs □ PHD http://www.predictprotein.org/ combination of evolutionary information with neural network □ PROFphd http://www.predictprotein.org/ improved version of PHD combination of evolutionary information and secondary structure prediction with neural network trained only on high resolution structures 3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessionlity □ SABLE2 ■ http://sable.cchmc.org/ ■ combines solvent accessibility and secondary structure predictions □ GeneSilico metaserver ■ https://genesilico.pl/meta2/ ■ meta-server for structure prediction, including solvent accessibility Protein Solvation \M\M & i........10........20........30........40........so MaiRRPEDFKHYEVQLPDVKIHYVEEGAGPTLLLLHGWPGFWWEWSKVIGPLaE netsurfp_sol25 soprano_sol25 sable_acc spine_sol25 spineX_sol25 paleale_sol25 accpro_sol25 jnet_sol25 paleale_sol5 333—3—3-3333-3—333333333-3333333333333333333333333- --------E—3-3-3—3-33333—3333333333333333333-333333- ---------------3—3-33-3-------33333333333-33—33—3 — — 3—3—3---- 333—3—3-3333-3—3333333-3—3333333333333333-333—33- B— 3-3-B— 3-3—3 3-3333—3---3333333333333333-33—33- 3-3333----333333333333333333-33—33- 333333333333-333-333333- 3333333333333333-33—33- 3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessionlity 11 Solubility and expressability prediction □ Complicated definition of the property □ Prediction of the extent to which a given sequence will produce a soluble protein in a given expression system or □ Prediction of aggregation propensity □ Methods heavily rely on machine learning. 3-Binf DB & Str. Pred -> Properties prediction -> Solubiility & Expressability Solubility and expressability prediction □ Methods based on: ■ Plain protein sequences ■ Evolutionary information implicit in the learning data ■ SOLpro http://scratch.proteomics.ics.uci.edu ■ ESPRESSO http://mbs.cbrc.jp/ESPRESSO ■ SoluProt https://loschmidt.chemi.muni.cz/soluprot/ ■ Sequence profiles ■ Evolutionary Information implicit in the profile ■ AGGRESCAN http://bioinf.uab.es/aggrescan/ ■ TANGO http://tango.crg.es ■ PASTA http://protein.cribi.unipd.it/pasta/ 3-Binf DB & Str. Pred -> Properties prediction -> Solubiility & Expressability Transmembrane region prediction □ transmembrane (TM) proteins - challenge for experimental determination of 3D structure -> structure prediction needed even more than for globular water-soluble proteins □ two major classes of integral membrane proteins ■ transmembrane helices (TMH) ■ transmembrane beta-strand barrels (TMB) 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region TMH: bacteriorhodopsin (PDB-ID lap9) TMB: matrix porin (PDB-ID 2omf) 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction □ prediction of TMH simplified by strong environmental constraints - lipid bilayer of the membrane ■ TMHs are predominantly apolar and 12-35 residues long (hydrophobicity) ■ specific distribution of Arg and Lys (positively charged) -> connecting loop regions at the inside of the membrane have more positive charges than loop regions at the outside = positive-inside rule NH2 COOH Cytoplasmic 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction □ prediction of TMB ■ transmembrane beta-strands contain 10 - 25 residues ■ only every second residue faces the lipid bilayers and is hydrophobic, other residues face the pore of the (3-barrel and are more hydrophilic -> analysis of hydrophobicity NOT useful for TMB prediction 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction □ hydrophobicity-based methods (for TMH) ■ hydrophobicity along the sequence, hydrophobic moment or other membrane-specific amino acid preferences ■ averaging hydrophobicity values over windows of adjacent residues ■ prediction of orientation of TMH using positive-inside rule □ evolutionary information combined with machine learning or hidden Markov models (for TMH) ■ superior to methods based solely on hydrophobicity □ evolutionary information combined with machine learning or hidden Markov models (for TMB) 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction programs v . □ no appropriate estimate of performance available ■ insufficient number of high-resolution structures (needed for a statistically significant analysis) ■ in the papers, accuracy of methods usually largely overestimated -methods perform much better on proteins for which they were developed than on new proteins ■ the best methods for TMH estimated to have ~70% accuracy 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction programs v . □ TM H MM 2.0 http://wwwxbs.dtu.dk/services/TMHMM/ a number of statistical preferences and rules embedded in hidden Markov model -> localization and orientation of TMH TMHMM posterior probabilities for sp_P785B8_FREL_CANAL transmem bran e inside outside 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction programs v . □ TOPCONS http://topcons.cbr.su.se/ consensus prediction of TMHs — Inside —Outside TM-helix (IN-*0UT) TW-helix (OUT->IN) ■ Reentrant region SCAMPl-seq SCAMPI-msa PRODIV PRO OCTOPUS ■-> N o > Si = ■Jl _E 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region Transmembrane region prediction programs v . □ TBBpred http://wwwjmtech.res.in/raghava/tbbpred/ prediction of TMB using machine learning □ PROFtmb ■ http://www.predictprotein.org/ ■ profile-based hidden Markov model ■ prediction of bacterial TMB □ ... 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region □ Gu, J. & Bourne, P. E. (2009). Structural Bioinformatics, 2nd Edition, Wiley-Blackwell, Hoboken, p. 1067. □ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352. □ Schwede, T. & Peitsch, M. C. (2008). Computational Structural Biology: Methods and Applications, World Scientific Publishing Company, Singapore, p. 700. Bioinformatics databases & Structure prediction