Edward N. Trifonov Early Molecular Evolution Trifonov, E. N., Bettecken, T., Sequence fossils, triplet expansion, and reconstruction of earliest codons. Gene 205, 1-6 (1997) Trifonov, E. N., Consensus temporal order of amino acids and evolution of the triplet code. Gene 261, 139-151 (2000) Trifonov, E. N. The triplet code from first principles. J Biomolec Str Dyn 22, 1-11 (2004) Trifonov, E. N., Genetic Code: Evolution, Encyclopedia of Life Sciences, John Wiley & Sons, Ltd, Chichester, UK, 2008 Contents: Introduction Chapter I. Prebiotic syntheses. Combinatorics. Complementarity. Chapter II. Nucleic acids - key component of Life. Definition of Life. Chapter III. Amino acid chronology A. Ancient triplet repeats and first codons B. Consensus temporal order of amino acids. Chapter IV. Evolutionary chart of codons. Chapter V. Predictive power of the evolutionary chart. A. Glycine clock B. Binary code of protein sequences. C. The size of the earliest proteins (peptides) D. The earliest mRNA hairpins Chapter VI. Omnipresent protein sequences. Chapter VII. Ancient closed loop modules. A. The size of the modules. B. Loop-n-lock structure C. Linear arrays of the closed loops D. Prototypes, proteomic code Chapter VIII. Last Universal Common Ancestor (LUCA) A. LUCA modules B. Sequence space C. The earliest gene pair Chapter IX. Genome segmentation Introduction Molecular evolution is commonly known as the discipline initiated by seminal study of E. Zuckerkandl and L. Pauling on evolutionary distances between similar protein sequences. It deals with events of last 2-3 billion years, when the Life already operated with long sequences. Zuckerkandl, E., and Pauling, L. (1962) Molecular disease, evolution and genetic heterogeneity. In: Kasha, M., and Pullman, B., (eds.) Horizons in Biochemistry. Academic Press, New York, pp. 189-225. Early Molecular Evolution is a new discipline. It is reconstruction of the earliest molecular events and structures, starting with origin of the triplet code and continuing to the very first small nucleic acids and short protein chains. The first steps of the reconstruction have been made by W. Loeb, S. Miller, M. Eigen and P. Schuster. Löb W (1913) Über das Verhalten des Formamids unter der Wirkung der stillen Entladung: Ein Betrag zur Frage der Stickstoff- Assimilation. Ber 46:684-697 Yockey, H.P., 1997. Walther Löb, Stanley L. Miller and prebiotic "building blocks" in the silent electrical discharge. Persp. Biol. Med. 41, 125-131. Miller SL (1953) A production of amino acids under possible primitive earth conditions. Science 117:528-529 Miller SL, Urey HC, 1959, Organic compound synthesis on the primitive Earth, Science 130, 245-251 Miller SL (1987) Which organic compounds could have occurred on the prebiotic Earth? Cold Spr Harb Symp Quant Biol 52:17-27 Eigen M, Schuster P (1978) The hypercycle. A principle of natural self-organization. Part C: The realistic hypercycle. Naturwissenschaften 65:341-369 Origin of life – the most difficult scientific and philosophical problem. From the earliest days of science (then natural philosophy) the hardest was the question about deity of life. Two major ideas had been consructive in the history of the problem: 1.Occam`s principle (Let`s assume that God does exist…) 2. 2. Cartesian division of Life problem in two: Body and Soul Farther William Occam - 14th-century English logician, theologian and Franciscan friar. In the search of proof that God exists, in fashion at that time, he suggested to assume the simplest: Yes, it does, and we should stay with that statement until confronted with a contradiction (if at all) Pluralitas non est ponenda sine necessitate. (Don`t complicate the matter without necessity) (By the same token one can assume another simplest: God does not exist, and proceed to contradiction, if at all) There is still the Soul (thought, mind, consciousness) component which is hard to deal with without assuming existence and power of God. With this in mind Rene Descartes (Cartesius, 17th century), mathematician and philosopher, suggested to separate all what relates to material (subject of physical laws) and immaterial, hence the Body/Soul dualism We are going to stay on the Body (Physics, chemistry) side, that is structure and mechanisms of life The strategy and scientific method of the new discipline of Early Molecular Evolution is to stay away from speculations, to stay away from attempts to invent and build the scenario of the earliest steps of life, but rather to reconstruct the molecular past from living molecules of today, DNA and proteins Major assumption is: If in the earliest times of life during most difficult times of evolution some molecules happened to become survivors, winners, which also means – good performers-, these molecules (and their sequences) may well still be around, in form of their descendants, as successful as in the past, and not much different Claw-hammer Hammer-1 HMB_Steinaxtmanufaktur_%283_von_3%29_Vinelz_Jungsteinzeit_2700_BC handaxlg stone_age_weapon_CoolClips_wb042562 Hammer – heavy thing with handle, for hitting, exists over 100 000 years (from early paleolite) There are sequences, in DNA and in proteins, which, apparently, are perpetuated from the earliest times. We, thus, look for the sequence fossils in modern molecules. 804E4EA Abiotic syntheses Earliest genes and proteins LUCA First cellular species RECONSTRUCTION E15F1CEB “millions of years, in pain, labors and fight this shining beauty has been created from primordial slime, and here it is: just a rooster walking on the grass. And it occurs to nobody what a Life cost has been paid… …in a thousand year long blink, in a tremendous effort dead particles fused together - and the Life, selfconfident, joyfully runs across the road, disregarding those incredible sufferings that have been sacrificed to its being”. (V. V. Veresaev. Dead end. Transl. by ENT) Медленно ступала по траве около колодца невиданно огромная и красивая птица с огненно-красной шеей, с пышным хвостом, отливаюшим зеленою чернью … Миллионы лет, в муках, трудах и борьбе, создавалась из первобытной слизи эта сверкающая красота, - и вот шагает по траве простой петух, и никто не чувствует, во что обошелся он жизни … ……. В тысячевековый миг с чудовищными усилиями слились друг с другом мертвые частицы, - и весело перебегает через шоссе осознавшая себя жизнь, забывшая о заплаченных судьбе невероятных своих страданиях. (В. В. Вересаев, В тупике. 1922.) 1a 2a 3b 4b 5b 6b 7b 8 1b 2b 3a 4a 5a 6a 7a 0 H 1 He 2 Li 3 Be 4 B 5 C 6 N 7 O 8 F 9 Ne 10 Na 11 Mg 12 Al 13 Si 14 P 15 S 16 Cl 17 Ar 18 K 19 Ca 20 Sc 21 Ti 22 V 23 Cr 24 Mn 25 Fe 26 Co 27 Ni 28 Cu 29 Zn 30 Ga 31 Ge 32 As 33 Se 34 Br 35 Kr 36 Rb 37 Sr 38 Y 39 Zr 40 Nb 41 Mo 42 Tc 43 Ru 44 Rh 45 Pd 46 Ag 47 Cd 48 In 49 Sn 50 Sb 51 Te 52 I 53 Xe 54 Cs 55 Ba 56 La 57 Hf 72 Ta 73 W 74 Re 75 Os 76 Ir 77 Pt 78 Au 79 Hg 80 Tl 81 Pb 82 Bi 83 Po 84 At 85 Rn 86 Fr 87 Ra 88 Ac 89 Rf 104 Ha 105 ?? 106 Lanthinide Series Ce 58 Pr 59 Nd 60 Pm 61 Sm 62 Eu 63 Gd 64 Tb 65 Dy 66 Ho 67 Er 68 Tm 69 Yb 70 Lu 71 Actinide Series Th 90 Pa 91 U 92 Np 93 Pu 94 Am 95 Cm 96 Bk 97 Cf 98 Es 99 Fm 100 Md 101 No 102 Lr 103 Living matter O C H N Earth O Si Al Fe Ocean O H Cl Na Atmosphere N O C H Atmosphere N O C H Life O C H N 9F7E75CC 5481EC11 F7A8928F BA0DA025 C5EA72DE untitled2_1 1176D417 DAD212B4 807904DC F54C62D0 FBAB47FC FCF0FB5C untitled3_1 7CDA7064 Steps of reconstruction of the earliest Life: 1953-1983 Stanley Miller imitation experiments yielded A, G, V, D, S, E, P, L, T, I – 10 natural amino acids 1976 Manfred Eigen and Peter Schuster noted that Alanine and Glycine are encoded today by the most stable and complementary codons GCC/GGC 1987-92 Jaime Lagunez-Otero and ENT discovered that consensus of mRNA is (GCU)n 1997 Thomas Bettecken and ENT speculated that (GCC)n/(GGC)n could be the first duplex gene. This duplex is the most expandable still today. 2000 Evolutionary Chart of Codons is derived 2001 2007 Abiotic synthesis of nucleobases on minerals, Raffaele Saladino, Ernesto Di Mauro 35 •Miller’s Soup MILLER'SEXP.gif 0002BA9DBank B4F6A031: Millers_1.gif 0002BA9DBank B4F6A031: 36 •Miller’s products millersO_3.gif 0002BA9DBank B4F6A031: millersO_4.gif 0002BA9DBank B4F6A031: millersatmo_2.gif 0002BA9DBank B4F6A031: Chimp_art01_200 2.jpg 0002BA9DBank B4F6A031: aa composition of aa's of modern proteins Miller mix L L A A G G S S V V E E I I T T K D D R P P N Q F Y M H C W Opportunism of life: Miller did not pay attention to the fact that about 90% of amino acids in modern proteins are those synthesized in Miller`s system. Life built its proteins from those amino acids that could be provided by environment, - abiotically synthesized amino acids The imitation experiments of Miller, then Ph. D. student of Harold Urey, have been conducted as side-project, with permission of the supervisor. Walther Loeb (1913) first synthesized glycine in experiments imitating primordial conditions. this was recognized only in 1995, when translation mistake was noticed (German to English). “Kohlenoxyd”, carbon monooxide CO, Instead of “Kohlensaure”, carbonic acid H2CO3 (carbon dioxide CO2) Raffaele Saladino Umberto Ciambecchini, Claudia Crestini, Giovanna Costanzo, Rodolfo Negri, Ernesto Di Mauro, 2003 first synthesized in primordial conditions in presence of catalyzers,(TiO2), all four nucleobases in appreciable amounts J. Biol. Chem. 2007 What are the simplest Living organisms? Bacteria? Viruses? The simplest are viroids. They consist of just infectious RNA molecules, about 300 bases. They attack plants (avocado, citruses, potato). Is that life? But what is life? 1CCFF411 99C2B9FF 6A65414A 18CBB94A “The evolution of life is a trick of nature to ensure a faster and better reproduction of the nucleic acids”. Sol Spiegelman MASTER t-RNA SEQUENCE (Eigen and Winkler-Ostwatitsch, Naturwissenschaften 68, 217, 1981) GCC GGG GUA GCU CAG UUG GUA GAG anticodon CGC CGG ACU XXX AAU CCG GAG GUC GCG GGU UCG AAU CCC GUC CCC GGC ACC A Consensus sequence of ancient RNA: (RNY)n Eigen, Schuster, 1976 MASTER t-RNA: I II III A+G 16 10 11 C+U 8 13 13 BUT, ACTUALLY: I II III A 4 5 2 C 6 8 8 (GNN)n G 12 5 9 U 2 5 5 F5FBA2FE If tRNA used to be a protein-coding sequence, with the ancient periodical sequence background, why it still has that periodicity, when the protein-coding function of tRNA is long gone? The answer is: There are many overlapping codes in the sequences. They utilize the periodical letters, so that the periodicity does not go away. “We must admit that we had expected more noise accumulation during later stages of evolution, so that the memory of a triplet pattern - which has no foundation in tRNA present adaptor function – came out as a true surprise” Eigen, Winkler-Ostwatitsch, Naturwissenschaften 68, 282-292, 1981 -the headacke surprise since 1979 (Braunlage) until 2006 (Les Treilles). AB867AE7 JUST ONE EXAMPLE OF OVERLAPPING gaattccacattgtttgccgcacgttggattttgaaatgccagggaactttgggagactcatatttctgggccagaggatctgtggaccacaagatctt tttatgatgacagtagcaatgtatctgtggagccggattctgggttgggagtgcaaggaaaagaatgtactaaatgccaagacatctatttcaggagca tgaggaataaaagttctagtttctggtctcagagtggtgcagggatcagggagtctcacaatctcctgagtgccggtgtcttagggcacactgggtctt ggagtgcaaaggatctaggcacgtgaggccttgtatgaagaatcggggatcgtacccaccgccgccgccgccgccgccgccgccgccgccgccgccgcc gccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccccctgtttctgtttcatc ctgggcgtgtctcctctgcctttgtcccctagatgaagtctccatgagccaagggcctggtgcatccagggtgatctagtaattgcagaacagcaagtg ccagccctccctccccttccacagccctggatgtgggagggggttgtccagcctccagcagcatggggagggccttggtcagcctctgggtgccagcag ggcaggggcggagtcctggggaatgaaggttttatagggcccctgggggaggccccccagccccaagcctaccacctgcacccggagagccgtgtcacc atgtgggtcccggttgtcttcctcaccctgtccgtgacgtggattggtgagaggggccatggttggggggatgcaggagagggagccagccctgactgt caagccgaggccctttcccccccaacccagcaccccagcccagacagggagccgggcccttttctgtctctcccagccccactccaagcccataccccc agcccctccatattgcaacagtcctcactcccacaccaggtccccgccccctcccacttaccccagaactttctccccatt gaattccacattgtttgccgcacgttggattttgaaatgccagggaactttgggagactcata tttctgggccagaggatctgtggaccacaagatctttttatgatgacagtagcaatgtatctgtggagccggattctgggttgggagtgcaaggaaaag aatgtactaaatgccaagacatctatttcaggagcatgaggaataaaagttctagtttctggtctcagagtggtgcagggatcagggagtctcacaatc tcctgagtgccggtgtcttagggcacactgggtcttggagtgcaaaggatctaggcacgtgaggccttgtatgaagaatcggggatcgtacccaccgcc gccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgccgcc gccgccgccgccgccgccccctgtttctgtttcatcctgggcgtgtctcctctgcctttgtcccctagatgaagtctccatgagccaagggcctggtgc atccagggtgatctagtaattgcagaacagcaagtgccagccctccctccccttccacagccctggatgtgggagggggttgtccagcctccagcagca tggggagggccttggtcagcctctgggtgccagcag ggcaggggcggagtcctggggaatgaaggttttatagggcccctgggggaggccccccagccccaagcctaccacctgcacccggagagccgtgtcacc atgtgggtcccggttgtcttcctcaccctgtccgtgacgtggattggtgagaggggccatggttggggggatgcaggagagggagccagccctgactgt caagccgaggccctttcccccccaacccagcaccccagcccagacagggagccgggcccttttctgtctctcccagccccactccaagcccataccccc agcccctccatattgcaacagtcctcactcccacaccaggtccccgccccctcccacttaccccagaactttctccccatt Three known pathologically expanding (“aggressive”) classes of triplets GCU (GCU, CUG, UGC, AGC, GCA, CAG) , GCC (GCC, CCG, CGC, GGC, GCG, CGG) and GAA (AAG, AGA, GAA, CTT, TTC, TCT). 4E814839 Occurrence of the expansion of the same triplets in eukaryotes and in prokaryotes points to special property of the repeating triplets, Irrespective of enzymatic environment. This property is formation of slippage structures during replication 34DE8FAF 9F12495 9D08186B If GCC had been the very first codons, then the next codons would be, likely, their point mutated versions: GCC GCA GCC GCU GAC GGC GUC ACC CCC UCC And they would have been the codons for the very first amino acids There are four obvious ways to evaluate which of 20 aminoacids had been first on the evolutionary scene: 1. Chemical simplicity 2. Presence in the Miller soup 3. They, perhaps, have been served by the oldest class II aminoacyl-tRNA synthetases Structurally Amino simple acids Class II Earliest amino of Miller's aa-tRNA amino acids mixture synthetases acids Ala + ............ + ............ + ............ + Arg Asn + + Asp + ............ + ............ + ............ + Cys + Gln Glu + Gly + ............ + ............ + ............ + His + Ile + + Leu + + Lys + Met + Phe + Pro + ............ + ............ + ............ + Ser + ............ + ............ + ............ + Thr + ............ + ............ + ............ + Trp Tyr Val + + Triplet code and its early form UUU Phe UCU Ser UAU Tyr UGU Cys UUC Phe UCC Ser UAC Tyr UGC Cys UUA Leu UCA Ser UAA TRM UGA TRM UUG Leu UCG Ser UAG TRM UGG Trp CUU Leu CCU Pro CAU His CGU Arg CUC Leu CCC Pro CAC His CGC Arg CUA Leu CCA Pro CAA Gin CGA Arg CUG Leu CCG Pro CAG Gin CGG Arg AUU Ile ACU Thr AAU Asn AGU Ser AUC Ile ACC Thr AAC Asn AGC Ser AUA lie ACA Thr AAA Lys AGA Arg AUG Met ACG Thr AAG Lys AGG Arg GUU Val GCU Ala GAU Asp GGU Gly GUC Val GCC Ala GAC Asp GGC Gly GUA Val GCA Ala GAA Glu GGA Gly GUG Val GCG Ala GAG Glu GGG Glu This spectacular match of the (speculated) earliest codons to (speculated) earliest amino acids, asuming tacitly that the code today is the same as 3.9 bln years back, which is another (wild) speculation – is too much of a coincidence. The analyses put together three independent speculations which converged in the match that could not be realized unless the three speculations are all correct. Why the RNA-protein translation code is triplet, and not doublet or tetraplet? 1. The doublet code would not be enough for 20 amino acids. 2. The triplets make slippage structure more likely than with other lengths of the repeats, because of formation of stabilizing structures within the slippage loops The structures are imperfect hairpins (for gcc and GCC), triplex structure (for GAA), and, possibly, Kuryavyi-Jovin structure (for GCA) C1AE0434 FCDCDC62 Evolutionary chart of codons There are much more different criteria, opinions and experimental suggestions as to what would be the hypothetical order of amino acids` appearance on the evolutionary scene. They are too numerous and highly diverse, and there are no ways to reach possible consensus by the way of arguments 508A1AC0 39 criteria for amino-acid chronology (2000) 1. Simplicity (number of non-hydrogen atoms) 2. Involvement with more ancient synthetases of class II 3. Yield in the Miller’s experiments 4. Amino-acid composition of extant proteins 5. Chemical inertness 6. Stability of codon-anticodon interactions 7. Molecular clock sequence analysis of synthetases 8. Stability of (“older”) assignments in the table of the code 9. Jukes’ theory of the origin of the code 10. Coevolution theory of Wong 11. GCU-based theory of Trifonov and Bettecken 12. RRY hypothesis of Crick 13. RNY hypothesis, Eigen and Schuster 14. Hypothesis of Hartman 15. Hypothesis of Ferreira 16. Prebiotic physicochemical code of Altshtein-Efimov 17. Early copolymerization code of Nelsestuen 18. Composition of proteinoids of Fox 19. Coevolution theory of Dillon 20. Yield in imitation experiments of Fox and Windsor 21. Yield in experiments of Harada and Fox, high temperatures. 22. Yield in shock wave experiments of Bar-Nun 23. Coevolution theory of Wächtershäuser 24. Remnants of primordial code in tRNA (Möller and Janssen) 25. Evolutionary distances between isoacceptor tRNAs 26. Hypothesis of O. Ivanov 27. Match scores of BLOSUM matrix 28. A/U start, Jimenez-Sanchez 29. N-fixing amino acids first, Davis 30. GNN codons first, Taylor and Coates 31. Algebraic model of Hornos and Hornos 32. Composition of translated Urgen 33. Murchison meteorite 34. Minimal graph complexity, amino acids 35. Minimal graph complexity, amino-acid residues 36. Hypothesis of Jimenez-Montano 37. “Size/complexity” score, Dufton 38. Minimal alphabet for folding 39. DNA stability 40. RNA duplex stability (ENT, 2000) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1. G A -CS- - PTV - - -DILMN- - - EKQ - H -FR- Y W 2. -AG- - - -DFHKNPST- - - - - - CEILMQRVWY - - - 3. A G D V L E I S P T M K - - -CFHNQRWY- - - 4. L A G S -VE- -IT- K D R P N Q F Y -HM- -CW- 5. - - AFGILPV - - - NQST - - - - CDEHKMRWY - - - 6. A -GP- - DES - -TV- R L - CHQW - -IM- Y - FKN - 7. - - - - - - ACDEFGHIKLMNPQRSTV - - - - - - -WY- 8. - - ADEFGHP - - V I - KNSY - - MTW - L R -CQ- 9. - - - ADEGHLPQRV - - - - - -CFIKNSTY- - - -MW- 10. - -ADEGS- - V -PT- -IL- F C Y -KR- -NQ- H -MW- 11. A - - DGPSTV - - E - - - - CFHIKLMNQRWY - - - - 12. - DGNS - - - - - - ACEFHIKLMPQRTVWY - - - - - 13. -AG- - - DINSTV - - - - - - CEFHKLMPQRWY - - - - 14. G P A R - - DENQST - - -HK- C - -FILVY- - -MW- 15. - - FGKLNP - - - - - CDEHQRSTVW - - - - AIMY - 16. - - - ADEGKRSTV - - - - - - -CFHILMNPQWY- - - - 17. - - - - DEFHIKLMSTVY - - - - - - -ACGNPQRW- - - 18. A E V -GK- M L C Y -NQ- I -DF- R H P W T S 19. G A D V E Q - HLPR - N T -IS- -KM- F -CY- W 20. G I -AP- S E D F L V - - - CHKMNQRTWY - - - 21. G A E D L -PV- S I T -FY- - - -CHKMNQRW- - - 22. G A V L - - - - - CDEFHIKMNPQRSTWY - - - - - 23. -DE- - - -ACGNPQST- - - - ILMV - - - FHKRWY - - 24. - ADGV - - - - - - CEFHIKLMNPQRSTWY - - - - - 25. Q H P -LS- G C W R V -DE- A Y T -IM- F -KN- 26. - - - ADEGLPRSTV - - - - - - CFHIKMNQWY - - - 27. - -AILSV- - - - EKMQRT - - - DFGN - -PY- H C W 28. - - FIKLMNY - - - - - CDEHQRSTVW - - - - AGP - 29. - DENQ - - APSV - -CG- T - ILM - R K -FY- H W 30. - -ADEGV- - - - - - - CFHIKLMNPQRSTWY - - - - - 31. - -CDFSV- - - -EKLRY- - -HP- - - -AGIMNQTW- - - 32. V - AGP - - ENRT - - LQS - - - - CDFHIKMYW - - - 33. -AG- - DEPV - - - - - -CFHIKLMNQRSTWY- - - - - 34. G A D P -CS- N E V K Q T L M I R H F Y W 35. G A -CS- P V K M T L -DI- N E Q H F R Y W 36. - ADGV - - LPR - - - CIKQST - - - - EFHMNWY - - 37. G A V -IL- S T K P D N E Q F R Y C H M W 38. - -AGEIK- - - - - - - CDFHLMNPQRSTVWY - - - - - 39. A G S R C T D V P E W -HN- F L I Y M -KQ- 40. G A P W -RS- C D T E H V -LM- Q I Y N F K 41. - ADGS - - CPQV - - - EFIKNT - - - - HLMRWY - - 42. G A C S D V - - - - -EFHIKLMNPQRTWY- - - - - 43. - -AGPTV- - L R S I - - - CDEFHKNQY - - - -MW- 44. G A S P D C N T E V Q H M -LI- K R F Y W 45. G L A V D E P I T R F K S Y N H Q M W C 46. - - - - ADEFGIKLNQTVY - - - - - - CHMPRSW - - 47. - - - ADGHINSTV - - - - MPR - - - -CEFKLQWY- - - 48. - - - - - -ADEGHIKLMNPQRSTVW- - - - - - - CFY - 49. - AGPR - - - - CDEHLQSTVW - - - - - FIKMNY - - 50. - ADGV - - - EHLPQR - - - - - CFIKMNSTWY - - - 51. D N T E Q K P I M S G A R V L C H Y F W 52. - - - - ADEFGHLPQRSTV - - - - - - CIKMNWY - - 53. - -AILPV- - - -DEGST- - - -CFHNY- - - -KMQRW- - 54. - - ADEGSV - - - -KLPRT- - - - - CFHIMNQWY - - - Table 2. Thermostability of the codons (complementary pairs, kcal/M) A GCC 28.3 K AAG 17.3 R AGG 23.9 GCG 25.5 AAA 13.6 AGA 22.9 GCU 25.4 L CUC 22.9 S UCC 25.8 GCA 25.3 CUG 20.9 UCG 23.1 C UGC 25.3 CUA 18.2 UCU 22.9 UGU 21.8 CUU 17.3 UCA 22.9 D GAC 23.8 L UUG 17.3 S AGC 25.4 GAU 21.8 UUA 14.5 AGU 21.9 E GAG 22.9 M AUG 19.8 T ACC 24.8 GAA 19.3 N AAC 18.2 ACG 22.0 F UUC 19.3 AAU 16.3 ACU 21.9 UUU 13.6 P CCC 26.8 ACA 21.8 G GGC 28.3 CCG 24.0 V GUC 23.8 GGG 26.8 CCU 23.9 GUG 21.8 GGA 25.8 CCA 23.8 GUA 19.1 GGU 24.8 Q CAG 20.9 GUU 18.2 H CAC 21.8 CAA 17.3 W UGG 23.8 CAU 19.8 R CGC 25.5 Y UAC 19.1 I AUC 21.8 CGG 24.0 UAU 17.1 AUA 17.1 CGA 23.1 AUU 16.3 CGU 22.0 (Xia et al., 1998) 62591707 Consensus temporal order of amino acids (single-factor criteria) amino average order codon acids rank capture of Miller (± 0.7) cases + G 2.8 1 + A 3.9 2 + V 6.5 3 + S 7.1 4 + P 7.4 5 + D 7.7 6 + T 9.0 7 + E 9.9 8 + L 10.3 9 (+) + I 10.9 10 (+) N 11.2 11 R 11.7 12 H 12.7 13 + Q 12.8 14 + K 13.2 15 F 13.2 16 + C 13.9 17 + M 15.0 18 + W 15.3 19 + Y 15.3 20 + Consensus temporal order of amino acids (multi-factor criteria) amino average order codon acids rank capture of Miller (± 0.7) cases + A 4.1 1 + G 4.2 2 + D 4.2 3 + V 6.1 4 + E 6.3 5 + P 7.2 6 + S 8.0 7 + L 9.5 8 (+) + T 9.8 9 Q 9.9 10 (+) R 10.2 11 N 11.4 12 + I 11.9 13 (+) H 13.2 14 + K 13.4 15 C 13.8 16 + F 15.1 17 + Y 15.2 18 + M 15.9 19 + W 17.7 20 + Consensus temporal order of amino acids (final) amino average order codon acids rank capture of Miller (± 0.7) cases + G 3.5 1 + A 4.0 2 + D 6.0 3 + V 6.3 4 + P 7.3 5 + S 7.6 6 + E 8.1 7 + T 9.4 8 + L 9.9 9 (+) R 11.0 10 N 11.3 11 + I 11.4 12 (+) Q 11.4 13 (+) H 13.0 14 + K 13.3 15 C 13.8 16 + F 14.2 17 + Y 15.2 18 + M 15.4 19 + W 16.5 20 + Persistence of the ranking Number of criteria (simple averaging) Filtered 3 7 25 28 40 one two rank 1. G A G.......G.......G.......G.......G 2. A G A.......A.......A.......A.......A 3. S S D V.......V.......V.......V 4. D P V D.......D.......D.......D 5. P V P S S S E 6. T T S P E E P 7. V L E E P P S 8. L D L.......L.......L.......L.......L 9. I I T.......T.......T.......T.......T 10. K E I I I N R 11. N N N N N R N 12. E F F R R K.......K 13. C K H F K I Q 14. M R K K Q Q I 15. H Q R Q C H C 16. F C Q H F C H 17. Q H C C H F.......F 18. R M.......M.......M.......M.......M.......M 19. Y W Y.......Y.......Y.......Y.......Y 20. W Y W.......W.......W.......W.......W Consensus chronology of amino acids (2000) Raw data Filtered data Miller ± ± G 4.4 0.7 1 G 2.9 0.3 1 G A 4.9 0.8 2 A 2.9 0.3 2 A V 6.9 0.6 3 V 6.6 0.6 3 V D 7.2 0.7 4 D 7.0 0.7 4 D S 7.9 0.7 5 E 7.2 0.6 5 E E 8.2 0.7 6 P 7.5 0.6 6 P P 8.3 0.7 7 S 7.7 0.7 7 S L 9.4 0.7 8 L 9.5 0.7 8 L T 10.1 0.6 9 T 9.8 0.6 9 T I 11.2 0.7 10 R 11.5 0.7 10 N 11.8 0.7 11 N 12.2 0.7 11 R 12.0 0.7 12 K 12.3 0.5 12 K 12.0 0.7 13 Q 13.0 0.4 13 Q 12.4 0.7 14 I 13.0 0.5 14 I C 12.4 0.7 15 C 14.3 0.6 15 F 13.0 0.7 16 H 14.9 0.5 16 H 13.3 0.6 17 F 15.1 0.4 17 M 14.0 0.6 18 M 15.4 0.4 18 Y 14.7 0.5 19 Y 15.6 0.4 19 W 15.8 0.6 20 W 16.7 0.5 20 GCC – codon for alanine (A), GGC – codon for glycine (G). Both are of the highest yield in imitation experiments of Stanley Miller EVOLUTION OF THE TRIPLET CODE E. N. Trifonov, December 2007, Chart 101 Consensus temporal order of amino acids: UCX CUX CGX AGY UGX AGR UUY UAX Gly Ala Asp Val Ser Pro Glu Leu Thr Arg Ser TRM Arg Ile Gln Leu TRM Asn Lys His Phe Cys Met Tyr Trp Sec Pyl 1 GGC-GCC . . . . . . . . . . . . . . . . . | . . . . . . . . 2 | | GAC-GUC . . . . . . . . . . . . . . . | . . . . . . . . 3 GGA--|---|---|--UCC . . . . . . . . . . . . . . | . . . . . . . . 4 GGG--|---|---|---|--CCC . . . . . . . . . . . . . | . . . . . . . . 5 | | (gag)-|---|---|--GAG-CUC . . . . . . . . . . . | . . . . . . . . 6 GGU--|---|---|---|---|---|---|--ACC . . . . . . . . . . | . . . . . . . . 7 . GCG--|---|---|---|---|---|---|--CGC . . . . . . . . . | . . . . . . . . 8 . GCU--|---|---|---|---|---|---|---|--AGC . . . . . . . . | . . . . . . . . 9 . GCA--|---|---|---|---|---|---|---|---|--ugc . . . . . . . | . . UGC . . . . . 10 . . | | | CCG--|---|---|--CGG | | . . . . . . . | . . | . . . . . 11 . . | | | CCU--|---|---|---|---|---|--AGG . . . . . . | . . | . . . . . 12 . . | | | CCA--|---|---|---|---|--ugg | . . . . . . | . . | . . UGG . . 13 . . | | UCG------|---|---|--CGA | | | . . . . . . | . . | . . . . . 14 . . | | UCU------|---|---|---|---|---|--AGA . . . . . . | . . | . . . . . 15 . . | | UCA------|---|---|---|---|--UGA . . . . . . . | . . | . . . UGA . 16 . . | | . . | | ACG-CGU | | . . . . . . . | . . | . . . . . 17 . . | | . . | | ACU-----AGU | . . . . . . . | . . | . . . . . 18 . . | | . . | | ACA---------ugu . . . . . . . | . . UGU . . . . . 19 . . GAU--|-----------|---|----------------------AUC . . . . . | . . . . . . . . 20 . . . GUG----------|---|-----------------------|--cac . . . . |CAC . . . . . . . 21 . . . | . . | CUG----------------------|--CAG . . . . | | . . . . . . . 22 . . . | . . | | . . . . . aug-cau . . . . |CAU . . AUG . . . . 23 . . . | . . GAA--|-----------------------|---|--uuc . . . | . UUC . . . . . . 24 . . . GUA--------------|-----------------------|---|---|--uac . . | . | . . UAC . . . 25 . . . | . . . CUA----------------------|---|---|--UAG . . | . | . . | . . UAG 26 . . . GUU--------------|-----------------------|---|---|---|--AAC . | . | . . | . . . 27 . . . . . . . CUU----------------------|---|---|---|---|--AAG| . | . . | . . . 28 . . . . . . . . . . . . . | CAA-UUG | | | | . | . . | . . . 29 . . . . . . . . . . . . . AUA------|--uau | | | . | . . UAU . . . 30 . . . . . . . . . . . . . AUU------|---|--AAU | | . | . . . . . . 31 . . . . . . . . . . . . . . . UUA-UAA | | . | . . . . . . 32 . . . . . . . . . . . . . . . uuu---------AAA| . UUU . . . . . . CONSECUTIVE ASSIGNMENT OF 64 TRIPLETS CODON CAPTURE aa "age": 17 17 16 16 15 14 13 13 12 11 10 9 8 7 6 5 4 3 2 1 • THE OLD NEW RULES IN EVOLUTION OF THE TRIPLET CODE • 1.ABIOTIC START (Miller, 1953) • Initial set of amino acids is • of purely chemical origin • •2. COMPLEMENTARITY (Eigen and Schuster, 1978) • New codons are introduced as • complementary pairs • •3. THERMOSTABILITY (Eigen and Schuster, 1978) • The codons that make the most • stable pairs with their • anticodons are engaged first • •4. PROCESSIVITY • New codons are derived from • the earlier ones by mutations • in redundant third positions • and complementary copying GLYCINE CLOCK 6336C1E8 image002 image001 Contents of shared glycine (%) in kingdom-to-kingdom protein sequence alignments ANIMALIA PLANTA FUNGI PROTOCTISTA ARCHAEA Branching level PLANTA 8.8 ± 0.4 8.8 ± 0.4 (51) (426/4862, 51) FUNGI 8.8 ± 0.4 8.8 ± 0.4 8.8 ± 0.3 (573/6479, 70) (391/4427, 50) (964/10906, 120) PROTOCTISTA 9.6 ± 0.6 9.9 ± 0.6 9.8 ± 0.5 9.8 ± 0.3 (300/3127, 28) (324/3283, 27) (321/3262, 27) (945/9672, 82) ARCHAEA 11.1 ± 0.7 12.9 ± 0.9 12.5 ± 0.8 13.9 ± 1.3 12.3 ± 0.4 (222/1994, 30) (215/1669, 26) (245/1961, 31) (109/787, 13) (791/6411, 100) EUBACTERIA 14.9 ± 0.6 13.5 ± 0.6 13.4 ± 0.5 11.4 ± 0.7 13.3 ± 0.8 13.5 ± 0.3 (685/4590, 70) (546/4041, 44) (667/4966, 70) (304/2656, 28) (304/2288, 35) (2506/18541, 247) 218E1BA8 DF0E112F Ancient binary alphabet Gly Ala Val Asp Ser Pro ... 1 GGC--GCC 2 | | GUC--GAC 3 GGA---|----|----|---UCC 4 GGG---|----|----|----|---CCC . . ↓ At every step of the evolution of the codons middle purines remain purines (R→R), middle pyrimidines remain pyrimidines (Y→Y). Reconstruction of evolutionary history of the triplet code suggests that the earliest protein sequences could be presented in the binary alphabet of two types of amino acids – those encoded by xYx triplets (Ala family, A) and those encoded by xRx triplets (Gly family, G). A F I L M P T V|C D E G H K N Q R W Y A 1 1 | 1 4 F | I 1 1 3| Ala L 1 3 1| alphabet M 1 3 1| P 1 | T 1 | V 3 1 1 |_____________________ C | D | 3 2 1 E | 3 1 2 G 1 | Gly H | 2 3 1 alphabet K | 1 2 N | 2 1 2 1 Q | 1 2 3 1 R | 1 2 1 1 W | 1 2 Y 4 | 2 Rearranged PAM120 substitution matrix (original matrix in Altschul SF, JMB 219, 555, 1991) The conclusion about two alphabets is strongly supported by respective rearrangements of substitution matrices: A F I L M P T V|C D E G H K N Q R W Y A | F | 1 3 I 2 1 3| Ala L 2 2 1| alphabet M 1 2 1| P | T | V 3 1 1 |_____________________ C | D | 2 1 E | 2 1 2 G | Gly H | 1 2 alphabet K | 1 1 2 N | 1 1 Q | 2 1 1 R | 2 1 W 1 | 2 Y 3 | 2 2 Rearranged BLOSUM substitution matrix (original matrix in Henikoff S, Henikoff JG, PNAS 89, 10915,1992) 5D9A07F6 50D05E95 Using the two-letter alphabet one can rewrite modern sequences in their (presumed) ancient version AFLIIMVRKREDQNFFVTAMAQQNEDGR AFLIIMVRKREDQNFFVTAMAQQNEDGR AAAAAAAGGGGGGGAAAAAAAGGGGGGG E6E20ECE C0710FC5 “I assume that the earliest proteins were small peptides of about ten amino acids, and specified by small primitive genes, probably made of RNA” “In the next stage, I postulate that the genes become joined together at random and a primitive splicing mechanism concatenates the peptides into longer molecules” Sidney Brenner, Nature 334, 528-530, 1988 Rewriting modern amino acid sequence in the binary form would suggest what was the ancestral form of that sequence, all the way to original Alanines and Glycines only The G to A and G to G distance analysis of modern protein sequences suggests that the very first miniproteins had the structure GGGGGGG and AAAAAAA encoded by the duplex xRx xRx xRx xRx xRx xRx xRx The size of the original miniproteins is estimated from modern sequences written in binary form to be 7 amino acid residues (J. Mol. Evol. 53, 394-401, 2001).The same estimate is provided by sequence fossils of ancient hairpins in mRNA(J Biomol Str Dyn 24, 163-170, 2006) untitled2 D27F9B1B In the complementarity distance analysis surviving ancient hairpins would deliver complementary bases at a distance up to the hairpin sequence length 23DEDB4A Possible early hairpins The picture has period of 10-11 bases After removing it we observe expected transition – at ~18-21 bases Hairpins of 6-7 triplets Codon evolution chart as basis of the new theory of early evolution: predictions and confirmations 1. Oldest proteins were glycine-rich. Glycine clock. 2. Alanine- and Glycine-family amino acids. Binary code. Substitutions keep the code. 3. The earliest mini-proteins had the size of 6-7 amino acids. 4. The earliest mini-genes had the size of 18-21 bases. 5. The earliest mRNA were duplexes, coding in both strands. 6. The most conserved protein sequence motifs are enriched by early amino acids (see below). Protein modules (closed loops) Polymer statistics of polypeptide chains The chain returns to itself with optimal loop closure size of 3-4 persistence lengths (Shimada and Yamakawa). Persistence length of mixed sequence polypeptides is ~5 amino acid residues (Flory). Natural closed loops are expected to be 15-20 residues (non-structured) and 25-35 residues long (α-helix containing loops). AEF79858 58AE3A00 D7F5B00E Beta-galactosidase OUT-OF-CONTEXT SEQUENCES I, II and III original seq. ACC GCU AUA CAG AUG UGU CAU ACC GCC CAU GAC GGC ACU UGC AAU GCA CGU UUA I A G A C A U C A G C G G A U A G C U II C C U A U G A C C A A G C G A C G U III C U A G G U U C C U C C U C U A U A original seq. ACCGCUAUACAGAUGUGUCAUACCGCCCAUGACGGCACUUGCAAUGCACGUUUA I AGACAUCAGCGGAUAGCU II CCUAUGACCAAGCGACGU III CUAGGUUCCUCCUCUAUA A. Rapoport, 2008 Position I Position II Position III Natural Random Ratio Natural Random Ratio Natural Random Ratio Bradyrhizobium japonicum Y5 29757 26041 1.14 157363 146121 1.08 214525 150012 1.43 Y6 12846 10460 1.23 95764 83157 1.15 135458 84731 1.6 Y7 5616 4213 1.33 60556 47624 1.27 85807 47918 1.79 Y8 2499 1700 1.47 39758 27455 1.45 54740 27139 2.02 Y9 1166 687 1.7 26915 15938 1.69 35100 15396 2.28 Chromobacterium violaceum Y5 22413 18361 1.22 70680 62766 1.13 104311 60872 1.71 Y6 10443 7910 1.32 41858 34333 1.22 65390 33047 1.98 Y7 4894 3431 1.43 25831 18923 1.37 41265 18046 2.29 Y8 2358 1498 1.57 16602 10514 1.58 26237 9918 2.65 Y9 1207 658 1.84 10904 5891 1.85 16775 5488 3.06 Thermotoga maritima Y5 3285 2783 1.18 26752 23210 1.15 20941 15676 1.34 Y6 1246 992 1.26 16412 12540 1.31 10960 7656 1.43 Y7 470 358 1.31 10659 6862 1.55 5755 3751 1.53 Y8 177 131 1.35 7329 3806 1.93 3105 1843 1.68 Y9 61 48 1.27 5216 2139 2.44 1688 909 1.86 Methanosarcina acetivorans Y5 9255 8316 1.11 61310 54328 1.13 60914 56666 1.07 Y6 3780 3143 1.2 36752 29118 1.26 33395 30070 1.11 Y7 1676 1221 1.37 23284 15797 1.47 18493 16031 1.15 Y8 846 490 1.72 15559 8682 1.79 10343 8592 1.2 Y9 444 204 2.18 10759 4837 2.22 5806 4634 1.25 Sulfolobus sulfataricus Y5 6380 4193 1.52 43090 36761 1.17 21356 18400 1.16 Y6 2783 1529 1.82 26790 20511 1.31 10867 8693 1.25 Y7 1220 568 2.15 17416 11632 1.5 5553 4130 1.34 Y8 556 214 2.6 11810 6704 1.76 2834 1974 1.44 Y9 250 81 3.1 8212 3922 2.09 1457 949 1.53 Pyrimidine clusters in different codon positions. The highest ratios are in red. Picture1 total_5_2 pyrimidines of 2-nd and 3-rd codon positions cluster at distance 25-30 triplets Levinthal paradox: t = nL ⋅ Ƭ = 3150 ⋅ 10-12 s = 1048 yrs (L = 150 residues) Solution: t = nL ⋅ Ƭ = 323 to 31 ⋅ 10-12 s = 0.1 to 1000 sec (L = 23 to 31 residues) Berezovsky, ENT, 2002 Hullabaloo around Levinthal Berezovsky, I. N., Trifonov, E. N., Loop fold structure of proteins: Resolution of Levinthal’s paradox, J. Biomolec. Str. Dyn. 20, 5-6 (2002) Finkelstein A. V., Cunning simplicity of a hierarchical folding, J. Biomolec. Str. Dyn. 20, 311-313 (2002) Berezovsky, I. N., Trifonov, E. N., Back to units of protein folding, J. Biomolec. Str. Dyn. 20, 315-316 (2002) Grosberg, A., A few disconnected notes related to Levinthal paradox, J. Biomolec. Str. Dyn. 20, 317-321 (2002) Kloczkowski, A., Jernigan, R. L., Loop folds in proteins and evolutionary conservation of folding nuclei, J. Biomolec. Str. Dyn. 20, 323-325 (2002) Rooman M., Dehouck, Y., Kwasigroch, J. M., Biot, C., Gilis, D., What is paradoxical about Levinthal paradox? J. Biomolec. Str. Dyn. 20, 327-329 (2002) Fernandez, A., Belinky, A., de las Mercedes Boland, M., Protein folding: where is the paradox? J. Biomolec. Str. Dyn. 20, 331-332 (2002) 1. Zwanzig R et al. Proc Natl Acad Sci USA 1992 2. Ngo JT et al. Prot Fold Probl & Tert Str Pred 1994 3. Finkelstein AV et al. Prog Biop MB 1996 4. Karplus M. Folding and Design 1997 5. Dill K, Chan HS. Nature Struct Biol 1997 6. Durup J. J Molec Struct 1998 7. Honig B. J Molec Biol 1999 8. Clote P. Proc ICAL 1999 9. Moret A et al. Phys Rev E 2001 10. Berezovsky IN, Trifonov EN. JBSD 2002 11. Rooman M et al. JBSD 2002 12. Grosberg A. JBSD 2002 13. Bai Y. Bioph Bioch Res Comm 2003 14. Kaya H, Chan HS. Proteins SFG 2003 2004 – 2011 no papers on Levinthal’s paradox! F7777B65 α/β Sandwich Trefoil Doubly Wound Jelly Roll TATA binding protein 525651C8 Cytochrome C Cytochrome 256b Cytochrome C Cytochrome 256b TIM barrell protein Generic closed loop of TIM barrell proteins ILLLGIGSPEEVRELARAAKEAGADALI aaaaaaggggggggaaaaaggggggaaa Examples of TIM barrell proteins First five presumably ancient sequence prototypes identified Aleph GEIVALVGPSGSGKSTLLRALAGLLKPTSG Beth LSGGQRQRVAIARALALEPKLLLLDEPTSALD Gimel DVIVVGAGPAGLAAALVLARAGAKVLVIE Dalet RRGIGMVFQNYALFPHLTVLENVALGL Heh PVIILTARDDEEDRVEGLELGADDYLTKPF Histidine permease Aleph Dalet Beth Vav Zayin Aleph Beth Dalet Zayin Vav Vav in PDB crystals Zayin in PDB crystals Seven prototypes Aleph GEIVALVGPSGSGKSTLLRALAGLLKPDGG Beth LSGGQRQRVAIARALALEPKLLLLDEPTSALD Gimel DVIVVGAGPAGLAAALVLARAGAKVLVIE Dalet RRRIGMVFQNYALFPHLTVLENVALGL Heh PVIILTARDDEEDRVEGLELGADDYLTKPF Vav VLGLSKEEARERALKLLAKVGLDERADGKP Zayin LLKKLQKELGLTILLVTHDLGEA •THE EARLIEST STEPS OF LIFE • • •0. Heptapeptides GGGGGGG and AAAAAAA encoded in RNA duplexes of 21 bp. • •1. "Complementary" heptapeptides of Gly- and Ala- alphabets. Some encoded by hairpins. • •2. The peptides fuse in closed loops of ~28 aa, by end-ligation of the alternating minigenes for all-Gly- and all-Ala-fragments. • •3. The closed loops develop in standard sequence/structure/function prototype modules. 10532872 BB23D828 650821FF 39188222 FD750F74 The terms “closed loop” and “lock” have been introduced first by E. Haas. He did not pay attention that the closed loops have almost standard size 314E49A2 Preferred distance between hydrophobic triplets VAI-EVL SGG-SAL GIG-GLG VIG-GVG GGG-LGG ALN-LAE Two related sequences, aligned 33% match Q816J5 DVNLPKFDGFYWCRQIRHESTCPIIFISARAGEMEQIMAIESGADDYITKPFHYDVVMAKIKGQLRR |||||-|||----|--|--|----------------------||||---|||------|-----||| DVNLPGIDGWDLLRRLRERSSARVMMLTGHGRLTDKVRGLDLGADDFMVKPFQFPELLARVRSLLRR Q7DCC5 CPIIFISARAGEMEQIMAIE Q816J5 Two-component response regulator B. cereus |||||||| | | |||| VPIIFISARDSDMDQVMAIE Q97IX4 Response regulator C. acetobutylicum || ||||||| | | | | VPVIFISARDADIDRVLGLE O32192 Transcr. regulatory protein cssR B. subtilis || | |||| |||||||| VPILFLSARDEEIDRVLGLE Q89D26 Two-component response regulator B. japonicum || | || || | ||||| IPIIMLTARSEEFDKVLGLE Q8R9H7 Response regulators Th. tengcongensis | |||||| ||| ||| SRIMMLTARSRLADKVRGLE Q88RT2 heavy metal response regulator Ps. Putida | |||| || |||||| ARVMMLTGHGRLTDKVRGLD Q7DCC5 Two-component response regulator Ps. Aeruginosa Q816J5 Two-component response regulator DVNLPKFDGFYWCRQIRHESTCPIIFISARAGEMEQIMAIESGADDYITKPFHYDVVMAKIKGQLRR |||||-|||----|--|--|----------------------||||---|||------|-----||| DVNLPGIDGWDLLRRLRERSSARVMMLTGHGRLTDKVRGLDLGADDFMVKPFQFPELLARVRSLLRR Q7DCC5 Probable two-component response regulator No-match relatives LEVALALSQADIIVRDALVS Q8UBQ7 Uroporphyrin-III C-methyltransferase A. tumefaciens | | || ||| || |||| LHAANALRQADVIVHDALVN Q92P47 probable Uroporphyrin-III C-methyltransferase Rh. meliloti | | | |||||||||| LRAQRVLMEADVIVHDALVP Q8YEV9 Uroporphyrin-III C-methyltransferase B. melitensis ||| | |||||||||||||| LRAHRLLMEADVIVHDALVP Q98GP6 Siroheme synthase (precorrin methyltransferase) Rh. loti | ||| ||||| LKGQRLLQEADVILYADSLV Q8DLD2 Precorrin-4 C11-methyltransferase S. elongatus |||| ||||| || ||| IKGQRIVKEADVIIYAGSLV Q8REX7 Precorrin-4 C11-methyltransferase F. nucleatum |||| ||||||||| VKGQRLIRQCPVIIYAGSLV Q88HF0 Precorrin-4 C11-methyltransferase Ps. putida | | || ||| |||||| VRGRDLIAACPVCLYAGSLV Q8UBQ5 Precorrin-4 C11-methyltransferase A. tumefaciens Q8UBQ7 methyltransferase HVWLAGAGPGDVRYLTLEVALALSQADIIVRDALVS -|---|||||-----|-------------------- TVHFIGAGPGAADLITVRGRDLIAACPVCLYAGSLV Q8UBQ5 methyltransferase No-match relatives To be related the sequences do not have to be similar (upto even complete mismatch) Existing most advanced sequence alignment techniques (e. g. BLAST) would not be able to qualify such fully dissimilar sequences as relatives unless many intermediate sequences are analyzed (that amounts to a whole research project) One can make long walks from fragment to fragment in the formatted protein sequence space (sequence fragments of the same length, 20 residues, gathered from all or many proteomes) Pair-wise connected matching fragments make also networks 5 7 WALK NETWORK Frenkel, 2006 9_1 60-65-35_2_1 Network of GTP binding proteins Sequence fragments with the same function are found in the same network 50_1_4_cor1 1 Putative peptidoglycan bound protein 2 Collagen adhesion protein 3 Ribosomal protein L11 4 Penicillin-binding protein 2x 5 Penicillin-binding protein 1 6 Penicillin binding protein 2A 7 D-alanyl-D-alanine carboxypeptidase 8 cytochrome 9 Beta-Lactamase 10 Mannitol-1-phosphate 5-dehydrogenase 11 glutaminase 12 Beta-lactamase 13 Esterase EstB Fragments of the same network have, essentially, the same structure. Periferal fragments may be different What are the protein modules: Their sequences are represented by networks in the protein sequence space - separate network (or group of related networks) for each module. Each module has its own unique structure. Typically, these are closed loops of the contour length 25-30 residues. Apart from general activity ascribed to the protein that harbors given module, each module type has its own specific function. Individual modules even of the same type are sequence-wise often different. Their evolution from ancestral prototypes may be traced along walks and networks in the sequence space. Omnipresent oligopeptides GHVDHGKT 131 SGSGKSTL 125 LSGGQQQR 125 GPPGTGKT 122 KMSKSLGN 121 LRPGRFDR 119 QRVAIARA 119 DEPTSALD 119 SIGEPGTQ 117 SGGLHGVG 117 VEGDSAGG 116 GLPNVGKS 116 DEPSIGLH 115 DLGGGTFD 115 GPNGAGKS 114 GIDLGTTN 113 VITVPAYF 113 LNRAPTLH 113 NADFDGDQ 113 NLLGKRVD 113 AGDGTTTA 112 GPTGVGKT 112 GIAVGMAT 112 GFDYLRDN 112 ERERGITI 111 KPNSALRK 111 NMITGAAQ 111 SHRSGETE 110 MAGRGTDI 110 IIFIDEID 110 GGTVGDIE 110 KFSTYATW 109 DEARTPLI 108 HHNVGGLP 108 GHNLQEHS 107 GGRVKDLP 107 LPDKAIDL 107 NPRSTVGT 107 NEKRMLQE 106 CPIETPEG 106 NPETVSTD 106 LEYRGYDS 106 SRSSALAS 106 HTRWATHG 106 DEREQTLN 105 DVSGEGVQ 105 GPSGCGKS 105 KTKPTQHS 105 DHPHGGGE 105 GRFRQNLL 105 AGRHGNKG 104 PRSNPATY 104 MTDADVDG 104 LTEAGYVG 104 INGFGRIG 104 TQQPLGGK 104 PIGRTPRS 104 LPGKLADC 104 GDEGGFAP 104 ERHRHRYE 103 RYKGLGEM 103 ATPIPRTL 103 AVKAPGFG 103 ATWWIRQA 103 GTQLTMRT 102 EPTAAALA 102 TLHRLGIQ 102 NIIDTPGH 102 SYYDYYQP 101 EMFVGVGA 101 LFGGAGVG 101 TGRTHQIR 101 PESSGKTT 101 KPETINYR 101 RERIRQIE 101 GQRFGEME 100 GVQQALLK 100 PSAVGYQP 100 EPTTALDV 99 QLSQFMDQ 99 SRQLWWGH 99 DVLDTWFS 99 ADKEGFLR 99 AHIDAGKT 99 VRKRPGMY 99 GYLTRRLV 98 AAQMDGAI 98 GVGERTRE 98 NVISITDG 98 GGITQHIG 98 NMQRQAVP 97 RIDNQLRG 97 DCPGHADY 97 EMEVWALE 97 GPGSICTT 97 GLTGRKII 97 VDYSGRSV 96 NPLGVPSR 96 SAASFQET 96 VPSGASTG 96 SSDSQAMG 30 LRQDPDII 30 TGGEPLLR 30 SGVSGAGR 30 PAMREGSG 30 QASRISGV 30 TSMGFTPL 30 GHRELPIR 30 LNVFPVPD 30 AFANAFLG 30 LLKILEGT 30 AYLFSGPR 30 LLTFFYRY 30 MLLRGQNL 30 DTALKTAD 30 GQLTEKVR 30 ASDMSGWL 30 DNHYVPNL 30 FPFIFRGA 30 PVGFKNGT 30 EDWGRRQL 30 DASAERSA 30 IGHTQPRR 30 AINAPMQG 30 ETDSPYLA 30 KQFDVTRE 30 GREQILKV 30 DVAGCDEA 30 AGANSIFY 30 MAGLQGAG 30 KGPAVRAT 30 ATHYFELT 30 GSKVSTKL 30 RALWRATG 30 GMPESFNV 30 KISVDSAT 30 GGVQPQSE 30 GYMYMLKL 30 GRIVEIYG 30 ALTPKAEI 30 GDLKYGRT 30 TNGDTHLG 30 ASSSSVYG 30 QTIISGMG 30 ILHVSAKD 30 AYIRFASV 30 GYNFEDSI 30 RTTDVTGV 30 WDDPRMPT 30 AYLKISEG 30 TGNTVIDA 30 GAIEQDAD 30 VNAQQARR 30 HDVKAVEY 30 LTDSTVLR 30 NVVMMGMG 30 VQIPCIER 30 WREPGCSM 30 GHEQYTRN 30 TGYITEGQ 30 KATKVDGV 30 TESFISAA 30 RRLPKRGF 30 AYSARNRS 30 SHEIRTPM 30 GKSPNIFF 30 EIWNLVFM 30 NVNDSVTK 30 GTAAGPHP 30 SVKVPDPK 30 FWAEWCGP 30 GLPGNPVS 30 CRNVLIYF 30 FLTGITEP 30 GIEYGDMQ 30 GAIGTGLF 30 AVMGCVVN 30 RRLLWPIK 30 DAANILKP 30 RISLGIKQ 30 DYVGSWGP 30 LVKTMRAS 30 GDVSAFVP 30 KPIVVINK 30 FPDLNTGN 30 GPVKDYEC 30 DPHNLGAC 30 LEEVGKQF 30 EADESDAS 30 GGGIANTF 30 ALIIDSWF 30 NAGSFFKN 30 IATDHAPH 30 RAGTKAGN 30 IAGNWKMN 30 NAGMNQFK 30 HGTgccLS 30 GTSHGAYK 30 TEETTTGV 30 LGIFLPLI 30 Omnipresent and frequent motifs Less frequent motifs Fig KMSKSLGN_FINAL SIGEPGTQ_PAINT Fig Fig Fig version3 MOST COMMON PROTEIN SEQUENCE MODULES (PROTOTYPES) Aleph GEIVLLVGPSGSGKTTLLRALAGLLGPDGG Beth LSGGQRQRVAIARALALEPKLLLLDEPTSALD Gimel DVVVIGAGGAGLAAALALARAGAKVVVVE Dalet RRGIGMVFQEYALFPHLTVLENVALGL Heh PVIMLTARGDEEDRVEALLEAGADDYLTKPF Vav LLGLSKKEARERALELLELVGLEEKADRYP Zayin LLLKLLKELGLTVLLVTHDLEEA Berezovsky et al. 2000-2003 The underlined motifs are omnipresent KVALVGRSGSGKTTVTSLLM FIAVEGIDGAGKTTLAKSLS GxxxxGKT - Walker A motif (NTP binding) Phylogenetically diverse prokaryotes used for calculation of the omnipresent motifs Bradyrhizobium japonicum Streptomyces coelicolor Rhodopirellula baltica Bacillus cereus Bacteroides thetaiotaomicron Gloeobacter violaceus Treponema denticola Thermus thermophilus Fusobacterium nucleatum Thermotoga maritime Aquifex aeolicus Chlamydophila pneumoniae Methanosarcina acetivorans Nanoarchaeum equitans Sulfolobus solfataricus sequences NATURAL SHUFFLE1 SHUFFLE2 SHUFFLE3 Tetramers 36593 40553 40485 40652 Pentamers 2326 1554 1442 1527 Hexamers 46 0 0 0 Heptamers 21 0 0 0 Octamers 9 0 0 0 Nonamers 3 0 0 0 Omnipresent 6-9 mers of 15 prokaryotes from different phyla ALEPH ATP/GTP binding 1 HVDHGKTTL 2 GPPGTGKT 3 GHVDHGKT 4 GSGKTTLL 5 IDTPGHV 6 GPSGSGK 7 PTGSGKT 8 NGSGKTT 9 GKSTLLN 10 SGSGKT 11 TGSGKS 12 PGVGKT 13 PNVGKS 14 GVGKTT 15 GTGKTT 16 DHGKST 17 GKTTLA 18 GKTTLV 19 KSTLLK BETH ATPases of ABC transporters 20 QRVAIARAL 21 LSGGQQQRV 22 LADEPT 23 TLSGGE Other omni: 24 FIDEID 25 KMSKSL 26 WTTTPWT 27 NADFDGD Omnipresence is a new measure of sequence conservation. These elements are the most conserved ones, coming, presumably from last common ancestor EVOLUTIONARY ELITE (OMNIPRESENT 6- to 9-MERS) HVDHGKTTL Aleph LSGGQQQRV Beth QRVAIARAL Beth GHVDHGKT Aleph GPPGTGKT Aleph GSGKTTLL Aleph GKSTLLN Aleph GPPGTGK Aleph GPSGSGK Aleph IDTPGHV Dalet NADFDGD NGSGKTT Aleph PTGSGKT Aleph WTTTPWT DHGKST Aleph FIDEID GKTTLA Aleph GKTTLV Aleph GTGKTT Aleph GVGKTT Aleph KMSKSL KSTLLK Aleph LADEPT Beth PGVGKT Aleph PNVGKS Aleph SGSGKT Aleph TGSGKS Aleph TLSGGE Beth Functional involvement of the most conserved octamers present in all (131) or almost all (125 and less) prokaryotic proteomes. number of genomes protein function 1. GHVDHGKT 131 ● ■initiation and elongation factors 2. SGSGKSTL 125 ● ■ABC transporter family proteins 3. LSGGQQQR 125 ● ■ABC cassettes, transporters 4. GPPGTGKT 122 ●cell division proteins 5. KMSKSLGN 121 aa-tRNA synthetases class I 6. QRVAIARA 119 ● ■ABC cassettes, transporters 7. DEPTSALD 119 ● ■ABC cassettes, transporters 8. LRPGRFDR 119 cell division proteins 9. SIGEPGTQ 117 DNA-directed RNA polymerases 10. SGGLHGVG 117 topoisomerases 11. VEGDSAGG 116 topoisomerases 12. GLPNVGKS 116 ●GTP/ATP binding proteins 13. DEPSIGLH 115 ■exinuclease ABC (UvrA) 14. DLGGGTFD 115 chaperones (heat shock) proteins 15. GPNGAGKS 114 ● ■ABC transporters 16. GIDLGTTN 113 chaperones 17. VITVPAYF 113 ■ATPase of heat shock protein 70 18. LNRAPTLH 113 RNA polymerase beta' subunit 19. NADFDGDQ 113 RNA polymerase beta' subunit 20. NLLGKRVD 113 RNA polymerase beta' subunit 21. AGDGTTTA 112 chaperonin GroEL 22. GPTGVGKT 112 ●chaperone ClpB 23. GIAVGMAT 112 DNA gyrase subunit A 24. GFDYLRDN 112 preprotein translocase secA subunit 25. ERERGITI 111 ●GTP-binding protein lepA 26. KPNSALRK 111 30S ribosomal protein S12 27. NMITGAAQ 111 elongation factor TU 28. SHRSGETE 110 enolase (phosphopyruvate hydratase) 29. MAGRGTDI 110 preprotein translocase secA subunit 30. IIFIDEID 110 cell division protein FtsH 31. GGTVGDIE 110 CTP synthase 32. KFSTYATW 109 RNA polymerase sigma factor rpoD 33. DEARTPLI 108 preprotein translocase secA subunit 34. HHNVGGLP 108 GMP synthase 35. GHNLQEHS 107 30S ribosomal protein S12 36. GGRVKDLP 107 30S ribosomal protein S12 37. LPDKAIDL 107 chaperone ClpB 38. NPRSTVGT 107 ■excinuclease ABC subunit A 39. NEKRMLQE 106 DNA-directed RNA polymerase beta' chain 40. CPIETPEG 106 DNA-directed RNA polymerase beta chain 41. NPETVSTD 106 carbamoyl-phosphate synthase large chain 42. LEYRGYDS 106 glucosamine-fructose-6-phosphate aminotransferase 43. SRSSALAS 106 carbamoyl-phosphate synthase large chain 44. HTRWATHG 106 glucosamine-fructose-6-phosphate aminotransferase 45. DEREQTLN 105 cell division protein FtsH 46. DVSGEGVQ 105 ●Clp protease ATP-binding subunit clpX 47. GPSGCGKS 105 ●phosphate import ATP-binding protein pstB 48. KTKPTQHS 105 CTP synthase Motifs involved in elementary syntheses appear late Many of the 27 omnipresent elements do not match to one another (e. g. WTTTPWT and QRVAIARAL) yet, they turn out to belong to the same network. Major nuclei in sequence space (10% Monster) LSGGQRQRVAIARALALDPD 3753 60% +++++++++++++++++-+- LSGGQRQRVAIARALALEPKLLLLDEPTSALD Beth GEFVAIVGPSGCGKSTLLRL 3043 60% ++-+--+++++-++-++++- GEIVLLVGPSGSGKTTLLRALAGLLGPDGG Aleph All 20 aa fragments of all proteins of prokaryotes make a sequence space Those fragments that are close relatives (matching >60%) are pair-wise connected. This makes networks that allow tracing evolutionary relatedness of protein sequence motifs Fig1AB 10% MONSTER network (107 fragments) Fig2A Sequence space based evolutionary tree of omnipresent elements All omnipresent elements are relatives! They belong to the same 60% match network Cotradiction: According to Evolutionary Chart of Codons, the first genes have been complementary, residing in one duplex. That implies that there were two starting genes, each with its separate evolutionary tree. We, however, observe only one tree One possible solution: two identical selfcomplementary genes (like restriction sites) RECONSTRUCTION OF COMMON PROTOTYPE OF OMNIPRESENT ELEMENTS. ALIGNMENT OF FOUR GROUPS. AGAAGGAGGGGAAAAG Aleph AASGGGGGGAAAAGAA Beth GAAGSGGAAAA rest of Aleph GAAAGGAA rest of omni -------------------- AGAAGGAGGGGAAAAGAA common prototype The above mentioned example of no match: GAAAAGA WTTTPWT GGAAAAGAA QRVAIARAL This is, apparently, why the omnipresent elements belong to one common network of relatives A G AA GG A GGGG AAAA G AA prototype | | || || | |||| |||| | || I D TP GH V DHGK TTLL N Aleph || *| * |||| |||| | || TL SG G QQQR VAIA R AL Beth AGAAGGAGGGGAAAAG ++-+-+++++++++ AASGGGGGGAAAAGAA In binary form ALEPH and BETH are rather similar Compare to IDTPGHVDHGKTTLLN + TLSGGQQQRVAIARAL Symmetry properties of common prototype AGAAGGAGGGGAAAAGAA AGAA|GGAGGGG|AAAAGAA AAAAGAA GGAGGGG AAAAGAA This is blunt end fusion of the same element GGAGGGG ← ← → OMNIPRESENT ELEMENTS RECONSTRUCTION OF ALEPH AND BETH ALEPH: IDTPGHVDHGKTTLLn k BETH: TLSGGqQQRVAIARAL e COMMON BINARY PROTOTYPE OF ALEPH AND BETH AGAAGGAGGGGAAAAGAA ↓ ↓ AAAAAAA | GGGGGGG | AAAAAAA AGAA | GGAGGGG | AAAAGAA AAAAAAAGGGGGGGAAAAAAA BINARY MOSAIC GGGGGGG & AAAAAAA FIRST PEPTIDES ‘ BINARY ALPHABET EVOLUTIONARY CHART OF CODONS ↑ ↑ ↑ TWO RECONSTRUCTIONS MEET ↑ ↑ AAAAGAA GGAGGGG AAAAGAA ↑ ALEPH: IDTPGHVDHGKTTLLN BETH: TLSGGQQQRVAIARAL ATPases of ABC transporters, signature loop ATP binding P-loop Alanine and Glycine only fusion of three minigenes GGAGGGG first mixed alphabet minigene ↑ from first amino acids to first protein modules According to the same theory (reconstruction of evolutionary history of the triplet code) the earliest proteins have been encoded in both strands of the genes-duplexes, so that the xYx codons of one strand would be complementary to xRx codons of another strand. Remarkably, the above ALEPH and BETH are, indeed, complementary: ALEPH AGAAGGAGGGGAAAAG |||||||||||- Gimel→ GAAAAGAGGAGAAAAAAAAGAGAGAAAAG •• •••••••••••••• • • • AAGAAGGGAGAGAAGAGGGGGGGAAAAAAA ←Heh Zayin→ AAAGAAGGAGAAAAAAAGGAGGA •• ••• ••••• •••••• AAGAAGAAAAGGGGAGGAAGAGAGG ←Chet Aleph→ GGAAAAAGAAGAGGAAAAGAAAGAAGAGGG •••• •• •• • •••••••• •• AGGGAGGGAGAAGAAGAAGGGAGGGAAGAA ←Vav Beth→ AAGGGGGGAAAAGAAAAGAGAAAAGGAAAAAG • •• •• • • • •••• ••• ••• AGAAAGGAAAAGAAAAGGGAAAGAGGG ←Dalet All 27 omnipresent LUCA motifs originate from one prototype sequence, which is: Ala Ala Ala Ala Gly Ala Ala Gly Gly Ala Gly Gly Gly Gly encoded in GCC GCC GCC GCC GGC GCC GCC GGC GGC GCC GGC GGC GGC GGC which is self-complementary: GCC GCC GCC GCC GGC GCC GCC GGC GGC GCC GGC GGC GGC GGC (as expected) The very first gene was a short duplex, encoding the same thing in both strands ENZYMATIC REPERTOIRE OF LUCA Omnipresent cassette of ABC transporters Omnipresent cassette of Proteases (cell division protein FtsH, zinc-dependent metalloprotease) (146-463)LLVGPPGTGKTLLARAVAGEA (7) SGSDFVEMFVGVGASRVRD (9) PCIIFIDEIDAVGR(7-11)DEREQTLNQLLVEMDGF consensus (cont.) (191) LLyGePGvGKTLLAkAiAGEA (7) SGSDFVEMFVGVGAaRVRD (9) PCIIFIDEIDAVGR (10) DEREQTLNQLLVEMDGF O67077 - Aae (198) LLVGPPGTGKTLLARAVAGEA (7) SGSDFVEMFVGVGASRVRD (9) PCIIFIDEIDAVGR (11) DEREQTLNQLLVEMDGF Q81J82 - Bce (192) LLVGPPGTGKTLiARAVAGEA (7) SGSDFVEMFVGVGASRVRD (9) PCIIFIDEIDAVGR (11) DEREQTLNQLLVEMDGF Q9XBG5 - Bja (213) LLVGPPGTGKTLLAkAVAGEA (7) aGSDFVEMFVGVGASRVRD (9) PCIvFIDEIDAVGR (10) DEREnTLNQLLtEMDGF Q8A0L4 - Bth (463) LLiGPPGTGKTLiAkAVsGEA (7) aGSDFVEMFVGVGASRiRD (9) PCIIFIDEIDAVGR (11) DEREQTLNQLLVEMDGF Q9Z6R1 - Cpn (309) LLlGePGTGKTLLAkAVAGEA (7) SGSeFVEMFVGVGASRVRD (9) PCIvFIDEIDAVGR (11) DEREQTLNQLLVEMDGF Q8R6D4 - Fnu (210) LLVGPPGTGKTLLAkAiAGEA (7) SGSeFVEMFVGVGASRVRD (9) PCIvFIDEIDAVGR (11) DEREQTLNQLLVEMDGF Q7NHF9 - Gvi (233) LLnGPPGTGKTLLARAVAGEA (7) nGSeFiqMFVGVGASRVRD (9) PsIIFIDEIDAVGR (11) DEREQTLNQILgEMDGF Q7UUZ7 – Rba (239) LLtGPPGTGKTLLARAVAGEA (7) SaSeFiEMiVGVGASRVRe (9) PsIIFIDEIDtiGR (10) DEREQTLNQILtEMDGF O69875 - Sco (241) LLVGPPGTGKTLLARAVAGEA (7) SGSDFVEMFVGVGASRVRD (9) PCIIFIDElDAiGk (11) DEREQTLNQLLVEMDGF AAS10965 - Tde (197) LLVGPPGTGKTLLARAVAGEA (7) SGSDFVElFVGVGAaRVRD (9) PCIvFIDEIDAVGR (11) DEREQTLNQLLVEMDGF Q9WZ49 - Tma (192) LLVGPPGvGKThLARAVAGEA (7) SGSDFVEMFVGVGAaRVRD (9) PCIvFIDEIDAVGR (11) DEREQTLNQLLVEMDGF AAS81470 – Tth (213) LLhGPPGTGKTmiAkAVAsEt (7) SGpeiVskyyGeseqklRe (9) PsIIFIDEIDsiap (11) emerrvvaQLLslMDGl Q8THE2 - Mac (146) LLyGPPGTGKTLigkAlAksA (7) vGSelVqkyiGeGAklVke (9) PaIvFIDEIDAiaa (11) rEvqrTfmQLLaEiDGF AAR39040 – Neq (238) LLyGPPGvGKTLLARAlAnEi (7) nGpeimskFyGeseqRlRe (9) PaIIFIDEIDAiap (7) evekrvvaQLLtlMDGi Q97ZZ9 - Sso (8) IAATNRPDxLDPALLRPGRFDRQ (95-415) consensus (8) IAATNRPDILDPALLRPGRFDRQ (314) O67077 - Aae (8) vAATNRPDILDPALLRPGRFDRQ (307) Q81J82 - Bce (8) IAATNRPDvLDPALLRPGRFDRQ (320) Q9XBG5 - Bja (8) lAATNRvDvLDkALLRaGRFDRQ (354) Q8A0L4 - Bth (8) mAATNRPDvLDkALLRPGRFDRr (319) Q9Z6R1 - Cpn (8) lAATNRaDvLDkALrRPGRFDRQ (277) Q8R6D4 - Fnu (8) IAATNRPDvLDaAiLRPGRFDRQ (292) Q7NHF9 - Gvi (8) IAATNRPDvLDPALLRPGRFDRh (311) Q7UUZ7 – Rba (8) IAATNRaDILDaALtRPGRFDRv (280) O69875 - Sco (8) lAATNRPDvLDPALLRPGRFDRQ (290) AAS10965 - Tde (8) mAATNRPDILDPALLRPGRFDkk (285) Q9WZ49 - Tma (8) mAATNRPDILDPALLRPGRFDRQ (304) AAS81470 – Tth (8) IAATNRPnsiDeALrRgGRFDRe (415) Q8THE2 - Mac (8) IgATNRlDILDPAiLRPGRFDRi (95) AAR39040 – Neq (8) IgATNRPDavDPALrRPGRFDRe (406) Q97ZZ9 - Sso Omnipresent cassette of Initiation factor 2 (10-546)MGHVDHGKTTLL (11) EAGGITQHIGA(11-29)FIDTPGHEAFT (14) LVVAADDGV (21) INKIDLP(381-458)consensus (313) MGHVDHGKTTLL (11) EkGGITQHIGA (12) FlDTPGHEAFT (14) LVVAADDGV (21) vNKIDKP (384) O67825 - Aae (195) MGHVDHGKTTLL (11) EAGGITQHIGA (11) FlDTPGHaAFT (14) LVVAADDGV (21) vNKmDKP (384) Q812X7 - Bce (345) MGHVDHGKTsLL (11) EAGGITQHIGA (13) FIDTPGHaAFT (14) LVVAADDGV (21) INKIDKP (388) Q89WA9 - Bja (546) MGHVDHGKTsLL (11) EAGGITQHIGA (12) FlDTPGHEAFT (14) iiVAADDnV (21) INKvDKP (386) Q8A2A1 - Bth (342) MGHVDHGKTTLI (11) EAGaITQHmGA (11) ilDTPGHEAFs (14) LVVAgDeGi (21) INKcDKP (381) Q9Z8M1 - Cpn (244) MGHVDHGKTsLL (11) EAGGITQkIGA (11) FIDTPGHEAFT (14) LVVAADDGV (21) vNKIDKP (386) Q8R5Z1 - Fnu (424) MGHVDHGKTsLL (11) EAGGITQHIGA (15) FlDTPGHEAFT (14) LVVAADDGV (21) INKvDKP (390) Q7NH85 - Gvi (536) lGHVDHGKTsLL (11) EAGGITQHIrA (11) FvDTPGHEAFT (14) LVVAADDGi (21) lNKIDle (395) Q7URR0 - Rba (533) MGHVDHGKTrLL (11) EAGGITQHIGA (15) FIDTPGHEAFT (14) LVVAAnDGV (21) vNKIDve (389) Q8CJQ8 - Sco (322) MGHVDHGKTKTL (11) EfGGITQHIGA (11) FlDTPGHEAFT (14) LVVAADDGV (21) vNKvDKP (407) AAS11595 - Tde (185) MGHVDHGKTTLL (11) EeGGITQsIGA (11) FIDTPGHElFT (14) LVVAADDGV (21) INKIDKP (398) Q9WZN3 - Tma (78) MGHVDHGKTTLL (11) EAGGITQHvGA (11) FIDTPGHEAFT (14) iViAADDGi (21) INKIDlP (386) AAS80695 – Tth (20) MGHVDHGKTTLL (11) EAGAITQHIGA (27) FIDTPGHhAFT (14) vVVdineGf (21) aNKIDri (454) Q8TQL5 - Mac (10) lGHVDHGKTTLL (11) EAGGITQHIGA (29) FIDTPGHEAFs (14) vVidineGi (21) aNKIDKi (439) AAR39338 – Neq (17) lGHVDHGKTTLL (11) EpGemTQevGA (29) FIDTPGHEyFs (14) LVVditeGl (21) aNKIDKi (458) Q980Q8 – Sso Omnipresent cassette of Aminoacyl-tRNA synthases (class I) (495-671) DQTRGWF(29-84)GRKMSKSLGN(318-467)consensus (585) DQhRGWF (29) GRKMSKSLGN (325) O66651 - Aae (554) DQyRGWF (29) GRKMSKSiGN (321) Q819R4 - Bce (632) DQhRGWF (29) GRKMSKSLGN (324) Q89DF8 - Bja (671) DQTRGWF (29) GnKMSKrLnN (445) Q8A9K9 - Bth (552) DQTRGWF (29) GnKMSKrLnN (445) Q9Z972 - Cpn (568) DQhRGWF (29) GkKMSKSLGN (320) Q8RH47 - Fnu (606) DQhRGWF (29) GRKMSKSLGN (327) Q7NF75 - Gvi (648) DQTRGWF (84) tgKMSKSLrN (464) Q7UNZ2 - Rba (562) DQTRGWF (29) GRKMSKhLGN (440) Q9S2X5 - Sco (587) DQTRGWF (29) GkKMSKSLrN (467) AAS13180 – Tde (555) DQhRGWF (29) GRKMSKSLGN (318) P46213 - Tma (576) DQTRGWF (29) GqKMSKSkGN (445) AAS81050 – Tth (556) DQTRGWF (29) GkKMSKSLGN (455) Q8TN62 - Mac (622) DQiRGWF (29) GRKMSKSLGN (348) AAR39083 – Neq (495) DQlRGWF (29) GReMhKSLGN (445) Q9UXB1 - Sso Elongation factors (5-21)RNIGIMAHIDAGKTTTTERIL(15-19)TMDWMEQEQERGITITSAATT(7-22)INIIDTPGHVDFTVEVERSLRVLDGAVAV consensus (cont.) (10) RNIGIVAHIDAGKTTTTERIL (17) TMDWMPQEKERGITITVATTA (11) INIIDTPGHVDFSVEVVRSMKVLDGIVFI (Aae) (10) RNIGIMAHIDAGKTTATERIL (17) QMDWMEQEQERGITITSAATT (7) VNIIDTPGHVDFTVEVERSLRVLDGAVAV (Bce) (10) RNFGIMAHIDAGKTTTTERIL (17) TMDWMEQEQERGITITSAATT (7) LNIIDTPGHVDFTIEVERSLRVLDGAVCV (Bja) (9) RNIGIMAHIDAGKTTTSERIL (17) TMDWMEQEQERGITITSAATT (11) INLIDTPGHVDFTAEVERSLRILDGAVAA (Bth) (11) RNIGIMAHIDAGKTTTTERIL (17) TMDWMAQEQERGITITSAATT (7) INIIDTPGHVDFTIEVERSLRVLDGAVAV (Cpn) (10) RNVGIMAHIDAGKTTTTERIL (17) TMDWMEQEQERGITITSAATT (7) INIIDTPGHVDFTVEVERSLRVLDGAVAV (Fnu) (10) RNIGIAAHIDAGKTTTTERIL (17) VTDWMAQERERGITITAAAIT (22) INIIDTPGHVDFTIEVERSMRVLDGVITV (Gvi) (6) RNIGISAHIDSGKTTLSERIL (19) TMDHMELEKERGITITSAATS (7) INLIDTPGHVDFTVEVERSLRVLDGAVLV (Rba) (11) RNIGIMAHIDAGKTTTTERIL (17) TMDWMEQEQERGITITSAATT (11) INIIDTPGHVDFTVEVERSLRVLDGAVTV (Sco) (5) RNIGIMAHIDAGKTTTTERIL (17) TMDWMAQEQDRGITIQSAATT (7) INIIDTPGHVDFTAEVERSLRVLDGAVAV (Tde) (11) RNIGIMAHIDAGKTTTTERIL (17) TTDWMPQEKERGITIQSAATT (7) INIIDTPGHVDFTAEVERALRVLDGAIAV (Tma) (12) RNIGIAAHIDAGKTTTTERIL (17) TMDFMEQERERGITITAAVTT (7) INIIDTPGHVDFTIEVERSMRVLDGAIVV (Tth) (21) RNIGIVAHIDHGKTTLSDNLL (15) FMDSDEEEQARGITIDSSNVS (11) INLIDTPGHVDFGGDVTRAMRAVDGAVVV (Mac) (21) RNIGIIAHIHHGKTTLTDNLL (15) FTWWHEQEREREMTIYGAAVS (11) INLIDTPGHVEFGGEVTRAVRAIDGAVVV (Nec) (20) RNIGIIAHVDHGKTTTSDTLL (15) ALDYLNVEQQRGITVKAANIS (11) INLIDTPGHVDFSGRVTRSLRVLDGSIVV (Sso) (8) PQSETVWRQA (4) VPRIAFVNKMDRTGA(261-98)EDPTF (14) GMGELHLEI (228-287) consensus (8) PQSEANWRWA (4) VPRIAFINKMDRLGA (289) EDPTF (14) GMGELHLEI (236) O66428 Aae (8) PQTETVWRQA (4) VPRIVFVNKMDKIGA (289) EDPTF (14) GMGELHLDI (233) Q814C5 Bce (8) PQTETVWRQG (4) VPRIVFANKMDKTGA (289) EDPSF (14) GMGELHLDI (231) Q89J81 Bja (8) PQSETVWRQA (4) VPRIAYVNKMDRSGA (291) EDPTF (14) GMGELHLDI (241) Q8A474 Bth (8) PQSETVWRQA (4) VPRIAFVNKMDRMGA (294) EDPTF (14) GMGELHLDI (229) Q9Z802 Cpn (8) PQSETVWRQA (4) VPRLAFFNKMDRIGA (292) EDPTF (14) GMGELHLEI (231) Q8R602 Fnu (8) PQTETVWRQA (4) VPRFIFVNKMDRTGA (287) EDPTF (14) GMGELHLEI (235) Q7NEF2 Gvi (8) SQSITVDRQM (4) IPRLAFINKMDRTGA (286) EDPTF (14) GMGELHLEI (241) Q7URV2 Rba (8) PQSETVWRQA (4) VPRICFVNKLDRTGA (298) EDPSF (14) GMGELHLEV (235) P40173 Sco (8) PQTETVWHQA (4) VPRICFVNKMDRIGA (290) EDPTF (14) GMGELHIDV (228) AAS10780 Tde (8) PQSETVWRQA (4) VPRIAFMNKMDKVGA (289) EDPTL (14) GMGELHLEI (232) P38525 Tma (8) PQSETVWRQA (4) VPRIAFANKMDKTGA (289) EDPTF (14) GMGELHLEI (230) AAS81673 Tth (8) PQTETVLRQA (4) VRPVLFVNKVDRLIN (261) EDPTL (14) GMGELHLEV (286) Q8TRC3 Mac (8) PQTETVLKQA (4) VKPVLFINKVDRAIK (273) EDPTL (14) GLGDLHLEI (287) AAR39383 Neq (8) TQTETVLRQS (4) VRPILFINKVDRLVK (267) EDPNL (14) GMGFLHLEV (282) P30925 Sso RNA polymerase (224-538)LDGGRFATSDLNDLYRRVINRNNRLK (13) NEKRMLQEAVDAL(25-33)GKQGRFRQNLLGKRVDYSGRSVIVVGP(59-84) consensus (continued) (385) LDGGRFATSDLNDFYRRVINRNNRLK (13) NEKRMLQEAVDAL (25) GKQGRFRQNLLGKRVDYSGRSVIVVGP (59) (Aae) (243) LDGGRFATSDLNDLYRRVINRNNRLK (13) NEKRMLQEAVDAL (26) GKQGRFRQNLLGKRVDYSGRSVIVVGP (59) (Bce) (256) LDGGRFATSDLNDLYRRVINRNNRLK (13) NEKRMLQEAVDAL (26) GKQGRFRQNLLGKRVDYSGRSVIVVGP (59) (Bja) (266) LDGGRFATSDLNDLYRRVIIRNNRLK (13) NEKRMLQESVDSL (26) GKQGRFRQNLLGKRVDYSARSVIVVGP (59) (Bth) (257) LDGGRFATSDLNDLYRRVINRNNRLK (13) NEKRMLQEAVDAL (26) GKNGRFRQNLLGKRVDYSGRSVIIVGP (59) (Cpn) (243) LDGGRFATSDLNDLYRRVINRNNRLK (13) NEKRMLQEAVDAL (26) GKQGRFRQNLLGKRVDYSARSVIVVGP (59) (Fnu) (262) LDGGRFATSDLNDLYRRVINRNNRLA (13) NEKRMLQEAVDAL (26) GKQGRFRQNLLGKRVDYSGRSVIVVGP (59) (Gvi) (253) LDSGNFATSDLNDLYRRIINRNNRLR (13) NEKRMLQQSVDAL (26) GKQGRFRENLLGKRVDYSARSVIVVGP (59) (Rba) (329) LDGGRFATSDLNDLYRRVINRNNRLK (13) NEKRMLQEAVDAL (26) GKQGRFRQNLLGKRVDYSARSVIVVGP (59) (Sco) (243) LDGGRFATSDLNDLYRRVIHRNSRLS (13) NEKRMLQEAVDAL (26) GKQGRFRQNLLGKRVDYSGRSVIVVGP (59) (Tde) (538) IEGGRFATTDLNELYRRLINRNNRLK (13) NEKRMLQEAVDAL (33) GKKGRFRRNLLGKRVDYSGRAVIVVGP (61) (Tma) (518) VDGGRFATSDLNDLYRRLINRNNRLK (13) NEKRMLQEAVDAL (27) GKQGRFRQNLLGKRVDYSGRSVIVVGP (62) (Tth) (224) LESGQRSEDDLTHKLVDIIRINQRFQ (13) DLWELLQYHVTTF (27) GKEGRFRGSLSGKRVNFSARTVISPDP (80) (Mac) (232) LETGERSEDDLTHKIADIVRINNRIE (13) ENWEMLQYHVATY (27) GKEGRLRRNLAGKRVNFSARGVISVDP (83) (Neq) (224) IESGIRAEDDLTHKLVDIVRINERLK (13) DLWDLLQYHVATY (27) GKEGRFRGNLSGKRVDFSSRTVISPDP (84) (Sso) HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH(169-975) consensus HPVLLNRAPTLHRPSIQAF (19) FNADFDGDQMAVH (975) O67763 Aae HPVLLNRAPTLHRLGIQAF (19) YNADFDGDQMAVH (745) Q81J47 Bce HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH (927) Q89J75 Bja HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH (946) Q8A470 Bth HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH (921) Q9Z999 Cpn HPVLLNRAPTLHRLSIQAF (19) FNADFDGDQMAVH (861) Q8RHI7 Fnu HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH (169) Q7NDF8 Gvi HPVLLNRAPTLHRMGIQAF (19) FNADFDGDQMAVH (961) Q7URW4 Rba HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH (755) Q8CJT1 Sco HPVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAIH (966) AAS12938 Tde SVVLLNRAPTLHRMSIQAF (19) FNADFDGDQMAVH (928) P36252 Tma KVVLLNRAPTLHRLGIQAF (19) FNADFDGDQMAVH (776) AAS81802 Tth DTVLFNRQPSLHKMSIMAH (19) YNADFDGDEMNMH (416) Q8TRB7 Mac DYALENRQPSLHKMSMMGH (19) YNADFDGDEMNYH (338) AAR39345 Neq DIVLENRQPSLHRISMMAH (19) YNADFDGDEMNLH (415) Q980R2 Sso Omnipresent cassettes (1) ABC transporters (32-72)GPSGSGKTTLL(29-41)MVFQNYALFPHLTALENV(31-42)QLSGGQQQRVAIARAL (6) LLADEPTSALD(21-22)IYVTHDQ(28-263) (2) Proteases (cell division protein FtsH, zinc-dependent metalloprotease) (146-463)LLVGPPGTGKTLLARAVAGEA (7) SGSDFVEMFVGVGASRVRD (9) PCIIFIDEIDAVGR(7-11)DEREQTLNQLLVEMDGF (3) RNA polymerase beta’ (gamma) subunit LDGGRFATSDLNDLYRRVINRNNRLK 12 RNEKRMLQEAVDAL 25-33 GKQGRFRQNLLGKRVDYSGRSVIVVGP 59-84 HPVLLNRAPTLHRLGIQAF 18 AFNADFDGDQMAVH (4) Initiation factor 2 MGHVDHGKTTLV 11 EAGGITQHIGA 12-29 FIDTPGHEAFT 14 LVVAADDGV 21 INKIDLP (5) Elongation factor G GIMAHIDAGKTTTTERIL 22-26 ERERGITIT 12-27 INIIDTPGHVDFTxEVERSLRVLDGAV 13 ETVWRQA (6) tRNA synthase (isoleucine synthases and class I synthases) (495-671) DQTRGWF(29-84)GRKMSKSLGN(318-467)consensus Two most widespread modules ALEPH and BETH, apparently, represent the earliest duplex gene that encoded in the earliest past two vitally important activities involved in energy supply (ATP binding and ATP-ase). Today the module ALEPH is located in a variety of enzymes that require ATP, including the most ancient ones: 1. ABC cassettes of transporters, 2. cell division proteins (proteases), 3. initiation and 4. elongation translation factors. Other most ancient enzymes are 5. RNA polymerase and 6. Amino acyl tRNA synthetase untitled1 Functional definition of LUCA: Early organism that contained functionally unique omnipresent cassettes and functionally unique omnipresent singular modules HVDHGKTTL Elongation factor EF-TU GHVDHGKT Elongation factor EF-TU GSGKTTLL ABC transporters (UraD) GKSTLLN ABC transporters SGSGKT Amino acid ABC transporters GPSGSGK Amino acid (glutamine) ABC transporter NGSGKTT ABC transporters KSTLLK ABC transporters GPPGTGKT Cell division control protein GVGKTT ParA (chromosome partitioning) family protein PGVGKT Clp protease, ATP binding GKTTLA Holiday junction DNA helicase RuvB PTGSGKT General secretion pathway protein TGSGKS Twitching motility protein PNVGKS GTP-binding protein era GKTTLV GTP-binding protein TypA DHGKST GTP-binding protein LepA GTGKTT Signal recognition particle receptor protein LSGGQQQRV ABC transporters, ATPases QRVAIARAL ABC transporters, ATPases TLSGGE ABC transporters, ATPases LADEPT ABC transporters, ATPases IDTPGHV Elongation factors G NADFDGD DNA-directed RNA polymerases WTTTPWT Isoleucyl-tRNA synthetases KMSKSL Amino acyl tRNA synthetases, class I FIDEID Cell division proteins None of the omnipresent motifs is involved in elementary syntheses. ATP binding and breaking up, peptide digestion, membrane transport and template functions only Most of the singular omnipresent modules are involved in many different multimodular activities. For complete functional characterization of LUCA one has to determine what are specific functions of the omnipresent modules themselves GENOME SEGMENTATION 7B7183B6 4F3D626B C6C5A162 5C04343F 5F0622F3 CB010889 586E87A5 8FD1447F 8AA541E5 98316E06 FE12B735 401F6BB2 2C7B7690 E7F8323B 9EE7C301 “Evolution may have proceeded largely, rather than periferally, through extrachromosomal elements” D. Reanney Bact. Rev. 40, 552, 1976 F0657DB7 EC1D4C4C EB7C143A 7 aa 25-30 aa 120-150 aa Closed loops Folds Multifold proteins As first explicitly suggested by Darwin, life developed from something simple to more complex we could conceive in some warm little pond with all sort of ammonia and phosphoric salts, - light, heat, electricity etc., present, that a protein compound was chemically formed, ready to undergo still more complex changes,… (Darwin 1871) and probably all the organic beings which have ever lived on this earth have descended from some one primordial form, into which life was first breathed. from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved. Does complexity go together with evolution of species? YES Genome changes open new opportunities, new niches NO Loss of functions/structures in parasites and simbionts with evolution of biosphere? YES speciation NO extinction Natural selection is often understood as brutal extermination of those individuals which do not fit, while, actually, it is mostly a very soft process where the species (individuals and populations) are given a chance to select their niches themselves Active PATH SELECTION by life (marching to all permissive niches and subniches) VERSUS Passive NATURAL SELECTION by environment (condemning unfortunate individuals and whole species in underpermissive conditions) “adaptive radiation” A4725742 Adaptive radiation AT THE BORDER NON-LIFE/LIFE The first known attempt to experimental modeling of events close to origin of life border was Sol Spiegelman study on in vitro evolution of bacteriophage Qβ RNA (1973-75) Using purified replicase of Qβ and RNA of the phage, about 2000 bases long, he watched evolution of the replicating RNA, which was loosing its weight in the consecutive cycles, until the size of only 208 bases has been reached. The system involved the natural enzyme and design. But it certainly was a model of a living organism, with its reproduction assisted 6A65414A 18CBB94A Self-ligation of trinucleotides on complementary hexanucleotide, and replication of the hexanucleotide. Von Kiedrowsky, 1994 The system depends on the supply of the trinucleotides 5’-C-C-G-C-G-G 3’-G-G-C 5’-C-C-G- Sievers and von Kiedrowski Nature 369, 221, 1994 Complementary primer extension on C15 template in “protocells” by using chemical RNA analog J. W. Szostak, 2008 No evolutionary changes have been observed A self-catalytic generation (“cross-replication”) of plus-strand and minus-strand ribozymes from their constituent 14-mer and 52-mer RNA chains G. F. Joyce, 2009 The system also allows to introduce and to monitor evolutionary changes in the ribozymes. A complicated start - No elementary template polymerization steps one would expect in the simplest self-reproducing system. Requires the 14-mer and 52-mer chains Elongation of oligo-riboA in aqueous solution due to formation of “complementary” pairs of polyA strands E. Di Mauro, 2009 Similar hairpin extension of oligoG ligated to oligoC has been observed as well, with incorporation of complementary Gs. This is already complementary synthesis, i.e. - replication “Creation” of a bacterial cell with chemically synthesized genome J. C. Venter, 2010 -a very large scale bacterial transformation. - The transforming DNA has been chemically synthesized according to natural design – previously fully sequenced genome, with some changes. The synthesized genome did replicate many rounds. It is very much assisted replication as it was provided with an initial natural cytoplasm. No evolutionary changes in the design have been monitored. Bacteria living in volcanic geysers contain in their DNA arsenic (As) instead of phosphorus (P) Arsenic DNA (F. Wolfe-Simon, 2010) ? In a similar system as above, in the same laboratory mutational events demonstrated for the first time in a simple system E. Di Mauro, 2011 On the template of GCCGCCGCC in presence of only 3’-5’ cyclic GMP the “complementary” GGGGGGGGG is synthesized non-enzymatically, thus, allowing the G●G mismatch In the beginning was… a mistake! The first amino acids formed in water-formamide system abiotically (Saladino, Di Mauro, 2008-2011) are alanine, glycine, guanine and cytosine – the same basic monomers which are predicted from the Evolutionary Chart of Codons as starters of the triplet stage of life Cyanobacteria-like structures in metheorites R. B. Hoover, 2011 Variety of apparently bacterial morphologies claimed to be observed In metheorites ? •DEFINITIONS OF LIFE Description of natural selection by Darwin: "... if variations useful to any organic being ever do occur, assuredly individuals thus characterized will have the best chance of being preserved in the struggle for life; and from the strong principle of inheritance, these will tend to produce offspring similarly characterized“ Charles Darwin, Origin of Species (1859) Rephrasing (ET): Individuals with useful variations will self-reproduce It is, actually, definition of life The essential criteria of life are twofold: (1)the ability to direct chemical change by catalysis; (2) the ability to reproduce by autocatalysis. The ability to undergo heritable catalysis changes is general, and is essential where there is competition between different types of living things, as has been the case in the evolution of plants and animals (Alexander 1948). Any system capable of replication and mutation is alive (Oparin 1961). The criteria of living systems are: metabolism, self-reproduction and spatial proliferation. The more complicated kinds also have the ability to mutate and evolve (G´anti 1974). We regard as alive any population of entities which has the properties of multiplication, heredity and variation (Maynard-Smith 1975). Life is synonymous with the possession of genetic properties. Any system with the capacity to mutate freely and to reproduce its mutation must almost inevitably evolve in directions that will ensure its preservation. Given sufficient time, the system will acquire the complexity, variety and purposefulness that we recognize as being alive (Horowitz 1986) To biologists, life is an outcome of ancient events that led to the assembly of nonliving materials into the first organized, living cells. ‘Life’ is a way of capturing and using energy and materials. ‘Life’ is a way of seeing and responding to specific changes in the environment. ‘Life’ is a capacity to reproduce; it is a capacity to follow programs of growth and development. And ‘life’ evolves, meaning that details in the body plan and functions of each kind of organism can change through successive generations (Starr and Taggart 1992). Life is a self-sustained chemical system capable of undergoing Darwinian Evolution (NASA working definition of life, Joyce 1994, 2002) A living entity is defined as a system which, owing to its internal process of component production and coupled to the medium via adaptative changes, persists during the time history of the system (Luisi 1998). Life on the Earth [. . .] seems to possess three properties (strongly related to each other and in fact being different aspects of the same thing) which are absent in inanimate systems. Namely, life is (1) composed of particular individuals, that (2) reproduce (which involves transferring their identity to progeny) and (3) evolve (their identity can change from generation to generation). A living individual is defined as a network of inferior negative Feedbacks (regulatory mechanisms) subordinated to (being at the service of) a superior positive feedback (potential of expansion of life) (Korzeniewski 2001). Life is the process of existence of open non-equilibrium complete systems, which are composed of carbon-based polymers and are able to selfreproduce and evolve on the basis of template synthesis of their polymer components (Altstein 2002). Life is defined as a system capable of 1. self-organization; 2. selfreplication; 3. evolution through mutation; 4. metabolism and 5. concentrative encapsulation (Arrhenius 2002). Life is defined as a self-sustained molecular system transforming energy and matter, thus realizing its capacity of replication with mutations and anastrophic evolution (Baltcheffsky 2002). Life is a chemical system capable of transferring its molecular information independently (self-reproduction) and also capable of making some accidental errors to allow the system to evolve (evolution) (Brack 2002). Life is synonymous with the possession of genetic properties, i.e., the capacities for self-replication and mutation (Horowitz 2002). A living entity is an ensemble of molecules which exhibit spatial organization and molecular-informational feedback loops in utilization of materials and energy from the environment for its growth, reproduction and evolution (Lahav and Nir 2002). Any definition of life that is useful must be measurable.We must define life in terms that can be turned into measurables, and then turn these into a strategy that can be used to search for life. So what are these? a. structures, b. chemistry, c. replication with fidelity and d. evolution (Nealson 2002). Life is a population of functionally connected, local, non-linear, informationally-controlled chemical systems that are able to self-reproduce, to adapt, and to coevolve to higher levels of global functional complexity (Von Kiedrowski 2002). A living system is one capable of reproduction and evolution, with a fundamental logic that demands an incessant search for performance with respect to its building blocks and arrangement of these building blocks. The search will end only when perfection or near perfection is reached. Without this built-in search, living systems could not have achieved the level of complexity and excellence to deserve the designation of life (Wong 2002). Rephrasing Darwin and all above: Life is self-reproduction with variations There are over 150 known definitions of life, but no consensus definition. One can try to derive the consensus by the word count analysis, conceptually similar to Principal Component Analysis The temporal order of appearance of amino acids in evolution has been calculated by Principal Component Analysis as well Since the individual suggestions about the order are treated equally, as formal vectors, there is no discussion or evaluation of appropriateness or weights of various vectors. The approach is formal, “non-scientific” in that sence. And yet it brings to a resonable consensus, successful in specific predictions, thus, qualifying as theory. Life 123 Living 47 System 43 Matter 25 Systems 22 Environment 20 Energy 18 Chemical 17 Process 15 Metabolism 14 Organism 14 Organization 14 Complexity 13 Ability 12 Itself 12 Able 11 Capable 11 Definition 11 Organic 11 Alive 10 Evolution 10 Materials 10 Reproduction 10 Existence 9 Defined 8 Growth 8 Information 8 Open 8 Processes 8 Properties 8 Property 8 Reproduce 8 Through 8 Complex 7 Evolve 7 Genetic 7 Internal 7 Replication 7 Being 6 Change 6 Characteristics 6 Entity 6 External 6 Means 6 Molecules 6 One 6 Order 6 Organisms 6 State 6 Things 6 Time 6 Way 6 Based 5 Biological 5 Capacity 5 Different 5 Force 5 Form 5 Functional 5 Highly 5 More 5 Mutation 5 Necessary 5 Network 5 Objects 5 Only 5 Organized 5 Reactions 5 Self-reproduction 5 Some 5 Three 5 Most frequent words of the vocabulary of definitions of life … These can be combined in groups of similar meaning LIFE 123 COMPLEXITY 13 living 47 information 8 alive 10 complex 7 being 6 other related words 46 biological 5 Sum 74 other related words 8 Sum 199 REPRODUCTION 10 reproduce 8 SYSTEM 43 replication 7 systems 22 self-reproduction 5 organization 14 other related words 33 organism 14 Sum 63 order 6 organisms 6 EVOLUTION 10 network 5 evolve 7 organized 5 change 6 other related words 40 mutation 5 Sum 155 other related words 20 Sum 48 MATTER 25 organic 11 ENVIRONMENT 20 materials 10 external 6 molecules 6 other related words 15 other related words 36 Sum 41 Sum 88 ENERGY 18 CHEMICAL 17 force 5 process 15 other related words 17 metabolism 14 Sum 40 processes 8 reactions 5 ABILITY 12 other related words 26 able 11 Sum 85 capable 11 capacity 5 other related words 1 Sum 40 From vocabulary of 123 known definitions of life the following groups of meanings are revealed Classification of the words (terms) in the groups is rather arbitrary, as none of the terms used is clearly defined. Thus, the groupings are rather intuitive. To have any scientific value the groups should be also formed by an independent intuitive opinions. Kompanichenko Trifonov (groups of notions) (groups of words) 1. Capable of self-reproduction Reproduction 2. Capable of self-replication 3. Capable of evolution… Evolution 4. Performance and control of metabolism Chemical 5. Ability of extraction of energy Energy and matter from the environment Matter Environment (V. Kompanichenko, 2002) Life (definiendum) Definientia: System Matter Chemical Complexity Reproduction Evolution Environment Energy Ability These appear to be both necessary and sufficient for the definition of life Possible comprehensive definition: Life is metabolizing material informational system with ability of self-reproduction with changes (evolution), which requires energy and suitable environment. But this definition is far from being necessary and sufficient It is clearly excessive. What would be the necessary and sufficient definientia? Life is metabolizing material informational system with ability of self-reproduction with changes (evolution), which requires energy and suitable environment. We, thus, come to already formulated: Life is self-reproduction with variations Gly Ala| Val Asp Ser Pro ... | 1 GGC--GCC| 2 | | GUC--GAC 3 GGA---|----|----|---UCC 4 GGG---|----|----|----|---CCC . . (self-reproduction only) ↓ (self-reproduction and variations) not Life yet Life WANTED Self-reproducing composite replicon duplex of 5’-GCCGCCGCCGCCGCCGCCGCC-3’ 1 and 3’-CGGCGGCGGCGGCGGCGGCGG-5’ 2 and heptapeptides ala ala ala ala ala ala ala 3 gly gly gly gly gly gly gly 4 Good news: In imitation experiments similar to Miller`s (but in presence of mineral catalysts) primary products synthesized are Alanine, Glycine, Cytosine and Guanine ( R. Saladino, E. Di Mauro, 2010-2011), In complete accordance with predictions from Evolutionary Chart of Codons LIFE STARTS AGAIN AND AGAIN Baby talk words, perfect repeats (Russian, if not specified) Mama Papa Baba (grandma) Pipi Caca Sisi (breast) Bobo (pain) Baibai (good night) Tiatia (father) Niania (nanny) Ham-ham (eat, Vietnamese) Ai-ai-ai (mishap) Ne-ne-ne (no, Czech) Wong-wong (drink, Vietnamese) Adult forms, perfect repeats: O-o (warning) Bebe Da-da (come in) Ja-ja (yes, German) Ku-ku (crazy) Ga-ga (crazy, English) Hahaha Nununu (warning to babies) Tuktuk (Cambodia, moto-rickshaw) Tamtam (drum) Tak-tak (all right) Ks-ks-ks (calling cat) Nuka-nuka (go ahead) Chachacha Leat-leat (slowly, Hebrew) Tipa-tipa (little bit, Hebrew) Tilki-tilki (barely fit, Ukrainian) Trochi-trochi (little bit, Ukrainian) Rock-rock-rock (Kenya, lullaby) Langsam-langsam (slowly, Yiddish) Mutated, imperfect repeats, babies and adults: Mamy (mother, English) Baby Bibika (car) Mamaya (fruit, Brazil) Papaya (similar fruit, Brazil) O-la-la (surprize, French) Coocook To-to-je (Aliska, co to je, Czech) Ta-ra-ram (mess) Balalaika Tarataika (type of a cart) Yin‘-yan‘ (Chinese) Siusiukat‘ (imitate baby-talk) Tsap-tsarap (catch, about cats) Villi-nilli (against will, Latin) Meli, Emelia (talking nonsense) Olgoi-horhoi (Mongolian, ferrytale creature) Volens-nolens (against will, Latin) Naziuziukalsa (drunk) Futy-nuty, lapti gnuty (mishap) Martin Luther King, 1968: “Yes, if you want to say that I was a drum major, say that I was a drum major for justice. Say that I was a drum major for peace. I was a drum major for righteousness.” Criticized misquote: “I was a drum major for justice, for piece, for righteousness.“ Human languages, quite likely, originated from simple repetitive words, continued with their mutated forms, and even today the languages operate with simple repeats, mutated forms, and longer tandem or dispersed repeats (refrains). EXACTLY THE SAME CAN BE SAID ABOUT BIOLOGICAL SEQUENCES (nucleic acids and proteins) All 15-mers of human genome (sorted) 1 1198780 TTTTTTTTTTTTTTT Tn 2 1190667 AAAAAAAAAAAAAAA An 3 366285 TGTGTGTGTGTGTGT TGn 4 362623 ACACACACACACACA ACn 5 348215 GTGTGTGTGTGTGTG GTn 6 344421 CACACACACACACAC CAn 7 223424 GCTGGGATTACAGGC Alu 8 223011 GCCTGTAATCCCAGC Alu 9 9 222894 TATATATATATATAT TAn 10 222730 ATATATATATATATA ATn 11-67 Alu 68 169033 TTTTTTTTTTTTTTG Tn 69-72 Alu 73 167889 CAAAAAAAAAAAAAA An 74 167361 CTAAAAATACAAAAA Alu 75 150349 CTTTTTTTTTTTTTT Tn 76 149748 AAAAAAAAAAAAAAG An 77-82 Alu --------------------------------------------------- Three known pathologically expanding (“aggressive”) classes of triplets GCU (GCU, CUG, UGC, AGC, GCA, CAG) , GCC (GCC, CCG, CGC, GGC, GCG, CGG) and GAA (AAG, AGA, GAA, CTT, TTC, TCT). Aggressive amino acids encoded by expanding triplets Amino acid Triplets L (leucine) CTG CTT A (alanine) gcc GCA GCC GCG G (glycine) GGC P (proline) CCG S (serine) AGC TCT E (glutamate) GAA R (arginine) CGG CGC AGA Q (glutamine) CAG K (lysine) AAG F (phenylalanine) UUC C (cysteine) UGC Majority of homopeptides are built from aggressive amino acids human eukar. prokar. tripeptides Score (Faux (Faux 1st exons (tripept.) et al.) et al.) 1. L3 4552 1446 70(5) 2. A3 4046 5465(3) 251(3) 3. G3 2972 5002(5) 310(2) 4. P3 2258 4157(7) 217(4) 5. S3 1981 5424(4) 378(1) 6. E3 1630 4334(6) 67(6) 7. R3 1145 462 60(8) 8. Q3 802 8022(1) 52(9) 9. K3 535 1920(9) 25 --------------------------------------- 10. V3 414 94 9 11. H3 273 1049 32 12. D3 269 1554 34 13. T3 267 2492(8) 63(7) 14. I3 109 34 3 15. F3 103 175 1 16. C3 92 38 0 17. N3 79 6962(2) 31 18. M3 34 19 0 19. Y3 32 39 4 20. W3 14 3 0 92% 75% 89% (Z. Koren, 2011) Sorted occurrence of the triplet repeats for different groups ("aggressive" triplets in bold) group of codons Occurrence 1 GCC CCG CGC GGC GCG CGC 1784302 2 GCA CAG AGC UGC GCU CUG 1436660 3 GAA AAG AGA UUC UCU CUU 1131214 4 AAU AUA uaa AUU UUA UAU 932105 5 AUC UCA CAU GAU AUG uga 735397 6 ACC CCA CAC GGU GUG UGG 726443 7 AGG GGA GAG CCU CUC UCC 706484 8 AAC ACA CAA GUU UUG UGU 694387 9 ACG CGA GAC CGU GUC UCG 533888 10 ACU CUA UAC AGU GUA uag 152747 middle triplet occurrence first derivatives gcu 243706 gcu ggu 125946 * gau 115500 * gaa 114278 guu 102550 * gca 95493 * gcc 92153 * auu 89648 uuu 87861 aaa 84194 uua 80660 gga 74934 ggc 71770 gcg 68672 * caa 64785 cuu 63404 aau 60495 gag 60308 ucu 59511 * gug 59440 “Sandwiches” GCU abc GCU DistribProtMaxCodon.bmp Natural sequences of mRNA are anomalously dominated by one or another codon Kirpichi1.bmp Recognizable fossils of ancient repeats (Z. Frenkel, 2011) Repeat_Codons.bmp Codon frequencies (in non-repeating parts of mRNA) correlate with repeat frequencies Ala GCC 110 465 Arg CGC 70 177 Arg AGA 55 62 GCA 94 195 CGU 46 45 AGG 29 22 GCU 93 245 CGG 41 86 GCG 88 386 CGA 33 39 Asn AAU 121 523 Asp GAU 148 359 Cys UGC 31.9 18 AAC 85 170 GAC 107 236 UGU 31.5 7 Gln CAA 88 269 Glu GAA 163 584 Gly GGC 107 500 CAG 87 459 GAG 122 367 GGU 92 229 GGA 87 135 GGG 56 17 His CAU 58 62 Ile AUU 128 151 Leu UUA 91 127 CAC 49 61 AUC 100 107 UUG 73 30 AUA 70 63 Leu CUG 108 375 Lys AAA 158 403 Met AUG 109 117 CUU 75 43 AAG 104 277 CUC 70 59 CUA 40 8 Phe UUU 112 68 Pro CCA 62 89 Ser UCU 63 81 UUC 82 85 CCG 59 169 UCA 62 90 CCU 58 59 UCC 50 67 CCC 50 11 UCG 44 54 Ser AGC 59 147 Thr ACC 76 138 Trp UGG 60 22 AGU 53 36 ACA 71 126 ACU 65 45 ACG 51 59 Tyr UAU 86 68 Val GUG 91 187 UAC 61 41 GUU 88 92 GUC 74 103 GUA 61 23 In 17 of 21 codon repertoires topmost codons are also topmost repeats, that is, in non-repeating sequences the repeats are still remembered Low complexity (simple repeat) – just appeared intermediates High complexity – used to be simple repeat long time ago - genome today - genome at the origin of life ………….. } some 4 bln yrs Genomes are all built from simple repeats. Just many of them already unrecognizable } The bulk of words in the genome do not reach highest complexity, staying repetitive. They still carry, therefore, memory of their primitive past. Replication of tandem repeats with accumulation of mutations within the genomes falls under the definition of life Thus, life started with the replication (and expansion) and subsequent mutation of tandemly repeating triplets GGC and GCC. Life continued then to spontaneously emerge within the primitive early genomes and further on, in form of replication and expansion of other tandem repeats as well, and subsequent mutations Life never stopped to emerge and evolve “… if (and oh what a big if) we could conceive in some warm little pond with all sort of ammonia and phosphoric salts, - light, heat, electricity etc., present, that a protein compound was chemically formed, ready to undergo still more complex changes, at the present day such matter would be instantly devoured, or absorbed, which would not have been the case before living creatures were formed.” (Darwin 1871) With the new view on genome origin and evolution the emerging life is not consumed by the earlier life, but rather protected by the environment within the cell. The tandem repeats have been considered as a class of “selfish DNA” (Orgel and Crick, 1980; Doolittle and Sapienza, 1980). They are, actually, more than just parasites tolerated by genome. They are even more than building material for the genome (Ohno, Junk DNA, 1972). The tandem repeats represent constantly emerging life, and genomes are products of their everlasting accomodation. Genomes are built largely by the expansion and mutational accomodation of the tandem repeats Genomes ARE the repeats Linguistic complexity ATATATATATATATATA (17 bases) Max possible vocabulary for 1-mers: 5A, 4C, 4G, 4T Actual vocabulary: 9A, 0C, 0G, 8T Overlap 9 bases Complexity C1 = 9/17 Maximal possible vocabulary for2-mers: AA, AC, AG, AT,… TA, TC, TG, TT Actual vocabulary: AT and TA only Complexity C2 = 2/16 C3 = 2/15, C4 = 2/14,… C8 = 2/10, C9 = 2/9, C10 = 2/8, C11 = 2/7, C12 = 2/6, C13 =2/5, C14 = 2/4, C15 = 2/3, C16 = 2/2 Product C = Π (Ci) = 9*215/(17!) = 2.65 * 10-8 (~0) i = 1-16 For maximally complex sequence (no extra repeats) C = 1.0 Complexity 15-mers of human genome are on low sequence complexity side. High complexity words are rather avoided Complexity Occurrences of simple sequence 15-mers are anomalously high Topmost 15-mers of human genome (first 10 of 1 073 741 824 words) 1 1198780 TTTTTTTTTTTTTTT Tn 2 1190667 AAAAAAAAAAAAAAA An 3 366285 TGTGTGTGTGTGTGT TGn 4 362623 ACACACACACACACA ACn 5 348215 GTGTGTGTGTGTGTG GTn 6 344421 CACACACACACACAC CAn 7 223424 GCTGGGATTACAGGC Alu 8 223011 GCCTGTAATCCCAGC Alu 9 222894 TATATATATATATAT Tan 10 222730 ATATATATATATATA ATn Statistical expectation ~3 occurrences each 10 The amino acid repeats in prokaryotes are less frequent compared to eukaryotes. Perhaps, the prokaryotic sequences are less permissive and can not afford carrying the repeats. Eukaryotes often have many copies of the same gene and of its versions. The versions with (variable) repeats may serve for adjustments of the protein performance in different cell types. Sequence variability in HIV proteins. Note the RNY repeats De Grignis et al., AIDs Res. & Human Retroviruses 27, 2011 ↑ The amino acid repeats appear primarily in the first exons of the eukaryotic genes. This is, probably, the most permissive location for the de novo emerging repeats, as other, older parts of the protein sequences already developed advanced functionality. Life, in its simplest repeating sequence form, never stopped emerging, within the genomes Another life before triplets Well organized sequences GCC GCC GCC GCC…. and GGC GGC GGC GGC…. could not appear from nowhere. Obviously, some other (simpler?) RNA molecules had to come before. This suggests that the early biomolecular life, actually, started earlier, before the triplet stage. Moreover, one could speculate that there were two lifes, one after another The abiotic synthesis of RNA (homopolyribonucleotides) in water is experimentally established fact (Di Mauro, 2009, 2010) The abiotic synthesis of 5’-AAAAA…. stops at 5-mers, because the degradation starts to dominate over condensation If, however, one starts with hexamers or longer oligonucleotides a magic thing happens: the synthesis resumes and continues to over hundred steps. 5’-AAAAAAAAAAAAAAA A•A complementary pairs are formed, first discovered by J. Brahms in 70s (the strands are arranged in parallel) 5’-AAAAAAAAAAAAAAA Nature, thus, discovered the complementary template synthesis, although not Watson-Crick complementarity yet In the above AAAAAAAAAAA… system erroneous incorporation of bases other than A has lead to formation of a spectrum of mixed sequence RNAs The Watson-Crick pairing entered the scene The competition started between the replicating molecules The simple repeating sequences took over due to their ability to form slippage structures and expand The champions of the slippage and expansion GCC GCC GCC GCC …. and GGC GGC GGC GGC …. appeared This first pre-triplet life started with primitive elongating homooligonucleotides (self-reproduction), went through the heterooligonucleotide stage (self-reproduction and variation – LIFE), and ended with, again, primitive simple repeats (self-reproduction) This was beginning of second life, now with triplets and encoded amino acids Major steps of early molecular evolution I.Life before triplet code II. 1.Abiotic syntheses of monomers 2. 2.Oligomerization, mixed sequence peptides, RNA oligonucleotides 3. 3.Homooligonucleotides (polyA) take over, due to A•A complementarity 4. 4.Inclusion of non-A bases, mixed sequences 5. 5.Appearance of Watson-Crick pairs and takeover 6. 6.Competition between RNA replicons, and appearance of simple repeats 7. 7.GCCn•GGCn take over – first stage of the triplet code life ACC CCGG UAG CUUGGG AAAA AUAUCGC AUGG GAU ..... CCUUGAG GUCUU UUU short mixed sequences Duplexes of oligoA. Degradation barrier by-passed Birth of complementarity AAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAUUUUUUUU AAAAAAAAAAAAAAAAAAAAAAA Development of Watson-Crick complementarity ↓ U Single-strand stacking Watson-Crick bp stacking Variety of mixed sequence complementary duplexes 5’-AGCUUCGAGGUAUUC UCGAAGCUCCAUAAG-5’ 5’-GUAGAGUAGAGUACAGAUGAU CAUCUCAUCUCAUGUCUACUA-5’ 5’-GUAAGUGCACUAGGGUA CAUUCACGUGAUCCCAU-5’ 5’-UAUAAAACCAGUUGGCCUAUGAA AUAUUUUGGUCAACCGGAUACUU-5’ …………………………….. (GAU)n•(AUC)n (GU)n•(AC)n (UAU)n•(AUA)n (AAG)n•(CUU)n (UUCC)n•(GGAA)n (UC)n•(GA)n ............. (CUC)n•(GAG)n (AUCG)n•(CGAU)n variety of repetitive duplexes GGC•GCC duplexes. Triplet life started. 5’-…GGCGGCGGCGGCGGCGGC… CCGCCGCCGCCGCCGCCG…-5’ II. Triplet code life 1.Appearance of first codons, in addition to GCC and GGC 2. 2.First complementary mini-genes encoding peptides of 7 Ala-family residues and of 7 Gly-family residues 3. 3.Fusion of minigenes, alternation of Ala-family and Gly-family units 4. 4. Completion of the assignment of 64 codons to 17 amino acids and terminators 5.Codon capture stage, completion of modern codon table 6. 6.Formation of closed polypeptide loops, first protein modules 7. 7.Fusion of the early modules, formation of LUCA protein repertoire 8. 8. Fusion of the genes encoding fold-size proteins, appearance of multi-fold proteins Example of complementary template synthesis: CCCCCCCCCCCCCCC Copied into GGGGGGGGGGGGGGG (lab. of Di Mauro, 2010) Another example (Taq polymerase assisted): Polymerase Chain Reaction (PCR) (global experience) - one step away from replication of duplexes First successful example of replication with mutations The sequence GCC GCC GCC GCC GCC is complementarily copied to GGG GGG GGG GGG GGG in presence of G only, forcing formation of non-complementary G•G pairs (Pino et al., 2011) Total 364 slides, for 5 lectures, 72 slides each Edward N. Trifonov (kakhol ve lavan) (blue and white)