Structural Bioinformatics CoLiDe: Combinatorial Library Design tool for probing protein sequence space Vyacheslav Tretyachenko1,4 , Václav Voráček2* , Radko Souček5 , Kosuke Fu- jishima3 , and Klára Hlouchová1,5* 1 Department of Cell Biology, Faculty of Science, Charles University, Biocev, Prague, Czech Republic. 2 Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Technicka 2, 166 27, Prague, Czech Republic 3 Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, 1528550, Japan 4 Department of Biochemistry, Faculty of Science, Charles University, Hlavova 8, 128 00, Prague 2, Czech Republic. 5 Institute of Organic Chemistry and Biochemistry IOCB Research Centre & Gilead Sciences, Academy of Sciences of the Czech Republic, Flemingovo nám. 2, 166 10, Prague, Czech Republic *To whom correspondence should be addressed. Associate Editor: Arne Elofsson Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Motivation: Current techniques of protein engineering focus mostly on re-designing small targeted regions or defined structural scaffolds rather than constructing combinatorial libraries of versatile compositions and lengths. This is a missed opportunity because combinatorial libraries are emerging as a vital source of novel functional proteins and are of interest in diverse research areas. Results: Here, we present a computational tool for Combinatorial Library Design (CoLiDe) offering precise control over protein sequence composition, length and diversity. The algorithm uses evolutionary approach to provide solutions to combinatorial libraries of degenerate DNA templates. We demonstrate its performance and precision using 4 different input alphabet distribution on different sequence lengths. In addition, a model design and experimental pipeline for protein library expression and purification is presented, providing a proof-of-concept that our protocol can be used to prepare purified protein library samples of up to 1011-1012 unique sequences. CoLiDe presents a composition-centric approach to protein design towards different functional phe- nomena. Availability: CoLiDe is implemented in Python and freely available at https://github.com/vo- racva1/CoLiDe. Contact: klara.hlouchova@natur.cuni.cz, voracva1@fel.cvut.cz Supplementary information: Supplementary data are available at Bioinformatics online. © The Author(s) 2020. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 1 Introduction Considering the vastness of the potential protein sequence space, naturally occurring proteins are constructed from a small number of coding sequences that arrange into a limited number of structural folds. While there are 20100 possible combinations for the design of a 100-amino-acid protein within the canonical amino acid alphabet, only ~1015 sequences encode all proteins on Earth (Luisi, 2006). Furthermore, these sequences are estimated to fold into only ~2,000 distinct topologies (Govindarajan et al., 1999). These observations raise numerous questions in the fields of biotechnology, synthetic biology and evolutionary biology: How easily can a useful sequence be encountered in the unexplored sequence space? Are there protein folds and functions outside those formed by the natural sequence pool? Several recent studies have started providing answers to these questions. Both secondary and tertiary structures seem to be abundant in completely random sequences (Chiarabelli et al., 2006; Davidson and Sauer, 1994; LaBean et al., 2011; Tretyachenko et al., 2017). Novel folds and functions have been encountered in random and semi-random sequence libraries, and some researchers argue that protein function may be discovered by entirely stochastic means (Chao et al., 2013; Donnelly et al., 2018; Fisher et al., 2011; Keefe and Szostak, 2001; Ravarani et al., 2018). In addition, the bioactivity of and cellular response to random sequences has been actively discussed in association with de novo gene birth (Bornberg-Bauer and Heames, 2019; Neme et al., 2017). While it seems that protein structure and function can be encountered in random sequence space, different biological functions have been associated with specific amino acid composition and hence physicochemical properties. For example, positively charged and aromatic amino acids are known to promote protein-RNA interaction, evolutionary early amino acids promote solubility and trends in amino acid composition have been related to phenomena such as protein disorder and liquid-liquid phase separation (Blanco et al., 2018; Doi et al., 2005; Newton et al., 2019; Wang et al., 2018; Vymětal et al., 2019). Local residue composition is apparently what makes natural sequences stand out from randomness (Weidmann et al., 2019). Overall, these studies highlight the importance of developing tools to probe the protein sequence space in a rational way. Several approaches to constructing synthetic protein sequence libraries have been developed. The simplest is direct chemical synthesis of a peptide from amino acid precursors but has major restrictions in sequence length and conformational biases (reviewed in (Jaradat, 2018)). Another approach is based on construction of a degenerate DNA template with subsequent expression. The template can be designed either using triplet codon as the minimal unit, where pre-synthesized triplets are linked together, or at the single nucleotide level. Although the former method can provide a library with unbiased amino acid distribution at each template position, the cost of the trinucleotide phosphoramidite precursors limits its widespread adoption in laboratory practice (Virnekas et al., 1994). On the other hand, template synthesis at the nucleotide level is economically feasible and is offered by multiple commercial oligonucleotide synthesis companies. Using this approach, random libraries have been constructed from simple repeat of frequently used degenerate codons, such as NNN and NNK. The major drawback of NNN/NNK method for protein engineering is its high level of degeneracy (NNK codes 20 amino acids via 32 different codons). An elegant solution to reduce the degeneracy introduced by Kille et al. combines three degenerate codons in a vertical way to cover all 20 amino acids using 22 codons (so called “22c-trick”) without an introduction of STOP codons (Kille et al., 2013). Nevertheless, this solution is effective only when screening a few positions because of an increased cost of oligonucleotide synthesis (mere three mutagenized positions would demand 33 = 27 separate oligonucleotides) and the experimental effort during template assembly. Both of these methods are focused on producing the highest mutational coverage without any attention to amino acid distribution of the mutant library. While several computational algorithms for library design exist, they have been optimized to introduce as few degenerate codons as possible (Jacobs et al., 2015; Shimko et al., 2020; Tang et al., 2012). An optimal solution to amino acid distribution approximation by combinations of degenerate codons was recently introduced in SwiftLib and DeCoDe algorithms (Jacobs et al., 2015; Shimko et al., 2020). Both produce compact combinatorial libraries by as few degenerate codons as possible while DeCoDe implements complex patterns of covariation into the library design (Shimko et al., 2020). Degenerate codon positions consist of nucleotide mixtures at equimolar ratios where more than one nucleotide is found at a single position. An alternative approach is represented by use of spiked codons where nucleotides can be represented by variable ratios. Mapping of amino acid distribution into a single spiked codon was implemented by Wolf et al. and Craig et al. via numerical optimization and genetic algorithms. Unfortunately neither of these algorithms is publicly available (Wolf and Kim, 1999; Craig et al., 2009). Although these tools are particularly useful for site-specific randomization strategies, there remains a missed opportunity for the overall design of protein libraries. Specifically, the formation of combinatorial segments of versatile length with a desired amino acid composition would benefit synthetic biology practitioners. Here, we present a combinatorial library design tool (CoLiDe) for the DNA template design of versatile protein libraries. CoLiDe aids in construction of libraries with specific amino acid distributions and lengths, Figure 1. Outline of the CoLiDe algorithm. Based on the input amino acid distribution and length of the randomized library, at first an unoptimized vector of degenerate codons of given length is generated. Then the vector is optimized by single exchanges of codons until a vector of degenerate codons with minimal distance from the input distribution is ob- tained Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 CoLiDe i.e. optimization of the overall amino acid composition. Such libraries are notably in demand for investigating phenomena that are principally related to amino acid composition - protein liquid-liquid phase separation (Wang et al., 2018), intrinsic protein disorder (Vymětal et al., 2019), spatial protein localization in vivo (Cedano et al., 1997), protein degradation halflife in the cellular milieu and chain elongation rate during ribosomal synthesis (Guruprasad et al., 1990; Riba et al., 2018). In addition, our algorithm allows for incorporation of spiked trinucleotides (i.e., with variable nucleotide composition for single position) and removal of specific codons, such as for codon reassignment and incorporation of unnatural amino acids (Liu and Schultz, 2010). As a proof-of-concept, we demonstrate the use of CoLiDe by construction of a combinatorial protein library of 33 amino acids in length and composed of a 10 amino acid alphabet (A, S, D, G, L, E, T, I, P, and V). Total amino acid composition of the library and therefore each protein sequence was specified using the CoLiDe input option. Moreover, CoLiDe can be used to upgrade currently available DNA block shuffling methods to prepare combinatorial libraries that are hundreds of amino acids in length. 2 Results and discussion In this work, we present a computational tool for automated design of combinatorial libraries. CoLiDe uses evolutionary approach to find a satisfactory solution. The algorithm provides a set of degenerate codons which approximate the total amino acid distribution of protein without regard to individual degenerate positions in the coding template. The principle of the algorithm is summarized in Fig. 1. Mandatory inputs include library length, amino acid distribution, and degenerate codon type (standard or spiked, Supporting Fig. S1). Other parameters, such as organism-specific codon preference, extent of degeneracy, or codon removal/reassignment, also can be specified (Supporting Fig. S1). Once the input parameters are defined, codons are pre-selected based on the amino acid input from a total pool of 3,375 degenerate codons. The codon pre-selection removes undesired amino acid and STOP codons. This step guarantees that the combinatorial library is composed only of input amino acids and will not contain prematurely terminated templates. On the other hand, depending on input distribution, most highly degenerate codons are removed which reduces degeneracy of individual library positions. Only the pre-selected degenerate codons serve in the subsequent library construction pipeline. The pipeline starts with random sets of degenerate codons of desired library length and follows with random codon exchanges (standard codons) or a shift in nucleotide ratios (spiked codons). Exchanges and shifts are kept within the optimized codon set if the amino acid product comes closer to input distribution (evaluated by mean squared error) and rejected if not. Optimization is finished when repeated changes do not further improve the solution (specifically, after n = 1000 × [library length] rejected mutations) This threshold was selected after test runs of the optimization path which recorded the rejection rate of mutations and provided satisfactory deviation on all tested distributions (Supporting Fig. S2 A-D). The output of the algorithm is a vector of degenerate codons of given library length. In other words, CoLiDe provides a list of degenerate codons combined randomly into a single oligonucleotide tem- plate. CoLiDe offers a graphical user interface (Supporting Fig. S1) that aids input of all variables, displays statistics of the optimized solution, and allows the user to generate a report as a PDF document. CoLiDe is implemented in Python 3, and the source code is available as open source under MIT license at https://github.com/voracva1/CoLiDe. CoLiDe performance analysis We tested CoLiDe’s precision and reproducibility on the following four amino acid distributions: (i) a reduced alphabet used in protein evolution studies to approximate an early version of the genetic code (Solis, 2019), (ii) a functional distribution derived from an analysis of RNA-binding proteins (Blanco et al., 2018), (iii) a natural amino acid distribution from the UniProt database (UniProtKB/Swiss-Prot UniProt release 2019_11), and (iv) a rational selection of a reduced set of amino acids for protein engineering (Murphy et al., 2000) (Fig. 2A-D, Supporting table S1). For each amino acid distribution, optimization was performed 10 independent times for library lengths of 5, 10, 15, 20, 40, 60, 80, and 100 amino acids (Fig. 2E-H). CoLiDe was able to reliably spread all the tested distributions on a DNA template of given length. Mean squared errors in the shortest amino acid libraries ranged from 0.11 to 0.17 between individual alphabets and converged with increasing template length to values around 0.005. Variance in precision between solutions — measured as a coefficient of variation was highest in short libraries, ranging between 10-2 -10-3 , and decreased to values around 10-5 in longer templates (Supporting table S2). Our results confirmed that the algorithm consistently finds precise solutions to selected input amino acid distributions. The precision of the solution increases and the variance between solutions within each group decreases along with the increase in library template length. With reduced template length, error became dependent on the specific amino acid alphabet. Solutions using spiked codons showed better precision with similar variance within each group (Supporting table S2). CoLiDe runtimes were tested on four library templates (Fig. 2A-D) with the template sizes Figure 2. CoLiDe performance analysis. Amino acid distributions used to benchmark CoLiDe performance (A-D) and comparison of solutions generated from each (E-H). Each distribution was approximated via degenerate (red) and spiked (blue) codons. Solutions were produced in 10 replicates for various library lengths ranging from 5 to 100 amino acids Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 V.Tretyachenko et al. ranging from 5 to 400 degenerate codons. Reported runtimes range from ~3 to 600s on Intel i5-8250U laptop (Supporting Fig. S3). Diverse degenerate libraries can be produced with other available tools, even though they are designed for construction of different library types. CoLiDe, in contrast to alternative design tools (SwiftLib, DeCoDe), focuses on combinatorial library design without position-specific restraints. Designed libraries are suitable for probing the constrained sequence space rather than for screening small, rationally designed library of protein variants (Jacobs et al., 2015; Shimko et al., 2020). As an example, we compare the solutions for combinatorial libraries provided by degenerate codon optimization algorithm SwiftLib (Jacobs et al., 2015). SwiftLib outputs an optimized set of degenerate codons which cover the provided amino acid variability with as few degenerate codons as possible. Such approach faces difficulty to assure the precision of the distribution when targeting longer regions, whereas that is not the case for CoLiDe (Supporting Fig. S5). On the other hand, SwiftLib outperforms CoLiDe when very short randomized regions (of 2-3 codons) are calculated (Supporting Fig. S4). Deviations of ratios of single amino acids are reported in Supporting Tables S4 and S5. CoLiDe provides a better choice for combinatorial design of longer protein templates provided that overall amino acid distribution of sequence is preferred over the specific amino acid variations on predefined positions. Furthermore CoLiDe can be used in protein engineering applications for coarse grained yet computationally efficient vertical design (multiple degenerate oligonucleotides per one tube) of degenerate codons to approximate amino acid distributions in single protein positions, similarly to established deterministic approaches described by Jacobs and coworkers (Jacobs et al., 2015). Proof-of-concept experimental library design To identify general pitfalls and experimental bottlenecks of library preparation, we experimentally evaluated one specific CoLiDe solution from DNA to protein level. A 45 amino acid protein library was prepared with a randomized region of 33 amino acids, following the early alphabet distribution (Fig. 2A). The mean squared error of the randomized region with CoLiDe solution was 0.0022 with an error variance of 0.00011 (Fig. 3). The random 33 codon region was tagged with an 8×H+QH (i.e. octa-His + Gln-His) coding sequence (separated by a two amino acid linker, KS) on the C-terminus for subsequent purification (Supporting information, Sequence). The protein coding sequence was embedded into a linear expression cassette, and the library was transcribed as described in Materials and methods (Supporting Fig. S6). The length of the protein library was selected so that a single commercially synthesized oligonucleotide could be used for the downstream procedure. However, a larger construct could be prepared by DNA shuffling methods as previously described (Cho et al., 2000). Thus, CoLiDe algorithm can also be utilized for the construction of random protein libraries with amino acids residues up to several hundreds. Construction and characterization of the oligonucleotide library Nucleotide sequences for degenerate libraries were analyzed on the DNA and mRNA template levels by high-throughput sequencing (HTS). The in silico translated amino acid composition (from both the DNA and mRNA templates) showed good agreement with the designed construct (Fig. 3&4, Supporting table S6). While deviations of whole distributions are listed here as mean squared error calculated on (0,1) scale, we plot single amino acid occurrence as percentage of input distribution on (0,100) scale. Deviations between the CoLiDe solution and the in silico translated DNA template were observed in enrichment of valine, leucine, and isoleucine (2.9, 2.2 and 1.6 %) and depletion of proline, threonine, and alanine (3, 2.2 and 2.4 %) (Fig. 3&4, Supporting table S6). Upon analysis of nucleotide frequencies at each position, we found that deviation can be explained by the nucleotide composition bias during the oligonucleotide synthesis and have been confirmed as the current bottleneck by the provider (Supporting Fig. S7). Statistical analysis of the sequencing data provides a confirmation of library diversity and shows that vast majority (99.9 %) of all sequences are unique (Supporting table S7). Overall, mean squared error of amino acid distribution of DNA and RNA templates remained to be around ~0.02 (Supporting table S6). Hence, we found that while CoLiDe algorithm can provide low mean squared error for the library design, one should be aware of the nucleotide bias that will be introduced during the oligonucleotide synthesis of highly degenerate DNA oligonucleotides. Such nucleotide composition bias of DNA library depends on each oligonucleotide provider (unpublished observation). Construction and characterization of the protein library The combinatorial protein library was expressed using an in vitro translation system and His-tag purified for downstream analysis (Fig. 5A). Expressed proteins were assessed by mass spectrometry (Fig. 5B) and amino acid analysis (Fig. 5C, Supporting table S6). MALDI-TOF mass spectrometry revealed good agreement with expected values. The expected mass distribution was produced by analysis of 600,000 random sequences corresponding to the degenerate DNA template and by in silico translation of 600,000 sequences obtained by HTS of DNA and mRNA templates. The experimental spectrum is represented by normal weight distribution with a mean value of 5,029 Da and a standard deviation of 120.6 (Fig. 5B). This is slightly shifted from the mean value of the molecular weight distribution expected from the design (4,902 Da), partly as a result of sequence bias during the solid-state oligonucleotide synthesis. However, in silico translation of sequences obtained by HTS (producing a mean molecular weight of 4,957 Da) confirms that this Figure 3. Comparison of the amino acid distribution of the CoLiDe solution of 33 amino acid long library to its target distribution and the DNA and mRNA templates obtained from the high-throughput sequencing (HTS) data (upon in silico transla- tion) Figure 4. Preparation and analysis of DNA and RNA libraries. (left) Sequence logos generated in silico from the designed template (top), sequenced DNA template (middle), and sequenced reverse-transcribed mRNA (bottom). (right) Agarose gel electrophoresis of dsDNA library template (middle) and urea PAGE analysis of single stranded random library mRNA and (bottom). Polar and small amino acids (G, S, T, P, A) are green, hydrophobic and large amino acids are black (L, V, I) and negatively charged residues (D, E) are blue Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 CoLiDe explains only part of the shift. This result indicates that the translation and purification steps have introduced additional compositional shift into the protein library. Most notably, the purified protein library is under-represented in alanine, aspartic acid, and threonine (by 2-4 % from the desired amount) and enriched in glutamic acid and glycine (by ~5% from the input) as assessed by amino acid analysis (Fig. 5C), likely due to their impact on protein solubility and contamination by carry over protein components from the cell-free expression system in the purified library sample (Fig. 5A). While these deviations do not represent a major difference in the overall amino acid ratio profile (amino acid analysis shows an overall of 0.05 mean squared error (Supporting table S6)), it is important to be aware of the sequence biases that may be introduced into designed libraries during oligonucleotide synthesis and downstream procedures as a result of the translation and purification process or the physicochemical properties of the expressed proteins themselves. Currently, there is no satisfactory methodology to analyze the variability of the large protein sequence pool directly. One translation reaction (in a 20 µl volume) is typically primed with 1011 -1012 different template molecules. Even with the genotype-phenotype linked display methods (i.e. mRNA-display, ribosome display, etc.) number of characterized sequences is limited to the performance of HTS. Because neither DNA library preparation, RNA transcription nor the in vitro translation involve sequence amplification, a similar variability of protein sequences is expected after translation. The computational protocol therefore presents a tool for truly effective exploration of the protein sequence space. 3 Conclusions Here, we present CoLiDe, a novel tool for precise design of combinatorial protein libraries of flexible length and desired amino acid composition. We provide evidence that it performs with minimal error and variance across several different amino acid distributions and lengths. It significantly outperforms SwiftLib (that have been developed for other applications) especially when designing combinatorial libraries longer than ~10 amino acids. In addition, we present a model protocol for combinatorial library (composed of a 10 amino acid alphabet) preparation by cell-free expression. By monitoring the DNA and mRNA sequence pool during library preparation using HTS, we confirmed the desired variability (99.9% of the sequences representing unique species). While negligible error is detected between the input sequence and the CoLiDe solution, up to 3% deviations of individual amino acid ratios were detected upon in silico translation of the mRNA sequence pool. The error was primarily attributable to nucleotide compositional bias from the synthesis of the starting material. Using the template mRNA, we expressed and purified a highly variable protein library (represented by a normal weight distribution). To our knowledge, this is the first report of purification of a combinatorial protein library in an amount sufficient for biophysical characterization. The experimental procedure introduced additional detectable shifts among several amino acid compositions (up to 5% deviation), likely occurred during translation and purification steps of the library. Such an error is to be expected and may vary depending on the nature of individual amino acid alphabets. We estimate that 1011-1012 unique protein sequences can be produced in a 20-µl cell-free translation reaction using our protocol. The design and experimental strategy presented here can be used in combination with vertical library design strategies (i.e., mixing multiple degenerate templates) and DNA shuffling synthesis. This represents a powerful tool for the synthesis of combinatorial protein libraries composed of hundreds of amino acids. 4 Materials and methods 4.1 CoLiDe algorithm Basic definitions The following procedure addresses problem-solving with spiked codons (degenerate codons with variable nucleotide composition). If the domain is restricted to degenerate codons, the procedure differs slightly, as noted below. We considered spiked codon to be a 12-tuple concatenated from 4tuples representing each degenerated position of the triplet: (𝑇1, 𝐶1, 𝐴1, 𝐺1, 𝑇2, 𝐶2, 𝐴2, 𝐺2, 𝑇3, 𝐶3, 𝐴3, 𝐺3) satisfying ∀𝑖 ∈ {1,2,3}: 𝑇𝑖 + 𝐶𝑖 + 𝐴𝑖 + 𝐺𝑖 = 1 ∀𝑖 ∈ {1,2,3}∀𝑆 ∈ {𝑇, 𝐶, 𝐴, 𝐺}: 𝑆𝑖 ≥ 0 We also introduced a 12-tuple base-codon term: (𝑇1, 𝐶1, 𝐴1, 𝐺1, 𝑇2, 𝐶2, 𝐴2, 𝐺2, 𝑇3, 𝐶3, 𝐴3, 𝐺3) satisfying ∀𝑖 ∈ {1,2,3}: 𝑇𝑖 + 𝐶𝑖 + 𝐴𝑖 + 𝐺𝑖 ≥ 1 ∀𝑖 ∈ {1,2,3}∀𝑆 ∈ {𝑇, 𝐶, 𝐴, 𝐺}: 𝑆𝑖 ∈ {0,1} Base-codons serve as templates for codons. For example, the codon NNS can be represented by the 12-tuple (1,1,1,1,1,1,1,1,0,1,0,1), meaning that the first two positions can include all four bases and the last position is restricted to C or G only. By defining base-codon 𝒃, a spiked codon can be obtained by replacing 1’s in 𝒃 with non-zero numbers. Note that in Figure 5. Preparation and analysis of the protein library. (A) SDS-PAGE and Western blot analysis of library expression and purification. The library was expressed in a recombinant cell-free system PUREfrex 2.0. -/+ stands for cell free fraction without and with expressed library, FT is affinity purification flow through, and E is eluted fraction.(B) MALDITOF MS analysis of the purified library (black) compared with the theoretical mass distribution (blue) and mass distribution calculated from sequenced DNA templates (red). (C) Results of amino acid analysis deviations of variable (colored) and constant sequence regions/contaminations (grey) of the expressed and purified protein library in percentage units. Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 V.Tretyachenko et al. cases of restriction to degenerate codons, there is one-to-one mapping between degenerate codons and base-codons. The optimization problem can be formulated as follows: Given amino acid sequence length 𝒍; desired amino acid distribution 𝑫, which is a vector of 21 non-negative numbers summing up to 1, one number for each amino acid; a set of forbidden codons 𝑭; and a distance function dist, find a multiset 𝑴 cardinality 𝒍 of codons, minimizing 𝑑𝑖𝑠𝑡(𝑫, 𝑴), subject to ∀𝒎 ∈ 𝑴∀𝒇 ∈ 𝑭∃𝑝: 𝒇 𝑝 ≠ 0 ⇒ 𝒎 𝑝 = 0, where 𝒇 𝑝 is an element of 𝒇 on position 𝑝. This condition guarantees that there are no forbidden codons in 𝑴. Every codon encodes a distribution of amino acids. Hence, 𝑴 representing a multiset of degenerate codons, can be considered as a mixture distribution of amino acids encoded by its codons. The closer the mixture distribution encoded by 𝑴 is to 𝑫, the smaller 𝑑𝑖𝑠𝑡(𝑫, 𝑴) should be. We defined 𝑫 as a vector in ℝ21 , so that we could use a norm to measure the distance between two distributions. Common norms include the 𝑳 𝟏 norm, which is a sum of absolute values of elements, and the 𝑳 𝟐 norm, which is a square root of the sum of squares of elements. As square root is a strictly increasing function, minimizing the square root of a sum of squares and minimizing a sum of squares yield the same optimal argument. The third common norm is the 𝑳∞ norm, which is the greatest absolute value of elements. We used the 𝑳 𝟐 norm in our implementation, as it penalizes large differences considerably but is permissive for slight deviations. Algorithm We present the base implementation of the CoLiDe algorithm as a pseu- docode: 1. BC ← generate valid base-codons 2. M ← ∅ 3. For i = 1 to l: (a) bc ← random element from BC (b) c ← make random codon from bc (c) M ← M ∪ {c} 4. rejected ← 0 5. While rejected < 1000 · l: (a) bc ← random element from BC (b) c ← make random codon from bc (c) dold ← dist(D, M) (d) M2 ← M ∪ {c}\ (random element from M) (e) dnew ← dist(D, M2) (f) If dnew < dold i. M ← M2 ii. rejected ← 0 Else i. rejected ← rejected + 1 6. Output M In the first step, valid base-codons are generated. There are 3 independent sequences in base-codon (𝑇𝑖, 𝐶𝑖, 𝐴𝑖, 𝐺𝑖, 𝑖 ∈ {1,2,3}), and every sequence is an arbitrary binary string of length 4, excluding string 0000. There are 24 −1 such strings, so the number of base codons is (24 − 1)3 = 3,375. Along the fact that there are at most 64 forbidden codons, the time needed to execute this step is negligible with any reasonable implementation. In the third step, filling multiset 𝑴 with random codons yields an initial result. In the fifth step, the optimization is performed. Once per loop, a random codon is generated, and an attempt is made to replace a random codon in 𝑴 with this codon. If the objective improves, the change is accepted; otherwise, it is rejected. The algorithm works reasonably well and reasonably quickly (visualization of results is many times slower than the algorithm itself). The base algorithm can be easily modified, because dist can be chosen arbitrarily. In our implementation, dist is chosen as the 𝑳 𝟐 norm of the vector of differences between 𝑫 and the distribution of amino acids encoded by codons of 𝑴. This problem also could be formulated as a quadratic programming task, but it would be difficult or even impossible to add new requirements to the result. The ability of the algorithm to be easily extended to new problems offers flexibility. Library construction Preparation of DNA and RNA templates A degenerate ssDNA of 197 bases was synthesized by Integrated DNA Technologies (Suppl Sequences, library). The oligonucleotide was converted to dsDNA by Klenow extension with a 5′ complementary reverse primer (Supporting sequences, reverse). Annealing of the primer was performed by cooling down a mixture of 2 μM oligonucleotide and primer in the presence of 200 μM dNTPs in buffer NEB1 from 90 °C to 25 °C at a rate of 1 °C/min. Total 10 U Klenow polymerase was added to the annealed mixture, and extension step was carried out for 1 h at 37 °C followed by polymerase deactivation at 50 °C for 15 min. The dsDNA library product was purified with the Monarch® PCR & DNA Cleanup Kit (New England Biolabs) and used for the downstream in vitro transcription, carried out with the Ampliscribe T7-Flash kit (Lucigen) according to the manufacturer’s recommendations. The resulting mRNA was purified by ammonium acetate precipitation and dissolved in RNase free water to a final concentration of 3 µg/ul. cDNA preparation for high-throughput sequencing (HTS) Complementary DNA (cDNA) was prepared from 1 µg transcribed mRNA. cDNA was synthesized according to the SuperScript IV (Thermo Fisher Scientific) instruction manual using reverse primer (Suppl Sequences, reverse) and 20 μl reverse transcribed product was further amplified with Q5 DNA polymerase (New England Biolabs) in a 100-µl reaction volume for 11 amplification cycles with a primer annealing temperature of 68 °C. Protein expression and purification for amino acid analysis and mass spectrometry The protein library was prepared in a PUREfrex 2.0 (GeneFrontier Corporation) cell-free protein expression system. The reaction was prepared according to the manufacturer’s recommendations, supplemented with 0.05% Triton X-100 (v/v), and initiated by addition of 3 µg library mRNA. Protein expression was conducted for 4 h at 30 °C. The reaction was diluted 10 times with guanidine denaturation buffer (6 M guanidine hydrochloride, 100 mM sodium phosphate, 500 mM NaCl, 0.05% Triton X-100, pH 8) and incubated with 4 µl TALON affinity chromatography resin (Clontech) for 12 h at 25 °C. The resin was washed twice with urea denaturation buffer (8 M urea, 100 mM sodium phosphate, 500 mM NaCl, 0.05% Triton X-100, pH 8) and twice with distilled water supplemented with 0.05% Triton X-100. The library was eluted by boiling the affinity matrix in 50 µl of 2% (w/v) aqueous SDS. Eluted fractions were purified from SDS by addition of 5× volumes of ice-cold acetone. The precipitates were centrifuged, washed with 100% acetone, and air-dried. Preparation of libraries for HTS and data analysis The dsDNA library template was analyzed by HTS with an Illumina MiSeq. Prior to sequencing the library preparation, quantification was carried out on a Quantus™ Fluorometer (Promega). A total of 100 ng of DNA sample was used as an input for library preparation with the NEBNext Ultra II DNA Library Prep kit (New England Biolabs) with AMPure XP purification beads (Beckman Coulter). The length of the prepared library was determined with an Agilent 2100 Bioanalyzer (Agilent Technologies) and quantified with a Quantus Fluorometer (Promega). Samples were sequenced on a MiSeq Illumina platform using the Miseq Reagent Kit v2 for 500 cycles (2 × 250) in paired-end mode. Raw data was processed with Galaxy platform. Sequence analysis of assembled and filtered paired reads was performed with MatLab scripts developed by the Heinis lab (Afgan et al., 2018; Rebollo et al., 2014). Amino acid analysis and mass spectrometry The purified and precipitated library samples were hydrolyzed in 6 M hydrochloric acid at 110 °C for 20 hours, the hydrolysate was evaporated, and reconstituted with 0.1 M hydrochloric acid containing the internal standard. Amino acid analysis was performed on an Agilent 1260 HPLC Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 CoLiDe (Agilent Technologies) equipped with a fluorescence detector using automated o-phtalaldehyde / 2-mercaptopropionic acid (OPA / MPA) derivatization. For mass spectrometry, the purified protein library sample was resuspended in water. The spectrum was collected after addition of 2,5dihydroxybezoic acid matrix substance (Merck) using an UltrafleXtreme™ MALDI-TOF/TOF mass spectrometer (Bruker Daltonics, Germany) in linear mode. Acknowledgements We would like to thank Prof. Mirko Navara for helpful discussions, Dr. Martin Hubálek for MS analyses, Dr. Hillary Hoffman for language editing and Shota Nishikawa and Hidenori Watanabe for providing the library for comparison Funding This work was supported by the Czech Science Foundation (GA ČR) [17-10438Y]; Human Frontiers Science Program (RGY0074/2019); Charles University Grant Agency [260572 / 2020] to VT; Ministry of Education, Youth and Sports of CR [National Sustainability Program II, BIOCEVFAR, LQ1604] to KH; and ELSIFirstLogic Astrobiology Donation Program to KF. Conflict of Interest: none declared. References Afgan,E. et al. (2018) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res., 46, W537–W544. Blanco,C. et al. (2018) Analysis of Evolutionarily Independent Protein-RNA Complexes Yields a Criterion to Evaluate the Relevance of Prebiotic Scenarios. Curr. Biol., 28, 526-537.e5. Bornberg-Bauer,E. and Heames,B. (2019) Becoming a de novo gene. Nat. Ecol. Evol., 3, 524–525. Cedano,J. et al. (1997) Relation between amino acid composition and cellular location of proteins. J. Mol. Biol., 266, 594–600. Chao,F.-A. et al. (2013) Structure and dynamics of a primordial catalytic fold generated by in vitro evolution. Nat. Chem. Biol., 9, 81–83. Chiarabelli,C. et al. (2006) Investigation of de novo Totally Random Biosequences. Chem. Biodivers., 3, 827–839. Cho,G. et al. (2000) Constructing high complexity synthetic libraries of long ORFs using in vitro selection. J. Mol. Biol., 297, 309–319. Craig,R.A. et al. (2009) Optimizing nucleotide sequence ensembles for combinatorial protein libraries using a genetic algorithm. Nucleic Acids Res., 38, 1–9. Davidson,A.R. and Sauer,R.T. (1994) Folded proteins occur frequently in libraries of random amino acid sequences. Proc. Natl. Acad. Sci. U. S. A., 91, 2146– 2150. Doi,N. et al. (2005) High solubility of random-sequence proteins consisting of five kinds of primitive amino acids. Protein Eng. Des. Sel., 18, 279–284. Donnelly,A.E. et al. (2018) A de novo enzyme catalyzes a life-sustaining reaction in Escherichia coli. Nat. Chem. Biol., 14, 253–255. Fisher,M. a. et al. (2011) De novo designed proteins from a library of artificial sequences function in Escherichia Coli and enable cell growth. PLoS One, 6, e15364. Govindarajan,S. et al. (1999) Estimating the total number of protein folds. Proteins Struct. Funct. Genet., 35, 408–414. Guruprasad,K. et al. (1990) Correlation between stability of a protein and its dipeptide composition: A novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. Des. Sel., 4, 155–161. Jacobs,T.M. et al. (2015) SwiftLib: Rapid degenerate-codon-library optimization through dynamic programming. Nucleic Acids Res., 43, 1–10. Jaradat,D.M.M. (2018) Thirteen decades of peptide synthesis: key developments in solid phase peptide synthesis and amide bond formation utilized in peptide ligation. Amino Acids, 50, 39–68. Keefe,A.D. and Szostak,J.W. (2001) Functional proteins from a random-sequence library. Nature, 410, 715–718. Kille,S. et al. (2013) Reducing codon redundancy and screening effort of combinatorial protein libraries created by saturation mutagenesis. ACS Synth. Biol., 2, 83–92. Labean,T.H. et al. (2011) Protein folding absent selection. Genes (Basel)., 2, 608– 26. Liu,C.C. and Schultz,P.G. (2010) Adding new chemistries to the genetic code. Annu. Rev. Biochem., 413–44. Luisi,P.L. (2006) The emergence of life: From chemical origins to synthetic biology 1 edition. Cambridge University Press. Murphy,L.R. et al. (2000) Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. Des. Sel., 13, 149– 152. Neme,R. et al. (2017) Random sequences are an abundant source of bioactive RNAs or peptides. Nat. Ecol. Evol., 1, 1–7. Newton,M.S. et al. (2019) Genetic Code Evolution Investigated through the Synthesis and Characterisation of Proteins from Reduced-Alphabet Libraries. ChemBioChem, 20, 846–856. Ravarani,C.N. et al. (2018) High‐throughput discovery of functional disordered regions: investigation of transactivation domains. Mol. Syst. Biol., 14(5). Rebollo,I.R. et al. (2014) Identification of target-binding peptide motifs by highthroughput sequencing of phage-selected peptides. Nucleic Acids Res., 42, e169–e169. Riba,A. et al. (2018) Protein synthesis rates and ribosome occupancies reveal determinants of translation elongation rates. bioRxiv, 465914. Shimko,T.C. et al. (2020) DeCoDe: degenerate codon design for complete proteincoding DNA libraries. Bioinformatics, 1–7. Solis,A.D. (2019) Reduced alphabet of prebiotic amino acids optimally encodes the conformational space of diverse extant protein folds. BMC Evol. Biol., 19, 1–19. Tang,L. et al. (2012) Construction of ‘small-intelligent’ focused mutagenesis libraries using well-designed combinatorial degenerate primers. Biotechniques, 52, 149–158. Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020 V.Tretyachenko et al. Tretyachenko,V. et al. (2017) Random protein sequences can form defined secondary structures and are well-tolerated in vivo. Sci. Rep., 7. Virnekas,B. et al. (1994) Trinucleotide phosphoramidites: Ideal reagents for the synthesis of mixed oligonucleotides for random mutagenesis. Nucleic Acids Res., 22, 5600–5607. Vymětal,J. et al. (2019) Sequence versus composition: What prescribes IDP biophysical properties? Entropy, 21, 1–8. Wang,J. et al. (2018) A Molecular Grammar Governing the Driving Forces for Phase Separation of Prion-like RNA Binding Proteins. Cell, 174, 688- 699.e16. Weidmann,L. et al. (2019) Where Natural Protein Sequences Stand out From Randomness. bioRxiv, 706119. Wolf,E. and Kim,P.S. (1999) Combinatorial codons: a computer program to approximate amino acid probabilities with biased nucleotide usage. Protein Sci., 8, 680–8. Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa804/5909645bygueston07October2020