PV248 Python Petr Ročkai October 22, 2020 Parti: Object Model PV248 Python 2/301 October 22, 2020 Objects • the basic 'unit' of OOP • also known as 'instances' • they bundle data and behaviour • provide encapsulation • local (object) invariants • make code re-use easier PV248 Python 3/301 October 22, 2020 Classes • each (Python) object belongs to a class • templates for objects • calling a class creates an instance o my_foo = Foo() • classes themselves are also objects PV248 Python 4/301 October 22, 2020 Types vs Objects • class system is a type system • since Python 3, types are classes • everything is dynamic in Python o variables are not type-constrained PV248 Python 5/301 October 22, 2020 Poking at Classes • you can pass classes as function parameters • you can create classes at runtime • and interact with existing classes: o {}.__class__, (0).__class__ o {}.__class_____class__ o compare type(0), etc. o n = numbers.Number(); n.__class__ PV248 Python 6/301 October 22, 2020 Encapsulation • objects hide implementation details • classic types structure data o objects also structure behaviour • facilitates loose coupling PV248 Python 7/301 October 22, 2020 Loose Coupling • coupling is a degree of interdependence • more coupling makes things harder to change o it also makes reasoning harder • good programs are loosely coupled • cf. modularity, composability PV248 Python 8/301 October 22, 2020 Polymorphism • objects are (at least in Python) polymorphic • different implementation, same interface o only the interface matters for composition • facilitates genericity and code re-use • cf. 'duck typing' PV248 Python 9/301 October 22, 2020 Generic Programming • code re-use often saves time o not just coding but also debugging o re-usable code often couples loosely • but not everything that can be re-used should be o code can be too generic o and too hard to read PV248 Python 10/301 October 22, 2020 Attributes • data members of objects • each instance gets its own copy o like variables scoped to object lifetime • they get names and values PV248 Python 11/301 October 22, 2020 Methods • functions (procedures) tied to objects • implement the behaviour of the object • they can access the object (self) • their signatures (usually) provide the interface • methods are also objects PV248 Python 12/301 October 22, 2020 Class and Instance Methods • methods are usually tied to instances • recall that classes are also objects • class methods work on the class (els) • static methods are just namespaced functions • decorators (jclassmethod, @staticmethod PV248 Python 13/301 October 22, 2020 Inheritance shape r ^ ellipse rectangle ^ -j i i r ^ circle square class Ellipse^ Shape J: ... usually encodes an is-a relationship PV248 Python 14/301 October 22, 2020 Multiple Inheritance • more than one base class is possible • many languages restrict this • Python allows general M-I o class Bat( Mammal, Winged ): pass • 'true' M-I is somewhat rare o typical use cases: mixins and interfaces PV248 Python 15/301 October 22, 2020 Mixins • used to pull in implementation o not part of the is-a relationship o by convention, not enforced by the language • common bits of functionality o e.g. implement __gt__, __eq__ &c. using __lt__ o you only need to implement __lt__ in your class PV248 Python 16/301 October 22, 2020 Interfaces • realized as 'abstract' classes in Python o just throw a Not Implemented exception o document the intent in a docstring • participates in is-a relationships • partially displaced by duck typing o more important in other languages (think Java) PV248 Python 17/301 October 22, 2020 Composition • attributes of objects can be other objects o (also, everything is an object in Python) • encodes a has-a relationship o a circle has a center and a radius o a circle is a shape PV248 Python 18/301 October 22, 2020 Constructors • this is the __init__ method • initializes the attributes of the instance • can call superclass constructors explicitly o not called automatically (unlike C++, Java) o MySuperClass.__init__( self ) o super(). __init__ (if unambiguous) PV248 Python 19/301 October 22, 2020 Class and Object Dictionaries • most objects are basically dictionaries • try e.g. foo. __dict__ (for a suitable foo) • saying foo.x means foo.__dict__["x"] o if that fails, type(foo) .__dict__["x"] follows o then superclasses of type(foo), according to MRO • this is what makes monkey patching possible PV248 Python 20/301 October 22, 2020 Writing Classes class Person: def __init__( self, name ): self.name = name def greet( self ): print( "hello 11 + self.name ) p = Person( "you" ) p.greet() PV248 Python 21/301 October 22,2020 Functions • top-level functions/procedures are possible • they are usually 'scoped' via the module system • functions are also objects o try print. __class__ (or type (print)) • some functions are built in (print, len,...) PV248 Python 22/301 October 22, 2020 Modules in Python • modules are just normal .py files • import executes a file by name o it will look into system-defined locations o the search path includes the current directory o they typically only define classes & functions • import sys —> lets you use sys. argv • from sys import argv —>■ you can write just argv PV248 Python 23/301 October 22, 2020 Part 2: Memory Management & Builtin Types PV248 Python 24/301 October 22, 2020 Memory • most program data is stored in 'memory' o an array of byte-addressable data storage o address space managed by the OS o 32 or 64 bit numbers as addresses • typically backed by RAM PV248 Python 25/301 October 22, 2020 Language vs Computer • programs use high-level concepts o objects, procedures, closures o values can be passed around • the computer has a single array of bytes o and a bunch of registers PV248 Python 26/301 October 22, 2020 Memory Management • deciding where to store data • high-level objects are stored in flat memory o they have a given (usually fixed) size o have limited lifetime PV248 Python 27/301 October 22, 2020 Memory Management Terminology • object: an entity with an address and size o can contain references to other objects o not the same as language-level object • lifetime: when is the object valid o live: references exist to the object o dead: the object is unreachable - garbage PV248 Python 28/301 October 22, 2020 Memory Management by Type • manual: malloc and free in C • static automatic o e.g. stack variables in C and C++ • dynamic automatic o pioneered by LISP, widely used PV248 Python 29/301 October 22, 2020 Automatic Memory Management • static vs dynamic o when do we make decisions about lifetime o compile time vs run time • safe vs unsafe o can the program read unused memory? PV248 Python 30/301 October 22, 2020 Object Lifetime • the time between malloc and free • another view: when is the object needed o often impossible to tell o can be safely over-approximated o at the expense of memory leaks PV248 Python 31/301 October 22, 2020 Static Automatic • usually binds lifetime to lexical scope • no passing references up the call stack o may or may not be enforced • no lexical closures • examples: C, C++ PV248 Python 32/301 October 22, 2020 Dynamic Automatic • over-approximate lifetime dynamically • usually easiest for the programmer o until you need to debug a space leak • reference counting, mark & sweep collectors • examples: Java, almost every dynamic language PV248 Python 33/301 October 22, 2020 Reference Counting • attach a counter to each object • whenever a reference is made, increase • whenever a reference is lost, decrease • the object is dead when the counter hits 0 • fails to reclaim reference cycles PV248 Python 34/301 October 22, 2020 Mark and Sweep • start from a root set (in-scope variables) • follow references, mark every object encountered • sweep: throw away all unmarked memory • usually stops the program while running • garbage is retained until the GC runs PV248 Python 35/301 October 22, 2020 Memory Management in CPython • primarily based on reference countin; • optional mark & sweep collector o enabled by default o configure via import gc o reclaims cycles PV248 Python 36/301 October 22, 2020 Refcounting Advantages • simple to implement in a 'managed' language • reclaims objects quickly • no need to pause the program • easily made concurrent PV248 Python 37/301 October 22, 2020 Refcounting Problems • significant memory overhead • problems with cache locality • bad performance for data shared between threads • fails to reclaim cyclic structures PV248 Python 38/301 October 22, 2020 Data Structures • an abstract description of data • leaves out low-level details • makes writing programs easier • makes reading programs easier, too PV248 Python 39/301 October 22, 2020 Building Data Structures • there are two kinds of types in python o built-in, implemented in C o user-defined (includes libraries) • both kinds are based on objects o but built-ins only look that way PV248 Python 40/301 October 22, 2020 Mutability • some objects can be modified o we say they are mutable o otherwise, they are immutable • immutability is an abstraction o physical memory is always mutable • in python, immutability is not recursive' PV248 Python 41/301 October 22, 2020 Built-in: int • arbitrary precision integer o no overflows and other nasty behaviour • it is an object, i.e. held by reference o uniform with any other kind of object o immutable • both of the above make it slow o machine integers only in C-based modules PV248 Python 42/301 October 22, 2020 Additional Numeric Objects • boot: True or False o how much is True + True? o is 0 true? is empty string? • numbers.Real: floating point numbers • numbers. Complex: a pair of above PV248 Python 43/301 October 22, 2020 Built-in: bytes • a sequence of bytes (raw data) • exists for efficiency reasons o in the abstract is just a tuple • models data as stored in files o or incoming through a socket o or as stored in raw memory PV248 Python 44/301 October 22, 2020 Properties of bytes • can be indexed and iterated o both create objects of type int o try this sequence: id(x[l]), id(x[2]) • mutable version: bytearray o the equivalent of C char arrays PV248 Python 45/301 October 22, 2020 Built-in: str • immutable Unicode strings o not the same as bytes o bytes must be decoded to obtain str o (and str encoded to obtain bytes) • represented as utf-8 sequences in CPython o implemented in PyCompactUnicodeObject PV248 Python 46/301 October 22, 2020 Built-in: tuple • an immutable sequence type o the number of elements is fixed o so is the type of each element • but elements themselves may be mutable o x = [] then y = (x, 0) o x.append(l) —>• y == ([l], 0) • implemented as a C array of object references PV248 Python 47/301 October 22, 2020 Built-in: list • a mutable version of tuple o items can be assigned x[3] = 5 o items can be append-ed • implemented as a dynamic array o many operations are amortised O(l) o insert is 0(n) PV248 Python 48/301 October 22, 2020 Built-in: diet • implemented as a hash table • some of the most performance-critical code o dictionaries appear everywhere in python o heavily hand-tuned C code • both keys and values are objects PV248 Python 49/301 October 22, 2020 Hashes and Mutability • dictionary keys must be hashable o this implies recursive immutability • what would happen if a key is mutated? o most likely the hash would change o all hash tables with the key become invalid o this would be very expensive to fix PV248 Python 50/301 October 22, 2020 Built-in: set • implements the math concept of a set • also a hash table, but with keys only o a separate C implementation • mutable - items can be added o but they must be hashable o hence cannot be changed PV248 Python 51/301 October 22, 2020 Built-in: frozenset • an immutable version of set • always hashable (since all items must be) o can appear in set or another frozenset o can be used as a key in diet • the C implementation is shared with set PV248 Python 52/301 October 22, 2020 Efficient Objects: __slots__ • fixes the attribute names allowed in an object • saves memory: consider 1-attribute object o with __dict__: 56 +112 bytes o with __slots__: 48 bytes • makes code faster: no need to hash anything o more compact in memory —>■ better cache efficiency PV248 Python 53/301 October 22, 2020 Part 3: Text, JSON and XML PV248 Python 54/301 October 22, 2020 Transient Data • lives in program memory • data structures, objects • interpreter state • often implicit manipulation • more on this next week PV248 Python 55/301 October 22, 2020 Persistent Data • (structured) text or binary files • relational (SQL) databases • object and 'flat' databases (NoSQL) • manipulated explicitly PV248 Python 56/301 October 22, 2020 Persistent Storage • 'local' file system o stored on HDD, SSD,... o stored somwhere in a local network • 'remote', using an application-level protocol o local or remote databases o cloud storage &c. PV248 Python 57/301 October 22, 2020 Reading Files • opening files: open(1 file .txt1, 'r') • files can be iterated f = open( 1 file.txt1 1 -p 1 1 ) for line in f: print( line ) PV248 Python 58/301 October 22, 2020 Resource Acquisition • plain open is prone to resource leaks o what happens during an exception? o holding a file open is not free • pythonic solution: with blocks o denned in PEP 343 o binds resources to scopes PV248 Python 59/301 October 22, 2020 Detour: PEP • PEP stands for Python Enhancement Proposal • akin to RFC documents managed by IETF • initially formalise future changes to Python o later serve as documentation for the same • PV248 Python 60/301 October 22, 2020 Using with with open('/etc/passwd1, 'r') as f: for line in f: do_stuff( line ) • still safe if do_stuff raises an exception PV248 Python 61/301 October 22, 2020 Finalizers • there is a __del__ method • but it is not guaranteed to run o it may run arbitrarily late o or never • not very good for resource management PV248 Python 62/301 October 22, 2020 Context Managers • with has an associated protocol • you can use with on any context manager • which is an object with __enter__ and __exit__ • you can create your own PV248 Python 63/301 October 22, 2020 Part 3.1: Text and Unicode PV248 Python 64/301 October 22, 2020 Representing Text • ASCII: one byte = one character o total of 127 different characters o not very universal • 8-bit encodings: 255 characters • multi-byte encodings for non-Latin scripts PV248 Python 65/301 October 22, 2020 Unicode • one character encoding to rule them all • supports all extant scripts and writing systems o and a whole bunch of dead scripts, too • approx. 143000 code points • collation, segmentation, comparison,... PV248 Python 66/301 October 22, 2020 Code Point • basic unit of encoding characters • letters, punctuation, symbols • combining diacritical marks • not the same thing as a character • code points range from 1 to 10FFFF PV248 Python 67/301 October 22, 2020 Unicode Encodings • deals with representing code points • UCS = Universal Coded Character Set o fixed-length encoding o two variants: UCS-2 (16 bit) and UCS-4 (32 bit) • UTF = Unicode Transformation Format o variable-length encoding o variants: UTF-8, UTF-16 and UTF-32 PV248 Python 68/301 October 22, 2020 Grapheme • technically 'extended grapheme cluster' • a logical character, as expected by users o encoded using 1 or more code points • multiple encodings of the same grapheme o e.g. composed vs decomposed o U+0041 U+0300 vs U+0C00: A vs A PV248 Python 69/301 October 22, 2020 Segmentation • breaking text into smaller units o graphemes, words and sentences • algorithms defined by the Unicode spec o Unicode Standard Annex #29 o graphemes and words are quite reliable o sentences not so much (too much ambiguity) PV248 Python 70/301 October 22, 2020 Normal Form • Unicode defines 4 canonical (normal) forms o NFC, NFD, NFKC, NFKD o NFC = Normal Form Composed o NFD = Normal Form Decomposed • K variants = looser, lossy conversion • all normalization is idempotent • NFC does not give you 1 code point per grapheme PV248 Python 71/301 October 22, 2020 str vs bytes • iterating bytes gives individual bytes o indexing is fast - fixed-size elements • iterating str gives code points o slightly slower, because it uses UTF-8 o does not iterate over graphemes • going back and forth: str. encode, bytes. decode PV248 Python 72/301 October 22, 2020 Python vs Unicode • no native support for Unicode segmentation o hence no grapheme iteration or word splitting • convert everything into NFC and hope for the best o unicodedata.normalize() o will sometimes break (we'll discuss regexes in a bit) o most people don't bother o correctness is overrated —>■ worse is better PV248 Python 73/301 October 22, 2020 Regular Expressions • compiling: r = re.compile( r"key: (.*)" ) • matching: m = r.match( "key: some value" ) • extracting captures: print( m.group( 1 ) ) o prints some value • substitutions: s2 = re.sub( r"\s*$", M, si ) o strips all trailing whitespace in si PV248 Python 74/301 October 22, 2020 Detour: Raw String Literals • the r in r"..." stands for raw (not regex) • normally, \ is magical in strings o but \ is also magical in regexes o nobody wants to write \\s &c. o not to mention \\\\ to match a literal \ • not super useful outside of regexes PV248 Python 75/301 October 22, 2020 Detour: Other Literal Types • byte strings: b"abc" —> bytes • formatted string literals: f"x {y}" x = 12 print( f"x - {x}" ) • triple-quote literals: xy PV248 Python 76/301 October 22, 2020 Regular Expressions vs Unicode import re s = M\u004l\u0: m" it Á t = M\u00c0M print( s, t ) print( re.mate! i( 11..11, s ), re.match( 11..11, t ) ) print( re.mate! i( M\w+$M, s ), re.match( M\w+$M, t ) ) print( re.mate! i( "A", s ), re.match( "A", t ) ) PV248 Python 77/301 October 22, 2020 Regexes and Normal Forms • some of the problems can be fixed by NFC o some go away completely (literal Unicode matching) o some become rarer (the and "\w" problems) • most text in the wild is already in NFC o but not all of it o case in point: filenames on macOS (NFD) PV248 Python 78/301 October 22, 2020 Decomposing Strings • recall that str is immutable • splitting: str. split(1:1) o None = split on any whitespace • split on first delimiter: partition • better whitespace stripping: s2 = si.strip() o also lstripQ and rstripQ PV248 Python 79/301 October 22, 2020 Searching and Matching • startswith and endswith o often convenient shortcuts • find = index o generic substring search PV248 Python 80/301 October 22, 2020 Building Strings • format literals and str. format • str. replace - substring search and replace • str. j oin - turn lists of strings into a string PV248 Python 81/301 October 22, 2020 Part 3.2: Structured Text PV248 Python 82/301 October 22, 2020 JSON • structured, text-based data format • atoms: integers, strings, booleans • objects (dictionaries), arrays (lists) • widely used around the web &c. • simple (compared to XML or YAML) PV248 Python 83/301 October 22, 2020 JSON: Example composer": [ "Bach, Johann Sebastian" ] key": "g", voices": { "1": "oboe", "2": "bassoon" PV248 Python 84/301 October 22, 2020 JSON: Writing • printing JSON seems straightforward enough • but: double quotes in strings • strings must be properly \-escaped during output • also pesky commas • keeping track of indentation for human readability • better use an existing library: 'import jsonx PV248 Python 85/301 October 22, 2020 JSON in Python • j son. dumps = short for dump to string • python dict/list/str/... data comes in • a string with valid JSON comes out Workflow • just convert everything to diet and list • runjson.dumps or json.dump( data, file ) PV248 Python 86/301 October 22, 2020 Python Example d = {} d[McomposerM] = ["Bach, Johann Sebastian11] d["keyM] = "g" d[MvoicesM] = { 1: "oboe", 2: "bassoon" } json.dump( d, sys.stdout, indent=4 ) Beware: keys are always strings in JSON PV248 Python 87/301 October 22, 2020 Parsing JSON • import json • j son. load is the counterpart to j son. dump from above o de-serialise data from an open file o builds lists, dictionaries, etc. • j son. loads corresponds to j son. dumps PV248 Python 88/301 October 22, 2020 XML • meant as a lightweight and consistent redesign of SGML o turned into a very complex format • heaps of invalid XML floating around o parsing real-world XML is a nightmare o even valid XML is pretty challenging PV248 Python 89/301 October 22, 2020 XML: Example
Ellen Adams 123 Maple Street
Lawnmo¥er 1
PV248 Python 90/301 October 22,2020 XML: Another Example <0BSAH>25 bodů 72873 20160111104208 395879 PV248 Python 91/301 October 22, 2020 XML Features • offers extensible, rich structure o tags, attributes, entities o suited for structured hierarchical data • schemas: use XML to describe XML o allows general-purpose validators o self-documenting to a degree PV248 Python 92/301 October 22, 2020 XML vs JSON • both work best with trees • JSON has basically no features o basic data structures and that's it • JSON data is ad-hoc and usually undocumented o but: this often happens with XML anyway PV248 Python 93/301 October 22, 2020 XML Parsers • DOM = Document Object Model • SAX = Simple API for XML • expat = fast SAX-like parser (but not SAX) • ElementTree = DOM-like but more pythonic PV248 Python 94/301 October 22, 2020 XML: DOM • read the entire XML document into memory • exposes the AST (Abstract Syntax Tree) • allows things like XPath and CSS selectors • the API is somewhat clumsy in Python PV248 Python 95/301 October 22, 2020 XML: SAX • event-driven XML parsing • much more efficient than DOM o but often harder to use • only useful in Python for huge XML files o otherwise just use ElementTree PV248 Python 96/301 October 22, 2020 XML: ElementTree for child in root: print child.tag, child.attrib # Order { OrderDate: "1999-10-20" } • supports tree walking, XPath • supports serialization too PV248 Python 97/301 October 22, 2020 Part 4: Databases, SQL PV248 Python 98/301 October 22, 2020 NoSQL / Non-relational Databases • umbrella term for a number of approaches o flat key/value and column stores o document and graph stores • no or minimal schemas • non-standard query languages PV248 Python 99/301 October 22, 2020 Key-Value Stores • usually very fast and very simple • completely unstructured values • keys are often database-global o workaround: prefixes for namespacin o or: multiple databases PV248 Python 100/301 October 22, 2020 NoSQL & Python • redis (redis-py) module (Redis is Key-Value) • memcached (another Key-Value store) • PyMongo for talking to MongoDB (document-oriented) • CouchDB (another document-oriented store) • neo4j or cayley (module pyley) for graph structures PV248 Python 101/301 October 22, 2020 SQL and RDBMS • SQL = Structured Query Language • RDBMS = Relational DataBase Management System • SQL is to NoSQL what XML is to JSON • heavily used and extremely reliable PV248 Python 102/301 October 22, 2020 SQL: Example select name, grade from student; select name from student where grade < 1C1; insert into student ( name, grade ) values ( 1 Random X. Student1, 1C1 ); select * from student join enrollment on student.id = enrollment.student join group on group.id = enrollment.group; PV248 Python 103/301 October 22, 2020 SQL: Relational Data • JSON and XML are hierarchical o or built from functions if you like • SQL is relational o relations = generalized functions o can capture more structure o much harder to efficiently process PV248 Python 104/301 October 22, 2020 SQL: Data Definition • mandatory, unlike XML or JSON • gives the data a rather rigid structure • tables (relations) and columns (attributes) • static data types for columns • additional consistency constraints PV248 Python 105/301 October 22, 2020 SQL: Constraints • help ensure consistency of the data • foreign keys: referential integrity o ensures there are no dangling references o but: does not prevent accidental misuse • unique constraints • check constraints: arbitrary consistency checks PV248 Python 106/301 October 22, 2020 SQL: Query Planning • an RDBMS makes heavy use of indexing o using B trees, hashes and similar techniques o indices are used automatically • all the heavy lifting is done by the backend o highly-optimized, low-level code o efficient handling of large data PV248 Python 107/301 October 22, 2020 SQL: Reliability and Flexibility • most RDBMS give ACID guarantees o transparently solves a lot of problems o basically impossible with normal files • support for schema alterations o alter table and similar o nearly impossible in ad-hoc systems PV248 Python 108/301 October 22, 2020 SQLite • lightweight in-process SQL engine • the entire database is in a single file • convenient python module, sqlite3 • stepping stone for a "real" database PV248 Python 109/301 October 22, 2020 Other Databases • you can talk to most SQL DBs using python • postgresql (psycopg2,...) • mysql / mariadb (mysql-python, mysql-connector,...) • big & expensive: Oracle (cx_oracle), DB2 (pyDB2) • most of those are much more reliable than SQLite PV248 Python 110/301 October 22, 2020 SQL Injection sql = "SELECT * FROM t WHERE name = 111 + n + • the above code is bad, never do it • consider the following n = "x1 ; drop table students —11 n = "x1; insert into passwd (user, pass) ... PV248 Python 111/301 October 22, 2020 Avoiding SQL Injection • use proper SQL-building APIs o this takes care of escaping internally • templates like insert ... values (?, ?) o the ? get safely substituted by the module o e.g. the execute method of a cursor PV248 Python 112/301 October 22, 2020 PEP 249 • informational PEP, for library writers • describes how database modules should behave o ideally, all SQL modules have the same interface o makes it easy to swap a database backend • but: SQL itself is not 100% portable PV248 Python 113/301 October 22, 2020 SQL Pitfalls • sqlite does not enforce all constraints o you need to pragma foreign_keys = on • no portable syntax for autoincrement keys • not all (column) types are supported everywhere • no portable way to get the key of last insert PV248 Python 114/301 October 22, 2020 More Resources & Stuff to Look Up • SQL: https: / / www. w3schools. com/ sql / • https://docs.python.Org/3/library/sqlite3.html • Object-Relational Mapping • SQLAlchemy: constructing portable SQL PV248 Python 115/301 October 22, 2020 Part 5: Operators, Iterators and Exceptions PV248 Python 116/301 October 22, 2020 Callable Objects • user-defined functions (module-level clef) • user-defined methods (instance and class) • built-in functions and methods • class objects • objects with a __call__ method PV248 Python 117/301 October 22, 2020 User-defined Functions • come about from a module-level def • metadata: __doc__, __name__, __module__ • scope: __globals__, __closure__ • arguments: __defaults__, __kwdefaults__ • type annotations: __annotations__ • the code itself: __code__ PV248 Python 118/301 October 22, 2020 Positional and Keyword Arguments • user-defined functions have positional arguments • and keyword arguments o print("hello", file=sys.stderr) o arguments are passed by name o which style is used is up to the caller • variadic functions: clef foo(*args, **kwargs) o args is a tuple of unmatched positional args o kwargs is a diet of unmatched keyword args PV248 Python 119/301 October 22, 2020 Lambdas • def functions must have a name • lambdas provide anonymous functions • the body must be an expression • syntax: lambda x: print("hello", x) • standard user-defined functions otherwise PV248 Python 120/301 October 22, 2020 Instance Methods • comes about as obj ect. method o print(x.foo) —> • combines the class, instance and function itself • __func__ is a user-defined function object • let bar = x.foo, then o x.fooQ —>• bar.__func__(bar.__self__) PV248 Python 121/301 October 22, 2020 Iterators • objects with __next__ (since 3.x) o iteration ends on raise Stoplteration • iterable objects provide __iter__ o sometimes, this is just return self o any iterable can appear in for x in iterable PV248 Python 122/301 October 22, 2020 class Foolter: def __init__(self): self.x = 10 def __iter__(self): return self def __next__(self): if self.x: self.x -= 1 else: raise Stoplteration return self.x PV248 Python 123/301 October 22, 2020 Generators (PEP 255) • written as a normal function or method • they use yield to generate a sequence • represented as special callable objects o exist at the C level in CPython def foo(*lst): for i in 1st: yield i + 1 list(foo(l, 2)) # prints [2, 3_ PV248 Python 124/301 October 22, 2020 yield from • calling a generator produces a generator object • how do we call one generator from another? • same as for x in foo(): yield x def bar(*lst): yield from foo(*lst) yield from foo(*lst) list(bar(l, 2)) # prints [2, 3, 2, 3_ PV248 Python 125/301 October 22, 2020 Decorators • written as @decor before a function definition • decor is a regular function (def decor(f)) o f is bound to the decorated function o the decorated function becomes the result of decor • classes can be decorated too • you can create' decorators at runtime o @mkdecor("moo") (mkdecor returns the decorator) o you can stack decorators PV248 Python 126/301 October 22, 2020 def decor(f): return lambda: print(MbarM) def mkdecor(s): return lambda g: lambda: print(s) (ödecor def foo(f): print("fooM) (3mkdecor(MmooM) def moo(f): print(MfooM) tt foo() prints "bar", moo() prints "moo" PV248 Python 127/301 October 22, 2020 List Comprehension • a concise way to build lists • combines a filter and a map [ 2 * x for x in range(l0) ] [ x for x in range(l0) if x % 2 == 1 ] [ 2 * x for x in range(l0) if x % 2 == 1 ] [ (x, y) for x in range(3) for y in range(2) ] PV248 Python 128/301 October 22, 2020 Operators • operators are (mostly) syntactic sugar • x < y rewrites to x.__lt__(y) • is and is not are special o are the operands the same object? o also the ternary (conditional) operator PV248 Python 129/301 October 22, 2020 Non-Operator Builtins • len(x) —>• x.__len__() (length) • abs(x)—> x.__abs__() (magnitude) • str(x) —>• x.__str__() (printing) • repr(x) —>■ x.__repr__() (printing for eval) • bool(x) and if x: x.__bool__() PV248 Python 130/301 October 22, 2020 Arithmetic • a standard selection of operators • / is floating point, //is integral • += and similar are somewhat magical o x += y—>x = x.__iadd__(y) if defined o otherwisex = x.__add__(y) PV248 Python 131/301 October 22, 2020 x - 7 # an int is immutable x += 3 # works, x = 10, id(x) changes 1st = [7, 3] lst[0] +=3 # works too, id(lst) stays same tup = (7, 3) # a tuple is immutable tup += (1, 1) # still works (id changes) tup[0] +=3 # fails PV248 Python 132/301 October 22, 2020 Relational Operators • operands can be of different types • equality: ! =, == o by default uses object identity • ordering: <, <=, >, >= (TypeError by default) • consistency is not enforced PV248 Python 133/301 October 22, 2020 Relational Consistency • __eq__ must be an equivalence relation • x. __ne__ (y) must be the same as not x. __eq__ (y) • __lt__ must be an ordering relation o compatible with __eq__ o consistent with each other • each operator is separate (mixins can help) o or perhaps a class decorator PV248 Python 134/301 October 22, 2020 Collection Operators • in is also a membership operator (outside for) o implemented as __contains__ • indexing and slicing operators o del x[y] —>• x.__delitem__(y) o x[y] —> x.__getitem__(y) o x[y] = z —>• x.__setitem__(y, z) PV248 Python 135/301 October 22, 2020 Conditional Operator • also known as a ternary operator • written x if cond else y o in C: cond ? x : y • forms an expression, unlike if o can e.g. appear in a lambda o or in function arguments, &c. PV248 Python 136/301 October 22, 2020 Exceptions • an exception interrupts normal control flow • it's called an exception because it is exceptional o never mind Stop Iteration • causes methods to be interrupted o until a matching except block is found o also known as stack unwinding PV248 Python 137/301 October 22, 2020 Life Without Exceptions int fd = socket( ... ); if ( fd < 0 ) ... /* handle errors */ if ( bind( fd, ... ) < 0 ) ... /* handle errors */ if ( listen( fd, 5 ) < 0 ) ... /* handle errors */ PV248 Python 138/301 October 22, 2020 With Exceptions try: sock = socket.sock et( ... ) sock.bind( ... ) sock.listen( ... ) except ...: # handle errors PV248 Python 139/301 October 22, 2020 Exceptions vs Resources x = open( "file.txt" ) # stuff raise SomeError • who calls x.close() • this would be a resource leak PV248 Python 140/301 October 22, 2020 Using finally try: x = open( "file.txt" ) # stuff finally: x.close() • works, but tedious and error-prone PV248 Python 141/301 October 22, 2020 Using with with open( "file.txt" ) as f: # stuff • with takes care of the finally and close • with x as ysetsy = x.__enter__() o and calls x. __exit__(...) when leaving the block PV248 Python 142/301 October 22, 2020 The ^property decorator • attribute syntax is the preferred one in Python • writing useless setters and getters is boring class Foo: ^property def x(self): return 2 * self.a x.setter def x(self, v): self.a = v // 2 PV248 Python 143/301 October 22, 2020 Part 6: Closures, Coroutines, Concurrency PV248 Python 144/301 October 22, 2020 Concurrency & Parallelism • threading - thread-based parallelism • multiprocessing • concurrent - future-based programming • subprocess • sched, a general-purpose event scheduler • queue, for sending objects between threads PV248 Python 145/301 October 22, 2020 Threading • low-level thread support, module threading • Thread objects represent actual threads o threads provide start() and join() o the run() method executes in a new thread • mutexes, semaphores &c. PV248 Python 146/301 October 22, 2020 The Global Interpreter Lock • memory management in CPython is not thread-safe o Python code runs under a global lock o pure Python code cannot use multiple cores • C code usually runs without the lock o this includes numpy crunching PV248 Python 147/301 October 22, 2020 Multiprocessing • like threading but uses processes • works around the GIL o each worker process has its own interpreter • queued/sent objects must be pickled o see also: the pickle module o this causes substantial overhead o functions, classes &c. are pickled by name PV248 Python 148/301 October 22, 2020 Futures • like coroutine await but for subroutines • a Future can be waited for using f. result() • scheduled via concurrent. futures. Executor o Executor. map is like asyncio. gather o Executor. submit is like asyncio. create_task • implemented using process or thread pools PV248 Python 149/301 October 22, 2020 Native Coroutines (PEP 492) • created using async def (since Python 3.5) • generalisation of generators o yield from is replaced with await o an __await__ magic method is required • a coroutine can be suspended and resumed PV248 Python 150/301 October 22, 2020 Coroutine Scheduling • coroutines need a scheduler • one is available from asyncio. get_event.loop() • along with many coroutine building blocks • coroutines can actually run in parallel o via asyncio. create_task (since 3.7) o via asyncio. gather PV248 Python 151/301 October 22, 2020 Async Generators (PEP 525) • async clef + yield • semantics like simple generators • but also allows await • iterated with async for o async for runs sequentially PV248 Python 152/301 October 22, 2020 Execution Stack • made up of activation frames • holds local variables • and return addresses • in dynamic languages, often lives in the heap PV248 Python 153/301 October 22, 2020 Variable Capture • variables are captured lexically • definitions are a dynamic / run-time construct o a nested definition is executed o creates a closure object • always by reference in Python o but can be by-value in other languages PV248 Python 154/301 October 22, 2020 Using Closures • closures can be returned, stored and called o they can be called multiple times, too o they can capture arbitrary variables • closures naturally retain state • this is what makes them powerful PV248 Python 155/301 October 22, 2020 Objects from Closures • so closures are essentially code + state • wait, isn't that what an object is? • indeed, you can implement objects using closures PV248 Python 156/301 October 22, 2020 The Role of GC • memory management becomes a lot more complicated • forget C-style 'automatic' stack variables • this is why the stack is actually in the heap • this can go as far as form reference cycles PV248 Python 157/301 October 22, 2020 Coroutines • coroutines are a generalisation of subroutines • they can be suspended and re-entered • coroutines can be closures at the same time • the code of a coroutine is like a function • a suspended coroutine is like an activation frame PV248 Python 158/301 October 22, 2020 Yield • suspends execution and returns' a value • may also obtain a new value (cf. send) • when re-entered, continue where we left off for i in range(5): yield i PV248 Python 159/301 October 22, 2020 Send • with yield, we have one-way communication • but in many cases, we would like two-way • a suspended coroutine is an object in Python o with a send method which takes a value o send re-enters the coroutine PV248 Python 160/301 October 22, 2020 Yield From and Await • yield from is mostly a generator concept • await basically does the same thing o call out to another coroutine o when it suspends, so does the entire stack PV248 Python 161/301 October 22, 2020 Suspending Native Coroutines • this is not actually possible o not with async-native syntax anyway • you need a yield o for that, you need a generator o use the types. coroutine decorator PV248 Python 162/301 October 22, 2020 Event Loop • not required in theory • useful also without coroutines • there is a synergistic effect o event loops make coroutines easier o coroutines make event loops easier PV248 Python 163/301 October 22, 2020 Part 7: Communication & HTTP with asyncio PV248 Python 164/301 October 22, 2020 Running Programs (the old way) • os. system is about the simplest o also somewhat dangerous - shell injection o you only get the exit code • os .popen allows you to read output of a program o alternatively, you can send input to the program o you can't do both (would likely deadlock anyway) o runs the command through a shell, same as os. system PV248 Python 165/301 October 22, 2020 Low-level Process API • POSIX-inherited interfaces (on POSIX systems) • os. exec: replace the current process • os. fork: split the current process in two • os. f orkpty: same but with a PTY PV248 Python 166/301 October 22, 2020 Detour: bytes vs str • strings (class str) represent text o that is, a sequence of Unicode points • files and network connections handle data o represented in Python as bytes • the bytes constructor can convert from str o e.g. b = bytes("hello", "utf8") PV248 Python 167/301 October 22, 2020 Running Programs (the new way) • you can use the subprocess module • subprocess can handle bidirectional 10 o it also takes care of avoiding 10 deadlocks o set input to feed data to the subprocess • internally run uses a Popen object o if run can't do it, Popen probably can PV248 Python 168/301 October 22, 2020 Getting subprocess Output • available via run since Python 3.7 • the run function returns a CompletedProcess • it has attributes stdout and stderr • both are bytes (byte sequences) by default • or str if text or encoding were set • available if you enabled capture .output PV248 Python 169/301 October 22, 2020 Running Filters with Popen • if you are stuck with 3.6, use Popen directly • set stdin in the constructor to PIPE • use the communicate method to send the input • this gives you the outputs (as bytes) PV248 Python 170/301 October 22, 2020 import subprocess from subprocess import PIPE input = bytes( Mx\na\nb\nyM, Mutf8M) p = subprocess.Popen([MsortM], stdin=PIPE, stdout=PIPE) out = p.communicate(input=input) # out[0] is the stdout, out[l] is None PV248 Python 171/301 October 22, 2020 Subprocesses with asyncio • import asyncio.subprocess • create_subprocess_exec, like subprocess.run o but it returns a Process instance o Process has a communicate async method • can run things in background (via tasks) o also multiple processes at once PV248 Python 172/301 October 22, 2020 Protocol-based asyncio subprocesses • let loop be an implementation of the asyncio event loop • there's subprocess_exec and subprocess_shell o sets up pipes by default • integrates into the asyncio transport layer (see later) • allows you to obtain the data piece-wise • https://docs.python.Org/3/library/asyncio-protocol.html PV248 Python 173/301 October 22, 2020 Sockets • the socket API comes from early BSD Unix • socket represents a (possible) network connection • sockets are more complicated than normal files o establishing connections is hard o messages get lost much more often than file data PV248 Python 174/301 October 22, 2020 Socket Types • sockets can be internet or unix domain o internet sockets connect to other computers o Unix sockets live in the filesystem • sockets can be stream or datagram o stream sockets are like files (TCP) o you can write a continuous stream of data o datagram sockets can send individual messages (UDP) PV248 Python 175/301 October 22, 2020 Sockets in Python • the socket module is available on all major OSes • it has a nice object-oriented API o failures are propagated as exceptions o buffer management is automatic • useful if you need to do low-level networking o hard to use in non-blocking mode PV248 Python 176/301 October 22, 2020 Sockets and asyncio • asyncio provides sock_* to work with socket objects • this makes work with non-blocking sockets a lot easier • but your program needs to be written in async style • only use sockets when there is no other choice o asyncio protocols are both faster and easier to use PV248 Python 177/301 October 22, 2020 Hyper-Text Transfer Protocol • originally a simple text-based, stateless protocol • however o SSL/TLS, cryptography (https) o pipelining (somewhat stateful) o cookies (somewhat stateful in a different way) • typically between client and a front-end server • but also as a back-end protocol (web server to app server) PV248 Python 178/301 October 22, 2020 Request Anatomy • request type (see below) • header (text-based, like e-mail) • content Request Types • GET - asks the server to send a resource • HEAD - like GET but only send back headers • POST - send data to the server PV248 Python 179/301 October 22, 2020 Python and HTTP • both client and server functionality o import http.client o import http.server • TLS/SSL wrappers are also available o import ssl • synchronous by default PV248 Python 180/301 October 22, 2020 Serving Requests • derive from BaseHTTPRequestHandler • implement a do_GET method • this gets called whenever the client does a GET • also available: do_HEAD, do_P0ST, etc. • pass the class (not an instance) to HTTPServer PV248 Python 181/301 October 22, 2020 Serving Requests (cont'd) • HTTPServer creates a new instance of your Handler • the BaseHTTPRequestHandler machinery runs • it calls your do_GET etc. method • request data is available in instance variables o self.path, self.headers PV248 Python 182/301 October 22, 2020 Talking to the Client • HTTP responses start with a response code o self.send_response( 200, 'OK' ) • the headers follow (set at least Content-Type) o self.send_header( 'Connection1, 'close' ) • headers and the content need to be separated o self.end_headers() • finally, send the content by writing to self. wf ile PV248 Python 183/301 October 22, 2020 Sending Content • self .wfile is an open file • it has a write() method which you can use • sockets only accept byte sequences, not str • use the bytes( string, encoding ) constructor o match the encoding to your Content-Type PV248 Python 184/301 October 22, 2020 HTTP and asyncio • the base asyncio currently doesn't directly support HTTP • but: you can get aiohttp from PyPI • contains a very nice web server o from aiohttp import web o minimum boilerplate, fully asyncio-ready PV248 Python 185/301 October 22, 2020 Aside: The Python Package Index • colloquially known as PyPI (or cheese shop) o do not confuse with PyPy (Python in almost-Python) • both source packages and binaries o the latter known as wheels (PEP 427, 491) o previously python eggs • PV248 Python 186/301 October 22, 2020 SSL and TLS • you want to use the ssl module for handling HTTPS o this is especially true server-side o aiohttp and http. server are compatible • you need to deal with certificates (loading, checking) • this is a rather important but complex topic PV248 Python 187/301 October 22, 2020 Certificate Basics • certificate is a cryptographically signed statement o it ties a server to a certain public key o the client ensures the server knows the private key • the server loads the certificate and its private key • the client must validate the certificate o this is typically a lot harder to get right PV248 Python 188/301 October 22, 2020 SSL in Python • start with import ssl • almost everything happens in the SSLContext class • get an instance from ssl. create_default_context() o you can use wrap_socket to run an SSL handshake o you can pass the context to aiohttp • if httpd is a http.server.HTTPServer: httpd.socket = ssl.wrap_socket( httpd.socket, ... ) PV248 Python 189/301 October 22,2020 HTTP Clients • there's a very basic http. client • for a more complete library, use urllib. request • aiohttp has client functionality • all of the above can be used with ssl • another 3rd party module: Python Requests PV248 Python 190/301 October 22, 2020 Part 8: Low-level asyncio PV248 Python 191/301 October 22, 2020 10 at the OS Level • often defaults to blocking o read returns when data is available o this is usually OK for files • but what about network code? o could work for a client PV248 Python 192/301 October 22, 2020 Threads and 10 • there may be work to do while waiting o waiting for 10 can be wasteful • only the calling (OS) thread is blocked o another thread may do the work o but multiple green threads may be blocked PV248 Python 193/301 October 22, 2020 Non-Blocking 10 • the program calls read o read returns immediately o even if there was no data • but how do we know when to read? o we could poll o for example call read every 30ms PV248 Python 194/301 October 22, 2020 Polling • trade-off between latency and throughput o sometimes, polling is okay o but is often too inefficient • alternative: 10 dispatch o useful when multiple IOs are pending o wait only if all are blocked PV248 Python 195/301 October 22, 2020 select • takes a list of file descriptors • block until one of them is ready o next read will return data immediately • can optionally specify a timeout • only useful for OS-level resources PV248 Python 196/301 October 22, 2020 Alternatives to select • select is a rather old interface • there is a number of more modern variants • poll and epoll system calls o despite the name, they do not poll o epoll is more scalable • kqueue and kevent on BSD systems PV248 Python 197/301 October 22, 2020 Synchronous vs Asynchronous • the select family is synchronous o you call the function o it may wait some time o you proceed when it returns • OS threads are fully asynchronous PV248 Python 198/301 October 22, 2020 The Thorny Issue of Disks • a file is always ready' for reading • this may still take time to complete • there is no good solution on UNIX • POSIX AIO exists but is sparsely supported • OS threads are an option PV248 Python 199/301 October 22, 2020 10 on Windows • select is possible (but slow) • Windows provides real asynchronous 10 o quite different from UNIX o the 10 operation is directly issued o but the function returns immediately • comes with a notification queue PV248 Python 200/301 October 22, 2020 The asyncio Event Loop • uses the select family of syscalls • why is it called async 10? o select is synchronous in principle o this is an implementation detail o the IOs are asynchronous to each other PV248 Python 201/301 October 22, 2020 How Does It Work • you must use asyncio functions for 10 • an async read does not issue an OS read • it yields back into the event loop • the fd is put on the select list • the coroutine is resumed when the fd is ready PV248 Python 202/301 October 22, 2020 Timers asyncio allows you to set timers the event loop keeps a list of those and uses that to set the select timeout o just uses the nearest timer expiry when a timer expires, its owner is resumed PV248 Python 203/301 October 22, 2020 Blocking 10 vs asyncio • all user code runs on the main thread • you must not call any blocking 10 functions • doing so will stall the entire application o in a server, clients will time out o even if not, latency will suffer PV248 Python 204/301 October 22, 2020 DNS • POSIX: getaddrinfo and getnameinfo o also the older API gethostbyname • those are all blocking functions o and they can take a while o but name resolution is essential • asyncio internally uses OS threads for DNS PV248 Python 205/301 October 22, 2020 Signals • signals on UNIX are very asynchronous • interact with OS threads in a messy way • asyncio hides all this using C code PV248 Python 206/301 October 22, 2020 Native Coroutines (Reminder) • delared using async def async def foo(): await asyncio.sleep( 1 ) • calling foo() returns a suspended coroutine • which you can await o or turn it into an asyncio. Task PV248 Python 207/301 October 22, 2020 Tasks • asyncio. Task is a nice wrapper around coroutines o create with asyncio. create_task() • can be stopped prematurely using cancel () • has an API for asking things: o done() tells you if the coroutine has finished o resultQ gives you the result PV248 Python 208/301 October 22, 2020 Tasks and Exceptions • what if a coroutine raises an exception? • calling result will re-raise it o i.e. it continues propagating from result() • you can also ask directly using exception() o returns None if the coroutine ended normally PV248 Python 209/301 October 22, 2020 Asynchronous Context Managers • normally, we use with for resource acquisition o this internally uses the context manager protocol • but sometimes you need to wait for a resource o __enter__() is a subroutine and would block o this won't work in async-enabled code • we need __enter__() to be itself a coroutine PV248 Python 210/301 October 22, 2020 async with • just like wait but uses __aenter__(), __aexit__() o those are async def • the async with behaves like an await o it will suspend if the context manager does o the coroutine which owns the resource can continue • mainly used for locks and semaphores PV248 Python 211/301 October 22, 2020 Part 9: Python Pitfalls PV248 Python 212/301 October 22, 2020 Mixing Languages • for many people, Python is not a first language • some things look similar in Python and Java (C++,...) o sometimes they do the same thing o sometimes they do something very different o sometimes the difference is subtle PV248 Python 213/301 October 22, 2020 Python vs Java: Decorators • Java has a thing called annotations • looks very much like a Python decorator • in Python, decorators can drastically change meaning • in Java, they are just passive metadata o other code can use them for meta-programming though PV248 Python 214/301 October 22, 2020 Class Body Variables class Foo: some_attr = 42 • in Java/C++, this is how you create instance variables • in Python, this creates class attributes o i.e. what C++/Java would call static attributes PV248 Python 215/301 October 22, 2020 Very Late Errors if a == 2: priiiint(Ma is not 2M) • no error when loading this into python • it even works as long as a ! = 2 • most languages would tell you much earlier PV248 Python 216/301 October 22, 2020 Very Late Errors (cont'd) try: foo() except TyyyypeError print(Mmy mistake") • does not even complain when running the code • you only notice when foo() raises an exception PV248 Python 217/301 October 22, 2020 Late Imports if a == 2: import foo foo.say_hello() • unless a == 2, mymod is not loaded • any syntax errors don't show up until a == 2 o it may even fail to exist PV248 Python 218/301 October 22, 2020 Block Scope for i in range(lß): pass print(i) # not a NameError • in Python, local variables are function-scoped • in other languages, i is confined to the loop PV248 Python 219/301 October 22, 2020 Assignment Pitfalls x = [ 1, 2 ] y = x x.append( 3 ) print( y ) # prints [ 1, 2, 3 ] • in Python, everything is a reference • assignment does not make copies PV248 Python 220/301 October 22, 2020 Equality of Iterables 1] == [0, 1] range == range list(range(2JJ _0, l] == range True (obviously) ) —>• True 3, 1] —>• True ->• False PV248 Python 221/301 October 22, 2020 Equality of bool if 0: prirr if 1: prin False == 0 True == 1 - 0 is False 1 is True - "yes" "yes" > True True > False False nothirij yes PV248 Python 222/301 October 22, 2020 Equality of bool (cont'd) • if 2: print( "yes" ) —>■ yes • True == 2 —>■ False • False == 2 —>■ False • if 11: print( "yes" ) —>■ nothing • if 'x': print( "yes" ) —>• yes • 11 == False —>• False • 'x' == True —>• False PV248 Python 223/301 October 22,2020 Mutable Default Arguments def foo( x = [] ): x.append( 7 ) return x foo() # [ 7 ] foo() It [ 7, 7 ]... wait, what? PV248 Python 224/301 October 22, 2020 Late Lexical Capture f = [ lambda x : i * x for i in range( 5 ) ] f[ 4 ]( 3 ) # 12 f[ 0 ]( 3 ) # 12 ... ?! g = [ lambda x, i = i: i * x for i in range( 5 ) ] g[ 4 ]( 3 ) # 12 g[ 0 ]( 3 ) # 0 ... fml h = [ ( lambda x : i * x )( 3 ) for i in range( 5 ) ] h # [ 0, 3, 6, 12] ... i kid you not PV248 Python 225/301 October 22, 2020 Dictionary Iteration Order • in python <= 3.6 o small dictionaries iterate in insertion order o big dictionaries iterate in 'random' order • in python 3.7 o all in insertion order, but not documented • in python >= 3.8 o guaranteed to iterate in insertion order PV248 Python 226/301 October 22, 2020 List Multiplication x = [ [ 1 ] * 2 ] * 3 print( x ) # [ [ 1, 1 ], [ 1, 1 ], [ 1, 1 ] ] x[ 0 ][ 0 ] = 2 print( x ) # [ [ 2, 1 ], [ 2, 1 ], [ 2, 1 ] ] PV248 Python 227/301 October 22, 2020 Forgotten Await import asyncio async def foo(): print( "hello" ) async def main(): foo() asyncio.run( main() ) • gives warning coroutine 1 f oo1 was never awaited PV248 Python 228/301 October 22, 2020 Python vs Java: Closures • captured variables are final in Java • but they are mutable in Python o and of course captured by reference • they are whatever you tell them to be in C++ PV248 Python 229/301 October 22, 2020 Explicit super () • Java and C++ automatically call parent constructors • Python does not • you have to call them yourself PV248 Python 230/301 October 22, 2020 Setters and Getters obj.attr obj.attr = 4 • in C++ or Java, this is an assignment • in Python, it can run arbitrary code o this often makes getters/setters redundant PV248 Python 231/301 October 22, 2020 Part 10: Testing, Profilin PV248 Python 232/301 October 22, 2020 Why Testing • reading programs is hard • reasoning about programs is even harder • testing is comparatively easy • difference between an example and a proof PV248 Python 233/301 October 22, 2020 What is Testing • based on trial runs • the program is executed with some inputs • the outputs or outcomes are checked • almost always incomplete PV248 Python 234/301 October 22, 2020 Testing Levels • unit testing o individual classes o individual functions • functional o system o integration PV248 Python 235/301 October 22,2020 Testing Automation • manual testing o still widely used o requires human • semi-automated o requires human assistance • fully automated o can run unattended PV248 Python 236/301 October 22, 2020 Testing Insight • what does the test or tester know? • black box: nothing known about internals • gray box: limited knowledge • white box: complete' knowledge PV248 Python 237/301 October 22, 2020 Why Unit Testing? • allows testing small pieces of code • the unit is likely to be used in other code o make sure your code works before you use it o the less code, the easier it is to debug • especially easier to hit all the corner cases PV248 Python 238/301 October 22, 2020 Unit Tests with unittest • from unittest import TestCase • derive your test class from TestCase • put test code into methods named test_* • run with python -m unittest program.py o add -v for more verbose output PV248 Python 239/301 October 22, 2020 from unittest import TestCase class TestArith(TestCase): def test_add(self): self.assertEqual(l, 4-3) def test_leq(self): self.assertTrue(3 <= 2 * 3) PV248 Python 240/301 October 22, 2020 Unit Tests with pytest • a more pythonic alternative to unittest o unittest is derived from JUnit • easier to use and less boilerplate • you can use native python assert • easier to run, too o just run pytest in your source repository PV248 Python 241/301 October 22, 2020 Test Auto-Discovery in pytest • pytest finds your testcases for you o no need to register anything • put your tests in test_.py or _test.py • name your testcases (functions) test_* PV248 Python 242/301 October 22, 2020 Fixtures in pytest • sometimes you need the same thing in many testcases • in unittest, you have the test class • pytest passes fixtures as parameters o fixtures are created by a decorator o they are matched based on their names PV248 Python 243/301 October 22, 2020 import pytest import smtplib (Spy test, fixture def smtp_connection(): return smtplib.SMTP(11 smtp.gmail.com", 587) def test_ehlo(smtp_connection): response, msg = smtp_connection.ehlo() assert response == 250 PV248 Python 244/301 October 22, 2020 Property Testing • writing test inputs is tedious • sometimes, we can generate them instead • useful for general properties like o idempotency (e.g. serialize + deserialize) o invariants (output is sorted,...) o code does not cause exceptions PV248 Python 245/301 October 22, 2020 Using hypothesis • property-based testing for Python • has strategies to generate basic data types o int, str, diet, list, set,... • compose built-in generators to get custom types • integrated with pytest PV248 Python 246/301 October 22, 2020 import hypothesis import hypothesis.strategies as s (^hypothesis. given(s. lists (s. integers ())) def test_sorted(x): assert sorted(x) == x # should fail (^hypothesis. given(x=s. integers(), y=s. integers()) def test_cancel(x, y): assert (x + y) - y == x # looks okay PV248 Python 247/301 October 22, 2020 Going Quick and Dirty • goal: minimize time spent on testing • manual testing usually loses o but it has almost 0 initial investment • if you can write a test in 5 minutes, do it • useful for testing small scripts PV248 Python 248/301 October 22, 2020 Shell 101 • shell scripts are very easy to write • they are ideal for testing 10 behaviour • easily check for exit status: set -e • see what is going on: set -x • use dif f -u to check expected vs actual output PV248 Python 249/301 October 22, 2020 Shell Test Example set -ex python script.py < testl.in | tee out cliff -u testl.out out python script.py < test2.in | tee out cliff -u test2.out out PV248 Python 250/301 October 22, 2020 Continuous Integration • automated tests need to be executed • with many tests, this gets tedious to do by hand • CI builds and tests your project regularly o every time you push some commits o every night (e.g. more extensive tests) PV248 Python 251/301 October 22, 2020 CI: Travis • runs in the cloud (CI as a service) • trivially integrates with pytest • virtualenv out of the box for python projects • integrated with github • configure in . travis .yml in your repo PV248 Python 252/301 October 22, 2020 CI: GitLab • GitLab has its own CI solution (similar to travis) • also available at FI • runs tests when you push to your gitlab • drop a . gitlab-ci .yml in your repository • automatic deployment into heroku &c. PV248 Python 253/301 October 22, 2020 CI: Buildbot • written in python/twisted o basically a framework to build a custom CI tool • self-hosted and somewhat complicated to set up o more suited for complex projects o much more flexible than most CI tools • distributed design PV248 Python 254/301 October 22, 2020 CI: Jenkins • another self-hosted solution, this time in Java o widely used and well supported • native support for python projects (including pytest) o provides a dashboard with test result graphs &c. o supports publishing sphinx-generated documentation PV248 Python 255/301 October 22, 2020 Print-based Debugging • no need to be ashamed, everybody does it • less painful in interpreted languages • you can also use decorators for tracing • never forget to clean your program up again PV248 Python 256/301 October 22, 2020 def debug(e): f = sys._getframe(l) v = eval(e, f f_globals, f.f.locals) 1 = f.f_code.co_filename + ':' 1 += str(f.f_lineno) + ':' print(l, e, 1=1, repr(v), file=sys.stdG ^rr) x = 1 debug('x +1') PV248 Python 257/301 October 22, 2020 The Python Debugger • run as python -m pdb program.py • there's a built-in help command • next steps through the program • break to set a breakpoint • cont to run until end or a breakpoint PV248 Python 258/301 October 22, 2020 What is Profilin • measurement of resource consumption • essential info for optimising programs • answers questions about bottlenecks o where is my program spending most time? o less often: how is memory used in the program PV248 Python 259/301 October 22, 2020 Why Profiling • 'blind' optimisation is often misdirected o it is like fixing bugs without triggering them o program performance is hard to reason about • tells you exactly which point is too slow o allows for best speedup with least work PV248 Python 260/301 October 22, 2020 Profiling in Python • provided as a library cProfile o alternative: profile is slower, but more flexible • run as python -m cProfile program.py • outputs a list of lines/functions and their cost • use cProfile. mn() to profile a single expression PV248 Python 261/301 October 22, 2020 tt python -m cProfile -s time fib.py ncalls tottime percall file:line(function) 13638/2 0.032 0.016 fib.py:l(fib_rec) 2 0.000 0.000 {builtins.print} 2 0.000 0.000 fib.py:5(fib_mem) PV248 Python 262/301 October 22, 2020 Part 11: Linear Algebra & Symbolic Math PV248 Python 263/301 October 22, 2020 Numbers in Python • recall that numbers are objects • a tuple of real numbers has 300% overhead o compared to a C array of float values o and 350% for integers • this causes extremely poor cache use • integers are arbitrary-precision PV248 Python 264/301 October 22, 2020 Math in Python • numeric data usually means arrays o this is inefficient in python • we need a module written in C o but we don't want to do that ourselves • enter the SciPy project o pre-made numeric and scientific packages PV248 Python 265/301 October 22, 2020 The SciPy Family • numpy: data types, linear algebra • scipy: more computational machinery • pandas: data analysis and statistics • matplotlib: plotting and graphing • sympy: symbolic mathematics PV248 Python 266/301 October 22, 2020 Aside: External Libraries • until now, we only used bundled packages • for math, we will need external libraries • you can use pip to install those o use pip install —user PV248 Python 267/301 October 22, 2020 Aside: Installing numpy • the easiest way may be with pip o this would be pip3 on aisa • linux distributions usually also have packages • another option is getting the Anaconda bundle • detailed instructions on https: //scipy. org PV248 Python 268/301 October 22, 2020 Arrays in numpy • compact, C-implemented data types • flexible multi-dimensional arrays • easy and efficient re-shaping o typically without copying the data PV248 Python 269/301 October 22, 2020 Entering Data • most data is stored in numpy. array • can be constructed from a list o a list of lists for 2D arrays • or directly loaded from / stored to a file o binary: numpy. load, numpy. save o text: numpy. loadtxt, numpy. savetxt PV248 Python 270/301 October 22, 2020 LAPACK and BLAS • BLAS is a low-level vector/matrix package • LAPACK is built on top of BLAS o provides higher-level operations o tuned for modern CPUs with multiple caches • both are written in Fortran o ATLAS and C-LAPACK are C implementations PV248 Python 271/301 October 22, 2020 Element-wise Functions • the basic math function arsenal • powers, roots, exponentials, logarithms • trigonometric (sin, cos, tan,...) • hyperbolic (sinh, cosh, tanh,...) • cyclometric (arcsin, arccos, arctan,...) PV248 Python 272/301 October 22, 2020 Matrix Operations in numpy • import nimpy.linalg • multiplication, inversion, rank • eigenvalues and eigenvectors • linear equation solver • pseudo-inverses, linear least squares PV248 Python 273/301 October 22, 2020 Additional Linear Algebra in scipy • import scipy.linalg • LU, QR, polar, etc. decomposition • matrix exponentials and logarithms • matrix equation solvers • special operations for banded matrices PV248 Python 274/301 October 22, 2020 Where is my Gaussian Elimination? • used in lots of school linear algebra • but not the most efficient algorithm • a few problems with numerical stability • not directly available in numpy PV248 Python 275/301 October 22, 2020 Numeric Stability • floats are imprecise / approximate • multiplication is not associative • iteration amplifies the errors 0.1**2 ==0.01 # False 1 / ( 0.1**2 - 0.01 ) # 5.8-1017 a = (0.1 * 0.1) * 10 b = 0.1 * (0.1 * 10) 1 / ( a - b ) # 7.21-1016 PV248 Python 276/301 October 22, 2020 LU Decomposition • decompose matrix A into simpler factors • PA = LU where o Pis a permutation matrix o L is a lower triangular matrix o U is an upper triangular matrix • fast and numerically stable PV248 Python 277/301 October 22, 2020 Uses for LU • equations, determinant, inversion,... • e.g. de^A) = de^P"1) • det(L) • det([/) o where det([/) = o and dei(L) = n h% PV248 Python 278/301 October 22, 2020 Numeric Math • float arithmetic is messy but incredibly fast • measured data is approximate anyway • stable algorithms exist for many things o and are available from libraries • we often don't care about exactness o think computer graphics, signal analysis,... PV248 Python 279/301 October 22, 2020 Symbolic Math • numeric math sucks for 'textbook' math • there are problems where exactness matters o pure math and theoretical physics • incredibly slow computation o but much cleaner interpretation PV248 Python 280/301 October 22, 2020 Linear Algebra in sympy • uses exact math o e.g. arbitrary precision rationals o and roots thereof o and many other computable numbers • wide repertoire of functions o including LU, QR, etc. decompositions PV248 Python 281/301 October 22, 2020 Exact Rationais in sympy from sympy import * a = QQ( 1 ) / 10 # QQ = rationals Matrix( [ [ sqrt( a**3 ), 0, 0 ], [ 0, sqrt( a**3 ), 0 ], [ 0, 0, 1 ] ] ).det() # result: 1/1000 PV248 Python 282/301 October 22, 2020 numpy for Comparison import numpy as np import numpy.linalg as la a = 0.1 la.det( [ [ np.sqrt( a**3 ), 0, 0 ], [ 0, np.sqrt( a**3 ), 0 ], [ 0, 0, 1 ] ] ) # result: 0.0010000000000000002 PV248 Python 283/301 October 22, 2020 General Solutions in Symbolic Math from sympy import * x = symbols( 'x' ) Matrix( [ [ x, 0, 0 ], [ 0, 1, 0 ], [ 0, 0, x ] ] ).det() # result: x ** 2 PV248 Python 284/301 October 22, 2020 Symbolic Differentation x = symbols( 'x' ) diff( x**2 + 2*x + log( x/2 ) ) tt result: 2*x + 2 + l/x diff( x**2 * exp(x) ) tt result: x**2 * exp( x ) + 2 * x * exp( x ) PV248 Python 285/301 October 22, 2020 Algebraic Equations solve( x**2 - 7 ) # result: [ -sqrt( 7 ), sqrt( 7 ) ] solve( x**2 - exp( x ) ) # result: [ -2 * LambertW( -1/2 ) ] solve( x**4 - x ) # result: [ 0, 1, -1/2 - sqrt(3) * 1/2, # -1/2 + sqrt(3) * 1/2 ] ; 1**2 = -1 PV248 Python 286/301 October 22, 2020 Ordinary Differential Equations f = Function( 'f ) dsolve( f( x ).diff( x ) ) tt f'(x) = 0 tt result: Eq( f ( x ), CI ) dsolve( f( x ).diff( x ) - f(x) ) tt f1(x) = f(x) tt result: Eq( f( x ), Cl * exp( x ) ) dsolve( f( x ).diff( x ) + f(x) ) tt f1(x) = -f(x) tt result: Eq( f( x ), CI * exp( -x ) ) PV248 Python 287/301 October 22, 2020 Symbolic Integration integrate( x**2 ) # result: x**3 / 3 integrate( log( x ) ) # result: x * log( x ) - x integrate( cos( x ) ** 2 ) It result: x/2 + sin( x ) * cos( x ) / 2 PV248 Python 288/301 October 22, 2020 Numeric Sparse Matrices • sparse = most elements are 0 • available in scipy. sparse • special data types (not numpy arrays) o do not use numpy functions on those • less general, but more compact and faster PV248 Python 289/301 October 22, 2020 Fourier Transform continuous: /(£) — f(x) exp(—27T2o;£) dx senes: f(x) = V _ cnexp(^) 71=— CO Í2rK7lX real series: f(x) = ^ + x (an sin(^) + &n cos(^ o (complex) coefficients: cn = ö(an — z&n) PV248 Python 290/301 October 22, 2020 Discrete Fourier Transform • available in nimpy. f f t • goes between time and frequency domains • a few different variants are covered o real-valued input (for signals, rfft) o inverse transform (ifft, irfft) o multiple dimensions (fft2, fftn) PV248 Python 291/301 October 22, 2020 Polynomial Series • the numpy. polynomial package • Chebyshev, Hermite, Laguerre and Legendre o arithmetic, calculus and special-purpose operations o numeric integration using Guassian quadrature o fitting (polynomial regression) PV248 Python 292/301 October 22, 2020 Part 12: Statistics PV248 Python 293/301 October 22, 2020 Statistics in numpy • a basic statistical toolkit o averages, medians o variance, standard deviation o histograms • random sampling and distributions PV248 Python 294/301 October 22, 2020 Linear Regression • very fast model-fitting method o both in computational and human terms o quick and dirty first approximation • widely used in data interpretation o biology and sociology statistics o finance and economics, especially prediction PV248 Python 295/301 October 22, 2020 Polynomial Regression • higher-order variant of linear regression • can capture acceleration or deceleration • harder to use and interpret o also harder to compute • usually requires a model of the data PV248 Python 296/301 October 22, 2020 Interpolation • find a line or curve that approximates data • it must pass through the data points o this is a major difference to regression • more dangerous than regression o runs a serious risk of overfitting PV248 Python 297/301 October 22, 2020 Linear and Polynomial Regression, Interpolation • regressions using the least squares method o linear: nimpy.linalg.lstsq o polynomial: nimpy.polyfit • interpolation: scipy.interpolate o e.g. piecewise cubic splines o Lagrange interpolating polynomials PV248 Python 298/301 October 22, 2020 Pandas: Data Analysis • the Python equivalent of R o works with tabular data (CSV, SQL, Excel) o time series (also variable frequency) o primarily works with floating-point values • partially implemented in C and Cython PV248 Python 299/301 October 22, 2020 Pandas Series and DataFrame • Series is a single sequence of numbers • DataFrame represents tabular data o powerful indexing operators o index by column —>• series o index by condition —>■ filtering PV248 Python 300/301 October 22, 2020 Pandas Example scores = [ ('Maxine1, 12), ('John1, 12), ('Sandra', 10) ] cols = [ 'name', 'score' ] df = pd.DataFrame( data=scores, coluinns=cols ) df['score'].max() # 12 df[ df['score'] >= 12 ] # Maxine and John PV248 Python 301/301 October 22, 2020