PV248 Python Petr Ročkai PV248 Python 2/354 Intro Programming vs Languages • Python is unobtrusive (by design) • if you can program, you can program in Python • there are idiosyncracies (of course) • but you will mostly get by PV248 Python 3/354 Intro Programming vs Jobs • we all want to write beautiful programs − but you didn’t sleep for 2 nights − and this thing is going into production tomorrow • sometimes you get a chance to clean up later − and sometimes you don’t PV248 Python 4/354 Intro Engineering Flowchart should it? should it? does it move? no problem no problem WD40 duct tape yes no yes no yes no Python makes for decent duct tape and WD40. PV248 Python 5/354 Intro In This Course • you will not learn to write beautiful programs • we will try to do things with minimum effort − perfect is the enemy of good • ugly comes in shades − you should always write passable code − there is a balance to strike PV248 Python 6/354 Intro … ugly, cont’d • there are two main schools of writing software − do the right thing − worse is better • https://www.jwz.org/doc/worse-is-better.html PV248 Python 7/354 Intro The Right Thing • simplicity: interface irst, implementation second • correctness: required • consistency: required • completeness: more important than simplicity PV248 Python 8/354 Intro Worse is Better • simplicity: implementation irst • correctness: simplicity goes irst • consistency: less important than both • completeness: least important PV248 Python 9/354 Intro Design Schools • there are pros and cons to both • right thing is often expensive • worse is better often wins • which one do you think Python belongs to? PV248 Python 10/354 Intro Disclaimer • I am not a Python programmer • please don’t ask sneaky language-lawyer questions Goals • learn to use Python in practical situations • have a look at existing packages and what they can do • code up some cool stuff, have fun PV248 Python 11/354 Intro Organisation • there are 2 standard seminar groups − attendance is compulsory (minus 2 absences) − one virtual work-at-home group • the lecture and seminars on 2.10. are cancelled PV248 Python 12/354 Intro Coursework • there will be a set of exercises each week • you should mostly do these within the seminar • please make a public git (or hg) repository − we are all adults here – do not copy − i will collect the repository addresses PV248 Python 13/354 Intro Exercise Grading • exercises are binary: pass or fail • you will get 4 chances on each to get right • failing is the same as missing the deadline PV248 Python 14/354 Intro Exercise Deadlines • 7 days, worth 2 points • 14 days, worth 1.5 point • Monday 17.12., worth 1.25 points • Tuesday 12.2., worth 1 point PV248 Python 15/354 Intro Passing the Course • you can get − 24 points for exercises − 4 points for seminar attendance − 4 points for a small project • you need 20 points to pass PV248 Python 16/354 Intro Stuff We Could Try • working with text, regular expressions • plotting stuff with bokeh or matplotlib • talking to SQL databases • talking to HTTP servers • being an HTTP server • implementing a JSON-based REST API • parsing YAML and/or JSON data • … (suggestions welcome) PV248 Python 17/354 Intro Some Resources • https://docs.python.org/3/ (obviously) • https://msivak.fedorapeople.org/python/ • study materials in IS • help() • google, stack over low, … PV248 Python 18/354 Text & Regular Expressions Part 1: Text & Regular Expressions PV248 Python 19/354 Text & Regular Expressions Repository Structure • create a directory for each week • name them 01-text and so on − the -text doesn’t really matter − scripts will be looking for 01* • program names must be exactly as speci ied PV248 Python 20/354 Text & Regular Expressions Reading Input • opening iles: open('scorelib.txt', 'r') • iles can be iterated f = open( 'scorelib.txt', 'r' ) for line in f: print(line) PV248 Python 21/354 Text & Regular Expressions Regular Expressions • compiling: r = re.compile( r"Composer: (.*)" ) • matching: m = r.match( "Composer: Bach, J. S." ) • extracting captures: print(m.group(1)) − prints Bach, J. S. • substitutions: s2 = re.sub( r"\s*$", '', s1 ) − strips all trailing whitespace in s1 Other String Operations • better whitespace stripping: s2 = s1.strip() • splitting: str.split(';') PV248 Python 22/354 Text & Regular Expressions Dictionaries • associative arrays: map (e.g.) strings to numbers • nice syntax: dict = { 'foo': 1, 'bar': 3 } • nice & easy to work with • can be iterated: for k, v in dict.items() Counters • from collections import Counter • like a dictionary, but the default value is 0 • ctr = Counter() • compare ctr['baz'] += 1 with dict PV248 Python 23/354 Text & Regular Expressions Command Line • we will often need to process command arguments • in Python, those are available in the sys module • import sys • arguments are in sys.argv (a list) PV248 Python 24/354 Text & Regular Expressions Exercise 1: Input • get yourself a git/mercurial/darcs repository • grab input data (scorelib.txt) from study materials • read and process the text ile • use regular expressions to extract data • use dictionaries to collect stats • beware! hand-written, somewhat irregular data PV248 Python 25/354 Text & Regular Expressions Exercise 1: Output • print some interesting statistics − how many pieces by each composer? − how many pieces composed in a given century? − how many in the key of c minor? • bonus if you are bored: searching − list all pieces in a given key − list pieces featuring a given instrument (say, bassoon) PV248 Python 26/354 Text & Regular Expressions Exercise 1: Invocation • ./stat.py ./scorelib.txt composer • ./stat.py ./scorelib.txt century PV248 Python 27/354 Text & Regular Expressions Exercise 1: Example Output • Telemann, G. P.: 68 • Bach, J. S.: 79 • Bach, J. C.: 6 • … For centuries: • 16th century: 3 • 17th century: 11 • 18th century: 32 PV248 Python 28/354 Text & Regular Expressions Cheat Sheet for line in open('file', 'r') read lines dict = {} an empty dictionary dict[key] = value set a value in a dictionary r = re.compile(r"(.*):") compile a regexp m = r.match("foo: bar") match a string if m is None: continue match failed, loop again print(m.group(1)) extract a capture for k, v in dict.items() iterate a dictionary print("%d, %d" % (12, 1337)) print some numbers PV248 Python 29/354 Objects and Classes Part 2: Objects and Classes PV248 Python 30/354 Objects and Classes Objects • the basic “unit” of OOP • they bundle data and behaviour • provide encapsulation • make code re-use easier • also known as “instances” PV248 Python 31/354 Objects and Classes Classes • templates for objects (class Foo: pass) • each (python) object belongs to a class • classes themselves are also objects • calling a class creates an instance − my_foo = Foo() PV248 Python 32/354 Objects and Classes Poking at Classes • {}.__class__ • {}.__class__.__class__ • (0).__class__ • [].__class__ • compare type(0), etc. • n = numbers.Number(); n.__class__ PV248 Python 33/354 Objects and Classes Types vs Objects • class system is a type system • “duck typing”: quacks, walks like a duck • since python 3, types are classes • everything is dynamic in python − you can create new classes at runtime − you can pass classes as function parameters PV248 Python 34/354 Objects and Classes Encapsulation • objects hide implementation details • classic types structure data − objects also structure behaviour • facilitates weak coupling PV248 Python 35/354 Objects and Classes Weak Coupling • coupling is a degree of interdependence • more coupling makes hard to change things − it also makes reasoning harder • good programs are weakly coupled • cf. modularity, composability PV248 Python 36/354 Objects and Classes Polymorphism • objects are (at least in Python) polymorphic • different implementation, same interface • only the interface matters for composition • facilitates genericity and code re-use • cf. “duck typing” PV248 Python 37/354 Objects and Classes Generic Programming • code re-use often saves time − not just coding but also debugging − re-usable code often couples weakly • but not everything that can be re-used should be − code can be too generic − and too hard to read PV248 Python 38/354 Objects and Classes Attributes • data members of objects • each instance gets its own copy • like variables scoped to object lifetime • they get names and values PV248 Python 39/354 Objects and Classes Methods • functions (procedures) tied to objects • they can access the object (self) • implement the behaviour of the object • their signatures (usually) provide the interface • methods are also objects PV248 Python 40/354 Objects and Classes Class and Instance Methods • methods are usually tied to instances • recall that classes are also objects • class methods work on the class (cls) • static methods are just namespaced functions • decorators @classmethod, @staticmethod PV248 Python 41/354 Objects and Classes Inheritance shape ellipse rectangle squarecircle • class Ellipse( Shape ): ... • usually encodes an is-a relationship PV248 Python 42/354 Objects and Classes Multiple Inheritance • more than one base class is possible • many languages restrict this • python allows general M-I − class Bat( Mammal, Winged ): pass • ‘true’ M-I is somewhat rare − typical use cases: mixins and interfaces PV248 Python 43/354 Objects and Classes Mixins • used to pull in implementation − not part of the is-a relationship − by convention, not enforced by the language • common bits of functionality − e.g. implement __gt__, __eq__ &c. using __lt__ − you only need to implement __lt__ in your class PV248 Python 44/354 Objects and Classes Interfaces • realized as “abstract” classes in python − just throw a NotImplemented exception − document the intent in a docstring • participates in is-a relationships • partially displaced by duck typing − more important in other languages (think Java) PV248 Python 45/354 Objects and Classes Composition • attributes of objects can be other objects − (also, everything is an object in python) • encodes a has-a relationship − a circle has a center and a radius − a circle is a shape PV248 Python 46/354 Objects and Classes Constructors • this is the __init__ method • initializes the attributes of the instance • can call superclass constructors explicitly − not called automatically (unlike C++, Java) − MySuperClass.__init__( self ) − super().__init__ (if unambiguous) PV248 Python 47/354 Objects and Classes Class and Object Dictionaries • most objects are basically dictionaries • try e.g. foo.__dict__ (for a suitable foo) • saying foo.x means foo.__dict__["x"] − if that fails, type(foo).__dict__["x"] follows − then superclasses of type(foo), according to MRO PV248 Python 48/354 Objects and Classes Writing Classes class Person: def __init__( self, name ): self.name = name def greet( self ): print( "hello " + self.name ) p = Person( "you" ) p.greet() PV248 Python 49/354 Objects and Classes Modules in Python • modules are just normal .py iles • import executes a ile by name − it will look into system-de ined locations − the search path includes the current directory − they typically only de ine classes & functions • import sys lets you use sys.argv • from sys import argv you can write just argv PV248 Python 50/354 Objects and Classes Functions • top-level functions/procedures are possible • they are usually ‘scoped’ via the module system • functions are also objects − try print.__class__ (or type(print)) • some functions are built in (print, len, …) PV248 Python 51/354 Objects and Classes Exercise 2: Objects • create a class hierarchy for printed scores • de ine (at least) the folowing classes − Print, Edition, Composition, Voice, Person • de ine suitable constructors (__init__) • you can use additional helper classes PV248 Python 52/354 Objects and Classes Prints, Editions & Compositions • printed score belongs to an edition • an edition has an author (an editor) • edition of is a particular composition • the composition has an author (composer) • both editors and composers are people PV248 Python 53/354 Objects and Classes Voices • compositions can have multiple voices • each voice has a range and a name (instrument) • one or both may be unknown • ranges are written using a double dash (--) PV248 Python 54/354 Objects and Classes The Print class • attributes − edition (instance of Edition) − print_id (integer, from Print Number:) − partiture (boolean) • method format() − reconstructs and prints the original stanza • method composition() (= edition.composition) PV248 Python 55/354 Objects and Classes The Edition class • attributes − composition (instance of Composition) − authors (a list of Person instances) − name (a string, from the Edition: ield, or None) PV248 Python 56/354 Objects and Classes The Composition class • attributes − name, incipit, key and genre (strings or None) − year (integer if an integral year is given or None) − voices (a list of Voice instances) − authors (a list of Person instances) PV248 Python 57/354 Objects and Classes Voice and Person • Voice attributes − name, range (strings or None) • Person attributes − name (string) − born, died (integers or None) PV248 Python 58/354 Objects and Classes Exercise 2: Parsing • write a load(filename) function that reads the text − this will be the same scorelib.txt as before • the function returns a list of Print instances • the list should be sorted by the print number (print_id) PV248 Python 59/354 Objects and Classes Exercise 2: Module • the classes should live in scorelib.py • add a simple test script, test.py − this will take a single ilename − invocation: ./test.py scorelib.txt − run load() on that ilename − call format() on each Print, add empty lines PV248 Python 60/354 Persistent Data Part 3: Persistent Data PV248 Python 61/354 Persistent Data Transient Data • lives in program memory • data structures, objects • interpreter state • often implicit manipulation • more on this next week PV248 Python 62/354 Persistent Data Persistent Data • (structured) text or binary iles • relational (SQL) databases • object and ‘ lat’ databases (NoSQL) • manipulated explicitly PV248 Python 63/354 Persistent Data Persistent Storage • ‘local’ ile system − stored on HDD, SSD, … − stored somwhere in a local network • ‘remote’, using an application-level protocol − local or remote databases − cloud storage &c. PV248 Python 64/354 Persistent Data JSON • structured, text-based data format • atoms: integers, strings, booleans • objects (dictionaries), arrays (lists) • widely used around the web &c. • simple (compared to XML or YAML) PV248 Python 65/354 Persistent Data JSON: Example { "composer": [ "Bach, Johann Sebastian" ], "key": "g", "voices": { "1": "oboe", "2": "bassoon" } } PV248 Python 66/354 Persistent Data JSON: Writing • printing JSON seems straightforward enough • but: double quotes in strings • strings must be properly \-escaped during output • also pesky commas • keeping track of indentation for human readability • better use an existing library: import json PV248 Python 67/354 Persistent Data JSON in Python • json.dumps = short for dump to string • python dict/list/str/… data comes in • a string with valid JSON comes out Work low • just convert everything to dict’s and lists • run json.dumps or json.dump( data, file ) PV248 Python 68/354 Persistent Data Python Example d = {} d["composer"] = ["Bach, Johann Sebastian"] d["key"] = "g" d["voices"] = { 1: "oboe", 2: "bassoon" } json.dump( d, sys.stdout, indent=4 ) Beware: keys are always strings in JSON PV248 Python 69/354 Persistent Data Parsing JSON • import json • json.load is the counterpart to json.dump from above − de-serialise data from an open ile − builds lists, dictionaries, etc. • json.loads corresponds to json.dumps PV248 Python 70/354 Persistent Data XML • meant as a lightweight and consistent redesign of SGML − turned into a very complex format • heaps of invalid XML loating around − parsing real-world XML is a nightmare − even valid XML is pretty challenging PV248 Python 71/354 Persistent Data XML Features • offers extensible, rich structure − tags, attributes, entities − suited for structured hierarchical data • schemas: use XML to describe XML − allows general-purpose validators − self-documenting to a degree PV248 Python 72/354 Persistent Data XML vs JSON • both work best with trees • JSON has basically no features − basic data structures and that’s it • JSON data is ad-hoc and usually undocumented − but: this often happens with XML anyway PV248 Python 73/354 Persistent Data NoSQL / Non-relational Databases • umbrella term for a number of approaches − lat key/value and column stores − document and graph stores • no or minimal schemas • non-standard query languages PV248 Python 74/354 Persistent Data Key-Value Stores • usually very fast and very simple • completely unstructured values • keys are often database-global − workaround: pre ixes for namespacing − or: multiple databases PV248 Python 75/354 Persistent Data NoSQL & Python • redis (redis-py) module (Redis is Key-Value) • memcached (another Key-Value store) • PyMongo for talking to MongoDB (document-oriented) • CouchDB (another document-oriented store) • neo4j or cayley (module pyley) for graph structures PV248 Python 76/354 Persistent Data SQL and RDBMS • SQL = Structured Query Language • RDBMS = Relational DataBase Management System • SQL is to NoSQL what XML is to JSON • heavily used and extremely reliable PV248 Python 77/354 Persistent Data SQLite • lightweight in-process SQL engine • the entire database is in a single ile • convenient python module, sqlite3 • stepping stone for a “real” database PV248 Python 78/354 Persistent Data Other Databases • you can talk to most SQL DBs using python • postgresql (psycopg2, …) • mysql / mariadb (mysql-python, mysql-connector, …) • big & expensive: Oracle (cx_oracle), DB2 (pyDB2) • most of those are much more reliable than SQLite PV248 Python 79/354 Persistent Data SQL Injection sql = "SELECT * FROM t WHERE name = '" + n + '"' • the above code is bad, never do it • consider the following n = "x'; drop table students --" n = "x'; insert into passwd (user, pass) ..." PV248 Python 80/354 Persistent Data Avoiding SQL Injection • use proper SQL-building APIs − this takes care of escaping internally • templates like insert ... values (?, ?) − the ? get safely substituted by the module − e.g. the execute method of a cursor PV248 Python 81/354 Persistent Data Aside: PEP • PEP stands for Python Enhancement Proposal • akin to RFC documents managed by IETF • initially formalise future changes to Python − later serve as documentation for the same • https://www.python.org/dev/peps/ PV248 Python 82/354 Persistent Data PEP 249 • informational PEP, for library writers • describes how database modules should behave − ideally, all SQL modules have the same interface − makes it easy to swap a database backend • but: SQL itself is not 100% portable PV248 Python 83/354 Persistent Data SQL Pitfalls • sqlite does not enforce all constraints • no portable syntax for autoincrement keys • not all (column) types are supported everywhere • no portable way to get the key of last insert PV248 Python 84/354 Persistent Data More Resources & Stuff to Look Up • SQL: https://www.w3schools.com/sql/ • https://docs.python.org/3/library/sqlite3.html • Object-Relational Mapping • SQLAlchemy: constructing portable SQL PV248 Python 85/354 Persistent Data Exercise 3: Importing Data • create an empty scorelib.dat from scorelib.sql • start by importing composers & editors into the database − then continue with scores &c. • use the classes from previous exercise − you can copy & extend them − you can also use inheritance or composition PV248 Python 86/354 Persistent Data Exercise 3: Database Structure • de ined in scorelib.sql (see study materials) • test with: sqlite3 scorelib.dat < scorelib.sql • you can rm scorelib.dat any time to start over • consult comments in scorelib.sql • do not store duplicate rows PV248 Python 87/354 Persistent Data Exercise 3: Requirements • the structure in scorelib.sql is compulsory • you must use SQLite 3 • parsing proceeds using rules from exercise 2 • each row in each table must be unique − special rules for people, see next slide PV248 Python 88/354 Persistent Data Exercise 3: Storing People • the name alone must be unique • merge born and died ields − NULL iff it is None in all instances − resolve con licts arbitrarily PV248 Python 89/354 Persistent Data Exercise 3: Invocation • the script should be called import.py • ./import.py scorelib.txt scorelib.dat • irst argument is the input text ile • second argument is the output SQLite ile − assume that this ile does not exist − the script must also set up the schema PV248 Python 90/354 Persistent Data SQL Cheat Sheet • INSERT INTO table (c1, c2) VALUES (v1, v2) • SELECT (c1, c2) FROM table WHERE c1 = "foo" sqlite3 Cheats • conn = sqlite3.connect( "scorelib.dat" ) • cur = conn.cursor() • cur.execute( "... values (?, ?)", (foo, bar) ) • conn.commit() (don’t forget to do this) PV248 Python 91/354 Memory (Data) Model Part 4: Memory (Data) Model PV248 Python 92/354 Memory (Data) Model Memory • most program data is stored in ‘memory’ − an array of byte-addressable data storage − address space managed by the OS − 32 or 64 bit numbers as addresses • typically backed by RAM PV248 Python 93/354 Memory (Data) Model Language vs Computer • programs use high-level concepts − objects, procedures, closures − values can be passed around • the computer has a single array of bytes − and, well, a bunch of registers PV248 Python 94/354 Memory (Data) Model Memory Management • deciding where to store data • high-level objects are stored in lat memory − they have a given (usually ixed) size − can contain references to other objects − have limited lifespan PV248 Python 95/354 Memory (Data) Model Memory Management Terminology • object: an entity with an address and size − not the same as language-level object • lifetime: when is the object valid − live: references exist to the object − dead: the object unreachable – garbage PV248 Python 96/354 Memory (Data) Model Memory Management by Type • manual: malloc and free in C • static automatic − e.g. stack variables in C and C++ • dynamic automatic − pioneered by LISP, widely used PV248 Python 97/354 Memory (Data) Model Automatic Memory Management • static vs dynamic − when do we make decisions about lifetime − compile time vs run time • safe vs unsafe − can the program read unused memory? PV248 Python 98/354 Memory (Data) Model Object Lifetime • the time between malloc and free • another view: when is the object needed − often impossible to tell − can be safely over-approximated − at the expense of memory leaks PV248 Python 99/354 Memory (Data) Model Static Automatic • usually binds lifetime to lexical scope • no passing references up the call stack − may or may not be enforced • no lexical closures PV248 Python 100/354 Memory (Data) Model Dynamic Automatic • over-approximate lifetime dynamically • usually easiest for the programmer − until you need to debug a space leak • reference counting, mark & sweep collectors PV248 Python 101/354 Memory (Data) Model Reference Counting • attach a counter to each object • whenever a reference is made, increase • whenever a reference is lost, decrease • the object is dead when the counter hits 0 • fails to reclaim reference cycles PV248 Python 102/354 Memory (Data) Model Mark and Sweep • start from a root set (in-scope variables) • follow references, mark every object encountered • throw away all unmarked memory • usually stops the program while running • garbage is retained until the GC runs PV248 Python 103/354 Memory (Data) Model Memory Management in CPython • primarily based on reference counting • optional mark & sweep collector − enabled by default − con igure via import gc PV248 Python 104/354 Memory (Data) Model Refcounting Advantages • simple to implement in a ‘managed’ language • reclaims objects quickly • no need to pause the program • easily made concurrent PV248 Python 105/354 Memory (Data) Model Refcounting Problems • signi icant memory overhead • problems with cache locality • bad performance for data shared between threads • fails to reclaim cyclic structures PV248 Python 106/354 Memory (Data) Model Data Structures • an abstract description of data • leaves out low-level details • makes writing programs easier • makes reading programs easier, too PV248 Python 107/354 Memory (Data) Model Building Data Structures • there are two types in Python − built-in, implemented in C − user-de ined (includes libraries) • both types are based on objects − but built-ins only look that way PV248 Python 108/354 Memory (Data) Model Mutability • some objects can be modi ied − we say they are mutable − otherwise, they are immutable • immutability is an abstraction − physical memory is always mutable • in Python, immutability is not ‘recursive’ PV248 Python 109/354 Memory (Data) Model Built-in: int • arbitrary precision integer − no over lows and other nasty behaviour • it is an object, i.e. held by reference − uniform with any other kind of object − immutable • both of the above make it slow − machine integers only in C-based modules PV248 Python 110/354 Memory (Data) Model Additional Numeric Objects • bool: True or False − how much is True + True? − is 0 true? is empty string? • numbers.Real: loating point numbers • numbers.Complex: a pair of above PV248 Python 111/354 Memory (Data) Model Built-in: bytes • a sequence of bytes (raw data) • exists for ef iciency reasons − in the abstract is just a tuple • models data as stored in iles − or incoming through a socket − or as stored in raw memory PV248 Python 112/354 Memory (Data) Model Properties of bytes • can be indexed and iterated − both create objects of type int − try this sequence: id(x[1]), id(x[2]) • mutable version: bytearray − the equivalent of C char arrays PV248 Python 113/354 Memory (Data) Model Built-in: str • immutable unicode strings − not the same as bytes − bytes must be decoded to obtain str − (and str encoded to obtain bytes) • represented as utf-8 sequences in CPython − implemented in PyCompactUnicodeObject PV248 Python 114/354 Memory (Data) Model Built-in: tuple • an immutable sequence type − the number of elements is ixed − so is the type of each element • but elements themselves may be mutable − x = [] then y = (x, 0) − x.append(1) y == ([1], 0) • implemented as a C array of object references PV248 Python 115/354 Memory (Data) Model Built-in: list • a mutable version of tuple − items can be assigned x[3] = 5 − items can be append-ed • implemented as a dynamic array − many operations are amortised 𝑂(1) − insert is 𝑂(𝑛) PV248 Python 116/354 Memory (Data) Model Built-in: dict • implemented as a hash table • some of the most performance-critical code − dictionaries appear everywhere in Python − heavily hand-tuned C code • both keys and values are objects PV248 Python 117/354 Memory (Data) Model Hashes and Mutability • dictionary keys must be hashable − this implies recursive immutability • what would happen if a key is mutated? − most likely, the hash would change − all hash tables with the key become invalid − this would be very expensive to ix PV248 Python 118/354 Memory (Data) Model Built-in: set • implements the math concept of a set • also a hash table, but with keys only − a separate C implementation • mutable – items can be added − but they must be hashable − hence cannot be changed PV248 Python 119/354 Memory (Data) Model Built-in: frozenset • an immutable version of set • always hashable (since all items must be) − can appear in set or another frozenset − can be used as a key in dict • the C implementation is shared with set PV248 Python 120/354 Memory (Data) Model Ef icient Objects: __slots__ • ixes the attribute names allowed in an object • saves memory: consider 1-attribute object − with __dict__: 56 + 112 bytes − with __slots__: 48 bytes • makes code faster: no need to hash anything − more compact in memory better cache ef iciency PV248 Python 121/354 Memory (Data) Model Exercise 4: Preliminaries • pull data from scorelib.dat using SQL • print the results as (nicely formatted) JSON • invocation: ./search.py Bach − the scorelib.dat will not be your own − you must not use the text data PV248 Python 122/354 Memory (Data) Model Exercise 4: Part 1 • write a script getprint.py − the input is a print number (argument) − the output is a list of composers (stdout) • each composer is a dictionary • name, born and died PV248 Python 123/354 Memory (Data) Model Exercise 4: Part 1 Output $ ./getprint.py 645 [ { "name": "Graupner, Christoph", "born": 1683, "died": 1760 }, { "name": "Grünewald, Gottfried" } ] PV248 Python 124/354 Memory (Data) Model Exercise 4: Part 1 Hints • you will need to use SQL joins • select ... from person join score_authors on person.id = score_author.composer ... where print.id = ? • the result of cursor.execute is iterable PV248 Python 125/354 Memory (Data) Model Exercise 4: Part 2 • write a script search.py • the input is a composer name substring • the output is a list of all matching composer names − along with all their prints in the database • hint: ... where person.name like "%Bach%" PV248 Python 126/354 Memory (Data) Model Exercise 4: Part 2 Output $ ./search.py Bach { "Bach, Johann Sebastian": [ { "Print Number": 111, "Title": "Konzert für ..." , ... }, { "Print Number": 139, ... }, ... ], "Bach, Johann Christian": ..., ... } PV248 Python 127/354 Numeric Data Part 5: Numeric Data PV248 Python 128/354 Numeric Data Numbers in Python • recall that numbers are objects • a tuple of real numbers has 300% overhead − compared to a C array of float values − and 350% for integers • this causes extremely poor cache use • integers are arbitrary-precision PV248 Python 129/354 Numeric Data Math in Python • numeric data usually means arrays − this is inef icient in python • we need a module written in C − but we don’t want to do that ourselves • enter the SciPy project − pre-made numeric and scienti ic packages PV248 Python 130/354 Numeric Data The SciPy Family • numpy: data types, linear algebra • scipy: more computational machinery • pandas: data analysis and statistics • matplotlib: plotting and graphing • sympy: symbolic mathematics PV248 Python 131/354 Numeric Data Aside: External Libraries • until now, we only used bundled packages • for math, we will need external libraries • you can use pip to install those − use pip install --user PV248 Python 132/354 Numeric Data Aside: The Python Package Index • colloquially known as PyPI (or cheese shop) − do not confuse with PyPy (Python in almost-Python) • both source packages and binaries − the latter known as wheels (PEP 427, 491) − previously python eggs • https://pypi.python.org PV248 Python 133/354 Numeric Data Aside: Installing numpy • the easiest way may be with pip − this would be pip3 on aisa • linux distributions usually also have packages • another option is getting the Anaconda bundle • detailed instructions on https://scipy.org PV248 Python 134/354 Numeric Data Arrays in numpy • compact, C-implemented data types • lexible multi-dimensional arrays • easy and ef icient re-shaping − typically without copying the data PV248 Python 135/354 Numeric Data Entering Data • most data is stored in numpy.array • can be constructed from from a list − a list of list for 2D arrays • or directly loaded from / stored to a ile − binary: numpy.load, numpy.save − text: numpy.loadtxt, numpy.savetxt PV248 Python 136/354 Numeric Data LAPACK and BLAS • BLAS is a low-level vector/matrix package • LAPACK is built on top of BLAS − provides higher-level operations − tuned for modern CPUs with multiple caches • both are written in Fortran − ATLAS and C-LAPACK are C implementations PV248 Python 137/354 Numeric Data Element-wise Functions • the basic math function arsenal • powers, roots, exponentials, logarithms • trigonometric (sin, cos, tan, …) • hyperbolic (sinh, cosh, tanh, …) • cyclometric (arcsin, arccos, arctan, …) PV248 Python 138/354 Numeric Data Matrix Operations in numpy • import numpy.linalg • multiplication, inversion, rank • eigenvalues and eigenvectors • linear equation solver • pseudo-inverses, linear least squares PV248 Python 139/354 Numeric Data Additional Linear Algebra in scipy • import scipy.linalg • LU, QR, polar, etc. decomposition • matrix exponentials and logarithms • matrix equation solvers • special operations for banded matrices PV248 Python 140/354 Numeric Data Sparse Matrices • sparse = most elements are 0 • available in scipy.sparse • special data types (not numpy arrays) − do not use numpy functions on those • less general, but more compact and faster PV248 Python 141/354 Numeric Data Discrete Fourier Transform • available in numpy.fft • goes between time and frequency domains • a few different variants are covered − real-valued input (for signals, rfft) − inverse transform (ifft, irfft) − multiple dimensions (fft2, fftn) PV248 Python 142/354 Numeric Data Polynomial Series • useful in differential problems and functional analysis • the numpy.polynomial package • Chebyshev, Hermite, Laguerre and Legendre • arithmetic, calculus and special-purpose operations PV248 Python 143/354 Numeric Data Statistics in numpy • a basic statistical toolkit − averages, medians − variance, standard deviation − histograms • random sampling and distributions PV248 Python 144/354 Numeric Data Linear and Polynomial Regression, Interpolation • regressions using the least squares method − linear: numpy.linalg.lstsq − polynomial: numpy.polyfit • interpolation: scipy.interpolate − e.g. piecewise cubic splines − Lagrange interpolating polynomials PV248 Python 145/354 Numeric Data Pandas: Data Analysis • the Python equivalent of R − works with tabular data (CSV, SQL, Excel) − time series (also variable frequency) − primarily works with loating-point values • partially implemented in C and Cython PV248 Python 146/354 Numeric Data Pandas Series and DataFrame • Series is a single sequence of numbers • DataFrame represents tabular data − powerful indexing operators − index by column → series − index by condition → iltering PV248 Python 147/354 Numeric Data Pandas Example scores = [ ('Maxine', 12), ('John', 12), ('Sandra', 10) ] cols = [ 'name', 'score' ] df = pd.DataFrame( data=scores, columns=cols ) df['score'].max() # 12 df[ df['score'] >= 12 ] # Maxine and John PV248 Python 148/354 Numeric Data Exercise 5: Warm-Up 1 • create a matrix from a list of lists • compute and print (to stdout) − rank and determinant − inverse (if applicable) • all operations are in numpy.linalg PV248 Python 149/354 Numeric Data Exercise 5: Warm-Up 2 • a simple non-homogeneous linear equation solver • put the coef icients in a list of lists • put the constants in a list of numbers • use linalg.solve from numpy • make sure you understand what is going on PV248 Python 150/354 Numeric Data Exercise 5: Intro • ‘nice’ equations, invocation: ./eqn.py input.txt • parse a human-readable system of equations • variables → single letters, coef icients → integers • only + and − are allowed • print the solution to stdout (using variable names) PV248 Python 151/354 Numeric Data Exercise 5: Unique Solution • decide a unique solution exists • if so, print the solution 2x + 3y = 5 x - y = 0 solution: x = 1, y = 1 PV248 Python 152/354 Numeric Data Exercise 5: No Solution • print no solution if the system is inconsistent x + y = 4 x + y = 5 no solution PV248 Python 153/354 Numeric Data Exercise 5: Multiple Solutions • it may also be under-determined • only print the dimension of the solution space x + y - z = 0 x = 0 solution space dimension: 1 PV248 Python 154/354 Numeric Data Exercise 5: Details • the right hand side is always a constant − and is the only constant term • print the solution/result to stdout − solutions come in alphabetical order • there are spaces around operators and = − no space between a coef icient and a variable PV248 Python 155/354 Numeric Data Exercise 5: Hints • linalg.solve assumes unique solution − you can use Rouché-Capelli to check • you can obtain a rank with linalg.matrix_rank PV248 Python 156/354 Advanced Constructs Part 6: Advanced Constructs PV248 Python 157/354 Advanced Constructs Callable Objects • user-de ined functions (module-level def) • user-de ined methods (instance and class) • built-in functions and methods • class objects • objects with a __call__ method PV248 Python 158/354 Advanced Constructs User-de ined Functions • come about from a module-level def • metadata: __doc__, __name__, __module__ • scope: __globals__, __closure__ • arguments: __defaults__, __kwdefaults__ • type annotations: __annotations__ • the code itself: __code__ PV248 Python 159/354 Advanced Constructs Positional and Keyword Arguments • user-de ined functions have positional arguments • and keyword arguments − print("hello", file=sys.stderr) − arguments are passed by name − which style is used is up to the caller • variadic functions: def foo(*args, **kwargs) − args is a tuple of unmatched positional args − kwargs is a dict of unmatched keyword args PV248 Python 160/354 Advanced Constructs Lambdas • def functions must have a name • lambdas provide anonymous functions • the body must be an expression • syntax: lambda x: print("hello", x) • standard user-de ined functions otherwise PV248 Python 161/354 Advanced Constructs Instance Methods • comes about as object.method − print(x.foo) → • combines the class, instance and function itself • __func__ is a user-de ined function object • let bar = x.foo, then − x.foo() → bar.__func__(bar.__self__) PV248 Python 162/354 Advanced Constructs Iterators • objects with __next__ (since 3.x) − iteration ends on raise StopIteration • iterable objects provide __iter__ − sometimes, this is just return self − any iterable can appear in for x in iterable PV248 Python 163/354 Advanced Constructs class FooIter: def __init__(self): self.x = 10 def __iter__(self): return self def __next__(self): if self.x: self.x -= 1 else: raise StopIteration return self.x PV248 Python 164/354 Advanced Constructs Generators (PEP 255) • written as a normal function or method • they use yield to generate a sequence • represented as special callable objects − exist at the C level in CPython def foo(*lst): for i in lst: yield i + 1 list(foo(1, 2)) # prints [2, 3] PV248 Python 165/354 Advanced Constructs yield from • calling a generator produces a generator object • how do we call one generator from another? • same as for x in foo(): yield x def bar(*lst): yield from foo(*lst) yield from foo(*lst) list(bar(1, 2)) # prints [2, 3, 2, 3] PV248 Python 166/354 Advanced Constructs Native Coroutines (PEP 492) • created using async def (since Python 3.5) • generalisation of generators − yield from is replaced with await − an __await__ magic method is required • a coroutine can be suspended and resumed PV248 Python 167/354 Advanced Constructs Coroutine Scheduling • coroutines need a scheduler • one is available from asyncio.get_event_loop() • along with many coroutine building blocks • coroutines can actually run in parallel − via asyncio.create_task (since 3.7) − via asyncio.gather PV248 Python 168/354 Advanced Constructs Async Generators (PEP 525) • async def + yield • semantics like simple generators • but also allows await • iterated with async for − async for runs sequentially PV248 Python 169/354 Advanced Constructs Decorators • written as @decor before a function de inition • decor is a regular function (def decor(f)) − f is bound to the decorated function − the decorated function becomes the result of decor • classes can be decorated too • you can ‘create’ decorators at runtime − @mkdecor("moo") (mkdecor returns the decorator) − you can stack decorators PV248 Python 170/354 Advanced Constructs def decor(f): return lambda: print("bar") def mkdecor(s): return lambda g: lambda: print(s) @decor def foo(f): print("foo") @mkdecor("moo") def moo(f): print("foo") # foo() prints "bar", moo() prints "moo" PV248 Python 171/354 Advanced Constructs List Comprehension • a concise way to build lists • combines a filter and a map [ 2 * x for x in range(10) ] [ x for x in range(10) if x % 2 == 1 ] [ 2 * x for x in range(10) if x % 2 == 1 ] [ (x, y) for x in range(3) for y in range(2) ] PV248 Python 172/354 Advanced Constructs Operators • operators are (mostly) syntactic sugar • x < y rewrites to x.__lt__(y) • is and is not are special − are the operands are the same object? • also the ternary (conditional) operator PV248 Python 173/354 Advanced Constructs Non-Operator Builtins • len(x) x.__len__() (length) • abs(x) x.__abs__() (magnitude) • str(x) x.__str__() (printing) • repr(x) x.__repr__() (printing for eval) • bool(x) and if x: x.__bool__() PV248 Python 174/354 Advanced Constructs Arithmetic • a standard selection of operators • / is loating point, // is integral • += and similar are somewhat magical − x += y → x = x.__iadd__(y) if de ined − otherwise x = x.__add__(y) PV248 Python 175/354 Advanced Constructs x = 7 # an int is immutable x += 3 # works, x = 10, id(x) changes lst = [7, 3] lst[0] += 3 # works too, id(lst) stays same tup = (7, 3) # a tuple is immutable tup += (1, 1) # still works (id changes) tup[0] += 3 # fails PV248 Python 176/354 Advanced Constructs Relational Operators • operands can be of different types • equality: !=, == − by default uses object identity • ordering: <, <=, >, >= (TypeError by default) • consistency is not enforced PV248 Python 177/354 Advanced Constructs Relational Consistency • __eq__ must be an equivalence relation • x.__ne_(y) must be the same as not x.__eq__(y) • __lt__ must be an ordering relation − compatible with __eq__ − consistent with each other • each operator is separate (mixins can help) − or perhaps a class decorator PV248 Python 178/354 Advanced Constructs Exercise 6: Fourier Transform • continuous: ˆ𝑓(𝜉) = ∫ 𝑓(𝑥) exp (−2𝜋𝑖𝑥𝜉) dx • series: − 𝑓(𝑥) = ∑ 𝑐 exp • real series: − 𝑓(𝑥) = + ∑ 𝑎 sin + 𝑏 cos − 𝑐 = (𝑎 − 𝑖𝑏 ) PV248 Python 179/354 Advanced Constructs Exercise 6: Signal Basics • sample rate: number of samples per second • we process the signal in equal-sized chunks − 𝑃 is the (time) length of the analysis window − 𝑁 is the number of samples • use non-overlapping analysis windows PV248 Python 180/354 Advanced Constructs Exercise 6: FFT in numpy • rfft gives you the 𝑐 of the real series − 𝑓(𝑥) = ∑ / 𝑐 exp( ) − 𝑁/2 because of the Nyquist frequency limit • we are only interested in amplitudes: |𝑐 | − amplitude of a complex number: numpy.abs PV248 Python 181/354 Advanced Constructs Exercise 6: Input • a .wav ile, PCM, sample rate 8–48 kHz − such that it will be accepted by wave.open − may be stereo or mono, 16 bit samples • average the channels for stereo input • ignore the inal (incomplete) analysis window • you can use struct.unpack to decode the samples PV248 Python 182/354 Advanced Constructs Exercise 6: Output • a peak is a frequency component with amplitude ≥ 20𝑎 − where 𝑎 is the average amplitude in the same window • printthehighest-andlowest-frequencypeakencountered − in the form low = 37, high = 18000 − print no peaks if there are no peaks − the numbers are in Hz, precision = exactly 1Hz PV248 Python 183/354 Advanced Constructs Exercise 6: Invocation & Hints • invocation: ./peaks.py audio.wav − the output goes to stdout − only a single line for the entire ile • think about how precision relates to 𝑁 • generate simple sine wave inputs for testing − also a sum of sine waves at different frequencies PV248 Python 184/354 Advanced Constructs 2, Pi alls Part 7: Advanced Constructs 2, Pitfalls PV248 Python 185/354 Advanced Constructs 2, Pi alls Collection Operators • in is also a membership operator (outside for) − implemented as __contains__ • indexing and slicing operators − del x[y] → x.__delitem__(y) − x[y] → x.__getitem__(y) − x[y] = z → x.__setitem__(y, z) PV248 Python 186/354 Advanced Constructs 2, Pi alls Conditional Operator • also known as a ternary operator • written x if cond else y − in C: cond ? x : y • forms an expression, unlike if − can e.g. appear in a lambda − or in function arguments, &c. PV248 Python 187/354 Advanced Constructs 2, Pi alls Concurrency & Parallelism • threading – thread-based parallelism • multiprocessing • concurrent – future-based programming • subprocess • sched, a general-purpose event scheduler • queue, for sending objects between threads PV248 Python 188/354 Advanced Constructs 2, Pi alls Threading • low-level thread support, module threading • Thread objects represent actual threads − threads provide start() and join() − the run() method executes in a new thread • mutexes, semaphores &c. PV248 Python 189/354 Advanced Constructs 2, Pi alls The Global Interpreter Lock • memory management in CPython is not thread-safe − Python code runs under a global lock − pure Python code cannot use multiple cores • C code usually runs without the lock − this includes numpy crunching PV248 Python 190/354 Advanced Constructs 2, Pi alls Multiprocessing • like threading but uses processes • works around the GIL − each worker process has its own interpreter • queued/sent objects must be pickled − see also: the pickle module − this causes substantial overhead − functions, classes &c. are pickled by name PV248 Python 191/354 Advanced Constructs 2, Pi alls Futures • like coroutine await but for subroutines • a Future can be waited for using f.result() • scheduled via concurrent.futures.Executor − Executor.map is like asyncio.gather − Executor.submit is like asyncio.create_task • implemented using process or thread pools PV248 Python 192/354 Advanced Constructs 2, Pi alls Exceptions • an exception interrupts normal control low • it’s called an exception because it is exceptional − never mind StopIteration • causes methods to be interrupted − until a matching except block is found − also known as stack unwinding PV248 Python 193/354 Advanced Constructs 2, Pi alls Life Without Exceptions int fd = socket( ... ); if ( fd < 0 ) ... /* handle errors */ if ( bind( fd, ... ) < 0 ) ... /* handle errors */ if ( listen( fd, 5 ) < 0 ) ... /* handle errors */ PV248 Python 194/354 Advanced Constructs 2, Pi alls With Exceptions try: sock = socket.socket( ... ) sock.bind( ... ) sock.listen( ... ) except ...: # handle errors PV248 Python 195/354 Advanced Constructs 2, Pi alls Exceptions vs Resources x = open( "file.txt" ) # stuff raise SomeError • who calls x.close() • this would be a resource leak PV248 Python 196/354 Advanced Constructs 2, Pi alls Using finally try: x = open( "file.txt" ) # stuff finally: x.close() • works, but tedious and error-prone PV248 Python 197/354 Advanced Constructs 2, Pi alls Using with with open( "file.txt" ) as f: # stuff • with takes care of the finally and close • with x as y sets y = x.__enter__() − and calls x.__exit__(...) when leaving the block PV248 Python 198/354 Advanced Constructs 2, Pi alls The @property decorator • attribute syntax is the preferred one in Python • writing useless setters and getters is boring class Foo: @property def x(self): return 2 * self.a @x.setter def x(self, v): self.a = v // 2 PV248 Python 199/354 Advanced Constructs 2, Pi alls Mixing Languages • for many people, Python is not a irst language • some things look similar in Python and Java (C++, …) − sometimes they do the same thing − sometimes they do something very different − sometimes the difference is subtle PV248 Python 200/354 Advanced Constructs 2, Pi alls Python vs Java: Decorators • Java has a thing called annotations • looks very much like a Python decorator • in Python, decorators can drastically change meaning • in Java, they are just passive metadata − othercodecanusethemformeta-programmingthough PV248 Python 201/354 Advanced Constructs 2, Pi alls Class Body Variables class Foo: some_attr = 42 • in Java/C++, this is how you create instance variables • in Python, this creates class attributes − i.e. what C++/Java would call static attributes PV248 Python 202/354 Advanced Constructs 2, Pi alls Very Late Errors if a == 2: priiiint("a is not 2") • no error when loading this into python • it even works as long as a != 2 • most languages would tell you much earlier PV248 Python 203/354 Advanced Constructs 2, Pi alls Very Late Errors (cont’d) try: foo() except TyyyypeError: print("my mistake") • does not even complain when running the code • you only notice when foo() raises an an exception PV248 Python 204/354 Advanced Constructs 2, Pi alls Late Imports if a == 2: import foo foo.say_hello() • unless a == 2, mymod is not loaded • any syntax errors don’t show up until a == 2 − it may even fail to exist PV248 Python 205/354 Advanced Constructs 2, Pi alls Block Scope for i in range(10): pass print(i) # not a NameError • in Python, local variables are function-scoped • in other languages, i is con ined to the loop PV248 Python 206/354 Advanced Constructs 2, Pi alls Assignment Pitfalls x = [ 1, 2 ] y = x x.append( 3 ) print(y) # prints [ 1, 2, 3 ] • in Python, everything is a reference • assignment does not make copies PV248 Python 207/354 Advanced Constructs 2, Pi alls Python vs Java: Closures • captured variables are final in Java • but they are mutable in Python − and of course captured by reference • they are whatever you tell them to be in C++ PV248 Python 208/354 Advanced Constructs 2, Pi alls Explicit super() • Java and C++ automatically call parent constructors • Python does not • you have to call them yourself PV248 Python 209/354 Advanced Constructs 2, Pi alls Setters and Getters obj.attr obj.attr = 4 • in C++ or Java, this is an assignment • in Python, it can run arbitrary code − this often makes getters/setters redundant PV248 Python 210/354 Advanced Constructs 2, Pi alls Exercise 7: Music Analysis • invocation: ./music.py 440 audio.wav − 440 is the frequency of the pitch a’ − audio.wav is the same as for exercise 6 • use a sliding window for .1 second precision • print peak pitches instead of frequencies PV248 Python 211/354 Advanced Constructs 2, Pi alls Exercise 7: Output 01.0-02.3 e+0 gis+0 b+0 10.0-12.0 b'+10 12.0-12.7 C+0 e-3 • consider only the 3 most prominent peaks • print 1 line for each segment with the same peaks − print nothing for segments with no peaks − order the peaks by increasing frequency PV248 Python 212/354 Advanced Constructs 2, Pi alls Exercise 7: Pitch Formatting • pitch names: c, cis, d, es, e, f, is, g, gis, a, bes, b • octaves (Helmholtz): A,, / A, / A / a / a’ / a’’ and so on • pitches use a logarithmic scale − if a’ is 440 Hz, then a is 220 Hz and A is 110 Hz • valid pitch examples: is / Cis / bes’ / Es, PV248 Python 213/354 Advanced Constructs 2, Pi alls Exercise 7: Pitch Deviation • not all pitches are exactly ‘right’ − i.e. they won’t exactly match a named pitch • cent is 1/100 the distance between semitones − remember that this is a logarithmic scale • print the closest named pitch and the deviation in cents − if a’ = 440 Hz, then 448 Hz is a’ + 31 cents − likewise, 115 Hz is Bes − 23 cents PV248 Python 214/354 Advanced Constructs 2, Pi alls Exercise 7: Peak Clustering • most instruments have complex spectra − individual notes are not pure sine waves • this can lead to peak clustering − that is multiple peaks next to each other (1Hz apart) − consider only the strongest peak in each cluster − if equal, pick the one closer to the center of the cluster PV248 Python 215/354 Tes ng, Debugging & Profiling Part 8: Testing, Debugging & Pro iling PV248 Python 216/354 Tes ng, Debugging & Profiling Why Testing • reading programs is hard • reasoning about programs is even harder • testing is comparatively easy • difference between an example and a proof PV248 Python 217/354 Tes ng, Debugging & Profiling What is Testing • based on trial runs • the program is executed with some inputs • the outputs or outcomes are checked • almost always incomplete PV248 Python 218/354 Tes ng, Debugging & Profiling Testing Levels • unit testing − individual classes − individual functions • functional − system − integration PV248 Python 219/354 Tes ng, Debugging & Profiling Testing Automation • manual testing − still widely used − requires human • semi-automated − requires human assistance • fully automated − can run unattended PV248 Python 220/354 Tes ng, Debugging & Profiling Testing Insight • what does the test or tester know? • black box: nothing known about internals • gray box: limited knowledge • white box: ‘complete’ knowledge PV248 Python 221/354 Tes ng, Debugging & Profiling Why Unit Testing? • allows testing small pieces of code • the unit is likely to be used in other code − make sure your code works before you use it − the less code, the easier it is to debug • especially easier to hit all the corner cases PV248 Python 222/354 Tes ng, Debugging & Profiling Unit Tests with unittest • from unittest import TestCase • derive your test class from TestCase • put test code into methods named test_* • run with python -m unittest program.py − add -v for more verbose output PV248 Python 223/354 Tes ng, Debugging & Profiling from unittest import TestCase class TestArith(TestCase): def test_add(self): self.assertEqual(1, 4 - 3) def test_leq(self): self.assertTrue(3 <= 2 * 3) PV248 Python 224/354 Tes ng, Debugging & Profiling Unit Tests with pytest • a more pythonic alternative to unittest − unittest is derived from JUnit • easier to use and less boilerplate • you can use native python assert • easier to run, too − just run pytest in your source repository PV248 Python 225/354 Tes ng, Debugging & Profiling Test Auto-Discovery in pytest • pytest inds your testcases for you − no need to register anything • put your tests in test_*.py or *_test.py • name your testcases (functions) test_* PV248 Python 226/354 Tes ng, Debugging & Profiling Fixtures in pytest • sometimes you need the same thing in many testcases • in unittest, you have the test class • pytest passes ixtures as parameters − ixtures are created by a decorator − they are matched based on their names PV248 Python 227/354 Tes ng, Debugging & Profiling import pytest import smtplib @pytest.fixture def smtp_connection(): return smtplib.SMTP("smtp.gmail.com", 587) def test_ehlo(smtp_connection): response, msg = smtp_connection.ehlo() assert response == 250 PV248 Python 228/354 Tes ng, Debugging & Profiling Property Testing • writing test inputs is tedious • sometimes, we can generate them instead • useful for general properties like − idempotency (e.g. serialize + deserialize) − invariants (output is sorted, …) − code does not cause exceptions PV248 Python 229/354 Tes ng, Debugging & Profiling Using hypothesis • property-based testing for Python • has strategies to generate basic data types − int, str, dict, list, set, … • compose built-in generators to get custom types • integrated with pytest PV248 Python 230/354 Tes ng, Debugging & Profiling import hypothesis import hypothesis.strategies as s @hypothesis.given(s.lists(s.integers())) def test_sorted(x): assert sorted(x) == x # should fail @hypothesis.given(x=s.integers(), y=s.integers()) def test_cancel(x, y): assert (x + y) - y == x # looks okay PV248 Python 231/354 Tes ng, Debugging & Profiling Going Quick and Dirty • goal: minimize time spent on testing • manual testing usually loses − but it has almost 0 initial investment • if you can write a test in 5 minutes, do it • useful for testing small scripts PV248 Python 232/354 Tes ng, Debugging & Profiling Shell 101 • shell scripts are very easy to write • they are ideal for testing IO behaviour • easily check for exit status: set -e • see what is going on: set -x • use diff -u to check expected vs actual output PV248 Python 233/354 Tes ng, Debugging & Profiling Shell Test Example set -ex python script.py < test1.in | tee out diff -u test1.out out python script.py < test2.in | tee out diff -u test2.out out PV248 Python 234/354 Tes ng, Debugging & Profiling Continuous Integration • automated tests need to be executed • with many tests, this gets tedious to do by hand • CI builds and tests your project regularly − every time you push some commits − every night (e.g. more extensive tests) PV248 Python 235/354 Tes ng, Debugging & Profiling CI: Travis • runs in the cloud (CI as a service) • trivially integrates with pytest • virtualenv out of the box for python projects • integrated with github • con igure in .travis.yml in your repo PV248 Python 236/354 Tes ng, Debugging & Profiling CI: GitLab • GitLab has its own CI solution (similar to travis) • also available at FI • runs tests when you push to your gitlab • drop a .gitlab-ci.yml in your repository • automatic deployment into heroku &c. PV248 Python 237/354 Tes ng, Debugging & Profiling CI: Buildbot • written in python/twisted − basically a framework to build a custom CI tool • self-hosted and somewhat complicated to set up − more suited for complex projects − much more lexible than most CI tools • distributed design PV248 Python 238/354 Tes ng, Debugging & Profiling CI: Jenkins • another self-hosted solution, this time in Java − widely used and well supported • native support for python projects (including pytest) − provides a dashboard with test result graphs &c. − supportspublishing sphinx-generateddocumentation PV248 Python 239/354 Tes ng, Debugging & Profiling Print-based Debugging • no need to be ashamed, everybody does it • less painful in interpreted languages • you can also use decorators for tracing • never forget to clean your program up again PV248 Python 240/354 Tes ng, Debugging & Profiling def debug(e): f = sys._getframe(1) v = eval(e, f.f_globals, f.f_locals) l = f.f_code.co_filename + ':' l += str(f.f_lineno) + ':' print(l, e, '=', repr(v), file=sys.stderr) x = 1 debug('x + 1') PV248 Python 241/354 Tes ng, Debugging & Profiling The Python Debugger • run as python -m pdb program.py • there’s a built-in help command • next steps through the program • break to set a breakpoint • cont to run until end or a breakpoint PV248 Python 242/354 Tes ng, Debugging & Profiling What is Pro iling • measurement of resource consumption • essential info for optimising programs • answers questions about bottlenecks − where is my program spending most time? − less often: how is memory used in the program PV248 Python 243/354 Tes ng, Debugging & Profiling Why Pro iling • ‘blind’ optimisation is often misdirected − it is like ixing bugs without triggering them − program performance is hard to reason about • tells you exactly which point is too slow − allows for best speedup with least work PV248 Python 244/354 Tes ng, Debugging & Profiling Pro iling in Python • provided as a library, cProfile − alternative: profile is slower, but more lexible • run as python -m cProfile program.py • outputs a list of lines/functions and their cost • use cProfile.run() to pro ile a single expression PV248 Python 245/354 Tes ng, Debugging & Profiling # python -m cProfile -s time fib.py ncalls tottime percall file:line(function) 13638/2 0.032 0.016 fib.py:1(fib_rec) 2 0.000 0.000 {builtins.print} 2 0.000 0.000 fib.py:5(fib_mem) PV248 Python 246/354 Tes ng, Debugging & Profiling Exercise 8: Statistics • fetch points.csv from study materials − each column is one deadline of one exercise − each line is one student, cells are points • an average student has average points in each column • you can use pandas and/or numpy if you like PV248 Python 247/354 Tes ng, Debugging & Profiling Exercise 8: Bulk Stats • invocation: ./stat.py file.csv is one of: dates, deadlines, exercises • in each mode, list all such entities along with − mean, median, first and last quartile of points − number of students that passed (points > 0) • the output is a JSON dictionary of dictionaries • date YYYY-MM-DD, exercise NN, deadline YYYY-MM-DD/NN PV248 Python 248/354 Tes ng, Debugging & Profiling Bulk Output (stat.py) { "01": { "mean": 1, "median": 1, ... }, "02": { ..., "passed": 60, ... }, ... } or { "2018-09-26": { ... "last": 2.5, ... }, "2018-10-03": { ... "passed": 20, ... }, ... } } PV248 Python 249/354 Tes ng, Debugging & Profiling Exercise 8: Individual Stats • invocation: ./student.py file.csv is the student identi ier or average • output mean and median points per exercise • a number of passed exercises and total points • a linear regression for cumulative points in time − keys: regression slope (intercept is 0) • expected date to pass the 16 and 20 point marks − keys: date 16 and date 20 PV248 Python 250/354 Tes ng, Debugging & Profiling Per-Student Output (student.py) { "mean": 1.66, "median": 1.5, "total": 10, "passed": 6, "regression slope": 0.2, "date 16": "2018-12-05", "date 20": "2018-12-25" } PV248 Python 251/354 Communica on, HTTP Part 9: Communication, HTTP PV248 Python 252/354 Communica on, HTTP Running Programs (the old way) • os.system is about the simplest − also somewhat dangerous – shell injection − you only get the exit code • os.popen allows you to read output of a program − alternatively, you can send input to the program − you can’t do both (would likely deadlock anyway) − runsthecommandthroughashell, sameasos.system PV248 Python 253/354 Communica on, HTTP Low-level Process API • POSIX-inherited interfaces (on POSIX systems) • os.exec: replace the current process • os.fork: split the current process in two • os.forkpty: same but with a PTY PV248 Python 254/354 Communica on, HTTP Detour: bytes vs str • strings (class str) represent text − that is, a sequence of unicode points • iles and network connections handle data − represented in Python as bytes • the bytes constructor can convert from str − e.g. b = bytes("hello", "utf8") PV248 Python 255/354 Communica on, HTTP Running Programs (the new way) • you can use the subprocess module • subprocess can handle bidirectional IO − it also takes care of avoiding IO deadlocks − set input to feed data to the subprocess • internally, run uses a Popen object − if run can’t do it, Popen probably can PV248 Python 256/354 Communica on, HTTP Getting subprocess Output • only available via run since Python 3.7! • the run function returns a CompletedProcess • it has attributes stdout and stderr • both are bytes (byte sequences) by default • or str if text or encoding were set • available if you enabled capture_output PV248 Python 257/354 Communica on, HTTP Running Filters with Popen • if you are stuck with 3.6, use Popen directly • set stdin in the constructor to PIPE • use the communicate method to send the input • this gives you the outputs (as bytes) PV248 Python 258/354 Communica on, HTTP import subprocess from subprocess import PIPE input = bytes( "x\na\nb\ny", "utf8") p = subprocess.Popen(["sort"], stdin=PIPE, stdout=PIPE) out = p.communicate(input=input) # out[0] is the stdout, out[1] is None PV248 Python 259/354 Communica on, HTTP Subprocesses with asyncio • import asyncio.subprocess • create_subprocess_exec, like subprocess.run − but it returns a Process instance − Process has a communicate async method • can run things in background (via tasks) − also multiple processes at once PV248 Python 260/354 Communica on, HTTP Protocol-based asyncio subprocesses • let loop be an implementation of the asyncio event loop • there’s subprocess_exec and subprocess_shell − sets up pipes by default • integrates into the asyncio transport layer (see later) • allows you to obtain the data piece-wise https://docs.python.org/3/library/asyncio-protocol.html PV248 Python 261/354 Communica on, HTTP Sockets • the socket API comes from early BSD Unix • socket represents a (possible) network connection • sockets are more complicated than normal iles − establishing connections is hard − messages get lost much more often than ile data PV248 Python 262/354 Communica on, HTTP Socket Types • sockets can be internet or unix domain − internet sockets connect to other computers − Unix sockets live in the ilesystem • sockets can be stream or datagram − stream sockets are like iles (TCP) − you can write a continuous stream of data − datagram sockets can send individual messages (UDP) PV248 Python 263/354 Communica on, HTTP Sockets in Python • the socket module is available on all major OSes • it has a nice object-oriented API − failures are propagated as exceptions − buffer management is automatic • useful if you need to do low-level networking − hard to use in non-blocking mode PV248 Python 264/354 Communica on, HTTP Sockets and asyncio • asyncio provides sock_* to work with socket objects • this makes work with non-blocking sockets a lot easier • but your program needs to be written in async style • only use sockets when there is no other choice − asyncio protocols are both faster and easier to use PV248 Python 265/354 Communica on, HTTP Hyper-Text Transfer Protocol • originally a simple text-based, stateless protocol • however − SSL/TLS, cryptography (https) − pipelining (somewhat stateful) − cookies (somewhat stateful in a different way) • typically between client (browser) and a front-end server • butalsoasaback-endprotocol(webservertoappserver) PV248 Python 266/354 Communica on, HTTP Request Anatomy • request type (see below) • header (text-based, like e-mail) • content Request Types • GET – asks the server to send a resource • HEAD – like GET but only send back headers • POST – send data to the server PV248 Python 267/354 Communica on, HTTP Python and HTTP • both client and server functionality − import http.client − import http.server • TLS/SSL wrappers are also available − import ssl • synchronous by default PV248 Python 268/354 Communica on, HTTP Serving Requests • derive from BaseHTTPRequestHandler • implement a do_GET method • this gets called whenever the client does a GET • also available: do_HEAD, do_POST, etc. • pass the class (not an instance) to HTTPServer PV248 Python 269/354 Communica on, HTTP Serving Requests (cont’d) • HTTPServer creates a new instance of your Handler • the BaseHTTPRequestHandler machinery runs • it calls your do_GET etc. method • request data is available in instance variables − self.path, self.headers PV248 Python 270/354 Communica on, HTTP Talking to the Client • HTTP responses start with a response code − self.send_response( 200, 'OK' ) • the headers follow (set at least Content-Type) − self.send_header( 'Connection', 'close' ) • headers and the content need to be separated − self.end_headers() • inally, send the content by writing to self.wfile PV248 Python 271/354 Communica on, HTTP Sending Content • self.wfile is an open ile • it has a write() method which you can use • sockets only accept byte sequences, not str • use the bytes( string, encoding ) constructor − match the encoding to your Content-Type PV248 Python 272/354 Communica on, HTTP HTTP and asyncio • thebaseasyncio currentlydoesn’tdirectlysupportHTTP • but: you can get aiohttp from PyPI • contains a very nice web server − from aiohttp import web − minimum boilerplate, fully asyncio-ready PV248 Python 273/354 Communica on, HTTP SSL and TLS • you want to use the ssl module for handling HTTPS − this is especially true server-side − aiohttp and http.server are compatible • you need to deal with certi icates (loading, checking) • this is a rather important but complex topic PV248 Python 274/354 Communica on, HTTP Certi icate Basics • certi icate is a cryptographically signed statement − it ties a server to a certain public key − the client ensures the server knows the private key • the server loads the certi icate and its private key • the client must validate the certi icate − this is typically a lot harder to get right PV248 Python 275/354 Communica on, HTTP SSL in Python • start with import ssl • almost everything happens in the SSLContext class • get an instance from ssl.create_default_context() − you can use wrap_socket to run an SSL handshake − you can pass the context to aiohttp • if httpd is a http.server.HTTPServer: httpd.socket = ssl.wrap_socket( httpd.socket, ... ) PV248 Python 276/354 Communica on, HTTP HTTP Clients • there’s a very basic http.client • for a more complete library, use urllib.request • aiohttp has client functionality • all of the above can be used with ssl • another 3rd party module: Python Requests PV248 Python 277/354 Communica on, HTTP Exercise 9: Forwarding HTTP • invocation: ./http-forward.py 9001 example.com − listen on the speci ied port (9001 above) for HTTP − use example.com as the upstream for GET • for GET requests: − forward the request as-is to the upstream − send back JSON to your client (see next slide) • for POST requests − accept JSON data, construct request, proceed as GET − supply suitable default headers unless overridden PV248 Python 278/354 Communica on, HTTP Exercise 9: GET Requests • the reply to the client must be valid JSON dictionary • send the upstream response code as code − or "timeout" (by default after 1 second) • send all the received headers to the client • if the response is valid JSON, include it under json − include it as a string in content otherwise PV248 Python 279/354 Communica on, HTTP Exercise 9: POST Requests • read a JSON dictionary from the request content; keys: − type – string, either GET (default) or POST − url – string, the address to fetch − headers – dictionary, the headers to send − content – the content to send if type is POST − timeout – number of seconds to wait for completion • if the JSON is invalid, set code to "invalid json" − also if a crucial key is missing (url, content for POST) PV248 Python 280/354 Communica on, HTTP # POST request content { "type": "GET", "url": "http://example.com", "headers": { "Accept-Encoding": "...", ... }, "timeout": 3 } # reply from http-forward.py { "code": 200 "headers": { "Content-Length": ... }, "json": ... } PV248 Python 281/354 Communica on, HTTP Exercise 9: Bonus • handle SSL/TLS when connecting to your upstream − speci ied by https as a protocol in url • include a boolean certificate valid in response JSON − rely on the default system trusted CA certs − also certificate for with a list of hostnames • get0.5extrapoint(regardlessofwhichdeadlineyoupass) PV248 Python 282/354 Closures, Corou nes &c. Part 10: Closures, Coroutines &c. PV248 Python 283/354 Closures, Corou nes &c. Exercise 10: CGI • invocation: ./serve.py 9001 dir • listen on the speci ied port (9001 in this case) • serve the content of dir over HTTP • treat iles named .cgi specially (see next slide) • serve anything else as static content PV248 Python 284/354 Closures, Corou nes &c. Exercise 10: Running CGI Scripts • if a .cgi ile is requested, run it • adhere to the CGI protocol − request info goes into environment variables − the stdout of the script goes to the client − refer to RFC 3875 and/or Wikipedia • do not forget to deal with POST requests PV248 Python 285/354 Closures, Corou nes &c. Exercise 10: Various • no need to auto-index directories • you must handle concurrent connections − even while a CGI script is running • you must handle arbitrarily large data − this applies to static iles − but also to CGI script outputs PV248 Python 286/354 Closures, Corou nes &c. Execution Stack • made up of activation frames • holds local variables • and return addresses • in dynamic languages, often lives in the heap PV248 Python 287/354 Closures, Corou nes &c. Variable Capture • variables are captured lexically • de initions are a dynamic / run-time construct − a nested de inition is executed − creates a clousre object • always by reference in Python − but can be by-value in other languages PV248 Python 288/354 Closures, Corou nes &c. Using Closures • closures can be returned, stored and called − they can be called multiple times, too − they can capture arbitrary variables • closures naturally retain state • this is what makes them powerful PV248 Python 289/354 Closures, Corou nes &c. Objects from Closures • so closures are essentially code + state • wait, isn’t that what an object is? • indeed, you can implement objects using closures PV248 Python 290/354 Closures, Corou nes &c. The Role of GC • memory management becomes a lot more complicated • forget C-style ‘automatic’ stack variables • this is why the stack is actually in the heap • this can go as far as form reference cycles PV248 Python 291/354 Closures, Corou nes &c. Coroutines • coroutines are a generalisation of subroutines • they can be suspended and re-entered • coroutines can be closures at the same time • the code of a coroutine is like a function • a suspended coroutine is like an activation frame PV248 Python 292/354 Closures, Corou nes &c. Yield • suspends execution and ‘returns’ a value • may also obtain a new value (cf. send) • when re-entered, continue where we left off for i in range(5): yield i PV248 Python 293/354 Closures, Corou nes &c. Send • with yield, we have one-way communication • but in many cases, we would like two-way • a suspended coroutine is an object in Python − with a send method which takes a value − send re-enters the coroutine PV248 Python 294/354 Closures, Corou nes &c. Yield From and Await • yield from is mostly a generator concept • await basically does the same thing − call out to another coroutine − when it suspends, so does the entire stack PV248 Python 295/354 Closures, Corou nes &c. Suspending Native Coroutines • this is not actually possible − not with async-native syntax anyway • you need a yield − for that, you need a generator − use the types.coroutine decorator PV248 Python 296/354 Closures, Corou nes &c. Event Loop • not required in theory • useful also without coroutines • there is a synergistic effect − event loops make coroutines easier − coroutines make event loops easier PV248 Python 297/354 asyncio, Projects Part 11: asyncio, Projects PV248 Python 298/354 asyncio, Projects IO at the OS Level • often defaults to blocking − read returns when data is available − this is usually OK for ile • but what about network code? − could work for a client PV248 Python 299/354 asyncio, Projects Threads and IO • there may be work to do while waiting − waiting for IO can be wasteful • only the calling (OS) thread is blocked − another thread may do the work − but multiple green threads may be blocked PV248 Python 300/354 asyncio, Projects Non-Blocking IO • the program calls read − read returns immediately − even if there was no data • but how do we know when to read? − we could poll − for example call read every 30ms PV248 Python 301/354 asyncio, Projects Polling • trade-off between latency and throughput − sometimes, polling is okay − but is often too inef icient • alternative: IO dispatch − useful when multiple IOs are pending − wait only if all are blocked PV248 Python 302/354 asyncio, Projects select • takes a list of ile descriptors • block until one of them is ready − next read will return data immediately • can optionally specify a timeout • only useful for OS-level resources PV248 Python 303/354 asyncio, Projects Alternatives to select • select is a rather old interface • there is a number of more modern variants • poll and epoll system calls − despite the name, they do not poll − epoll is more scalable • kqueue and kevent on BSD systems PV248 Python 304/354 asyncio, Projects Synchronous vs Asynchronous • the select family is synchronous − you call the function − it may wait some time − you proceed when it returns • OS threads are fully asynchronous PV248 Python 305/354 asyncio, Projects The Thorny Issue of Disks • a ile is always ‘ready’ for reading • this may still take time to complete • there is no good solution on UNIX • POSIX AIO exists but is sparsely supported • OS threads are an option PV248 Python 306/354 asyncio, Projects IO on Windows • select is possible (but slow) • Windows provides real asynchronous IO − quite different from UNIX − the IO operation is directly issued − but the function returns immediately • comes with a noti ication queue PV248 Python 307/354 asyncio, Projects The asyncio Event Loop • uses the select family of syscalls • why is it called async IO? − select is synchronous in principle − this is an implementation detail − the IOs are asynchronous to each other PV248 Python 308/354 asyncio, Projects How Does It Work • you must use asyncio functions for IO • an async read does not issue an OS read • it yields back into the event loop • the fd is put on the select list • the coroutine is resumed when the fd is ready PV248 Python 309/354 asyncio, Projects Timers • asyncio allows you to set timers • the event loop keeps a list of those • and uses that to set the select timeout − just uses the nearest timer expiry • when a timer expires, its owner is resumed PV248 Python 310/354 asyncio, Projects Blocking IO vs asyncio • all user code runs on the main thread • you must not call any blocking IO functions • doing so will stall the entire application − in a server, clients will time out − even if not, latency will suffer PV248 Python 311/354 asyncio, Projects DNS • POSIX: getaddrinfo and getnameinfo − also the older API gethostbyname • those are all blocking functions − and they can take a while − but name resolution is essential • asyncio internally uses OS threads for DNS PV248 Python 312/354 asyncio, Projects Signals • signals on UNIX are very asynchronous • interact with OS threads in a messy way • asyncio hides all this using C code PV248 Python 313/354 asyncio, Projects Exercise 11: Tic Tac Toe • write a game server for (3x3) tic tac toe • invocation: ./ttt.py port − listen on the given port (number) − serve HTTP (only GET requests) − all responses are JSON dictionaries PV248 Python 314/354 asyncio, Projects Exercise 11: Start • GET /start?name=string • returns a numeric id − multiple games may run in parallel • the game starts with an empty board • player 1 plays irst PV248 Python 315/354 asyncio, Projects Exercise 11: Status • GET /status?game=id • if the game is over: − set winner to 0 (draw), 1 or 2 • otherwise set: − board is a list of lists of numbers − 0 = empty, 1 and 2 indicate the player − next 1 or 2 (who plays next) PV248 Python 316/354 asyncio, Projects Exercise 11: Playing • GET /play?game=id&player=1&x=1&y=2 • must validate the request • set status to either "ok" or "bad" − if status is "bad", set message − message is free-form text for the user PV248 Python 317/354 asyncio, Projects Exercise 12: Tic Tac Toe Client • include ttt.py from exercise 11 − add a /list request − returns a JSON list of games − each is a dict with name and id • invocation: client.py host port PV248 Python 318/354 asyncio, Projects Exercise 12: User Interface • start by offering a list of games − only offer games with empty boards • the user enters the numeric id to join − joining makes you player 2 • typing new starts a new game − you start as player 1 PV248 Python 319/354 asyncio, Projects Exercise 12: Polling • ask for status ~once per second • while waiting, print (once) − waiting for the other player • draw an up-to-date board − use _, x and o, no spaces PV248 Python 320/354 asyncio, Projects Exercise 12: Gameplay • prompt with your turn (o): (or x) − read 𝑥 and 𝑦 (whitespace separated) − if invalid, print invalid input − then ask again (until satis ied) • on game over, print you lose or you win PV248 Python 321/354 asyncio, Projects Exercise 12: Bonus • make an interactive graphical interface − make the interaction mouse-based − use pygame or pyglet • must be ready for the last seminar − you can get 1 extra point PV248 Python 322/354 asyncio, Projects Projects • you can earn 4 points − that’s 2 exercises worth − the effort should match that • submit by the end of the exam period • this is a fallback option − exercises and reviews are preferred PV248 Python 323/354 asyncio, Projects Project Grading • there is only 1 automated option (see DF) − can be evaluated repeatedly • everything else is evaluated manually − should work 100% on irst try − you get at most one retry − expect latency of about a week PV248 Python 324/354 asyncio, Projects Project Reviews • projects can be reviewed before submission − excluding the machine-corrected variant − you can seek multiple reviews − getting at least one is strongly recommended • otherwise same rules as for exercises − review point limits are shared PV248 Python 325/354 asyncio, Projects Project Topics • do not try to sell something you already have • seek approval before you start working − put a project.txt in your repository − I will make a note in the IS notebook • it is okay to come up with your own − but I may request changes PV248 Python 326/354 asyncio, Projects Project Idea: Breakout • write a breakout clone (game) − or another game of similar complexity − do not settle for absolute bare-bones − add simple sound effects or animation • you can use pygame or pyglet PV248 Python 327/354 asyncio, Projects Project Idea: Scorelib Redux • write an editor for the score database − should be practically usable − work with the SQL representation • you can use pyqt5 • alternatively flask or django − might need some javascript − you can also use aiohttp and AJAX PV248 Python 328/354 asyncio, Projects Project Idea: A Real Tuner • should work in real time • process microphone input − alternatively work with a recording − in which case, provide a slider • visualize the outputs − try pygame or pyglet PV248 Python 329/354 Modules and Packages Part 12: Modules and Packages PV248 Python 330/354 Modules and Packages Code Modularity • common tasks are bundled as functions • functions can be bundled into classes − often contains shared state (via attributes) • classes are bundled into modules − simpler than classes: usually no data • modules can be bundled into packages PV248 Python 331/354 Modules and Packages Why Modularity 1. managing size and complexity 2. management of names 3. code re-use and sharing PV248 Python 332/354 Modules and Packages Code Size • there are natural limits on function size − long functions are hard to understand − likewise on class sizes • this also holds for modules − big modules are hard to use − but even harder to maintain PV248 Python 333/354 Modules and Packages Naming Things • human brain is highly context-sensitive − same name can refer to many things − consider a method called open • there is no optimal length for a name − wider scopes require longer names − long names in narrow scopes are wasteful PV248 Python 334/354 Modules and Packages Namespaces • a hierarchical approach to names − use a short name from within the scope − use a longer name from outside • with a built-in mechanism for shortcuts • realized by classes, modules, packages PV248 Python 335/354 Modules and Packages Python Modules • creating a single module is simple • a collection of re-usable code − mainly classes (class) − and functions (def, async def) • there is no special syntax − a ile, basically the same as a script PV248 Python 336/354 Modules and Packages Python Packages • a package is a bundle of modules • realized as a ile system directory − it must have an __init__.py − but it could be empty • this is what gives us import foo.bar PV248 Python 337/354 Modules and Packages Package Mechanics • the __init__.py has two roles − prevent con licts with non-package directories − provide de initions • import foo will load foo/__init__.py PV248 Python 338/354 Modules and Packages More on Import • import loads and evaluates the module • it creates an object to represent it • creates a variable in the current scope • assigns the object to the variable • import is somewhat like def PV248 Python 339/354 Modules and Packages Bytecode • CPython is actually a bytecode interpreter • there is a frontend which parses code − and emits an intermediate representation − which can be stored as bytecode • bytecode is stored in .pyc iles • and for modules, it is cached under __pycache__ PV248 Python 340/354 Modules and Packages Modules Written in C • those are implemented as shared libraries − .so on UNIX (typically ELF shared object) − .pyd on Windows (really a PE DLL ile) • the lookup is the same as for .py modules • functions show up as built-in functions PV248 Python 341/354 Modules and Packages The View from C • CPython objects are of type PyObject * • C APIs exist to create and use objects • recall that modules are just objects • a special function PyInit_modname() − say PyInit_spam() in spam.so − import calls this to create the object PV248 Python 342/354 Modules and Packages Built-in Modules • some modules are completely built into CPython • internally, they are much like C modules • may be for ef iciency or for low-level system access • the sys module is always built-in − sys.path is needed to load any other modules PV248 Python 343/354 Modules and Packages Modules are Garbage-Collected • sys.modules holds references to all loaded modules • it’s possible to remove modules from there • importing again will then reload the module • the old version can be garbage-collected • some C modules are excluded from this mechanism PV248 Python 344/354 Modules and Packages Distributing Packages: Reminder • python packages are distributed via PyPI • source trees are different from installed modules • extra metadata in the source tree − info about authors, links to resources − most importantly package dependencies PV248 Python 345/354 Modules and Packages Source Trees • python is not a compiled language − the source code is what is installed • some packages also contain C code − think number crunching in numpy − this must be actually compiled • there’s also unit tests of course PV248 Python 346/354 Modules and Packages setup.py • a script that installs your package • it knows where to put it and how • also knows how to build C code • usually written using setuptools PV248 Python 347/354 Modules and Packages Versioning • so you have made a package… − it is probably not complete − and it may have some bugs in it • you add features, ix bugs… − other people already use it − you need to make a new version PV248 Python 348/354 Modules and Packages Version Numbers • often major.minor or major.minor.patch − for example: python 3.6.5 • a change in major indicates incompatibility − like when print x no longer works in python 3 • minor is for non-breaking feature additions • patch is for bug ixes PV248 Python 349/354 Modules and Packages Dependencies • packages are meant for re-use • so you want to use some package − your users will need it too − maybe you need a dozen • sure enough, packages need other packages − this is ripe for automation PV248 Python 350/354 Modules and Packages Dependency Chasing • setup.py could just download dependencies − setuptools automate this for you − and use PyPI to ind the packages • it also only downloads what is missing • pip will ind you the ‘toplevel’ package PV248 Python 351/354 Modules and Packages Versioned Dependencies • so you use function bar from package foo − but it only appeared in version 2.4 • so you need package foo newer than 2.4 • but foo was then removed in version 3 − no time right now to deal with that • welcome to dependency hell PV248 Python 352/354 Modules and Packages Chasing Dependencies Redux • versioning makes dependencies NP-hard • dependencies may be impossible to satisfy • mistakes happen with version numbers too − those usually affect other packages • this is a problem in every complex software system PV248 Python 353/354 Modules and Packages Versioning Strategies • optimistic dependencies − maybe next foo major won’t break my code − if it does, my package breaks and i must ix it • defensive dependencies − next major of foo will probably break my code − i use baz 1.1 and foo 2.4 and depend on foo < 3 − around comes baz 1.2 but it needs foo 3.1 PV248 Python 354/354 Modules and Packages Questions & (maybe) Answers