PA152: Efficient Use of DB 3. Representing Data Elements Vlastislav Dohnal PA152, Vlastislav Dohnal, FI MUNI, 2023 2 Outline ◼ Data elements and fields ◼ Records ◼ Block organization ◼ Properties and examples PA152, Vlastislav Dohnal, FI MUNI, 2023 3 Representing Data ◼ What to store? name salary date picture ◼ How to store? values → byte sequence Data Elements Records Blocks Files Data Storage PA152, Vlastislav Dohnal, FI MUNI, 2023 4 Data Element Types ◼ Integers  By range: 2, 4, 8 bytes  E.g., 35 in 16 bits  Typically sign bit or ones complement ◼ Real numbers  Floating-point numbers ◼ n bits split to mantissa and exponent (IEEE 754)  Fixed decimal point (number(p,s)) ◼ Encoding a group of 9 digits (in base 10) into 4 bytes ◼ Store as a string in base 10 00000000 00100011 PA152, Vlastislav Dohnal, FI MUNI, 2023 5 ◼ Boolean Usually as an integer True False No reason to use less than 1 byte ◼ Bit array Length + bits ◼ Typically rounded up to next multiple of 4/8 bytes Data Element Types 1111 1111 0000 0000 PA152, Vlastislav Dohnal, FI MUNI, 2023 6 Data Element Types ◼ Date  number of days since “epoch” (e.g., Jan 1, 1970) ◼ or as a packed 3-byte integer DD + MM*32 + YYYY*16*32  string YYYYMMDD (8 bytes) ◼ YYYYDDD (7 bytes) ◼ Why not YYMMDD? ◼ Time  number of seconds since midnight ◼ number of milliseconds or microseconds ◼ or as a packed 3-byte integer DD*24*3600 + HH*3600 + MM*60 + SS  fractions of second ◼ As string HHMMSSFF or as above with fractional part separately (up to 3 bytes for 6 digits)  time zones – time converted and stored in UTC ◼ so converted from given/local time zone to UTC PA152, Vlastislav Dohnal, FI MUNI, 2023 7 Data Element Types ◼ Datetime Combining date and time ◼ Year*13+month; day, hour, min, sec + fraction  5 bytes + fractional part ◼ Timestamp Seconds since epoch with sec ◼ midnight Jan 1, 1970 UTC; Jan 1, 2000 in Pg. ◼ Enumerated type Assign integers (ordinal numbers) red → 0, green → 1, blue → 2, yellow → 3, ... Pg stores the value as string total number of bytes ◼ Characters & character sets In ASCII encoding – 1 byte Multi-byte characters ◼ UCS-2 (UTF-16) – UTF-8 encoding in 16 bits  Characters with ordinal numbers from 0 to 65535 ◼ UTF-8 – variable-length encoding  Character may occupy 1-4 bytes ▪ Originally up to 6 ▪ Now it is limited to the same range as UTF-16.  Representation: 0xxxxxxx 110xxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx PA152, Vlastislav Dohnal, FI MUNI, 2023 8 Data Element Types PA152, Vlastislav Dohnal, FI MUNI, 2023 9 Data Element Types ◼ Strings Fixed length ◼ Size limited, so  shorter strings filled with space  longer strings cut off Variable length ◼ Length plus content ◼ Null-terminated  must be read completely  cannot use zero character (ord == 0) in the string Character set issues (encoding) PA152, Vlastislav Dohnal, FI MUNI, 2023 10 Storing Data Elements: Summary ◼ Each element has a “type” bit interpretation size special “unknown” value (NULL) ◼ Usually, fixed length predefined bit representation ◼ Variable length length plus content/value PA152, Vlastislav Dohnal, FI MUNI, 2023 11 Outline ◼ Data elements and fields ◼ Records ◼ Block organization ◼ Properties and examples PA152, Vlastislav Dohnal, FI MUNI, 2023 12 Record ◼ List of related data elements i.e., their values Fields, Attributes ◼ E.g. Employee ◼ name – Novák ◼ salary – 1234 ◼ start_date – Jan 1, 2000 PA152, Vlastislav Dohnal, FI MUNI, 2023 13 Record Types by Length ◼ Fixed length Each record of same size (in bytes) ◼ Variable length Saving space More complex implementation Can store large data (images, …) PA152, Vlastislav Dohnal, FI MUNI, 2023 14 Record Schema ◼ Describes record structure ◼ Information contained Number of attributes Order of attributes Data type and name of each attribute PA152, Vlastislav Dohnal, FI MUNI, 2023 15 Record Types by Schema ◼ Fixed schema Same schema for all records ◼ Stored out of record (in data dictionary) ◼ Variable schema Record itself contains schema Useful for: ◼ “sparse“ records (many NULLs) ◼ repeating attributes ◼ evolving formats  schema changes during DB lifetime PA152, Vlastislav Dohnal, FI MUNI, 2023 16 Example: Fixed Length and Schema ◼ Employee 1) id – 2 byte integer 2) name – 10 chars 3) department – 2 bytes code55 n o v á k 02 83 d l o u h ý 01 schema records ◼ Padding to “convenient“ size  Faster memory access when address is round to 4 (8) bytes 55 n o v á k 02 - - PA152, Vlastislav Dohnal, FI MUNI, 2023 17 Example: Variable Length and Schema ◼ Employee: Called „Tagged fields“ Codes identify attribute names stored elsewhere; could be strings directly, i.e., tags. 4I52 4S HCEČ46#Fields Codeidentifying fieldasid Integertype CodeforEname Stringtype Lengthofstr. PA152, Vlastislav Dohnal, FI MUNI, 2023 18 Example (cont.): Repeating Attribute ◼ Employee’s children Useful in case of arrays, etc. ◼ Repeating attribute may not mean variable length either schema Can set maximum number of values ◼ Unused space filled with NULLs 3 Name: Jan Novák Child: Tomáš Child: Pavel Novák Potápění Šachy -- Schema Changes ◼ Fixed schema every record must be updated ◼ Variable schema only changed records get updated PA152, Vlastislav Dohnal, FI MUNI, 2023 19 PA152, Vlastislav Dohnal, FI MUNI, 2023 20 „Intermediate“ Schema ◼ Compromise between fixed and variable schema ◼ Record schema “version” in record header 5 27 . . . . record type tells me what to expect (i.e., points to schema) record length PA152, Vlastislav Dohnal, FI MUNI, 2023 21 Record Header ◼ = information about the record (fixed length; no relation to attribute values) Record schema “version” (pointer) Record Length Creation / update / access timestamp OID (Object Identifier) – “record ID”, “tuple ID” Bit array of NULL value flags ◼ One bit for each attribute … PA152, Vlastislav Dohnal, FI MUNI, 2023 22 Other Issues ◼ Compression  Increase speed of accessing/updating (fewer bytes)  Within record (values independently)  Collection of records ◼ More effective (can build a dictionary, find common patters) ◼ More complex to implement ◼ In Pg, LZ4 and PGLZ  Lempel, Ziv based lossless algorithm ◼ high compression performance (>0.5GMB/s per core) ◼ extreme decompression perf. (>6 GB/s per core) PA152, Vlastislav Dohnal, FI MUNI, 2023 23 Other Issues ◼ Encryption Consequence to indexing… How to do range queries? … Solution: ◼ Encrypt buffer data during file system I/O ◼ WAL records stored in WAL buffers that get encrypted when writing to the file system PA152, Vlastislav Dohnal, FI MUNI, 2023 24 Storing Objects ◼ Current commercial DBMS support objects Extension of relational DBMS OODBMS ◼ Objects have attributes Primitive types → store as a record Collections → create a new relation ◼ Referencing using OIDs PA152, Vlastislav Dohnal, FI MUNI, 2023 25 Storing Relations ◼ Row-oriented Tackled up to now… ◼ Example of row-oriented storage:  Order(id, cust, prod, store, price, date, qty) ◼ Column-oriented Values of the same attribute stored together id1 cust1 prod1 store1 price1 date1 qty1 id2 cust2 prod2 store2 price2 date2 qty2 id3 cust3 prod3 store3 price3 date3 qty3 PA152, Vlastislav Dohnal, FI MUNI, 2023 26 Column-oriented Storage ◼ Relation Order(id, cust, prod, store, price, date, qty) id1 cust1 id2 cust2 id3 cust3 id4 cust4 ... ... id1 prod1 id2 prod2 id3 prod3 id4 prod4 ... ... Id may or may not be stored. Could exploit record ordering … PA152, Vlastislav Dohnal, FI MUNI, 2023 27 Comparison ◼ Advantage of column-oriented storage More compact (no padding to 4/8 bytes, compression, …) Efficient access (e.g., data mining) ◼ Process few attributes but all their values ◼ Advantage of row-oriented storage Record update / insertion more efficient Whole record access more efficient Mike Stonebraker, Elizabeth O'Neil, Pat O’Neil, Xuedong Chen, et al.: C-Store: A Column-oriented DBMS, VLDB Conference, 2005. http://www.cs.umb.edu/~poneil/vldb05_cstore.pdf PA152, Vlastislav Dohnal, FI MUNI, 2023 28 Outline ◼ Data elements and fields ◼ Records ◼ Block organization ◼ Properties and examples PA152, Vlastislav Dohnal, FI MUNI, 2023 29 Block Organization ◼ Records Fixed length Variable length ◼ Block of fixed length records file blocks … PA152, Vlastislav Dohnal, FI MUNI, 2023 30 Block Organization ◼ Issues for storing records in blocks: 1. Separating records 2. Spanned vs. unspanned records 3. Sequencing 4. Interlacing more relations 5. Indirection PA152, Vlastislav Dohnal, FI MUNI, 2023 31 Separating Records ◼ Fixed-length records No delimiter Store record count and point to 1st record ◼ Variable-length records Delimiter / special marker Store record lengths (or offsets) ◼ Within each record ◼ In block header R2R1 R3blok ◼ Variable-length records ◼ Organization: block header, records ◼ Header  Pointers to other blocks (overflow, index, …)  Block type (relation, overflow, index, …)  Relation ID  (Directory of record offsets)  Timestamps (creation, modification, access) Separating Records PA152, Vlastislav Dohnal, FI MUNI, 2023 32 unused space header record 1 record 2 … record nblock PA152, Vlastislav Dohnal, FI MUNI, 2023 33 Spanned vs. unspanned ◼ Unspanned each record in a block simple, but not space efficient ◼ Spanned record split across blocks required when a record exceeds block size! R1 R2 R3 R4 R5 block 1 block 2 … R1 R2 R3(a) R3(b) R6R5R4 R7(a) block 1 block 2 … PA152, Vlastislav Dohnal, FI MUNI, 2023 34 Unspanned: Example ◼ Records cannot cross block boundary 106 records, each 2 050 bytes (fixed length) Block size 4 096 bytes Space allocated: 106 * 4096 B Space utilized: 106 * 2050 B Utilization ratio: 50.05% block 1 block 2 2050 bytes unused 2046 2050 bytes unused 2046 R1 R2 PA152, Vlastislav Dohnal, FI MUNI, 2023 35 Unspanned ◼ Options for oversized attribute values The Oversized-Attribute Storage Technique ◼ TOAST or "the best thing since sliced bread”** ◼ Principle  A TOAST table is created (chunk_id, chunk_seq, value)  Value is split into “chunks” ▪ Chunks form records in TOAST table ▪ Chunk identified by (chunk_id, chunk_seq)  Original space is used to store length of the value, toast table id and chunk id. Compression Split into multiple records within the table (internally) ** [cit. dokumentace PostgreSQL] PA152, Vlastislav Dohnal, FI MUNI, 2023 36 Unspanned ◼ Large Objects (LOBs) Two types: binary / text Stored off the table ◼ in consecutive blocks (in a separate file) Typically, not indexed by DBMS ◼ i.e., cannot search in the value PA152, Vlastislav Dohnal, FI MUNI, 2023 37 Spanned ◼ Record split across blocks Blocks must be ordered or Use pointers ◼ Record split into “fragments” Bit flag “fragmented” in header Pointers to next / previous fragments R1 R2 R3(a) R3(b) R6R5R4 R7(a) block 1 block 2 … PA152, Vlastislav Dohnal, FI MUNI, 2023 38 Sequencing ◼ Ordering records in file (and blocks) by some key value => sequential file ◼ Reason: Efficient record access in the key ordering E.g., good for merge-join, order by, … ◼ Solution in DBMS Clustered index PA152, Vlastislav Dohnal, FI MUNI, 2023 39 Sequencing – sequential file ◼ Stored consecutively physically contiguous ◼ Linked preserve order! R2R1 R3 … R3R1 R2 ◼ Overflow area Records in sequence ◼ reorganization needed after record modifications Pointer to an overflow area / block PA152, Vlastislav Dohnal, FI MUNI, 2023 40 Sequencing – sequential file block R1.0 R2.0 R3.0 R4.0 R5.0 header R2.1 R1.3 R4.7 header PA152, Vlastislav Dohnal, FI MUNI, 2023 41 Relation Interlacing ◼ Records of multiple tables in one block Records of more relations accessed simultaneously Store together → Access faster More complex implementation PA152, Vlastislav Dohnal, FI MUNI, 2023 42 Relation Interlacing: Example ◼ Relations: employee (eid, name, address) deposit (eid, did, amount) ◼ Good for query Q1:  SELECT name, address, amount FROM deposit, employee WHERE deposit.eid = employee.eid AND employee.eid = 2354 (2354, ‘Joe Little’, ‘Brno’) (2354, 999001, 100) (2354, 999010, 1500) block … … PA152, Vlastislav Dohnal, FI MUNI, 2023 43 Relation Interlacing: Example ◼ Query Q2: SELECT * FROM employee ◼ Interlacing not convenient for Q2 Depends on frequency of individual queries ◼ Solution: Do not mix within one block Store block in near proximity (same disk cylinder) PA152, Vlastislav Dohnal, FI MUNI, 2023 44 Relation Interlacing (2354, ‘Joe Little’, ‘Brno’) block1 … (2356, ‘Peter Newman’, ‘Brno’) block2 (2354, 999001, 100) (2354, 999010, 1500) … (2356, 924013, 5500) PA152, Vlastislav Dohnal, FI MUNI, 2023 45 Indirection (Pointers to Records) ◼ Applications: Spanned records Referencing blocks / record (e.g., in indices) Linked blocks (e.g., in indices) OODBMS: objects referencing other objects H D M A F Adam Matěj Denisa Hynek Felix index file PA152, Vlastislav Dohnal, FI MUNI, 2023 46 Indirection ◼ Record address Memory address ◼ direct addressing ◼ 8-byte pointer in the virtual memory of a process DB address ◼ sequence of bytes describing record location in external memory ◼ direct vs. indirect addressing PA152, Vlastislav Dohnal, FI MUNI, 2023 47 Indirection in DB (DB Address) ◼ Direct addressing Physical record address ◼ Purely physical address in storage  Device ID, track, platter, block, byte-offset in block Not flexible ◼ E.g., block or records reallocation PA152, Vlastislav Dohnal, FI MUNI, 2023 48 Indirection in DB (DB Address) ◼ Indirect addressing Record / block identified by its ID ID = logical address ◼ any sequence of bits Map table: ID → physical address phys. addr.ID rec ID r address a map table …… PA152, Vlastislav Dohnal, FI MUNI, 2023 49 Indirection in DB (DB Address) ◼ Indirect addressing Disadvantage ◼ Increased costs  Accessing map table  Storing map table Advantage ◼ Very flexible  Deletion/insertion of records  Optimization of block storage PA152, Vlastislav Dohnal, FI MUNI, 2023 50 Indirection in DB (DB Address) ◼ Combination = suitable option  Phys. record address = phys. block address + position ◼ The position in a list of records within block  At this position, byte-offset to the record is stored  Advantages ◼ Can move records within block  No changes to phys. address ◼ Map table is not necessary  Disadvantage ◼ Minor: Moving a record to another block  Replace it with a pointer to new location (block + position) ◼ Major: Not flexible in moving blocks (defragmentation) PA152, Vlastislav Dohnal, FI MUNI, 2023 51 Indirection in DB (DB Address) ◼ Widely used option Record address = File ID + block number + position Blocks are organized by a file system ◼ blocks are numbered from zero within each file File ID, Physical Block # Block ID File System Map PA152, Vlastislav Dohnal, FI MUNI, 2023 52 Indirection in DB (DB Address) ◼ Indirection in block (Slotted Page Structure) Phys. record addr. = file id + blck no. + position R3 R4 R1 R2 block unused space header Byte-offset PA152, Vlastislav Dohnal, FI MUNI, 2023 53 Block Header ◼ Present in each block File ID (or RELATION ID or DB ID) Block type ◼ e.g., record of type, overflow area, TOAST table, … Block ID (this one) Record directory (points to record data) Pointer to free space (beginning, end) Pointer to other blocks (e.g., in indices) Modification timestamp/version number PA152, Vlastislav Dohnal, FI MUNI, 2023 54 Record Modifications ◼ Insertion Typically, “no problem” ◼ Deletion Unused space management ◼ Update Same size ◼ Ok Enlarging/shrinking ◼ Same issues as for insertion/deletion PA152, Vlastislav Dohnal, FI MUNI, 2023 55 Deletion ◼ Pointer to deleted records Must be invalidated Cannot point to new data Dangling pointers R1 ? Deleted record PA152, Vlastislav Dohnal, FI MUNI, 2023 56 Deletion ◼ Direct record addressing (phys. addr.) 1. Mark as deleted ◼ With a marker (tombstone) ◼ One bit  Reality: several bytes due to memory padding 2. Advertise the free space ◼ Linked list of unused areas block cannot be reused unused space PA152, Vlastislav Dohnal, FI MUNI, 2023 57 Deletion ◼ Indirect addressing Map table Deleted record is freed in the block Tombstone in map table ◼ or mapping is deleted, but no ID reuse! ID LOC … 7788 … Map Table Cannot be reused PA152, Vlastislav Dohnal, FI MUNI, 2023 58 Deletion ◼ Rec. addr. = block addr. + rec. position Free space occupied by the record Defragment space ◼ to make it contiguous Set pointer to null in record directory block unused space R3 R4 R1 R2 PA152, Vlastislav Dohnal, FI MUNI, 2023 59 Deletion ◼ Store rec. ID in the record Check ID during record access No overhead than extending the record with RecID If RecID is the pointer itself, some other identification, e.g., xmin is necessary to differentiate the records. PA152, Vlastislav Dohnal, FI MUNI, 2023 60 Insertion ◼ Unordered file Append to end of file ◼ Last block, or allocate new Insert into unused space of existing block ◼ Need to handle variable length of records PA152, Vlastislav Dohnal, FI MUNI, 2023 61 Insertion ◼ Ordered file (sequential) Unfeasible without indirect addressing nor record positions (offsets) Find free space in a “neighboring” block → reorganize ◼ Move last record in the block to the next block  Put a marker in the original place to point to the new location Use overflow block ◼ Pointer to an overflow block is in the block header PA152, Vlastislav Dohnal, FI MUNI, 2023 62 Update ◼ Record enlarged Within a block ◼ No need for tombstones ◼ Move following records Create an overflow block … ◼ Record shrunk by analogy… May free overflow blocks Memory Buffers and Pointers ◼ DB pointer in memory are inefficient ◼ Pointer swizzling Change of DB pointer to memory pointer and back PA152, Vlastislav Dohnal, FI MUNI, 2023 63 Rec A relation block 2 index block 1 Memory (buffers/cache) Disk Itm J Memory Buffers and Pointers ◼ After loading block 1 in memory no update is necessary PA152, Vlastislav Dohnal, FI MUNI, 2023 64 Rec A Itm J index block 1 relation block 2 index block 1 Memory (buffers/cache) Disk Itm J Pointer Swizzling ◼ After reading block 2, pointer updated PA152, Vlastislav Dohnal, FI MUNI, 2023 65 Rec A Itm J index block 1 relation block 2 index block 1 Memory (buffers/cache) Disk Itm J relation block 2 Rec A PA152, Vlastislav Dohnal, FI MUNI, 2023 66 Pointer Swizzling ◼ When: Automatically – immediately after reading On request – on first use/access Never – use map table instead ◼ Implementation: DB address updated to memory address ◼ Build a Translation table  store a pair (disk addr., memory addr.) for each record Flag (swizzled/unswizzled) in the pointer PA152, Vlastislav Dohnal, FI MUNI, 2023 67 Buffer Management ◼ DB features needed Keep some blocks in memory/cache ◼ Indices, join of relations, … ◼ Different strategies LRU, FIFO, pinned blocks, toss-immediate, … Buffer Management Strategies ◼ LRU Update timestamp on access to block → significant maintenance, but effective ◼ FIFO Store time of loading, no update on access → improper for highly accessed blocks ◼ e.g., root of B+ tree ◼ Pinned blocks Blocks allocated in buffers forever PA152, Vlastislav Dohnal, FI MUNI, 2023 68 ◼ “Clock” algorithm Efficient approximation of LRU Hand points to last read record Rotates to find a block to be written back to disk and replaced (flag is 0). On loading / accessing a block, set the flag to 1 Reset the flag on passing over the blocks ◼ Can implement pinned blocks. How? Buffer Management Strategies PA152, Vlastislav Dohnal, FI MUNI, 2023 69 1 1 0 0 1 1 1 00 0 1 0 Buffer Management: Example ◼ Join relations with LRU: Blocked Nested loops: LRU ineffective: blocks to process removed ◼ Need to pin blocks of relation r/s PA152, Vlastislav Dohnal, FI MUNI, 2023 70 For each bs in s do For each br in r do For each ts in bs do For each tr in br do Join tuples tr and ts r s s1 s2 r1 r2 r3 buffers Reading 1st block of s and process r Read 2nd block of s … 4 buffers s1 r3 s2 4 5 1 buffers s1 r1 r2 2 3 3 buffers s1 r3 r2 4 3 … PA152, Vlastislav Dohnal, FI MUNI, 2023 71 Outline ◼ Data elements and fields ◼ Records ◼ Block organization ◼ Properties and examples PA152, Vlastislav Dohnal, FI MUNI, 2023 72 Own Implementation ◼ Access type (operations) and their efficiency: Locating a record with given key ◼ Getting consecutive records Insert / update / delete of records Table scan File reorganization Flexibility Space requirements Complexity Efficiency PA152, Vlastislav Dohnal, FI MUNI, 2023 73 Specialized Systems ◼ BigTable Distributed storage for tuples, by Google Scalable up to petabytes (1PB=1000TB) F. Chang, J. Dean, S. Ghemawat, et al.: Bigtable: A Distributed Storage System for Structured Data, Seventh Symposium on Operating System Design and Implementation (OSDI), 2006. http://labs.google.com/papers/bigtable-osdi06.pdf ◼ HBase Distributed storage for tuples Open-source Apache projekt Hadoop http://hadoop.apache.org/ Properties of BigTable and HBase ◼ Not traditional relational database systems NoSQL databases ◼ Storage as a “key→value” map row_id, column_id, time → value Value is structured, but of variable schema ◼ Records are versioned see time component in the key ◼ Ordered by row_id PA152, Vlastislav Dohnal, FI MUNI, 2023 74 Lecture’s Takeaways ◼ Differences in storing values Handling NULL values in attributes ◼ Organization of records in blocks ◼ Pointers in DBMS Why and how; cooperation with memory pointers ◼ To revise Sequential file ◼ Record manipulation operations Index files (sparse/dense indexes) PA152, Vlastislav Dohnal, FI MUNI, 2023 75