Advanced SearchTechniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz  Classic model of algorithms ▪ You get to see the entire input, then compute some function of it ▪ In this context, “offline algorithm”  Online Algorithms ▪ You get to see the input one piece at a time, and need to make irrevocable decisions along the way ▪ Similar to the data stream model 2Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 3 4 a b c dBoys Girls 4Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Nodes: Boys and Girls; Edges: Preferences Goal: Match boys to girls so that maximum number of preferences is satisfied M = {(1,a),(2,b),(3,d)} is a matching Cardinality of matching = |M| = 3 1 2 3 4 a b c dBoys Girls 5Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 3 4 a b c dBoys Girls M = {(1,c),(2,b),(3,d),(4,a)} is a perfect matching 6Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Perfect matching … all vertices of the graph are matched Maximum matching … a matching that contains the largest possible number of matches  Problem: Find a maximum matching for a given bipartite graph ▪ A perfect one if it exists  There is a polynomial-time offline algorithm based on augmenting paths (Hopcroft & Karp 1973, see http://en.wikipedia.org/wiki/Hopcroft-Karp_algorithm)  But what if we do not know the entire graph upfront? 7Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Initially, we are given the set boys  In each round, one girl’s choices are revealed ▪ That is, girl’s edges are revealed  At that time, we have to decide to either: ▪ Pair the girl with a boy ▪ Do not pair the girl with any boy  Example of application: Assigning tasks to servers 8Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 1 2 3 4 a b c d (1,a) (2,b) (3,d)  Greedy algorithm for the online graph matching problem: ▪ Pair the new girl with any eligible boy ▪ If there is none, do not pair girl  How good is the algorithm? 10Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  For input I, suppose greedy produces matching Mgreedy while an optimal matching is Mopt Competitive ratio = minall possible inputs I (|Mgreedy|/|Mopt|) (what is greedy’s worst performance over all possible inputs I) 11Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Consider a case: Mgreedy≠ Mopt  Consider the set G of girls matched in Mopt but not in Mgreedy  Then every boy B adjacent to girls in G is already matched in Mgreedy: ▪ If there would exist such non-matched (by Mgreedy) boy adjacent to a non-matched girl then greedy would have matched them  Since boys B are already matched in Mgreedy then (1) |Mgreedy|≥ |B| Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 12 a b c d G={ }B={ } Mopt 1 2 3 4  Summary so far: ▪ Girls G matched in Mopt but not in Mgreedy ▪ (1) |Mgreedy|≥ |B|  There are at least |G| such boys (|G|  |B|) otherwise the optimal algorithm couldn’t have matched all girls in G ▪ So: |G|  |B|  |Mgreedy|  By definition of G also: |Mopt|  |Mgreedy| + |G| ▪ Worst case is when |G| = |B| = |Mgreedy|  |Mopt|  2|Mgreedy| then |Mgreedy|/|Mopt|  1/2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 13 a b c d G={ }B={ } Mopt 1 2 3 4 1 2 3 4 a b c (1,a) (2,b) d 14Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Banner ads (1995-2001) ▪ Initial form of web advertising ▪ Popular websites charged X$ for every 1,000 “impressions” of the ad ▪ Called “CPM” rate (Cost per thousand impressions) ▪ Modeled similar to TV, magazine ads ▪ From untargeted to demographically targeted ▪ Low click-through rates ▪ Low ROI for advertisers Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 CPM…cost per mille Mille…thousand in Latin  Introduced by Overture around 2000 ▪ Advertisers bid on search keywords ▪ When someone searches for that keyword, the highest bidder’s ad is shown ▪ Advertiser is charged only if the ad is clicked on  Similar model adopted by Google with some changes around 2002 ▪ Called Adwords 17Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Performance-based advertising works! ▪ Multi-billion-dollar industry  Interesting problem: What ads to show for a given query? 19Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Given: ▪ 1. A set of bids by advertisers for search queries ▪ 2. A click-through rate for each advertiser-query pair ▪ 3. A budget for each advertiser (say for 1 month) ▪ 4. A limit on the number of ads to be displayed with each search query  Respond to each search query with a set of advertisers such that: ▪ 1. The size of the set is no larger than the limit on the number of ads per query ▪ 2. Each advertiser has bid on the search query ▪ 3. Each advertiser has enough budget left to pay for the ad if it is clicked upon Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 20  A stream of queries arrives at the search engine: q1, q2, …  Several advertisers bid on each query  When query qi arrives, search engine must pick a subset of advertisers whose ads are shown  Goal: Maximize search engine’s revenues ▪ Simple solution: Instead of raw bids, use the “expected revenue per click” (i.e., Bid*CTR)  Clearly we need an online algorithm! 21Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 22 Advertiser Bid CTR Bid * CTR A B C $1.00 $0.75 $0.50 1% 2% 2.5% 1 cent 1.5 cents 1.125 cents Click through rate Expected revenue  Two complications: ▪ Budget ▪ CTR of an ad is unknown  Each advertiser has a limited budget ▪ Search engine guarantees that the advertiser will not be charged more than their daily budget Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 23  CTR: Each ad has a different likelihood of being clicked ▪ Advertiser 1 bids $2, click probability = 0.1 ▪ Advertiser 2 bids $1, click probability = 0.5 ▪ Clickthrough rate (CTR) is measured historically ▪ Very hard problem: Exploration vs. exploitation Exploit: Should we keep showing an ad for which we have good estimates of click-through rate or Explore: Shall we show a brand new ad to get a better sense of its click-through rate Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 24  Our setting: Simplified environment ▪ There is 1 ad shown for each query ▪ All advertisers have the same budget B ▪ All ads are equally likely to be clicked ▪ Value of each ad is the same (=1)  Simplest algorithm is greedy: ▪ For a query pick any advertiser who has bid 1 for that query ▪ Competitive ratio of greedy is 1/2 25Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Two advertisers A1 and A2 ▪ A1 bids on query x, A2 bids on x and y ▪ Both have budgets of $4  Query stream: x x x x y y y y ▪ Worst case greedy choice: A2 A2 A2 A2 _ _ _ _ ▪ Optimal: A1 A1 A1 A1 A2 A2 A2 A2 ▪ Competitive ratio = ½  This is the worst case! ▪ Note: Greedy algorithm is deterministic – it always resolves draws in the same way 26Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  BALANCE Algorithm by Mehta, Saberi, Vazirani, and Vazirani ▪ For each query, pick the advertiser with the largest unspent budget ▪ Break ties arbitrarily (but in a deterministic way) 27Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Two advertisers A1 and A2 ▪ A1 bids on query x, A2 bids on x and y ▪ Both have budgets of $4  Query stream: x x x x y y y y  BALANCE choice: A1 A2 A1 A2 A2 A2 _ _ ▪ Optimal: A1 A1 A1 A1 A2 A2 A2 A2  In general: For BALANCE on 2 advertisers Competitive ratio = ¾ 28Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Consider simple case (w.l.o.g.): ▪ 2 advertisers, A1 and A2, each with budget B (1) ▪ Optimal solution exhausts both advertisers’ budgets  BALANCE must exhaust at least one advertiser’s budget: ▪ If not, we can allocate more queries ▪ Whenever BALANCE makes a mistake (both advertisers bid on the query), advertiser’s unspent budget only decreases ▪ Since optimal exhausts both budgets, one will for sure get exhausted ▪ Assume BALANCE exhausts A2’s budget, but allocates x queries fewer than the optimal ▪ Revenue: BAL = 2B - x Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 29 A1 A2 B xy B A1 A2 x Optimal revenue = 2B Assume Balance gives revenue = 2B-x = B+y Unassigned queries should be assigned to A2 (if we could assign to A1 we would since we still have the budget) Goal: Show we have y  x Case 1) ≤ ½ of A1’s queries got assigned to A2 then 𝒚 ≥ 𝑩/𝟐 Case 2) > ½ of A1’s queries got assigned to A2 then 𝒙 ≤ 𝑩/𝟐 and 𝒙 + 𝒚 = 𝑩 Balance revenue is minimum for 𝒙 = 𝒚 = 𝑩/𝟐 Minimum Balance revenue = 𝟑𝑩/𝟐 Competitive Ratio = 3/4 Queries allocated to A1 in the optimal solution Queries allocated to A2 in the optimal solution Not used 30Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) BALANCE exhausts A2’s budget xy B A1 A2 x Not used  In the general case with N advertisers, worst competitive ratio of BALANCE is 1–1/e = approx. 0.63 ▪ Interestingly, no online algorithm has a better competitive ratio! 31Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Customer X ▪ Buys Metallica CD ▪ Buys Megadeth CD  Customer Y ▪ Does search on Metallica ▪ Recommender system suggests Megadeth from data collected about customer X Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 33 Items Search Recommendations Products, web sites, blogs, news items, … 34Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Examples:  Shelf space is a scarce commodity for traditional retailers ▪ Also: TV networks, movie theaters,…  Web enables near-zero-cost dissemination of information about products ▪ From scarcity to abundance  More choice necessitates better filters ▪ Recommendation engines ▪ How Into Thin Air made Touching the Void a bestseller: http://www.wired.com/wired/archive/12.10/tail.html Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 35 Source: Chris Anderson (2004) 36Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Editorial and hand curated ▪ List of favorites ▪ Lists of “essential” items  Simple aggregates ▪ Top 10, Most Popular, Recent Uploads  Tailored to individual users ▪ Amazon, Netflix, … 37Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  X = set of Customers  S = set of Items  Utility function u: X × S → R ▪ R = set of ratings ▪ R is a totally ordered set ▪ e.g., 0-5 stars, real number in [0,1] 38Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 0.4 10.2 0.30.5 0.21 Avatar LOTR Matrix Pirates Alice Bob Carol David 39Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  (1) Gathering “known” ratings for matrix ▪ How to collect the data in the utility matrix  (2) Extrapolate unknown ratings from the known ones ▪ Mainly interested in high unknown ratings ▪ We are not interested in knowing what you don’t like but what you like  (3) Evaluating extrapolation methods ▪ How to measure success/performance of recommendation methods 40Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Explicit ▪ Ask people to rate items ▪ Doesn’t work well in practice – people can’t be bothered  Implicit ▪ Learn ratings from user actions ▪ E.g., purchase implies high rating ▪ What about low ratings? 41Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Key problem: Utility matrix U is sparse ▪ Most people have not rated most items ▪ Cold start: ▪ New items have no ratings ▪ New users have no history  Three approaches to recommender systems: ▪ 1) Content-based ▪ 2) Collaborative ▪ 3) Latent factor based 42Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Main idea: Recommend items to customer x similar to previous items rated highly by x Example:  Movie recommendations ▪ Recommend movies with same actor(s), director, genre, …  Websites, blogs, news ▪ Recommend other sites with “similar” content Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 44 likes Item profiles Red Circles Triangles User profile match recommend build 45J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org  User profile possibilities: ▪ Weighted average of rated item profiles ▪ Variation: weight by difference from average rating for item ▪ …  Prediction heuristic: ▪ Given user profile x and item profile i, estimate 𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) = 𝒙·𝒊 | 𝒙 |⋅| 𝒊 | J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46 Harnessing quality judgments of other users  Consider user x  Find set N of other users whose ratings are “similar” to x’s ratings  Estimate x’s ratings based on ratings of users in N 48Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) x N