Advanced SearchTechniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 Connections between political blogs Polarization of the network [Adamic-Glance, 2005] Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4 Citation networks and Maps of science [Börner et al., 2012] domain2 domain1 domain3 router InternetPavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6 I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7  How to organize the Web?  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge, full of untrusted documents, random things, web spam, etc. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 2 challenges of web search:  (1) Web contains many sources of information Who to “trust”?  Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”?  No single right answer  Trick: Pages that actually know about newspapers might all be pointing to many newspapers Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 10  We will cover the following Link Analysis approaches for computing importances of nodes in a graph:  Page Rank  Topic-Specific (Personalized) Page Rank  Web Spam Detection Algorithms 11Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! 13Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6 14Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Each link’s vote is proportional to the importance of its source page  If page j with importance rj has n out-links, each link gets rj / n votes  Page j’s own importance is the sum of the votes on its in-links 15Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) j ki rj/3 rj/3rj/3 rj = ri/3+rk/4 ri/3 rk/4  A “vote” from an important page is worth more  A page is important if it is pointed to by other important pages  Define a “rank” rj for page j Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16   ji i j r r id y ma a/2 y/2 a/2 m y/2 The web in 1839 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 𝒅𝒊 … out-degree of node 𝒊  3 equations, 3 unknowns, no constants  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒓 𝒚 + 𝒓 𝒂 + 𝒓 𝒎 = 𝟏  Solution: 𝒓 𝒚 = 𝟐 𝟓 , 𝒓 𝒂 = 𝟐 𝟓 , 𝒓 𝒎 = 𝟏 𝟓  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! 17Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Flow equations:  Stochastic adjacency matrix 𝑴  Let page 𝑖 has 𝑑𝑖 out-links  If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = 1 𝑑𝑖 else 𝑀𝑗𝑖 = 0  𝑴 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒓: vector with an entry per page  𝑟𝑖 is the importance score of page 𝑖  σ𝑖 𝑟𝑖 = 1  The flow equations can be written 𝒓 = 𝑴 ⋅ 𝒓 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18   ji i j r r id  Remember the flow equation:  Flow equation in the matrix form 𝑴 ⋅ 𝒓 = 𝒓  Suppose page i links to 3 pages, including j Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 19 j i M r r = rj 1/3   ji i j r r id ri . . =  The flow equations can be written 𝒓 = 𝑴 ∙ 𝒓  So the rank vector r is an eigenvector of the stochastic web matrix M  In fact, its first or principal eigenvector, with corresponding eigenvalue 1  Largest eigenvalue of M is 1 since M is column stochastic (with non-negative entries)  We know r is unit length and each column of M sums to one, so 𝑴𝒓 ≤ 𝟏  We can now efficiently solve for r! The method is called Power iteration 20Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) NOTE: x is an eigenvector with the corresponding eigenvalue λ if: 𝑨𝒙 = 𝝀𝒙 r = M∙r y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m 21Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages  Initialize: r(0) = [1/N,….,1/N]T  Iterate: r(t+1) = M ∙ r(t)  Stop when |r(t+1) – r(t)|1 <  22Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)    ji t it j r r i )( )1( d di …. out-degree of node i |x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean  Power Iteration:  Set 𝑟𝑗 = 1/N  1: 𝑟′𝑗 = σ𝑖→𝑗 𝑟 𝑖 𝑑 𝑖  2: 𝑟 = 𝑟′  Goto 1  Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 23 Iteration 0, 1, 2, … ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2  Power Iteration:  Set 𝑟𝑗 = 1/N  1: 𝑟′𝑗 = σ𝑖→𝑗 𝑟 𝑖 𝑑 𝑖  2: 𝑟 = 𝑟′  Goto 1  Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 24 Iteration 0, 1, 2, … ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2  Imagine a random web surfer:  At any time 𝒕, surfer is on some page 𝒊  At time 𝒕 + 𝟏, the surfer follows an out-link from 𝒊 uniformly at random  Ends up on some page 𝒋 linked from 𝒊  Process repeats indefinitely  Let:  𝒑(𝒕) … vector whose 𝒊th coordinate is the prob. that the surfer is at page 𝒊 at time 𝒕  So, 𝒑(𝒕) is a probability distribution over pages Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 25   ji i j r r (i)dout j i1 i2 i3  Where is the surfer at time t+1?  Follows a link uniformly at random 𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕)  Suppose the random walk reaches a state 𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕) then 𝒑(𝒕) is stationary distribution of a random walk  Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓  So, 𝒓 is a stationary distribution for the random walk )(M)1( tptp  j i1 i2 i3 26Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 27Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Does this converge?  Does it converge to what we want?  Are results reasonable?    ji t it j r r i )( )1( d Mrr or equivalently 29Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Example: ra 1 0 1 0 rb 0 1 0 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 30 = ba Iteration 0, 1, 2, …    ji t it j r r i )( )1( d  Example: ra 1 0 0 0 rb 0 1 0 0 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 31 = ba Iteration 0, 1, 2, …    ji t it j r r i )( )1( d 2 problems:  (1) Some pages are dead ends (have no out-links)  Random walk has “nowhere” to go to  Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group)  Random walked gets “stuck” in a trap  And eventually spider traps absorb all importance Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 32 Dead end  Power Iteration:  Set 𝑟𝑗 = 1  𝑟𝑗 = σ𝑖→𝑗 𝑟 𝑖 𝑑 𝑖  And iterate  Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 33 Iteration 0, 1, 2, … y a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm m is a spider trap All the PageRank score gets “trapped” in node m.  The Google solution for spider traps: At each time step, the random surfer has two options  With prob. , follow a link at random  With prob. 1-, jump to some random page  Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps 34Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m  Power Iteration:  Set 𝑟𝑗 = 1  𝑟𝑗 = σ𝑖→𝑗 𝑟 𝑖 𝑑 𝑖  And iterate  Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 35 Iteration 0, 1, 2, … y a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Here the PageRank “leaks” out since the matrix is not stochastic.  Teleports: Follow random teleport links with probability 1.0 from dead-ends  Adjust matrix accordingly Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 36 y a m y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m Why are dead-ends and spider traps a problem and why do teleports solve the problem?  Spider-traps are not a problem, but with traps PageRank scores are not what we want  Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps  Dead-ends are a problem  The matrix is not column stochastic so our initial assumptions are not met  Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 37  Google’s solution that does it all: At each step, random surfer has two options:  With probability , follow a link at random  With probability 1-, jump to some random page  PageRank equation [Brin-Page, 98] 𝑟𝑗 = ෍ 𝑖→𝑗 𝛽 𝑟𝑖 𝑑𝑖 + (1 − 𝛽) 1 𝑁 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 38 di … out-degree of node i This formulation assumes that 𝑴 has no dead ends. We can either preprocess matrix 𝑴 to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.  PageRank equation [Brin-Page, ‘98] 𝑟𝑗 = ෍ 𝑖→𝑗 𝛽 𝑟𝑖 𝑑𝑖 + (1 − 𝛽) 1 𝑁  The Google Matrix A: 𝐴 = 𝛽 𝑀 + 1 − 𝛽 1 𝑁 𝑁×𝑁  We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓 And the Power method still works!  What is  ?  In practice  =0.8,0.9 (make 5 steps on avg., jump) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 39 [1/N]NxN…N by N matrix where all entries are 1/N y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . . 40Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m 13/15 7/15 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A  Key step is matrix-vector multiplication  rnew = A ∙ rold  Easy if we have enough main memory to hold A, rold, rnew  Say N = 1 billion pages  We need 4 bytes for each entry (say)  2 billion entries for vectors, approx 8GB  Matrix A has N2 entries  1018 is a large number! Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 42 ½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = ∙M + (1-) [1/N]NxN = A =  Suppose there are N pages  Consider page i, with di out-links  We have Mji = 1/|di| when i → j and Mji = 0 otherwise  The random teleport is equivalent to:  Adding a teleport link from i to every other page and setting transition probability to (1-)/N  Reducing the probability of following each out-link from 1/|di| to /|di|  Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly 43Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  𝒓 = 𝑨 ⋅ 𝒓, where 𝑨𝒋𝒊 = 𝜷 𝑴𝒋𝒊 + 𝟏−𝜷 𝑵  𝑟𝑗 = σi=1 𝑁 𝐴𝑗𝑖 ⋅ 𝑟𝑖  𝑟𝑗 = σ𝑖=1 𝑁 𝛽 𝑀𝑗𝑖 + 1−𝛽 𝑁 ⋅ 𝑟𝑖 = σi=1 𝑁 𝛽 𝑀𝑗𝑖 ⋅ 𝑟𝑖 + 1−𝛽 𝑁 σi=1 𝑁 𝑟𝑖 = σi=1 𝑁 𝛽 𝑀𝑗𝑖 ⋅ 𝑟𝑖 + 1−𝛽 𝑁 since σ𝑟𝑖 = 1  So we get: 𝒓 = 𝜷 𝑴 ⋅ 𝒓 + 𝟏−𝜷 𝑵 𝑵 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 44 [x]N … a vector of length N with all entries x Note: Here we assumed M has no dead-ends  We just rearranged the PageRank equation 𝒓 = 𝜷𝑴 ⋅ 𝒓 + 𝟏 − 𝜷 𝑵 𝑵  where [(1-)/N]N is a vector with all N entries (1-)/N  M is a sparse matrix! (with no dead-ends)  10 links per node, approx 10N entries  So in each iteration, we need to:  Compute rnew =  M ∙ rold  Add a constant value (1-)/N to each entry in rnew  Note if M contains dead-ends then σ𝒋 𝒓𝒋 𝒏𝒆𝒘 < 𝟏 and we also have to renormalize rnew so that it sums to 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 45  Input: Graph 𝑮 and parameter 𝜷  Directed graph 𝑮 (can have spider traps and dead ends)  Parameter 𝜷  Output: PageRank vector 𝒓 𝒏𝒆𝒘  Set: 𝑟𝑗 𝑜𝑙𝑑 = 1 𝑁  repeat until convergence: σ 𝑗 𝑟𝑗 𝑛𝑒𝑤 − 𝑟𝑗 𝑜𝑙𝑑 > 𝜀  ∀𝑗: 𝒓′𝒋 𝒏𝒆𝒘 = σ𝒊→𝒋 𝜷 𝒓 𝒊 𝒐𝒍𝒅 𝒅 𝒊 𝒓′𝒋 𝒏𝒆𝒘 = 𝟎 if in-degree of 𝒋 is 0  Now re-insert the leaked PageRank: ∀𝒋: 𝒓𝒋 𝒏𝒆𝒘 = 𝒓′ 𝒋 𝒏𝒆𝒘 + 𝟏−𝑺 𝑵  𝒓 𝒐𝒍𝒅 = 𝒓 𝒏𝒆𝒘 46 where: 𝑆 = σ 𝑗 𝑟′𝑗 𝑛𝑒𝑤 If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Encode sparse matrix using only nonzero entries  Space proportional roughly to number of links  Say 10N, or 4*10*1 billion = 40GB  Still won’t fit in memory, but will fit on disk Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 47 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 source node degree destination nodes  Measures generic popularity of a page  Will ignore/miss topic-specific authorities  Solution: Topic-Specific PageRank (next)  Uses a single measure of importance  Other models of importance  Solution: Hubs-and-Authorities  Susceptible to Link spam  Artificial link topographies created in order to boost page rank  Solution: TrustRank Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 48  Instead of generic popularity, can we measure popularity within a topic?  Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”  Allows search queries to be answered based on interests of the user  Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history and computer security 50Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Random walker has a small probability of teleporting at any step  Teleport can go to:  Standard PageRank: Any page with equal probability  To avoid dead-end and spider-trap problems  Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set)  Idea: Bias the random walk  When walker teleports, she pick a page from a set S  S contains only pages that are relevant to the topic  E.g., Open Directory (DMOZ) pages for a given topic/query  For each teleport set S, we get a different vector rS 51Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  To make this work all we need is to update the teleportation part of the PageRank formulation: 𝑨𝒊𝒋 = 𝜷 𝑴𝒊𝒋 + (𝟏 − 𝜷)/|𝑺| if 𝒊 ∈ 𝑺 𝜷 𝑴𝒊𝒋 + 𝟎 otherwise  A is stochastic!  We weighted all pages in the teleport set S equally  Could also assign different weights to pages!  Compute as for regular PageRank:  Multiply by M, then add a vector  Maintains sparseness Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 52 1 2 3 4 Suppose S = {1},  = 0.8 Node Iteration 0 1 2 … stable 1 0.25 0.4 0.28 0.294 2 0.25 0.1 0.16 0.118 3 0.25 0.3 0.32 0.327 4 0.25 0.2 0.24 0.261 0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8 53Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) S={1,2,3,4}, β=0.8: r=[0.13, 0.10, 0.39, 0.36] S={1,2,3} , β=0.8: r=[0.17, 0.13, 0.38, 0.30] S={1,2} , β=0.8: r=[0.26, 0.20, 0.29, 0.23] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.90: r=[0.17, 0.07, 0.40, 0.36] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.70: r=[0.39, 0.14, 0.27, 0.19]  Spamming:  Any deliberate action to boost a web page’s position in search engine results, incommensurate with page’s real value  Spam:  Web pages that are the result of spamming  This is a very broad definition  SEO industry might disagree!  SEO = search engine optimization  Approximately 10-15% of web pages are spam 55Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Early search engines:  Crawl the Web  Index pages by the words they contained  Respond to search queries (lists of words) with the pages containing those words  Early page ranking:  Attempt to order pages matching a search query by “importance”  First search engines considered:  (1) Number of times query words appeared  (2) Prominence of word position, e.g. title, header Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 56  As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not  Example:  Shirt-seller might pretend to be about “movies”  Techniques for achieving high relevance/importance for a web page Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 57  How do you make your page appear to be about movies?  (1) Add the word movie 1,000 times to your page  Set text color to the background color, so only search engines would see it  (2) Or, run the query “movie” on your target search engine  See what page came first in the listings  Copy it into your page, make it “invisible”  These and similar techniques are term spam Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 58  Believe what people say about you, rather than what you say about yourself  Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text  PageRank as a tool to measure the “importance” of Web pages Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 59  Our hypothetical shirt-seller looses  Saying he is about movies doesn’t help, because others don’t say he is about movies  His page isn’t very important, so it won’t be ranked high for shirts or movies  Example:  Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text  These pages have no links in, so they get little PageRank  So the shirt-seller can’t beat truly important movie pages, like IMDB Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 60  Once Google became the dominant search engine, spammers began to work out ways to fool Google  Spam farms were developed to concentrate PageRank on a single page  Link spam:  Creating link structures that boost PageRank of a particular page Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 61  Three kinds of web pages from a spammer’s point of view  Inaccessible pages  Accessible pages  e.g., blog comments pages  spammer can post links to his pages  Owned pages  Completely controlled by spammer  May span multiple domain names 62Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  Spammer’s goal:  Maximize the PageRank of target page t  Technique:  Get as many links from accessible pages as possible to target page t  Construct “link farm” to get PageRank multiplier effect 63Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Inaccessible t Accessible Owned 1 2 M One of the most common and effective organizations for a link farm 64Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Millions of farm pages  Combating term spam  Analyze text using statistical methods  Similar to email spam filtering  Also useful: Detecting approximate duplicate pages  Combating link spam  Detection and blacklisting of structures that look like spam farms  Leads to another war – hiding and detecting spam farms  TrustRank = topic-specific PageRank with a teleport set of trusted pages  Example: .edu domains, similar domains for non-US schools Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 66  Basic principle: Approximate isolation  It is rare for a “good” page to point to a “bad” (spam) page  Sample a set of seed pages from the web  Have an oracle (human) to identify the good pages and the spam pages in the seed set  Expensive task, so we must make seed set as small as possible Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 67  Trust attenuation:  The degree of trust conferred by a trusted page decreases with the distance in the graph  Trust splitting:  The larger the number of out-links from a page, the less scrutiny the page author gives each out- link  Trust is split across out-links Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 68  HITS (Hypertext-Induced Topic Selection)  Is a measure of importance of pages or documents, similar to PageRank  Proposed at around same time as PageRank (‘98)  Goal: Say we want to find good newspapers  Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links? 69Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)  PageRank and HITS are two solutions to the same problem:  What is the value of an in-link from u to v?  In the PageRank model, the value of the link depends on the links into u  In the HITS model, it depends on the value of the other links out of u  The destinies of PageRank and HITS post-1998 were very different 70Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)