Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 http://24x7presence.files.wordpress.com/2010/12/white_noise_facebook.jpg Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 Connections between political blogs Polarization of the network [Adamic-Glance, 2005] Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4 Citation networks and Maps of science [Börner et al., 2012] domain2 domain1 domain3 router Internet Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5 ¡Web as a directed graph: §Nodes: Webpages §Edges: Hyperlinks Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6 I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7 ¡How to organize the Web? ¡First try: Human curated Web directories §Yahoo, DMOZ, LookSmart ¡Second try: Web Search §Information Retrieval investigates: Find relevant docs in a small and trusted set §Newspaper articles, Patents, etc. §But: Web is huge, full of untrusted documents, random things, web spam, etc. § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 http://www.imagstudios.com/images/yahoo-old.jpg ¡2 challenges of web search: ¡(1) Web contains many sources of information Who to “trust”? §Trick: Trustworthy pages may point to each other! § ¡(2) What is the “best” answer to query “newspaper”? §No single right answer §Trick: Pages that actually know about newspapers might all be pointing to many newspapers Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 ¡All web pages are not equally “important” §www.joe-schmoe.com vs. www.stanford.edu ¡ ¡There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 10 ¡We will cover the following Link Analysis approaches for computing importances of nodes in a graph: §Page Rank §Topic-Specific (Personalized) Page Rank §Web Spam Detection Algorithms 11 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 13 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6 14 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 15 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) j k i rj/3 rj/3 rj/3 rj = ri/3+rk/4 ri/3 rk/4 ¡A “vote” from an important page is worth more ¡A page is important if it is pointed to by other important pages ¡Define a “rank” rj for page j Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 y m a a/2 y/2 a/2 m y/2 The web in 1839 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 17 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Flow equations: Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 19 j i M r r = rj 1/3 ri . . = 20 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) r = M∙r y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m 21 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 ¡Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks ¡Power iteration: a simple iterative scheme §Suppose there are N web pages §Initialize: r(0) = [1/N,….,1/N]T §Iterate: r(t+1) = M ∙ r(t) §Stop when |r(t+1) – r(t)|1 < e § 22 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) di …. out-degree of node i |x|1 = å1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 23 Iteration 0, 1, 2, … ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 > Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 24 Iteration 0, 1, 2, … ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 25 j i1 i2 i3 j i1 i2 i3 26 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡A central result from the theory of random walks (a.k.a. Markov processes): ¡ For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 27 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) or equivalently 29 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡ ¡ ¡ ¡ ¡Example: ¡ ra 1 0 1 0 ¡ rb 0 1 0 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 30 = b a Iteration 0, 1, 2, … ¡ ¡ ¡ ¡ ¡Example: ¡ ra 1 0 0 0 ¡ rb 0 1 0 0 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 31 = b a Iteration 0, 1, 2, … ¡2 problems: ¡(1) Some pages are dead ends (have no out-links) §Random walk has “nowhere” to go to §Such pages cause importance to “leak out” ¡ ¡(2) Spider traps: (all out-links are within the group) §Random walked gets “stuck” in a trap §And eventually spider traps absorb all importance Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 32 Dead end Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 33 Iteration 0, 1, 2, … y a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm m is a spider trap All the PageRank score gets “trapped” in node m. ¡The Google solution for spider traps: At each time step, the random surfer has two options §With prob. b, follow a link at random §With prob. 1-b, jump to some random page §Common values for b are in the range 0.8 to 0.9 ¡Surfer will teleport out of spider trap within a few time steps 34 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m y a m Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 35 Iteration 0, 1, 2, … y a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Here the PageRank “leaks” out since the matrix is not stochastic. ¡Teleports: Follow random teleport links with probability 1.0 from dead-ends §Adjust matrix accordingly ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 36 y a m y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m ¡Why are dead-ends and spider traps a problem and why do teleports solve the problem? ¡Spider-traps are not a problem, but with traps PageRank scores are not what we want §Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps ¡Dead-ends are a problem §The matrix is not column stochastic so our initial assumptions are not met §Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 37 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 38 di … out-degree of node i Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 39 [1/N]NxN…N by N matrix where all entries are 1/N y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . . 40 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) y a m 13/15 7/15 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A ¡Key step is matrix-vector multiplication §rnew = A ∙ rold ¡Easy if we have enough main memory to hold A, rold, rnew ¡Say N = 1 billion pages §We need 4 bytes for each entry (say) §2 billion entries for vectors, approx 8GB §Matrix A has N2 entries §1018 is a large number! Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 42 ½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = b∙M + (1-b) [1/N]NxN = A = ¡Suppose there are N pages ¡Consider page i, with di out-links ¡We have Mji = 1/|di| when i → j and Mji = 0 otherwise ¡The random teleport is equivalent to: §Adding a teleport link from i to every other page and setting transition probability to (1-b)/N §Reducing the probability of following each out-link from 1/|di| to b/|di| §Equivalent: Tax each page a fraction (1-b) of its score and redistribute evenly 43 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 44 [x]N … a vector of length N with all entries x Note: Here we assumed M has no dead-ends Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 45 46 If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Encode sparse matrix using only nonzero entries §Space proportional roughly to number of links §Say 10N, or 4*10*1 billion = 40GB §Still won’t fit in memory, but will fit on disk Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 47 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 source node degree destination nodes ¡Measures generic popularity of a page §Will ignore/miss topic-specific authorities §Solution: Topic-Specific PageRank (next) ¡Uses a single measure of importance §Other models of importance §Solution: Hubs-and-Authorities ¡Susceptible to Link spam §Artificial link topographies created in order to boost page rank §Solution: TrustRank § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 48 ¡Instead of generic popularity, can we measure popularity within a topic? ¡Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history” ¡Allows search queries to be answered based on interests of the user §Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history and computer security 50 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Random walker has a small probability of teleporting at any step ¡Teleport can go to: §Standard PageRank: Any page with equal probability §To avoid dead-end and spider-trap problems §Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) ¡Idea: Bias the random walk §When walker teleports, she pick a page from a set S §S contains only pages that are relevant to the topic §E.g., Open Directory (DMOZ) pages for a given topic/query §For each teleport set S, we get a different vector rS 51 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 52 1 2 3 4 Suppose S = {1}, b = 0.8 Node Iteration 0 1 2 … stable 1 0.25 0.4 0.28 0.294 2 0.25 0.1 0.16 0.118 3 0.25 0.3 0.32 0.327 4 0.25 0.2 0.24 0.261 0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8 53 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) S={1,2,3,4}, β=0.8: r=[0.13, 0.10, 0.39, 0.36] S={1,2,3} , β=0.8: r=[0.17, 0.13, 0.38, 0.30] S={1,2} , β=0.8: r=[0.26, 0.20, 0.29, 0.23] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.90: r=[0.17, 0.07, 0.40, 0.36] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.70: r=[0.39, 0.14, 0.27, 0.19] 55 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) ¡Early search engines: §Crawl the Web §Index pages by the words they contained §Respond to search queries (lists of words) with the pages containing those words ¡Early page ranking: §Attempt to order pages matching a search query by “importance” §First search engines considered: §(1) Number of times query words appeared §(2) Prominence of word position, e.g. title, header Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 56 ¡As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not ¡Example: §Shirt-seller might pretend to be about “movies” ¡Techniques for achieving high relevance/importance for a web page ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 57 ¡How do you make your page appear to be about movies? §(1) Add the word movie 1,000 times to your page §Set text color to the background color, so only search engines would see it §(2) Or, run the query “movie” on your target search engine §See what page came first in the listings §Copy it into your page, make it “invisible” ¡These and similar techniques are term spam ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 58 Do not forget to tell the joke "I used to do this when I was young" when talking about point (1) :D Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 59 ¡Our hypothetical shirt-seller looses §Saying he is about movies doesn’t help, because others don’t say he is about movies §His page isn’t very important, so it won’t be ranked high for shirts or movies ¡Example: §Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text §These pages have no links in, so they get little PageRank §So the shirt-seller can’t beat truly important movie pages, like IMDB ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 60 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 61 ¡Three kinds of web pages from a spammer’s point of view §Inaccessible pages §Accessible pages §e.g., blog comments pages §spammer can post links to his pages §Owned pages §Completely controlled by spammer §May span multiple domain names 62 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 63 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) > Inaccessible t Accessible Owned 1 2 M One of the most common and effective organizations for a link farm 64 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Millions of farm pages ¡Combating term spam §Analyze text using statistical methods §Similar to email spam filtering §Also useful: Detecting approximate duplicate pages ¡Combating link spam §Detection and blacklisting of structures that look like spam farms §Leads to another war – hiding and detecting spam farms §TrustRank = topic-specific PageRank with a teleport set of trusted pages §Example: .edu domains, similar domains for non-US schools Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 66 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 67 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 68 ¡HITS (Hypertext-Induced Topic Selection) §Is a measure of importance of pages or documents, similar to PageRank §Proposed at around same time as PageRank (‘98) ¡Goal: Say we want to find good newspapers §Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers ¡Idea: Links as votes §Page is more important if it has more links §In-coming links? Out-going links? ¡ 69 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 70 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)