VB036 - English 2 Textbook _ Authors: Martin Dvořák, Kateřina Repová, Ivana Túlajová All the texts in this textbook have been taken from Encyclopedia of Database Systems (Ozsu, M. Tamer; Liu, Ling (Eds.) 2009. Approx. 4100 p. 60 illus. ISBN: 978-0-387-49616-0) and abridged. Tento projekt je spDlufinjn:nván Evropským sniiálním fondem a stát nim rozpočtem České republiky, • EVROPSKÁ UNIE IT*1* M NI£"EnSTVO S-ÍOLSTVÍ. M -ADEÍE A TĚLOVÝCHOV P" I INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Phonetics Vowels Long Vowels il sheep Short Vowels I e ship head ai farm ae hat 9 above UI coo U foot 3«, mother (US) 31 horse D sock (UK) worm (US) 31 bird cup Consonants Voiced b book Z zoo j yes Voiceless P s pen say d 3 w t S day vision we town she 9 m k give jump moon cat cheese V very I look n name f fish ô r n e the run sing think Diphthongs ei day OU nose (US) ai eye 13 ear (UK) 31 boy ea hair (UK) au mouth U3 pure (UK) 3U nose (UK) Other Symbols h [|aend] hand \ ['bA\.&] butter (US) D ['kwaes.B] croissant (UK) u [,in.flu'en.za] influenza ['haepj] happy ['lit.!] little al, 3m, 3n can be pronounced either: dl or!, etc.: ['lei.bH] = ['lei.bal] = ['lei.b!] r linking r is pronounced only before a vowel in British English: [foir] four: [foiraep.lz] four apples ' main stress [,ek.spek'tei.J~3n] expectation , secondary stress [^|'tel] retell Syllable division ['SIS.tBITl] System Cambridge Dictionary Online [Internet]. c2010 [cited 2010 Feb 24]. Phonetics. Available from: . Annotation-based Image Retrieval XIN-JING WANG, LEI ZHANG Microsoft Research Asia, Beijing, China 5 Definition Given (i) a textual query and (ii) a set of images and their annotations (phrases or keywords) annotation-10 based image retrieval systems retrieve images according to the matching score of the query and the corresponding annotations. There are three levels of queries according to Eakins [7]: 15 • Level 1: Retrieval by primitive features such as color, texture, shape or the spatial location of image elements, typically querying by an example, i.e., "find pictures like this." • Level 2: Retrieval by derived features, with 20 some degree of logical inference. For example, "find a picture of a flower." • Level 3: Retrieval by abstract attributes, involving a significant amount of high-level reasoning about the purpose of the objects or 25 scenes depicted. This includes retrieval of named events, of pictures with emotional or religious significance, etc., e.g., "find pictures of a joyful crowd." 30 Together, levels 2 and 3 are referred to as semantic image retrieval, which can also be regarded as annotation-based image retrieval. Historical Background 35 There are two frameworks of image retrieval [6]: annotation-based (or more popularly, text-based) and content-based. The annotation-based approach can be tracked back to the 1970s. In such systems, the images 40 are manually annotated by text descriptors, which are used by a database management system (DBMS) to perform image retrieval. There are two disadvantages with this approach. The first is that a considerable level of human labor is required for manual annotation. The 45 second is that because of the subjectivity of human perception, the manually labeled annotations may not converge. To overcome the aforementioned disadvantages, content-based image retrieval (CBIR) was introduced in the early 1980s. In CBIR, images are 50 indexed by their visual content, such as color, texture, shapes. In the past decade, several commercial products and experimental prototype systems were developed, such as QBIC, Photobook, Virage, VisualSEEK, Netra, SIMPLIcity. 55 However, the discrepancy between the limited descriptive power of low-level image features and the richness of user semantics, which is referred to as the "semantic gap" bounds the performance of CBIR. On the other hand, due to the explosive growth of visual 60 data (both online and offline) and the phenomenal success in Web search, there has been increasing expectation for image search technologies. For these reasons, the main challenge of image retrieval is understanding media by bridging the semantic gap 65 between the bit stream and the visual content interpretation by humans [3]. Hence, the focus is on automatic image annotation techniques. Foundations 70 The state-of-the-art image auto-annotation techniques include four main categories [3,6]: (i) using machine learning tools to map low-level features to concepts, (ii) exploring the relations between image content and 75 the textual terms in the associated metadata, (iii) generating semantic template (ST) to support high-level image retrieval, (iv) making use of both the visual content of images and the textual information obtained from the Web to learn the annotations. 80 Machine Learning Approaches A typical approach is using Support Vector Machine (SVM) as a discriminative classifier over image low-85 level features. Though straightforward, it has been shown effective in detecting a number of visual concepts. Recently there has been a surge of interest in leveraging and handling relational data, e.g., images 90 and their surrounding texts. Blei et al. [1] extend the Latent Dirichlet Allocation (LDA) model to the mix of words and images and proposed a Correlation LDA model. 95 Relation Exploring Approaches Another notable direction for annotating image visual content is exploring the relations among image content and the textual terms in the associated metadata. Such metadata are abundant, but are often incomplete and noisy. By exploring the co-occurrence relations among the images and the words, the initial labels may be filtered and propagated from initial labeled images to additional relevant ones in the same collection [3]. Jeon et al. [5] proposed a cross-media relevance model to learn the joint probabilistic distributions of the words and the visual tokens in each image, which are then used to estimate the likelihood of detecting a specific semantic concept in a new image. 100 105 110 3 Semantic Template Approaches Though it is not yet widely used in the techniques mentioned above, Semantic Template (ST) is a 115 promising approach in annotation-based image retrieval (a map between high-level concept and low-level visual features). Chang and Chen [2] show a typical example of ST, in which a visual template is a set of icons or example 120 scenes/objects denoting a personalized view of concepts such as meetings, sunset, etc. The generation of a ST is based on user definition. For a concept, the objects, their spatial and temporal constraints, and the weights of each feature of each object are specified. 125 This initial query scenario is provided to the system, and then through the interaction with users, the system finally converges to a small set of exemplar queries that "best" match (maximize the recall) the concept in the user's mind. 130 Large-Scale Web Data Supported Approaches Good scalability to a large set of concepts is required in ensuring the practicability of image annotation. On 135 the other hand, images from the Web repositories, e.g., Web search engines or photo sharing sites, come with free but less reliable labels. In [9], a novel search-based annotation framework was proposed to explore such Web-based resources. Fundamentally, it is to 140 automatically expand the text labels of an image of interest, using its initial keyword and image content. The process of [9] is shown in Fig. 1. It contains three stages: the text-based search stage, the content-based search stage, and the annotation learning stage, which 145 are differentiated using different colors (black, brown, blue) and labels (A., B., C.).When a user submits a query image as well as a query keyword, the system first uses the keyword to search a large-scale Web image database (2.4 million images crawled from 150 several Web photo forums), in which images are associated with meaningful but noisy descriptions, as tagged by "A." in Fig. 1. The intention of this step is to select a semantically relevant image subset from the original pool. 155 Visual feature-based search is then applied to further filter the subset and save only those visually similar images (the path labeled by "B." in Fig. 1). By these means, a group of image search results which are both semantically and visually similar to the query image 160 are obtained. Finally, based on the search results, the system collects their associated textual descriptions and applies the Search Result Clustering (SRC) algorithm to group the images into clusters. 165 (Abridged) Figure 1: (Lake) 170 Recommended Reading 1. Blei D. and Jordan M.I.Modeling Annotated Data. In Proc. 26th Annual Int. ACM SIGIR Conf. on 175 Research and Development in Information Retrieval, 2003, pp. 127-134. 2. Chang S.-F, Chen W., and Sundaram H. Semantic Visual Templates: Linking Visual Features to Semantics. In Proc. Int. Conf. on Image Processing, 180 Vol. 3. 1998, pp. 531-534. 3. Chang S.-F., Ma W.-Y., and Smeulders A. Recent Advances and Challenges of Semantic Image/Video Search. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2007, pp. 1205-1208. 185 4. Eakins J. and Graham M. Content-based image retrieval, Technical Report, University of Northumbria at Newcastle, 1999. 5. Jeon J., Lavrenko V., and Manmatha R. Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models, In 190 Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 119-126. 6. Liu Y., Zhang D., Lu G., and MaW.-Y. A survey of content-based image retrieval with high-level 195 semantics. Pattern Recognition., 40(1):262 -282, 2007. 7. Long F., Zhang H.J., and Feng D.D. Fundamentals of content-based image retrieval. In Multimedia Information Retrieval and Management, D. Feng (eds.). Springer, 2003. 200 8. Rui Y., Huang T.S., and Chang S.-F. Image retrieval: current techniques, promising directions, and open issues, J. Visual Commun. Image Represent. 10(4):39-62, 1999. 9. Wang X.-J., Zhang L., Jing F., and Ma W.-Y. 205 AnnoSearch: Image Auto-Annotation by Search, Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2006, pp. 1483-1490. 10. Zhuang Y., Liu X., and Pan Y. Apply Semantic Template to Support Content-based Image Retrieval. 4 210 In Proc. SPIE, Storage and Retrieval for Media Databases, vol. 3972, December 1999, pp. 442^149. 5 Answer the following questions: 1) Describe retrieval by primitive features. 2) What is meant by abstract attributes in the context of retrieving images'? 3) What is meant by semantic image retrieval? Why is it called semantic? 4) What is annotation and what are its disadvantages? 5) What are the machine learning tools good for in image retrieval? 6) Describe the term of relation exploring approach. 7) What is the generation of semantic template based on? Match the following terms and their definitions: 1) semantic gap 2) SVM 3) metadata 4) CBIR a) a machine learning approach b) data about data c) the discrepancy between the limited descriptive potential of low-level image features and the richness of user's description d) getting back stored data objects byinspecting their visual characteristic, suchas color, texture, shapes. Mark the following statements as true or false: 1) Retrieval by derived features can be based on color of the picture. 2) The content-based approach uses descriptors to retrieve images. 3) SVM works on low-level features. 4) Images from the Web repositories come with highly reliable labels. 5) Search Result Clustering refers to grouping information about images. 6 Vocabulary abundant [a'bAn.dant] - hojný, překypující query ['kwia.ri] ® ['kwir.i] - dotaz annotation [,aen.ai/'tei.J"3n] ©[-a-] - anotace classifier ['klae.si.faia] - klasifikátor co-occurrence [kaua'ka.rans] © [koua'ka.rans] -společný výskyt descriptor [dis'kripta] - deskriptor, popisek discriminative [dis'kriminativ] © [dis'krimanativ] -rozlišující, schopný rozlišovat exemplar [ig'zem.pla:1'] © [-plair] - typický příklad, vzor, model explosive [ik'splau.siv] ©[-'splou-] - explozivní, výbušný hence [henřs] - tudíž (formální) incomplete [,in.kam'pli:t] - nekompletní likelihood ['lai.kli.hud] - pravděpodobnost machine learning [ma'J"i:n] ['l3:.nin] © [-] [1?:-] -strojové učení metadata ['metadeita] - metadata, data popisující jiná data noisy ['nDi.zi] - obsahující šum, hlučný phenomenal [fa'nom.i.n3!] © [-'na:.ma-] - úžasný, výjimečný pool [pu:l] - úložiště, zásoba, bazén relational [ri'lei/anal] - relační, vztahový retrieval [rťtriival] - získávání, vyhledávání scenario [si'na:.ri.au] © [ sa'ner.i.ou] - scénář semantic [si'maen.tik] © [-tik-] - sémantický, významový subset ['sAb.set] - podmnožina token ['tau.k3n] © ['tou-] - znamení, znak, symbol to annotate st ['aen.aa.teit] © [-a-] - anotovat něco, vybavit anotací to bridge st [brid3] - překlenout něco to leverage st ['Ii:.var.id3] © [ 'Iev.av.1d3] - využívat k užitku to overcome st (overcome, overcame, overcome) [,au.va'ln] - klikatě, cikcak in contrast to ['kon.traist] © ['kain.traest] - naproti tomu, na rozdíl in parallel to ['paer.a.lel] © ['per-] - souběžně in particular [pa'tik.ju.lar] © [pav'tik.ja.la*] - obzvláště widely used - hojně využívaný with the aim of agreeing - s cílem shodnout se 18 Processor Cache PETER BONCZ CWI, Amsterdam, The Netherlands 5 Definition To hide the high latencies of DRAM access, modern computer architecture now features a memory 10 hierarchy that besides DRAM also includes SRAM cache memories, typically located on the CPU chip. Memory access first checks these caches, which takes only a few cycles. Only if the needed data is not found, an expensive memory access is needed. 15 Key Points CPU caches are SRAM memories located on the CPU chip, intended to hide the high latency of accessing off- 20 chip DRAM memory. Caches are organized in cache lines (typically 64 bytes). In a fully-associative cache, each memory line can be stored in any location of the cache. To make checking the cache fast, however, CPU caches tend to have limited associativity, such that 25 storage of a particular cache line is possible in only 2 or 4 locations. Thus only 2 or 4 locations need to be checked during lookup (these are called 2- resp. 4-way associative caches). The cache hit ratio is determined by the spatial and temporal locality of the memory 30 access generated by the running program(s). Cache misses can either be compulsory misses (getting the cache lines of all used memory once), capacity misses (caused by the cache being too small to keep all multiply used lines in cache), or conflict misses (due to 35 the limited associativity of the cache). Most modern CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a Translation Lookaside Buffer(TLB) 40 used to speed up virtual-to-physical address translation for both executable instructions and data. The TLB is not organized in cache lines; it simply holds pairs of (virtual, logical) page mappings, typically a fairly limited amount (e.g., 64). In practice, this means that 45 algorithms that repeatedly touch memory in more than 64 pages (whose size is often 4 KB) shortly after each other, run into TLB thrashing. This problem can sometimes be mitigated by setting a large virtual memory page size, or by using special large OS pages 50 (sometimes supported in the CPU with a separate, smaller, TLB for large pages). Another issue is the tradeoff between latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use 55 multiple levels of cache, with small fast caches backed up by larger slower caches. Multi-level caches generally operate by checking the smallest Level 1 (LI) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger 60 cache (L2) is checked, and so on, before external memory is checked. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. 65 For multi-CPU and multi-core systems, the fact that some of the higher levels of cache are not shared, yet provide coherent access to shared memory, causes additional cache-coherency inter-core communication to invalid stale copies of cache lines on other cores 70 when one core modifies it. In multi-core CPUs, an important issue is that cache level is shared among all cores - this cache level is on the one hand a potential hot-spot for cache conflicts, on the other hand provides an opportunity for very fast inter-core data exchange. 75 In case of sequential data processing, the memory controller or memory chipset in modern computers often detects this access pattern and starts requesting the subsequent cache lines in advance. This is called hardware prefetching. Prefetching effectively allows 80 hiding compulsory cache misses. Without prefetching, the effective memory bandwidth would equate cache line size divided by memory latency (e.g., 64/50ns = 1.2 GB/s). Thanks to hardware prefetching, modern computer architectures reach four times that on 85 sequential access. Modern CPUs also offer explicit prefetching instructions, which a software writer can exploit to perform (non-sequential) memory access in advance, hiding their latency. In database systems, such software prefetching has successfully been used 90 in making hash-table lookup faster (e.g., in hash-join and hash aggregation). In database systems, a series of cache-conscious data storage layouts (e.g., DSM and PAX) have been proposed to improve cache line usage. Also, a number 95 of cache-conscious query processing algorithms, such as cache-partitioned hash join and hash-join using memory prefetching, have been studied. In the area of data structures and theoretical computer science, there has recently been interest in cache-oblivious 100 algorithms that regardless of the exact parameters of the memory hierarchy (number of levels, cache size, cache line sizes and latencies) perform well. (Abridged) 105 19 Answer the following questions: 1) What kinds of independent caches are mentioned in the text? 2) What is the cache hit ratio determined by? 3) What kinds of cache misses are mentioned in the text? 4) What are cache-oblivious algorithms? 5) How do multi-level caches operate? 6) What enables a faster hash-table lookup? Match the following terms with their definitions: 1) hardware prefetching 2) cache miss 3) fully-associative cache 4) translation loookaside buffer a) this type of cache speeds up virtual-to-physical address translation b) requesting the subsequent memory lines in advance c) a situation when the requested data does not occur in the cache d) in this type of cache each memory line can be stored in any location of the cache Mark the following statements as true or false: 1) Modern computer architecture does not include DRAM, only SRAM. 2) A big problem with DRAM is its high latency. 3) A translation lookaside buffer is organized in cache lines. 4) Larger caches have worse hit rates but lower latency. Vocabulary Coherent [kac 'hia.rant] ©[ kou'hir.snt] -souvislý, soudržný Compulsory [kam'pAl.ssr.i] © [ -sa*-] - povinný conscious ['kDn.ffas] © [ 'ka:n-] - vědom si něčeho fairly ['fea.li] © [ 'fer-] - celkem, dost hotspot [/?DtspDt] © [hDitspoit] - ohnisko independent [,in.dťpen.d3nt] - nezávislý invalid [in'vael.id] - neplatný latency ['lei.t3nf .si] - zpoždění oblivious [a'bliv.i.as] - zapomnětlivý, zapomínající, nedbající (něčeho) ratio ['rei.Ji.au] © [ -ou] - poměr sequential [sťkwen.Jal] - následující, postupný spatial ['spei.J3!] - prostorový stale [steil] - zde zastaralý subsequent ['sAb.si.kwant] - následující, příští table [Vei.bl] - tabulka temporal ['tem.psr.3l] © [ -pa*.al] - časový thrashing [FJraeJin] - zahlcení paměti to equate [I'kweit] - rovnat se to fetch [fetj] - přinést to mitigate ['mit.i.geit] © [ 'mít-] - zmírnit to multiply ['rriAl.ti.plai] © [ -t i-] - násobit to propose [pra'pauz] © [ -'pouz] - navrhnout to tend to [tend] - mít tendenci k to utilize [jui.ti.laiz] © [ -t al.ai-] - využít, upotřebit trade-off ['treid.of] © [ -a:f] - kompromis Phrases in advance [aďva:nřs] © [-Vaenřs] - dopředu, předběžně, v předstihu regardless of [rťga:d.las] © [-'ga:rd-] - bez ohledu na 21 Spatial Data Mining SHASHI SHEKHAR, JAMES KANG, VIJAY GANDHI 5 University of Minnesota, Minneapolis, MN, USA Definition Spatial data mining is the process of discovering 10 nontrivial, interesting, and useful patterns in large spatial datasets. The most common spatial pattern families are co-locations, spatial hotspots, spatial outliers, and location predictions. 15 Historical Background Spatial data mining research began several decades ago when practitioners and researchers noticed that critical assumptions in classical data mining and statistics were 20 violated by spatial datasets. First, whereas classical datasets often assume that data are discrete, spatial data were observed to reside in continuous space. For example, classical data mining and statistical methods may use market-basket datasets (e.g., history of 25 Walmart's transactions), where each item-type in a transaction is discrete. However, "transactions" are not natural in continuous spatial datasets, and decomposing space across transactions leads to loss of information about neighbor relationships between 30 items across transaction boundaries. In addition, spatial data often exhibits heterogeneity (i.e., no places on the Earth are identical), whereas classical data mining techniques often focus on spatially stationary global patterns (i.e., ignoring spatial variations across 35 locations). Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. However, this assumption is generally false when analyzing spatial data, because spatial data tends to be highly self-correlated. For 40 example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial auto-correlation. Ignoring spatial auto-correlation when analyzing data with 45 spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set. Thus, classical data mining algorithms often perform poorly when applied to spatial data sets. Better methods are needed to analyze spatial data to detect 50 spatial patterns. Foundations The spatial data mining literature has focused on four 55 main types of spatial patterns: (i) spatial outliers, which are spatial locations showing a significant difference from their neighbors; (ii) spatial co-locations, or subsets of event types that tend to be found more often together throughout space than other 60 subsets of event types; (iii) location predictions, that is, information that is inferred about locations favored by an event type based on other explanatory spatial variables; and (iv) spatial hotspots, unusual spatial groupings of events. The remainder of this section 65 presents a general overview of each of these pattern categories. A spatial outlier is a spatially referenced object whose non-spatial attribute values differ significantly from 70 those of other spatially referenced objects in its spatial neighborhood. For example, consider spatial outliers, detected in traffic measurements for sensors on highway I-35W (North bound) in the Minneapolis-St. Paul area, for a 24-h time period. Station 9 may be 75 considered a spatial outlier as it exhibits inconsistent traffic flow compared with its neighboring stations. Once a spatial outlier is identified, one may proceed with diagnosis. For example, the sensor at Station 9 may be diagnosed as malfunctioning. Spatial attributes 80 are used to characterize location, neighborhood, and distance. Non-spatial attribute dimensions are used to compare a spatially referenced object to its neighbors. Spatial statistics literature provides two kinds of bipartite multidimensional tests, namely graphical tests 85 and quantitative tests. Graphical tests, such as Variogram clouds and Moran scatterplots, are based on the visualization of spatial data and highlight spatial outliers. Quantitative methods provide a precise test to distinguish spatial outliers from the remainder of data. 90 Spatial co-location pattern discovery finds frequently co-located subsets of spatial event types given a map of their locations. Spatial co-location is a generalization of a classical data mining pattern family 95 called association rules, since transactions are not natural in spatial datasets, and partitioning space across transactions leads to loss of information about neighbor relationships between items near transaction boundaries. Additional details about co-location 100 interest measures, e.g. participation index and K functions, and mining algorithms are described in [2]. Location prediction is concerned with the discovery of a model to infer preferred locations of a spatial phenomenon from the maps of other explanatory 105 spatial features. For example, ecologists may build models to predict habitats for endangered species using maps of vegetation, water bodies, climate, and other related species. For example, consider an example of a dataset used in building a location prediction model for 110 red-winged blackbirds in the Darr and Stubble wetlands on the shores of Lake Erie in Ohio, USA. This dataset consists of nest location, distance to open 22 water, vegetation durability and water depth maps. Classical prediction methods may be ineffective in this 115 problem due to the presence of spatial auto-correlation. Spatial data mining techniques that capture the spatial autocorrelation of nest location such as the Spatial Autoregressive Model (SAR) [1] and Markov Random Fields based Bayesian Classifiers (MRF-BC) are used 120 for location prediction modeling. Spatial Hotspots are unusual spatial groupings of events that tend to be much more closely related than other events. Examples of spatial hotspots can be 125 incidents of crime in a city or outbreaks of a disease. Hotspot patterns have properties of clustering as well as anomalies from classical data mining. However, hotspot discovery [4] remains a challenging area of research due to variation in shape, size, density of 130 hotspots and underlying space (e.g., Euclidean or spatial networks such as roadmaps). Additional challenges arise from the spatio-temporal semantics such as emerging hotspots, displacement etc. 135 Key Applications Spatial data mining and the discovery of spatial patterns has applications in a number of areas. Detecting spatial outliers is useful in many applications 140 of geographic information systems and spatial databases, including the domains of public safety, public health, climatology, and location-based services. As noted earlier, for example, spatial outlier applications may be used to identify defective or out of 145 the ordinary (i.e., unusually behaving) sensors in a transportation system. Spatial co-location discovery is useful in ecology in the analysis of animal and plant habitats to identify co-locations of predator-prey species, symbiotic species, or fire events with fuel and 150 ignition sources. Location prediction may provide applications toward predicting the climatic effects of El Nino on locations around the world. Finally, identification of spatial hotspots can be used in crime prevention and reduction, as well as in epidemiological 155 tracking of disease. (Abridged) Recommended Reading 160 1. Cressie N.A. Statistics for Spatial Data (Revised Edition). Wiley, New York, NY, 1993. 2. Huang Y., Shekhar S., and Xiong H. Discovering co-location patterns from spatial datasets: a general 165 approach. IEEE Trans. Knowl. Data Eng. (TKDE), 16(12): 1472-1485, 2004. 3. Shekhar S., Schrater P., Vatsavai R., Wu W., and Chawla S. Spatial contextual classification and prediction models for mining geospatial data. IEEE 170 Trans. Multimed. (special issue on Multimedia Databases), 4(2):174-188, 2002. 4. US Department of Justice - Mapping and Analysis for Public Safety report. Mapping Crime: Understanding Hot Spots, 2005 175 (http://www.ncjrs.gov/pdffiles l/nij/209393.pdf). 5. Shekhar S. and Chawla S. A Tour of Spatial Databases. Prentice Hall, 2003. 6. Longley P.A., Goodchild M., Maquire D.J., and Rhind D.W. Geographic Information Systems and 180 Science. Wiley, 2005. 7. Shekhar S., Zhang P., Huang Y., and Vatsavai R. Trend in spatial data mining. In Data Mining: Next Generation Challenges and Future Directions, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.). 185 AAAI/MIT Press, 2003. 8. Solberg A.H., Taxt T., and Jain A.K. A Markov random field model for classification of multisource satellite imagery. IEEE Trans. Geosci. Remote Sens., 34(1): 100-113, 1996. 190 9. Kou Y., Lu C.T., and Chen D. Algorithms for spatial outlier detection. In Proc. 2003 IEEE Int. Conf. on Data Mining, 2003, pp. 597-600. 10. Shekhar S., Lu C.T., and Zhang P. A unified approach to detecting spatial outliers. Geolnformatica, 195 7(2):139-166,2003. 11. Mamoulis N., Cao H., and Cheung D.W. Mining frequent spatio- temporal sequential patterns. In Proc. 2003 IEEE Int. Conf. on Data Mining, 2005, pp. 82-89. 23 Answer the following questions: 1) What is spatial data mining? 2) What are the basic differences between classical datasets and spatial datasets? 3) Why are classical data mining algorithms unsuitable for spatial datasets? 4) What does it mean that spatial data tends to be highly self-correlated? How would you explain spatial auto correlation? 5) Name the four main types of spatial patterns. 6) Explain a spatial outlier and give a simple example. 7) What are spatial and non-spatial attributes? 8) Which pairs are frequently co-located? 9) Give some examples of hotspots. 10) Name some areas where spatial data mining is applied. Mark the following statements as true or false: 1) Many of the relationships on spatial data are implicit. 2) Spatial autocorrelation means that nearby things are more different than distant things. 3) In spatial data mining data samples are not independent. 4) Classical data mining techniques perform well when applied to spatial data sets. 5) Spatial data mining is based on the same assumptions as classical data mining. Match the following terms with their definitions: 1) Spatial data mining (SDM) 2) Spatial outlier 3) Location prediction 4) Spatial co-location 5) Hotspot analysis 6) Spatial autocorrelation 7) Variogram clouds and Moran scatterplots a) Presence of two or more spatial objects at the same location or at significantly close distances from each other b) Process of finding unusually dense event clusters across time and space c) Process of discovering interesting, useful, non-trivial patterns from large spatial datasets d) Spatial outlier detection algorithms based on visualization e) Observation which appears to be inconsistent with its neighborhood f) Prediction of events occurring at particular geographic locations g) Spatial data values are influenced by values in their immediate vicinity 24 Vocabulary anomaly [a'rmm.a.li] © [-'nai.ma-] - odchylka assumption [a'sAm/?.J"an] - předpoklad, domněnka boundary ['baun.dar.i] [-dri] © [-daH - hranice, mez defective [dťfek.tiv] - vadný, chybný discrete [di'skriit] - nespojitý, oddělený, samostatný displacement [dťspleis.mant] - přemístění, posun endangered [in'dein.d3ad] © [-d3^d] - ohrožený grouping ['grui.pirj] - seskupení habitat ['haeb.i.taet] - přirozené prostředí, domov hypothesis (pi. hypotheses) [hai'poFj.a.sis] © [-'pa:.9a-], pi. [haťpoFj.a.siiz] - hypotéza, předpoklad, domněnka ignition [ig'nij>n] - vznícení, vzplanutí incident ['inř.si.d3nt] - případ, incident, příhoda, událost inconsistent [,in.kan'sis.tant] - neslučitelný, jsoucí v rozporu namely ['neim.li] - a to, jmenovitě nontrivial [non'triv.i.al] - podstatný, důležitý out of the ordinary [aut] [Di.di.na.ri] © [-] ['D:r.d3n.er-] - neobvyklý, zvláštní outbreak ['aut.breik] - vypuknutí outlier ['aut.lai.a] - co leží mimo pattern ['paet.3n] © ['paet .a^n] - vzorec, způsob, průběh phenomenon (pi. phenomena) [fa'nom.i.nan] © [-'nai.ma.nain], pi. [fa'nom.i.na] -jev predator ['pred.a.tar] © [-t a^] - dravec, predator prey [prei] - kořist remainder [n'mein.dar] © [-da^] - zbytek species ['spi:./i:z] - druh stationary ['stei.J"3n.ar.i] © [-/a.ner-] - neměnný, nehybný throughout space [9ru:'aut][speis] - po celém prostoru to be concerned with st [kan's3:nd] © [ -'s?:nd] -zabývat se čím to cluster ['klAs.tar] © [-ta^] - shlukovat se, seskupit to co-locate [kau.lau'keit] © [kou.lou'keit] -vyskytovat se společně to decompose [,di:.kam'pauz] © [-'pouz] -rozložit to favor ['fei.var] © [ -va^] - podporovat to highlight ['hai.lait] - zvýraznit, poukázat to infer [in'f3ir] ® [ -'f?:] - dedukovat, vyvozovat to note [naut] © [ nout] - upozornit, poukázat to partition [pa:'tij>n] © [pair-] - rozdělit to proceed [praa'siid] © [prou-] - pokračovat, postupovat to reference ['ref.3r.3nřs] © [-aH - odkazovat to tend [tend] - mít sklon, tendenci, být náchylný to track [traek] - sledovat to violate [Vaia.leit] - porušit, nedodržet Phrases due to - kvůli either - or - buď - nebo from the standpoint ['staend.pDint] - z hlediska in infant stages ['in.fant] - na počátku vývoje 25 Stemming CHRIS D. PAICE Lancaster University, Lancaster, UK 5 Definition Stemming is a process by which word endings or other affixes are removed or modified in order that word 10 forms which differ in non-relevant ways may be merged and treated as equivalent. A computer program which performs such a transformation is referred to as a stemmer or stemming algorithm. The output of a stemming algorithm is known as a stem. 15 Historical Background The need for stemming first arose in the field of information retrieval (IR), where queries containing 20 search terms need to be matched against document surrogates containing index terms. With the development of computer-based systems for IR, the problem immediately arose that a small difference in form between a search term and an index term could 25 result in a failure to retrieve some relevant documents. Thus, if a query used the term "explosion" and a document was indexed by the term "explosives," there would be no match on this term (whether or not the document would actually be retrieved would 30 depend on the logic and remaining terms of the query). The first stemmer for the English language to be fully described in the literature was developed in the late 1960s by Julie Beth Lovins [11]. This has now been largely superseded by the Porter stemmer [14], which 35 is probably the most widely used, and the Paice/Husk stemmer [12]. Stemmers have also been developed for a wide variety of other languages. Foundations 40 Definitions In an IR context, the process of taking two distinct words, phrases or other expressions and treating them as semantically equivalent is referred to as conflation. 45 The two expressions need not be precisely synonymous, but they must refer to the same core concept (compare "computed" and "computable"). In this article, the term "practically equivalent" is used to mean that, for the purposes of a particular 50 application, the words may as well be taken as equivalent. The term conflation is sometimes used as though it is equivalent to stemming, but it is in fact a much broader concept, since it includes (i) cases where the strings concerned are multi-word expressions, as in 55 "access time" and "times for access", and (ii) cases where the strings are not etymologically related, as in "index term" and "descriptor". In case (i) special string matching techniques may be used, whereas in case (ii) reference to a dictionary or thesaurus is 60 necessary. The present account deals exclusively with the conflation of single etymologically related words. There are various possible approaches to word conflation, including the following. 65 1. Direct matching. In this method, the character sequences of two words are compared directly, and a similarity value is computed. The words are then considered to match if their mutual similarity exceeds a predefined threshold. To give a simple example, the 70 first six letters of the words "exceeds" and "exceeded" are the same, so these words together contain 12 matching letters out of 15. Hence, a similarity of 12/15 = 0.80 can be computed. Use of a threshold (say, 0.70) allows a decision as to whether 75 the words can be considered equivalent. With such a method, setting the threshold is problematic. Thus, the similarity between "exceeds" and "excess" is 0.62, which is below the stated threshold. However, allowing for this by lowering the threshold to 0.60 80 would cause "excess" and "except" (similarity 0.67) to be wrongly conflated. 2. Lexical conflation. In this case a thesaurus or dictionary is used to decide whether two words are 85 equivalent. Obviously, this method can be used even for etymologically unrelated words. A problem here is obtaining a suitably comprehensive and up-to-date thesaurus, and one which explicitly lists routine variants such as plurals. 90 3. Cluster-based conflation. This method, investigated by Xu and Croft [15], involves creating clusters of practically equivalent words by analyzing the word associations in a large representative text corpus. Each 95 query word is then supplemented by adding in the other words in its cluster. In contrast to method (2), the clusters created are specific to the text collection in question. However, the creation of the clusters can be very time-consuming. 100 4. N-gram conflation. In this method, each word is decomposed into a collection of N-letter fragments (N-grams), and a similarity is computed between the N-gram collections of two words; a threshold is then 105 applied to decide whether the words are equivalent. This approach was pioneered by Adamson and Boreham[l], who used sets of bigrams, where N = 2. For example, after eliminating duplicates and sorting into order, "exceeds" can be represented by the 110 bigram set {ce, ds, ed, ee, ex, xc} and "exceeded" by {ce, de, ed, ee, ex, xc}. Out of 7 distinct bigrams here, 26 5 are shared between the two words; hence a similarity of 5/7 = 0.712 can be computed. 115 5. Stemming. Stemming refers to the removal of any suffixes (and sometimes other affixes) from an input word to produce a stem. Two words are then deemed to be equivalent if their stems are identical. This method is much favored because it is fast: all words 120 can be reduced to stems on input to the system, and simple string matching used thereafter. The remainder of this article focuses on stemming in this narrow sense. 125 Prefixes and Infixes In English, stemmers are usually designed for removing suffixes from words. The removal of "intimate" prefixes such as "intro-," "pro-" and 130 "con-" generally results in words being wrongly conflated (consider "intro-duction," "pro-duction" and "con-duction"). However, there may be a case for removing looser prefixes such as "hyper-" or "macro-." Also, prefix 135 removal may be desirable in certain domains with highly artificial vocabularies, such as chemistry and medicine. As explained below, there are some languages in which removal or replacement of prefixes, or even infixes, is in fact essential. 140 Performance and Evaluation Since stemmers were originally developed to aid the operation of information retrieval systems, it was 145 natural that they were first assessed in terms of their effect on retrieval performance, as well as on "dictionary compression" rates. Researchers were frustrated to find that the effects on retrieval performance for English language material were small 150 and often negative [10]. Removal of "-s" and other regular inflectional endings might be modestly helpful, but use of heavier stemming could easily result in a loss of performance [7]. Stemming errors are of two kinds: understemming, in 155 which a pair of practically equivalent words are not conflated, and overstemming, in which two semantically distinct words are wrongly conflated. Non-English Stemmers 160 Stemming is appropriate for most (though not all) natural languages, and appears to be especially beneficial for highly inflected languages [9]. There is neither space nor need to describe non-English 165 stemmers here, except to note that some languages exhibit much greater structural complexity, and this warrants special approaches. Thus, a typical Arabic word consists of a root verb of three (or occasionally four or five) consonants (e.g., "k-t-b" for "to write"), 170 into which various prefixes, infixes and suffixes are inserted to produce specific variant forms ("katabna": "we wrote" and "kitab": "book"). Some researchers have concentrated on extracting the correct root from a word [3], but Aljlayl and Frieder 175 have demonstrated that better retrieval performance is obtained by using a simpler "light stemming" approach, in which only the most frequent suffixes and prefixes are removed [4]. Their results showed that extraction of roots causes unacceptable levels of 180 overstemming. Key Applications As noted earlier, stemmers are routinely used in 185 information retrieval systems to control vocabulary variability. They also find use in a variety of other natural language tasks, especially when it is required to aggregate mentions of a concept within a document or set of documents. For example, stemmers may be used 190 in constructing lexical chains within a text. Stemming can also have a role to play in the standardization of data for input to a data warehouse. (Abridged) 195 Recommended Reading 1. Adamson G.W. and Boreham J. The use of an association measure based on character structure to 200 identify semantically related pairs of words and document titles. Inf. Process. Manage., 10(7/8):253-260, 1974. 2. Ahmad F., Yusoff M., and Sembok M.T. Experiments with a stemming algorithm for Malay 205 words. J. Am. Soc. Inf. Sci. Technol., 47(12):909-918, 1996. 3. Al-Sughaiyer LA. and Al-Kharashi LA. Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol., 55(3):189-213, 210 2004. 4. Aljlayl M. and Frieder O. On Arabic search: Improving the retrieval effectiveness via a light stemming approach. In , 2002, pp. 340-347. 5. Bacchin M., Ferro N., and Melluci M. A 215 probabilistic model for stemmer generation. Inf. Process. Manage., 41(1):121-137, 2005. 6. Frakes W.B. and Fox C.J. Strength and similarity of affix removal stemming algorithms. SIGIR Forum, 37(l):26-30, 2003 (Spring 2003). 220 7. Harman D. How effective is suffixing? J. Am. Soc. Inf. Sci., 42(1):7-15, 1991. 27 8. Hull D. A Stemming algorithms: a case study for detailed evaluation. J. Am. Soc. Inf. Sci., 47(l):70-84, 1996. 225 9. Krovetz R. Viewing morphology as an inference process. Artificial Intelligence, 118(l/2):277-294, 2000. 10. Lennon M., Pierce D.S., Tarry B.D., and Willett P. An evaluation of some conflation algorithms for 230 information retrieval. J. Inf. Sci., 3:177-183, 1981. 11. Lovins J.B. Development of a stemming algorithm. Mech. Transl. Comput. Linguist., 11:22-31, 1968. 12. Paice CD. Another stemmer. SIGIR Forum, 24(3):56-61, 1990. 235 13. Paice CD. A method for the evaluation of stemming algorithms based on error counting. J. Am. Soc. Inf. Sci., 47(8):632-649, 1996. 14. Porter M.F. An algorithm for suffix stripping. Program, 14(3):130-137, 1980. 240 15. Xu J. and CroftW.B. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst., 16(1):61-81, 1998. 28 Answer the following questions: 1) How would you describe stemming? What is its purpose? 2) What often resulted in a failure to retrieve relevant documents during searches in the past? 3) What is conflation? 4) Is there any difference between conflation and stemming? 5) What tools have to be used when strings are not etymologically related? 6) Describe direct matching. 7) What does the term of threshold refer to in the text? 8) What is the disadvantage of cluster-based conflation? 9) What are bigrams? 4) Hyper- in hyperactive is a suffix. 5) The term affix covers both prefix and suffix. 6) Stemming appears beneficial for highly inflected languages. 7) The light-stemming approach is based on removing the least frequent affixes. Match the following terms and their definitions: 1) lexical conflation 2) cluster-based conflation 3) N-gram conflation 4) understemming 5) overstemming a) a method using a corpus of texts b) a method based on bigrams c) a situation where more-or-less equivalent words are not conflated d) a method using a dictionary or thesaurus e) a situation where two semantically distinct words are wrongly conflated Mark the following statements as true or false: 1) During the conflation, the expressions need to be synonymous. 2) The words mother and father are etymologically related. 3) In stemming, two words are considered equivalent provided their stems are identical. 29 Vocabulary account [a'kaunt] - výčet; účet actual ['aek.tfu.al] [-tju-] [-tjul] - vlastní actually ['aek.tfu.a.li] [-tju-] [-tju.li] - vlastně affix ['aef.iks] - affix (předpona, přípona) algorithm ['ael.ga.ri.53m] - algoritmus bigram ['baigraem] - bigram (skupina dvou písmen, slabik či slov) cluster ['klAs.tar] © [-ta*] - hrozen, skupina, klastr comprehensive [.kom.prťhenř.siv] © [,ka:m-] - komplexní, obsáhlý conflation [kan'fleif3n] - spojování compression [kam'prej>n] rate [reit] -kompresivita consonant ['kon.sa.nant] © ['ka:n-] - souhláska core [kDír] © [kDír] - jádro; jádřinec corpus ['kDi.pas] © ['kDír-] - korpus, tělo; soubor textů distinct [di'stirjkt] - různý rozdílný duplicate ['dju:.pli.kat] © ['du:-] - duplikát; duplikovaný (srovnej výslovnost s„to duplicate" ['djui.pli.keit] ®['du:-]) inflectional [in'flekfanal] - skloňovací, skloňující, skloňovatelný equivalent [ťkwiv.3l.3nt] - ekvivalentní etymological [,et.i'mDl.a.d3ik3l] © [-'ma:.la-] - etymologický, vztahující se k původu slova exclusive [ik'sklui.siv] - výhradní failure ['fei.Ijar] © [-Ija*] - neúspěch hence [henřs] - tudíž however [,hau'ev.ar] © [-&] - však, avšak identical [aťden.ti.k3!] ©[-ta-] - identický, stejný lexical ['lek.si.k3!] - lexikální lexical ['lek.si.k3!] chains [tjein] - lexikální řetězce loose [lu:s] - volný mutual ['mjui.t/u.al] - vzájemný predefined [priidi'faind] © [priida'faind] - předem definovaný prefix ['prii.fiks] - předpona query ['kwia.ri] © [ 'kwir.i] - dotaz remainder [ri'mein.dar] © [-da*] - zbytek root [ru:t] - kořen routine [ruťtiin] - obvyklý semantic [sťmaen.tik] © [-t ik] - sémantický, významový stem [stem] - kmen; stopka surrogate ['sAr.a.gat] ®['s?:-] - náhradník; náhradní suffix ['sAf.iks] - přípona synonymous [sťnon.i.mas] © [-'na:.na-] - podobného významu thereafter [,5ea'ra:f.tar] © [,5er'aef.ta*] - poté thesaurus [9a'so:ras] - tesaurus threshold ['9re/./?auld] © [-/?ould] - práh thus [5as] - tak, a tak to aggregate st ['aeg.n.geit] - (na)hromadit něco to aid st [eid] - napomáhat něčemu to arise (arise, arose, arisen) [a'raiz] - objevit se; vyvstat to assess st [a'ses] - hodnotit něco to conflate [kan'fleit] - spojit, spojovat to decompose st [,di:.kam'pauz] © [-'pouz] -rozložit něco to deem [di:m] - považovat to duplicate st ['djui.pli.keit] © ['du:-] - duplikovat něco to eliminate st [ťlim.i.neit] - eliminovat něco to exceed st [ik'siid] - překročit něco to exhibit st [ig'zib.it] - vykazovat něco to extract st [ik'straekt] - extrahovat, vytáhnout něco to favor st ['fei.var] © [-va*] - dávat něčemu přednost to investigate st [in'ves.ti.geit] - vyšetřovat něco to merge st [m3:d3] © [ m?:d3] - spojit něco, sloučit to obtain st [ab'tein] - získat něco to pioneer [,paia'niar] © [-'nir] - razit cestu to retrieve st [n'triiv] - vyhledat, vyzvednout něco 30 to supersede st [,su:.pa'si:d] © [-paH - nahradit něco to supplement st ['sAp.li.mant] - doplnit něco to treat st [triit] - zacházet s něčím to warrant st [Wr.3nt] © ['woir-] - opravňovat něco variability [.vea.ri.a'bil.i.ti] © [.ver.i.a'bil.a.t i] -variabilita variant [Vea.ri.ant] © [Ver.i-] - varianta warehouse ['wea.haus] © ['wer-] - skladiště whereas [wea'raez] ©[wer'aez] - kdežto Phrases In contrast to st ['kon.traist] © ['kain.traest] -Oproti něčemu Obviously, ... [ob.vi.a.sli] © ['a:b-] - Samozřejmě 31 Storage Devices KALADHAR VORUGANTI Network Appliance, Sunnyvale, CA, USA 5 Definition One of the goals of database, file and block storage systems is to store data persistently. There are many 10 different types of persistent storage devices technologies such as disks, tapes, DVDs, and Flash. The focus of this write-up is on the design trade-offs, from a usability standpoint, between these different types of persistent storage devices and not on the 15 component details of these different technologies. Historical Background From a historical standpoint, tapes were the first type 20 of persistent storage followed by disks, CDs, DVDs, and Flash. Newer types of memory technologies such as PRAM and MRAM are still in their infant stages. These newer non-volatile memory technologies promise DRAM access speeds and packaging 25 densities, but these technologies are still too expensive with respect to cost/gigabyte. Foundations 30 - Tapes/Tape Libraries: Tape readers/tape head, tape library, tape robot, and tape cartridge are the key components of a tape subsystem. Tapes provide the best storage packaging density in comparison to other types of persistent storage devices. Tapes do not 35 provide random access to storage. Data on tapes can be stored either in compressed or uncompressed format. Unlike disks, tape cartridges can be easily transported between sites. Most organizations typically migrate data from older tape cartridges to newer tape cartridges 40 every 5 years to prevent data loss due to material degradation. One can employ disk based caches in front of tape subsystems in order to allow for tapes to handle bursty traffic. Tapes that provide Write-Once, Read Many (WORM) characteristics are also available. 45 WORM tapes are useful in data compliance environments where regulations warrant guarantees that a piece of data has not been altered. DLT and LTO are currently the two dominant tape technologies in the market. Technology-wise both these standards have 50 minor differences. Finally, from a pure media cost standpoint, tapes are less expensive (cost per gigabyte) than disks and other forms of persistent media. - Disks/Storage Controllers/NAS Boxes: Disks are the 55 most widely used form of persistent storage media. Disks are typically accessed by enterprise level applications when they are packaged as part of the processing server box (direct attached storage model), or are part of a network attached storage box (NAS) 60 and accessed via NAS protocols or, are packaged as part of a storage controller box and accessed via storage area network protocols (SAN). The current trend is for protocol consolidation, where the same storage controller provides support for both SAN and 65 NAS protocols. Typically, the size of the storage controllers can vary from a few terabytes to hundreds of terabytes (refrigerator sized storage controllers). A storage controller typically consists of redundant processors, protocol processing network cards, and 70 RAID processing adapter cards. The disks are connected to each other via either arbitrated loop or switched networks. Storage controllers also contain multi-gigabyte volatile caches. Disks are also packaged as part of laptops. 75 There is a marked difference in the manufacturing process, and testing process between the enterprise class disks and commodity laptop class disks. Disks vary in their form factor, rotational speed, storage capacity, number of available ports, and the protocols 80 used to access them. Currently, serial SCSI, parallel SCSI, serial ATA and parallel ATA, Fiber Channel, and SSA are the different protocols in use for accessing disks. Lower RPM and disk idle mode are new disk spin-down modes that allow disks to 85 consume less power when they are not actively being used. - DVD/Juke Boxes: DVDs and CDs are optical storage media that provide random access and WORM 90 capabilities. Only recently, the multiple erase capacity of an individual CD or DVD was less than the capacity of a single disk drive or tape cartridge. DVDs can store more data than a CD, and a high definition DVD can store more data than a DVD. There are numerous 95 competing standards for CDs, DVDs and high definition DVDs, however, format agnostic DVD players and DVD writers are emerging. Usage of DVDs is more prominent in the consumer space rather than in the enterprise space. A juke box system allows 100 one to access a library of CDs or DVDs. DVDs have slower access speeds than most types of disks. - Flash/SSDs/Hybrid Disks: Flash is memory technology that has non-volatile characteristics. Flash memory has slower read times than DRAM. Moreover, 105 it has much slower write times than DRAM. One has to perform an erase operation before one can re-use a flash memory location. One can only perform a limited number of erase operations. Thus, the number of write operations determines the Flash memory life. 110 SLC and MLC are the two different NAND flash technologies. SLC can be erased a greater number of times, and it has faster access times than MLC based 32 flash. NAND flash has faster write and erase times than NOR flash. NOR flash has faster read times than 115 NAND flash. NAND flash is used to store large amounts of data, whereas NOR flash is used to store executable code. People are using MLC flash in cameras and digital gadgets, and are using SLC flash as part of solid state 120 disks (SSDs). SSDs provide block level access interface (SCSI), and they contain a controller that performs flash wear leveling and block allocation. Hybrid disks that contain a combination of disks and Flash are emerging. Hybrid disks provide a Flash 125 cache in front of the disk media. One typically can store meta-data or recently used data in the flash portion of hybrid disks to save on power consumption. That is, one does not have to spin up the disk. Flash storage provides much better random access speeds 130 than disk based storage. Key Applications Tapes are being used primarily for archival purposes 135 because they provide good sequential read/write times. Disks are the media of choice for most on-line applications. Optical media (CDs, DVDs) are popular in the consumer electronic space. Flash based SSDs are popular for those workloads that exhibit random IOs. 140 Disks are being used in laptops, desktops and storage servers (SANs, NAS, DAS). Tape based WORM media and content addressable based disk storage are providing WORM media capabilities in tape and disk technologies, and thus, these technologies can be used 145 to also store compliance/regulatory data. (Abridged) Recommended Reading 150 1. Anderson D., Dykes J., and Riedel E. More than an interface- SCSI versus ATA. In Proc. Second Annual Conf. on FAST. San Francisco, CA, 2003. 2. Toigo J. Holy Grail of Network Storage 155 Management. Prentice Hall, Englewood Cliffs, NJ, 2003. 3. Voruganti K., Menon J., and Gopisetty S. Land below a DBMS. SIGMOD Rec, 33(l):64-70, 2004. Answer the following questions: Mark the following statements as true or false: 1) Name some of the persistent storage devices 1) Portable and inexpensive to purchase, tapes are often used for backing up or archiving data. mentioned in the text. 2) What are the advantages and disadvantages of tapes? 3) Where are WORM tapes useful? 4) Name the two dominant tape technologies that 2) Flash is memory technology that has volatile characteristics. 3) WORM disks can be written only once. are currently on the market. 5) What is the most widely used form of 4) NAND has significantly lower storage capacity than NOR. use for accessing disks? 7) What spin-down modes are used to save on persistent storage medium? 6) What are the protocols which are currently in 5) MLC is used in solid state drives (SSDs) and SLC is used in consumer appliances like cameras, media players, cell phones, etc. power consumption? 8) What are the characteristics of some of the 6) An SSD (solid-state drive or solid-state disk) is a storage device that stores persistent data on solid-state flash memory. optical storage media? 9) What is flash technology? 10) Where is flash memory used? 7) In hybrid disks the flash memory handles the data most frequently written to or retrieved from storage. Match the following terms with their definitions: 1) A library 2) A disk cache 3) Fibre channel 4) A storage area network (SAN) 5) DLT (digital linear tape) 6) A jukebox a) A mechanism for improving the time it takes to read from or write to a hard disk b) A unit in which optical disk drives are mounted c) A high-speed special-purpose network (or sub-network) that interconnects different kinds of data storage devices d) A collection of physical storage media such as tapes or disks and a way to access them e) A form of magnetic tape and drive system used for computer data storage and archiving f) A type of high speed interconnection 34 Vocabulary via [vaia] © [Vii.a] - přes, prostřednictvím wise (suffix) [waiz] - pokud jde o, hovoříme-li o write time [rait] [taim] - přístupová doba zápisu access time ['aek.ses][ taim] - přístupová doba bursty traffic [baisti] [traefik] - shlukový provoz, shluky dat cache [kaef] - skrýš, úkryt competing [kam'piitin] - konkurenční data compliance ['dei.ta] [kam'plai.anf s] © [-t a] [-] - ochrana dat před přepsáním wear leveling [wear] ['lev.alin] © [wer] [-] - strategie zápisu a mazání (vyrovnání opotřebení) focus on st ['fau.kas] © ['fou-] - zaměření format agnostic ['fDi.maet] [aeg'nos.tik] © f'fDir-] [-'nai.stik] - přehrávající jakýkoli formát gadget ['gaed3.1t] - přístroj, zařízení goal [gaul] © [goul] - cíl high definition [hai] [.def.ťnif.an] - vysoké rozlišení idle ['ai.dl] - nečinný life [laif] - životnost marked [ma:kt] © [ mairkt] - výrazný non-volatile [nonVol.a.tail] © [non' vai.la.t al] - stálý, stabilní numerous ['njui.ma.ras] © [ 'nu:-] - početný, četný persistent [pa'sis.t3nt] ©[pa*-] - stálý, trvalý read time [ri:d] [taim] - přístupová doba pro čtení solid state disk(SSD) ['sol.id] [steit] [disk] © [ 'sai.lid] [-] [-] - pevný disk na bázi paměťových čipů standpoint ['staend.pDint] - stanovisko, hledisko to access st ['aek.ses] - dostat se k, mít přístup k to allocate ['ael.a.keit] - přidělit to alter ['Dl.tar] © [ 'ail.t čk] - změnit to compete [kam'piit] - soutěžit to determine [dťt3:.min] ©[ -'t?:-] - určovat, udávat to emerge [i'm3:d3] © [ -'m?:d3] - objevit se to handle ['haen.dl] - zvládnout, vypořádat se to package ['paek.id3] - zabalit to spin down [spin] [daun] - zpomalit otáčení to spin up [spin] [Ap] - zrychlit otáčení to vary [Vea.ri] © [ Ver.i] - lišit se, různit se to warrant [Wr.3nt] ®['wD:r-] - zaručit trade-off ['treid.of] © [-a:f] - kompromis 35 Storage Protection KENICHIWADA Hitachi Limited, Tokyo, Japan 5 Definition Storage protection is a kind of data protection for data stored in a storage system. The stored data can be lost 10 or becomes inaccessible due to, mainly, a failure in storage component hardware (such as a hard disk drive or controller), a disastrous event, an operator's mistake, or intentional alteration or erasure of the data. Storage protection provides the underlying foundation 15 for high availability and disaster recovery. Historical Background In 1956, IBM shipped the first commercial storage that 20 had a hard disk drive. To protect data from bit errors on disk platters, the hard disk drive commonly uses cyclic redundancy check (CRC) and an error-correcting code (ECC). CRC and ECC cannot protect data from a whole disk 25 failure in which an entire disk becomes inaccessible (for example, because of a disk head crash). The IBM 3990, which was shipped in the 1980s, had the replication functionality in which two identical copies of data were maintained on separate media. This 30 approach protected data from this kind of failure. Replication functionality can be implemented in many other layers of the computer system. Most DBMS support database replication. Some file systems and Logical Volume Managers have file or volume 35 replication functionality. Further, many storage systems and storage virtualization appliances support volume replication functionality. RAID (Redundant Array of Inexpensive Disks) is another technology for protecting data from whole disk 40 failure. D. Patterson et al. published a paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the SIGMOD conference [6]. This paper introduced a five level data protection scheme. The term RAID was adopted from this paper, but 45 currently RAID is an acronym for Redundant Arrays of Independent Disks. It is noted that the patent covering RAID level 5 technology was issued in 1978 [5]. RAID level 1 is a kind of replication. RAID level 2 to 5 can reduce the capacity required to protect data 50 against disk drive failure than replication, but it is limited to protect disk drive failure. Replication, on the other hand, can be used to protect databases, file systems and logical volume. Further replication can be used for disaster recovery, if data are replicated 55 remotely. Foundations Hard disk drives commonly use Reed-Solomon code 60 [7] to correct bit errors. Data in hard disk drives is usually stored in fixed length blocks. Controllers in hard disk drives calculate ECC for each block and record it associated with the original data. When data are read the controller checks data integrity using ECC. 65 CRC can be used with ECC for detecting bit errors and/or reducing the possibility of correction error. Most DBMS support database replication with master/slave relation between the original and the replica. The master process updates and transfers it to 70 the slave. This type of replication can provide high availability to the client of the DBMS in case of storage system failures as well as server failures. Another type of database replication is multi-master, which is mostly used to provide high performance 75 parallel processing. Both types can be either synchronous or asynchronous replication. In synchronous replication, updates made in original are guaranteed in the replica, note there may be some delay in asynchronous replication. 80 Volume replication by storage system is also widely accepted as data protection. There are synchronous and asynchronous replications, the same as database replication. Asynchronous volume replication is often used for long distance remote replication. It may 85 prevent performance degradation caused by replication delay, but could cause some data loss in case of recovery. Synchronous replication, on the other hand, may provide no data loss recovery, but may cause performance degradation due to replication delay. 90 Volume replication is also used within a local data center for online backup. Backup servers use replica volume for backup when original volume is online. To support this, a storage system can pause update delegation from original to replica volume. 95 RAID (Redundant Array of Independent Disks) is a set of disks from one or more commonly accessible disk subsystems, combined with a body of control software, in which part of the physical storage capacity is used to store redundant information about user data stored on 100 the remainder of the storage capacity. The term RAID refers to a group of storage schemes that divide and replicate data among multiple disks, to enhance the availability of data at desired cost and performance levels. A number of standard schemes have evolved 105 which are referred to as levels. Originally, five RAID levels were introduced [6], but many more variations have evolved. Currently, there are several sublevels as well as many non-standard levels. There are trade-offs among RAID levels in terms of performance, cost and 110 reliability. 36 Key Applications 115 Storage protection is essential to achieve business continuity and legal compliance with adequate performance, cost, and reliability. (Abridged) 120 Recommended Reading 1. ANSI. NFPA1600 Standard on Disaster/Emergency Management and Business Continuity Programs. 125 2. BSI. BS25999; Business Continuity Management. 3. Houghton A. Error Coding for Engineers. Kluwer Academic Publications, Hingham, MA, 2001. 4. Keeton K., Santos C, Beyer D., Chase J., and Wilkes J. Designing for disasters. In Proc. 3rd 130 USENIX Conf. on File and Storage Technologies, 2004. 5. Ouchi N.K. (IBM Corporation). System for recovering data stored in failed memory unit. US Patent 4,092,732, 1978. 135 6. Patterson D., Gibson G, and Katz R. A case for redundant arrays of inexpensive disks (RAID). In 1988. 7. Sweeney P. Error Control Coding From Theory to Practice. Wiley, New York, 2002. 140 8. http://www.sec.gov/ Answer the following questions: 1) How can data loss occur? 2) What are CRC and ECC? 3) What does RAID stand for? 4) How many RAID levels were originally introduced? 5) What is Reed-Solomon code used for? 6) Besides disaster recovery, what is storage protection good for? 7) What is the basic difference between RAID 1 and the other RAID levels? Match the following terms with their definitions: 1) synchronous replication 2) asynchronous replication 3) RAID level 4) data integrity check a) in this approach, changes in the original can take some time to reflect in the replica b) a way to determine whether data is corrupted c) in this approach, changes in the original are immediately reflected in the replica d) a specific strategy of distributing data over multiple disks Mark the following statements as true or false: 1) When data is read, the controller checks data integrity using CRC. 2) Replication can also be used to protect databases and file systems. 3) Remote replication often employs the asynchronous volume replication approach. Vocabulary appliance [a'plai.anf s] - přístroj array [a'rei] - pole, sada compliance [kam'plai.anf s] - shoda, dodržení degradation [,deg.ra'dei./3n] - pokles, zhoršení disastrous [di'za:.stras] © [-'zaes.tras] - katastrofální, neblahý disk platter [disk] ['plaet.ar] © [-] ['plaet,.^] -disková plotna erasure [i'rei.3ar] © [-3^] - smazání failure ['fei.ljar] © [Ija^] - selhání inaccessible [.in.ak'ses.i.bl] - nepřístupný, nedostupný integrity [in'teg.ra.ti] © [-t_i] - celistvost, neporušenost intentional [in'ten./3n.al] - záměrný loss [Ids] © [la:s] - ztráta recovery [ri'kAv.ar.i] © [-»<■-] - obnova redundancy [n'dAn.danf .si] - nadbytečnost replica [Vep.li.ka] - kopie, duplikát scheme [ski:m] - schéma, soustava to evolve [i'vdIv] © [-Va:lv/] - vyvinout se volume [Vüljuim] © [Va:l-] - objem; svazek Phrases due to [dju:] © [du:] - kvůli; způsobený (čím) 39 Video YINGLI IBM T.J. Watson Research Center, Hawthorne, 5 NY, USA Definition Video, which means ' T see'' in Latin, is an electronic 10 representation of a sequence of images or frames, put together to simulate motion and interactivity. From the producer's perspective, a video delivers information created from the recording of real events to be processed simultaneously by a viewer's eyes and ears. 15 For most of time, a video also contains other forms of media such as audio. Video is also referred to as a storage format for moving pictures as compared to image, audio, graphics and animation. 20 Historical Background Video technology was first developed for television systems, but it has been further developed in many 25 formats to allow for consumer video recordings. Generally speaking, there are two main types of video: analog video and digital video. Analog videos are usually recorded as PAL (Phase Alternating Line) or NTSC (National Television System Committee) 30 electric signals following the VHS (Video Home System) standard and stored in magnetic tapes. Digital videos, on the contrary, are usually captured by digital cameras and stored in digital video formats such as DVD (Digital Versatile Disc), QuickTime and MPEG- 35 4 (Moving Picture Experts Group). Launched in September 1976, VHS became a standard format for consumer recording and viewing by the 1990s. Since then, it has dominated both home and commercial video markets. In March 1997, the DVD 40 format was introduced to American consumers, which gradually pulled consumers away from VHS in the following years due to its much better quality. In June 2003, the DVD's market share exceeded that of the VHS for the first time. Since then, it has been steadily 45 expanding its consumer market, and by July 2006, most major film studios have stopped releasing new movie titles in VHS format, opting for DVD-only releases. Now, VHS is gradually disappearing from both rental and retail stores, and DVD has dominated 50 the whole commercial market. Nevertheless, VHS is still popular for home recording of television programs, due to the large installed base and the lower cost of VHS recorders and tape. For the last few decades, as video technology quickly 55 advances and the cost of storage devices rapidly decreases, digital videos have become widely available in diverse application areas such as medicine, remote sensing, entertainment, education and online information services. This has thus led to very active 60 research in various video-related areas. Foundations The last three decades have witnessed a significant 65 amount of research efforts on various aspects of video technologies. Roughly speaking, they fall into the following three general categories: video representation, video content analysis, and video application. Specifically, video representation deals 70 with the way a video is represented, in another word, the file format. Video content analysis, on the other hand, aims to automatically structure and ultimately understand the video by analyzing its underlying content. Due to the difficult nature of this problem, 75 such process usually involves the analysis of multiple media modalities including visual, audio and text information. Finally, video application applies what it has learned from the analysis engine, and facilitates various types of content access including video 80 browsing, summarization and retrieval. A brief discussion on each of these three research domains is given below. Video Representation 85 A video sequence with accompanying sound track can occupy a vast amount of storage space when represented in digital format. As estimated in [6], a 1-min video clip could possibly occupy up to 448 MB. 90 Consequently, compression has been playing an important role in modern schemes for video representation. A wide variety of methods have been proposed to compress the video stream. Nevertheless, almost all of 95 them build their approaches upon the fact that video data contains both spatial and temporal redundancy. Specifically, to reduce the spatial redundancy, an intra-frame compression is applied which registers differences between parts of a single frame. Such a 100 task is more closely related to image compression. Likewise, to reduce the temporal redundancy, an inter-frame compression is exploited which registers differences between neighboring frames. This involves discrete Cosine transform (DCT), motion 105 compensation and other techniques. Some popular video compression mechanisms include H.261, H.263, H.264, MPEG-l,MPEG-2, MPEG-4 and MJPEG (Motion-Joint Photographic Experts Group). Specifically, H.261 is a 1990 ITU-T 110 (Telecommunication Standardization Sector of International Telecommunication Union) video coding standard originally designed for transmission over 40 ISDN lines. Later on, H.263 and H.264, which provide more capabilities and mainly target at video- 115 conferencing applications, were standardized in 1995 and 2003, respectively. In 1998, the Moving Picture Experts Group (MPEG) was formed to establish an international standard for the coded representation of moving pictures and associated audio on digital storage 120 media. Currently, there have been three established MPEG standards from this effort: MPEG-1, MPEG-2, and MPEG-4. Each of them targets at different commercial applications. For instance, MPEG-1 is usually used as the Video CD (VCD) format, MPEG-2 125 for High Definition Television (HDTV), and MPEG-4 for streaming video applications. Finally, to facilitate mobile appliances such as digital cameras, MJPEG was developed in the 1990s which uses intra-frame coding technology that is very similar to those used in MPEG- 130 1 or MPEG-2. However, it does not use inter-frame prediction, which on the one hand, results in a loss of compression capability, yet on the other hand, it makes the degree of compression capability independent of the amount of motion in the scene. Moreover, it also 135 eases video editing as simple editing can now be performed at any frame. Video Content Analysis 140 Video is a type of rich media as it often consists of other media types such as audio and text. Consequently, research on video content analysis can be grouped into three classes: visual content analysis, audio content analysis, and audiovisual content 145 analysis. A general goal of video content analysis is to extract the underlying video structure so as to facilitate convenient and nonlinear content access. Yet a more aggressive goal is to automatically understand video semantics so as to support applications such as video 150 summarization and retrieval that require an in-depth understanding of the video content. Video Application 155 Besides the large amount of research efforts on video content analysis, there are also many attentions on studying various video applications. After all, making the bulky and unstructured video content convenient and efficient to access, present, share, search and 160 deliver is the ultimate goal of the entire research community in this area. (Abridged) 165 Recommended Reading 1. Cheung S. and Zakhor A. Efficient video similarity measurement with video signature. IEEE Trans. Circ. Syst. Video Tech., 13(l):59-74, 2003. 170 2. Li Y. and Dorai C. SVM-based audio classification for instructional video analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2004. 3. Li Y. and Kuo C.-C. Video Content Analysis Using 175 Multimodal Information: for Movie Content Extraction, Indexing and Representation. Kluwer, MA, USA, 2003. 4. Li Y., Narayanan S., and Kuo C.-C. Content-based movie analysis and indexing based on audiovisual 180 cues. IEEE Trans. Circ. Syst. Video Tech., 14(8): 1073-1085, 2004. 5. Mahmood T.S. and Srinivasan S. Detecting topical events in digital video. In Proc. 8th ACM Int. Conf. on Multimedia, 2000, pp. 85-94. 185 6. Mitchell J., Pennebaker W., Fogg C, and LeGall D. MPEG Video Compression Standard. Chapman & Hall, New York, NY, USA, 1992. 7. MPEG Requirements Group, MPEG-7 Applications Document v.8, ISO/MPEG N2860, MPEG Vancouver 190 Meeting, July 1999. 8. MPEG Requirements Group, MPEG-7 Context, Objectives and Technical Roadmap, ISO/MPEG N2861, MPEG Vancouver Meeting, July 1999. 9. MPEG Requirements Group, MPEG-7 195 Requirements Document V.15, ISO/MPEG N4317, MPEG Sydney Meeting, July 2001. 10. Nock H, Adams W., Iyengar G., Lin C, Naphade M., Neti C, Tseng B., and Smith J. User-trainable video annotation using multimodal cues. In Proc. 26th 200 Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 403-404. 11. Oh J. and Hua K. Efficient and cost-effective techniques for browsing and indexing large video 205 databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 415^-26. 12. Pfeiffer S., Lienhart R., Fischer S., and Effelsberg W. Abstracting digital movies automatically. J. Vis. Comm. Image Represent., 7(4):345-353, 1996. 210 13. Yan R., Hauptmann A., and Jin R. Negative paeudo-relevance feedback in content-based video retrieval. In Proc. 11th ACM Int. Conf. on Multimedia, 2003, pp. 343-346. 14. Yeung M., Yeo B., and Liu B. Extracting story 215 units from long programs for video browsing and navigation. In 1996, pp. 296-305. 15. Zhang T. and Kuo C.-C. Audio content analysis for on-line audiovisual data segmentation. IEEE Trans. Speech Audio Process., 9(4):441^157, 2001. 220 16. Zheng W., Li J., Si Z., Lin F., and Zhang B. and Using highlevel semantic features in video retrieval. In Image and Video Retrieval. Springer, Berlin Heidelberg, New York, 2006, pp. 370-379. 41 Answer the following questions: 1) What is video ? 2) What is the difference between analog and digital video? 3) What makes VHS still popular for home recording? 4) What is meant by remote sensing? 5) What do the terms video representation, video content analysis, and video application refer to? 6) What do most compression formats build on? 7) What is MPEG-1? 8) What does the term "rich media" refer to? Match the following terms and their definitions: 1) video representation 2) video content analysis 3) intra-frame compression 4) inter-frame compression 5) likewise a) a way of reducing spatial redundancy b) deals with the file format c) a way to reduce temporal redundancy d) involves structuring the video e) means the same as "similarly" Mark the following statements as true or false: 1) Video was first developed for home use. 2) H.261 is a video coding standard originally designed for transmission over ISDN lines. 3) MPEG-4 is used as a high definition television standard. 4) MJPEG has been designed for use in mobile appliances. 5) MJPEG has nothing in common with MPEG-1 and MPEG-2 formats 6) A general goal of video content analysis is to facilitate convenient and linear content access. Vocabulary analysis [a'nael.a.sis] - analýza, pi. analyses appliance [a'plai.anfs] - zařízení audio ['d:.di.au] © ['ai.di.ou] - audio, zvuk audiovisual [,D:.di.au'vi3.u.al] © [,a:.di.ou-] - audiovizuální capability [.kei.pa'bil.i.ti] ©[-a.ti] - schopnost compression [kam'prej>n] - komprese consumer [kan'sjui.ma1'] © [-'sui.ma*] -spotřebitel discrete [di'skri:t] cosine ['kau.sain] ®['kou-] transform [traens'fDim] © [-'fDirm] - diskrétní kosinová transformace diverse [daťv3:s] © [ dťv?:s] - rozdílný domain [dai/'mein] ©[dou-] - doména due to st [dju:] © [ du:] - kvůli něčemu format ['fDi.maet] © f'fDir-] - formát frame [freim] - rám, rámec linear ['lin.i.ar] © [-&] - lineární market share ['ma:.kit] [fear] © ['ma:r-] [Jer] - podíl na trhu modality - modalita motion ['mau.f3n] © ['mou-] - pohyb nevertheless [,nev.a.5a'les] ©[-a*-] - nicméně non-linear [non'lin.i.a] - nelineární present ['prezent] - současný (compare the pronunciation with that of verb to present [pri'zent]) research [ri's3:tf] © [ Vi:.s3:tJ] - výzkum retrieval [n'triival] - vyhledávání, vyzvedávání sequence ['sii.kwanřs] - sekvence spatial ['spei.J3!] and temporal ['tem.psr.3l] © [-pa^.al] redundancy [ri'dAn.d3nt.si] - prostorová a časová nadbytečnost (redundance) steady ['sted.i] - stálý thus [5as] - tak, a tak to aim to do st [eim] - snažit se něco dělat, být zaměřen na dělání něčeho to browse st [brauz] - listovat, procházet něčím to capture st ['kaep.tJV] © [-tj"a*] - zachytit něco to deal with st [dial] - zabývat se něčím to develop [di'vel.ap] - vyvinout, vyvíjet to disappear [.dis.a'pia1-] © [-'pír] - zmizet, mizet to dominate ['dom.i.neit] © ['dai.ma-] - dominovat to ease st [i:z] - udělat něco jednodušší, zjednodušit to exceed st [ik'siid] - přesahovat, překročit něco to extract st [ik'straekt] - extrahovat, vytáhnout něco to facilitate st [fa'sil.i.teit] - zjednodušit něco, umožnit něco to fall into st [fDiI] © [ fail] - spadat do něčeho to involve st [inVolv] ®[-'va:lv] - zahrnovat něco to launch st [binřj] © [ lainřj] - vypustit něco, vydat něco to occupy ['ok.ju.pai] ®['a:.kju-] - zabírat, okupovat to opt for st [opt] © [a:pt] - rozhodnout se pro něco, zvolit něco to present st [pri'zent] - prezentovat něco to refer to st [ri'fa:] - odkazovat na něco, označovat něco to release st [ri'liis] - vypustit, vydat něco to simulate ['sim.ju.leit] © [-t id] - simulovat, napodobovat to witness st ['wit.nas] - být něčemu svědkem ultimate ['Al.ti.mat] © [-ta-] - konečný, nejzazší underlying [,An.da'lai.in] ®[-da*-] - základní Phrases consequently ['kDnř.si.kwant.li] © ['ka:nř-] -následně generally speaking - obecně řečeno 43 independent of st [,in.di'pen.dant] - nezávislý n něčem likewise [laik.waiz] - podobně on the contrary ['kon.tra.ri] © ['kain.tre-] - naopak on the one hand on the other hand ... - na jedné straně na druhé straně ... to play an important role in st - hrát důležitou úlohu v něčem