Obsah:
1. Úvod do data miningu: základní pojmy, CRISP-DM, SEMMA.
2. Přehled data minigového softwaru. Úvod do systému SAS.
3. Organizace dat, úvod do SQL.
4. Příprava dat – čistění, transformace (WOE). SAS data step.
5. SAS Data Step – podmíněné kódy, cykly, pole.
6. SAS Data Step – spojování tabulek, transpozice tabulek.
7. Explorační analýza – základní popis dat, tabulky.
8. Vizualizace dat, SAS/Graph, Úvod do SAS EM.
9. Regrese, Logistická regrese I.
10. Rozhodovací stromy, neuronové sítě.
11. Evaluace prediktivního modelu – LC (ROC), Gini, KS, Lift.
12. Úvod do makro jazyka v SAS.
13. Úprava výstupů/reportů SASu, export ze SASu.
14. Reference.
3
50
115
190
250
310
360
420
490
555
625
670
720
760
1. Úvod do data miningu
3
Co je to Data Mining?
4
• Data mining (DM), nebo také dolování
z dat či vytěžování dat, je analytická
metodologie získávání netriviálních skrytých
a potenciálně užitečných informací.
Aplikace
 Bankovnictví: schvalování úvěrů/kreditních karet
 Predikce dobrých zákazníků.
 Pojišťovnictví: schvalování pojistných smluv
 Odhad pravděpodobnosti pojistné události/výše škody.
 CRM (marketing):
 Identifikace zákazníků, kteří mají v úmyslu přejít ke konkurenci.
 Cross-selling.
 Up-selling.
 Cílený marketing:
 Identifikace pravděpodobných respondentů na nabídku.
 Detekce fraudu: telekomunikace, finanční transakce,
pojistné podvody
 Online/offline identifikace podvodného chování.
5
Aplikace
 Medicína: efektivita léčebné péče
 Analýza pacientovy historie (předchozí nemoci a jejich
průběh): nalezení vztahu mezi nemocemi.
 Farmacie: identifikace nových léků
 Vědecká analýza dat:
 Identifikace nových galaxií.
 Design webových stránek:
 Nalezení vztahu návštěvníka stránek a příslušná změna
podoby stránek.
6
Aplikace
 Rozpoznávání psaného textu, řeči, obrázků.
 Supermarkety
 Identifikace současně nakupovaného zboží
 Průmysl:
 automatické přenastavení ovládacích prvků při změně
parametrů procesu.
 Sport:
 NBA-optimalizace herní strategie
 další…
7
Aplikace - Rozmístění zboží v supermarketech
8
• Cíl: identifikovat zboží, které je
nakupováno souběžně dostatečným
množstvím zákazníků.
• Výsledek: Jestliže zákazník nakupuje
dětské pleny a mléko, pak si velmi
pravděpodobně koupí i pivo.
9
• Správné interpretace výsledků analýz je schopen
jen zkušený analytik.
• Jedna z možných interpretací:
Ukázka pracovních nabídek (rok 2011):
http://kariera.homecredit.cz/cz/analytik-rizeni-
rizik/job.html?id=493
Analytik řízení rizik
Náplň práce:
• zodpovědnost za vývoj rizikových ukazatelů na svěřeném
portfoliu produktů (kreditní karty a hotovostní
úvěry)
• identifikace rizik v portfoliu daného produktu
•odborné posouzení nestandardních žádostí o úvěr
Požadujeme:
• VŠ, případně SŠ vzdělání (výhodou zaměření na
matematiku nebo ekonomii)
• velmi dobrá znalost MS Office (zejména MS Excel)
• výhodou zkušenost ve finančním sektoru
• analytické myšlení
• výhodou znalost statistických metod
a matematiky a zkušenosti se skóringovými modely
• velmi dobré komunikační, vyjednávací
a argumentační schopnosti
• asertivita, smysl pro přesnost, odolnost vůči stresu
Specialista - analytik strategie vymáhání
http://kariera.homecredit.cz/cz/specialista-analytik-
strategie-vymahani/job.html?id=487
Náplň práce:
• měření vymáhacího procesu
• analýzy a podpora pro vymáhání
• testování a podpora změn
• Reporting
Požadujeme:
• VŠ vzdělání – nejlépe matematika, ekonomie, informatika
• pokročilou znalost MS Excel a SQL
• výhodou znalost SAS a VBA
• středně pokročilou znalost AJ
• velkou výhodou praxe v oblasti analýz, reportingu, orientace v
oblasti vymáhání
• aktivní přístup
• chuť využít své zkušenosti a zapojit se do nových věcí
• analytické myšlení a logické uvažování
• komunikační schopnosti, samostatnost při řešení úkolů,
spolehlivost, pečlivost, odolnost vůči stresu, flexibilita
10
https://www.recrutement-
societegenerale.com/jpapps/kbLocal/jobs/jobview.jsp?TOKEN=ff7ec84a3a44
0fd12c7c783ebf&requestno=RQ00056954
Náplň práce:
• Zajišťovat činnosti spojené s podporou a rozvojem
systémů Business Intelligence,
• V rámci vývoje zajišťovat analýzy uživatelských
požadavků, definovat procesy, navrhovat controllingové
postupy, provádět datové modelování, navrhovat
transformační procesy, připravovat uživatelské rozhraní a
metodicky připravovat nasazení nových nástrojů
Požadujeme:
• VŠ nebo SŠ s praxí ve finančním sektoru
• Znalost finanční matematiky
• Znalost oblasti bankovních aplikací výhodou, znalost
centrálního bankovního systému KBI vítána
• Výborná znalost MS Office
• Znalost SQL
• Znalost účetnictví vítána
• Znalost controllingu / MIS / performance managementu
• Znalost AJ
• Pečlivost, spolehlivost, schopnost komunikace a týmové
spolupráce
• Samostatnost, odolnost vůči stresu, flexibilita
Skoringový analytik
https://www.recrutement-
societegenerale.com/jpapps/kbLocal/jobs/jobview.jsp?TOKEN=ff7ec84a3a
440fd12c7c783ebf&requestno=RQ00054440
Náplň práce:
• Vývoj skóringových modelů, tvorba metodiky vývoje modelů
• Zpětné testování skóringových funkcí a tvorba metodiky pro jejich
použití
• Statistický reporting rizikových parametrů portfolia a spolupráce na
jeho vývoji
Požadujeme:
• Vysokoškolské vzdělání matematického nebo ekonomického
zaměření se znalostí statistických metod
• Analytické schopnosti, tvořivost, samostatnost a zodpovědnost
• Schopnost komunikovat v angličtině (zejména písemně)
• Zkušenosti se statistickým software formou skriptů (výhodou SAS a
S-Plus) a prací s databázemi (SQL)
• Znalost základních ekonomických pojmů (časová hodnota peněz,
opravné položky…)
• Znalost metodiky Basel 2 (konkrétně modelování LGD, CCF, EAD)
výhodou
• Základní znalost bankovních produktů výhodou
Analytik Business Intelligence
Ukázka pracovních nabídek (rok 2011):
11
http://www.onrea.com/pd/176680227?brand=g2&rcm=24045507&sourcebrand=g2&
source=3&exportRCM=24045507&trackingBrand=www.koop.cz
Náplň práce:
• Tvorba predikčních modelů a příprava dat pro ně
• Prezentace výsledků odborně i laickým uživatelům
• Organizace datových podkladů pro modely na úrovni zadávání
dalším útvarům
• Spoluvytváření firemního data skladu na straně uživatelů
• Průběžná aplikace pro firmu nových metod data miningu, reportingu
a čištění dat
Požadujeme:
• VŠ – obor: ekonometrie, matematická statistika, pojistná
matematika a podobně (možno i student postgraduálního studia)
• Znalost vícerozměrných statistických metod
• Znalost alespoň jednoho ze statistických SW SPSS, KXEN,
Rapidminer, SAS (SAS - výhodou)
• Alespoň základní znalost relačních databází a SQL
• Znalost MS Excel alespoň na úrovni maker / VBA (VBA – výhodou)
• Analytické a komunikační schopnosti
• Znalost AJ na alespoň technické úrovni
• Praxe v oblasti tvorby predikčních modelů / data miningu / BI –
výhodou
• Zkušenost v pojišťovnictví, telekomunikacích nebo finančním sektoru
– výhodou
• Zkušenost s CRM / Camapaign managementem – výhodou
Specialista zákaznických analýz
http://careers.peopleclick.com/careerscp/client_tmobile/external/cs/jobDetails.do?
functionName=getJobDetail&jobPostId=107551&localeCode=cs#
Náplň práce:
• Analyzovat a interpretovat DWH data.
• Komunikovat se zadavateli analýz.
• Zdokonalovat strukturu zdrojových dat v souladu s
potřebami datových analýz.
• Využívat analytické nástroje s ohledem na potřeby a
rozvoj analýz.
• Hledat nové přístupy v oblasti datových analýz.
Požadujeme:
• VŠ/SŠ vzdělání ekonomického, technického,
matematického směru
• Zkušenost s analytickým SW - např. SPSS, SAS, Access,
SQL
• Excel - vynikající znalost (databázové funkce, makra,
formuláře)
• Spolehlivost, pečlivost
• Analytické myšlení
• Komunikace
ANALYTIK DATA MININGU
Ukázka pracovních nabídek (rok 2011):
12
Ukázka aktuálních pracovních nabídek (17.1.2012):
www.jobs.cz
Klíčové slovo „SAS“ :
• RISK ANALYTIK – PROGRAMÁTOR
• Pojistně technický a datový analytik -úsek pojištění vozidel
• Risk Data Analyst
• Pricing Specialist
• Fraud Analytik/-čka produktů spotřebního financování
• Statistician (Brno)
• ...
Klíčové slovo „matematika“ :
• POJISTNÝ MATEMATIK – JUNIOR
• DWH/BI Specialista/tka
• Specialista ALM / pojistný matematik
•…
ČSOB
Požadovaná kvalifikace a další
požadavky
• zkušenost s dataminingem,
statistickými prediktivními
metodami, či neuronovými sítěmi
• výhodou zkušenost s používáním
nástrojů pro odhalování podvodů a
nástrojů pro navrhování a řízení strategií
• VŠ/SŠ vzdělání matematického, či
technického nebo ekonomického
zaměření
• dobrá znalost MS Office, EXCEL,
ACCESS, SQL, výborná znalost práce s
PC,
• znalost statistických systému, či
systémů pro dolování dat (SAS,
SPSS/Clementine, …)
• komunikativní znalost anglického jazyka
nezbytná
• znalost vnitřních informačních systémů
banky výhodou
AXA
We Require:
• university degree in statistics or
mathematics
• English language fluently written and
spoken
• excellent skills in MS Office (Excel, SQL
programmer for data mining)
• experience in insurance and ability to
use Pretium or SAS is advantage
• analytical and statistical skills
13
www.linkedin.com
• Modeling, Scoring, & Analysis Sr. Manager - CBNA Risk Management (Long Island City, NY)
• Head of level – Decision Science / Modelling (London, UK)
• Senior Credit Risk Analyst, Basel II Modeling (Detroit)
• Statistician (Dallas)
• + další TISÍCE volných míst požadujících matematické/ekonomické vzdělání, znalost statistiky, SQL,
SASu
Ukázka aktuálních pracovních nabídek (17.1.2012):
CITY
• Master Degree with specialization in Statistics, Economics, Finance, Engineering or other quantitative fields, PhD preferred.
• 10+ years hand-on statistical risk modeling experience in financial industry with demonstrated proficiency in scorecard development.
• Diversified modeling experience in Fraud and/or Mortgage modeling strongly preferred.
• In-depth understanding of regulatory requirements, and proven experience in interacting with regulators and internal auditors.
• Strong communication and project management skills.
Santander
•Graduate degree in Statistics, Economics, Operations Research or other quantitative discipline required.
• Familiarity with logistic regression models, segmentation and variable reduction techniques, hypothesis testing, non-parametric testing, design of experiments,
ANOVA, CHAID analysis and linear regression.
• SAS: SAS base, SAS/STAT, PROC SQL, SAS Macro programming, using SQL and SAS to extract data from different data sources. Ability to merge,
concatenate, import/export datasets, clean data and check for data consistency and accuracy. 14
Data mining a princip indukce
• Dedukce zachovává platné vztahy:
1. Koně jsou savci.
2. Všichni savci mají plíce.
3. Proto platí, že všichni koně mají plíce.
• Indukce přidává informace:
1. Všichni doposud pozorovaní koně mají plíce.
2. Proto platí, že všichni koně mají plíce.
15
Problém s indukcí
16
• Z platných faktů můžeme vyvodit nepravdivé tvrzení
(model).
• Příklad:
• Evropské labutě jsou bílé
• Indukce: „Labutě jsou bílé” jakožto obecné
pravidlo.
• Objevením Austrálie se objevili i černé labutě…
• Problém: množina pozorování nebyla náhodná a
tudíž reprezentativní.
http://cs.wikipedia.org/wiki/Labu%C5%A5_%C4%8Dern%C3%A1
Data mining –podpora business rozhodování
17
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Historie názvu
18
1960 Data Fishing, Data Dredging (bagrování):
• užíváno statistiky
1989 Knowledge Discovery (KD, KDD):
• užíváno komunitou zabývající se umělou inteligencí a
strojovým učením
1990 Data Mining (DM):
• užíváno v komerční sféře a databázové komunitě
Další názvy: Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Data mining – nutnost?
19
Největší světové databáze v r. 2005:
• Max Planck Inst. for Meteorology ~ 222 TB
• Yahoo ~ 100 TB
• AT&T ~ 94 TB
V roce 2008:
• Max Planck Inst. for Meteorology ~ 6000 TB
• Yahoo ~ 2000 TB
Další velké databáze:
• CIA, Amazon, Google, YouTube, AT&T,…
Více např. na http://www.focus.com/fyi/10-largest-databases-in-the-world/
Data mining – nutnost?
 Terabytes -- 10^12 bytes: data obchodních řetězců, bank,…
 Petabytes -- 10^15 bytes: geografická data
 Exabytes -- 10^18 bytes: národní databáze zdravotních
záznamů
 Zettabytes -- 10^21 bytes: databáze meteo-snímků
 Zottabytes -- 10^24 bytes: video-databáze
20
Data mining – nutnost?
21
Proč data mining? Proč dnes?
 Data jsou produkována.
 Data jsou skladována.
 Výpočetní síla je dostupná.
 Výpočetní síla je cenově dostupná.
 Konkurenční tlak je velice silný.
 Komerční produkty (DM software) jsou k dispozici.
22
Data mining vs. Statistická analýza
 Data Mining
 Původně vyvinuto pro expertní systémy
automaticky řešící zadané problémy.
 Neklade takový důraz na přesné
porozumění použité metody.
 Pokud něco dává smysl, pak to
použijme!
 Žádné předpoklady o datech.
 Funguje i pro velmi rozsáhlá data.
 Vyžaduje porozumění problému z
datovému a business pohledu.
23
• Statistická analýza
• Testuje se statistická korektnost
modelu.
 Jsou statistické předpoklady
modelu splněny?
• Testování hypotéz.
• Intervalové odhady.
• Pracuje se s výběrem hodnot.
• Standardní metody nejsou
optimalizovány pro rozsáhlá data.
• Vyžaduje pokročilé statistické
znalosti.
Data mining
 Proces (polo-) automatické analýzy (rozsáhlých) databází
k identifikaci vztahů, které jsou:
 validní: platí na nových datech s určitou jistotou obecné
platnosti
 nové: doposud neznámé
 užitečné: dají se v praxi nějak použít
 srozumitelné: (vždy) se nalezený vztah dá nějak vysvětlit
24
Data mining není:
 Brutální hromadné
zpracování dat.
 Slepé použití algoritmů.
 Hledání vztahů tam, kde
žádné neexistují.
25
Známé  Zajímavé
 Zajímavé jsou ty
vztahy, které se liší
od obecných
očekávání.
 Data mining se
vyplácí právě díky
objevování dosud
neznámých a
překvapivých
vztahů.
26
1995
1998
Mléko a
cereálie
prodávej
dohromady!
Zzzz...
Mléko a
cereálie
prodávej
dohromady!
Vztah s ostatními disciplínami
27
Data Mining
Databázové
technologie
Statistika
Ostatní vědní
disciplíny
Informační
technologie
Strojové učení Vizualizace
Data mining -proces
28
Čištění dat
Integrace dat
Databáze
Data Warehouse
Relevantní Data
Výběr Dat
Transformace Dat
Data Mining
Ověření vztahů
Data Mining Methodology (2007)
29
Kterou metodologii používáte pro data mining?
CRISP-DM (63) 42%
Vlastní (29) 19%
SEMMA (19) 13%
KDD Process (11) 7%
Firemní (8) 5%
Ostatní (20) 14%
CRISP-DM
(CRoss Industry Standard Process for Data Mining)
30
1. pochopení obchodních
souvislostí
2. pochopení dat
3. příprava dat
4. modelování
5. vyhodnocení modelu
6. nasazení modelu do
obchodního procesu
http://community.udayton.edu/provost/it/training/documents/SPSS_CRISPWPlr.pdf
SEMMA
(Sample, Explore, Modify, Model, Assess)
31
• Sample - identifikovat vhodná učící data, určit odpovídající rozsah dat, a to jak z pohledu časového okna tak i
z pohledu počtu případů. Dále se doporučuje rozdělit data na 3 skupiny:
Trénovací – využívá se pro vývoj modelu.
Validační – využívá se pro vyhodnocení modelu a pro prevenci proti přeučení (over fitting) modelu.
Testovací – využívá se pro finální vyhodnocení modelu. Zajímá nás především jak dobře se
model chová na datech disjunktních s daty, na kterých byl model vyvinut.
• Explore - připravit popisné statistiky, které poskytnou základní představu o obsahu a kvalitě podkladových dat.
Pomocí vizualizačních technik odhalit skryté trendy a závislosti v datech.
• Modify - na základě předchozího kroku konsolidovat data a odvodit nové proměnné. Následně transformovat
data do tvaru vhodného pro modelování.
• Model - vytvořit příslušný model. Mezi často používané techniky patří např. neuronové sítě, rozhodovací
stromy, logistické modely.
• Assess - vyhodnotit úspěšnost modelu a případně implementovat model do praxe.
Fáze DM procesu (1 & 2)
 Porozumění obchodu (Business
Understanding):
 Stanovení business cílů.
 Stanovení data miningových
cílů.
 Statnovení kriterií úspěchu.
 Porozumění datům (Data
Understanding):
 Průzkum dat a ověření
jejich kvality.
 Nalezení odlehlých hodnot.
32
Fáze DM procesu (3)
 Příprava dat (Data preparation):
 Obvykle zabírá přes 90% celkové času.
 Sběr dat
 Konsolidace a čištění
 Vazební tabulky, agregace, chybějící hodnoty,…
 Selekce
 Ignorování neužitečných dat?
 Odlehlá pozorování?
 Výběr dat?
 Vizualizační nástroje.
 Transformace – vytváření nových odvozených
proměnných
33
Fáze DM Procesu (4)
 Modelování (Model building)
 Výběr vhodných modelovacích technik
závisí na stanovených data miningových
cílech.
 Modelování je většinou iterační proces
propojený s přípravou dat
 Rozdílný přístup pro „supervised“ a
„unsupervised learning“
34
Základní přístupy k modelování
 Prediktivní: jde o matematický model předpovídající (s určitou
přesností) budoucí hodnotu/chování nějaké veličiny (entity).
 Regrese/ Klasifikace
 Analýza časových řad
 Deskriptivní: jde o matematický model popisující historické
události a předpokládané nebo reálné vazby mezi nimi.
 Klastrová (shluková) analýza
 Asociační pravidla
 Detekce deviací/zlomů
 Faktorová analýza / analýza hlavních komponent
35
Klasifikace
 Na základě známých údajů o „starých“ zákaznících a jejich
platební morálce máme predikovat platební způsobilost
nového žadatele o úvěr.
36
Věk
Příjem
Zaměstnání
Bydliště
Typ zákazníka
Předchozí zákazníci Klasifikátor Rozhodovací
pravidlo
Příjem > x
Prof. = y
Data nového žadatele
Dobrý/
špatný
Klasifikační metody
 Cíl: Predikovat třídu Ci = f(x1, x2, .. Xn)
 Regrese: (lineární nebo polynomiální)
 a*x1 + b*x2 + c = Ci
 Metody nejbližšího souseda (KNN)
 Rozhodovací stromy
 Pravděpodobnostní modely (GLM) – např. logistická regrese.
 Diskriminační analýza (LDA,…)
 Neuronové sítě
 Support vector machines (SVM)
 Bayesovské modely
37
Deskriptivní modelování
 Základním cílem je získání ucelených a snadno
srozumitelných informací z dostupných dat.
 Někdy součástí průzkumové (explorační)
analýzy předcházející prediktivnímu
modelování, někdy je vytvoření deskriptivního
modelu hlavním cílem DM projektu.
38
Klastrová analýza
 Máme nalézt skupiny/ klastry stávajících zákazníků na základě
platební historie tak, aby podobní klienti byli ve stejné skupině/
klastru.
 Základní požadavek: Kvalitní míra podobnosti
(http://cs.wikipedia.org/wiki/Shluková_analýza).
39
Zdroj: NEPIL, M. Data mining v praxi. Brno : MU v Brně, 2007. s 25-38.
40
Supervised vs. unsupervised learning
 Supervised learning:
 Supervize: Data (pozorování, měření, atp.) jsou označena
předem definovanými/známými třídami.
 Nová/testovací data jsou následně rozřazena do těchto tříd.
 Z pohledu kauzality daný model definuje vztah mezi
vstupními daty a daty výstupními.
 Unsupervised learning:
 Předem nejsou definované žádné třídy.
 Pro daná data je cílem prokázat existenci nějakých tříd.
 Z pohledu kauzality jsou všechna data chápána jako
výstupní. Modelujeme závislost daných dat na jakýchsi
neznámých skrytých proměnných.
Fáze DM Procesu (5)
 Vyhodnocení modelu (Model
Evaluation):
 Evaluace modelu: jak se chová na
testovacích datech.
 Metody a kritéria závisí na typu
modelu:
 Např. koincidenční matice pro klasifikační
modely, průměrná chyba pro regresní
modely,…
 Interpretace modelu: důležitost a
obtížnost interpretace značně závisí
na zvolené modelovacím algoritmu.
41
Fáze DM Procesu (6)
 Nasazení do praxe (Deployment)
 Je třeba určit, jak mají být výsledky využity.
 Kdo je bude využívat?
 Jak často budou využívány?
 Nasazení data miningových výsledků pomocí:
 Skórování databáze.
 Využití výsledků pomocí obchodních pravidel.
 Interaktivní on-line scoring.
 …
42
SAS - stručné seznámení
43
 2 základní SAS rozhraní:
 SAS windowing
environment
 SAS Enterprise Guide
(GUI)
SAS - stručné seznámení
44
SAS
Output
Program
editor
window
SAS
Explorer
window
Output
tab
Log tab
Editor
tab
SAS - stručné seznámení
45
SAS
Output
Process
Flow
Task
List
 Pomocí klikání a přetahování myší je budován procesní tok.
SAS Enterprise Guide (EG) Interface
46
• EG automaticky generuje kód, který možné dále editovat
SAS Help
Use the SAS Enterprise Guide Help facility or SAS OnlineDoc for
additional direction on SAS Enterprise Guide or the SAS
programming language. Go to support.sas.com and select
Product Documentation  Base SAS.
47Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
• SAS používají např.:
Více na http://www.sas.com/offices/europe/czech/reference/
49
SAS na webu
Michal Kulich: Malý manuál uživatele SASu
http://www.karlin.mff.cuni.cz/~kulich/sas/SASMain.html
http://en.wikipedia.org/wiki/SAS_%28software%29
Phil Spector: An Introduction to the SAS System
http://www.stat.berkeley.edu/classes/s100/sas.pdf
Patric McLeod : Introduction to SAS 9
http://www.unt.edu/rss/class/sas1/
2. Software, základy práce
v SAS
50
Data miningový software
 Cca 20 až 30 dodavatelů
 Hlavní hráči na trhu:
 Clementine,
 IBM’s Intelligent Miner,
 SGI’s MineSet,
 SAS’s Enterprise Miner.
51
IBM SPSS Modeler
(PASW Modeler)
Více např. na:
• http://www.kdnuggets.com/software/
• http://www.dmoz.org/Computers/Software/Databases/Data_Mining/Publ
ic_Domain_Software/
• http://dir.yahoo.com/Business_and_Economy/Business_to_Business/Com
puters/Software/Databases/Data_Mining/
Software (další)
AcaStat GAUSS MRDCL RATS StatsDirect
ADaMSoft GAUSS NCSS RKWard[4] Statistix
Analyse-it GenStat OpenEpi SalStat SYSTAT
ASReml Golden Helix Origin SAS
The
Unscrambler
Auguri gretl
Ox programming
language SOCR UNISTAT
BioStat JMP OxMetrics Stata VisualStat
BrightStat MacAnova Origin Statgraphics Winpepi
Dataplot Mathematica Partek STATISTICA WinSPC
EasyReg Matlab Primer StatIt XLStat
Epi Info MedCalc PSPP StatPlus XploRe
EViews modelQED R SPlus
Excel Minitab R Commander[4] SPSS
52
Software - SAS
 : www.sas.com
53
SAS
 Společnost SAS Institute
 Vznik 1976 v univerzitním prostředí
 Dnes:největší soukromá softwarová společnost na světě (více než
11.000 zaměstnanců)
 přes 45.000 instalací
 cca 9 milionů uživatelů ve 118 zemích
 v USA okolo 1.000 akademických zákazníků (SAS používá většina
vyšších a vysokých škol a výzkumných pracovišť)
54
SAS
55
Ročník 2010:
• 1. místo - Účast na SAS Global
Forum v Las Vegas.
Výherce měl hrazenou letenku,
ubytování a účastnický poplatek.
http://www.sas.com/offices/europe/czech/academic/soutez.html
http://www.sas.com/offices/europe/czech/academic/poster.html
Soutěž o nejlepší studentskou práci
• lze přihlásit bakalářskou, diplomovou,
dizertační, semestrální nebo ročníkovou
práci využívající SAS.
• 1. místo – letenky dle vlastního výběru v
hodnotě 15.000 Kč.
SAS
56
SAS
57
• Statistická analýza:
• Popisná statistika
• Analýza kontingenčních (frekvenčních) tabulek
• Regresní, korelační, kovarianční analýza
• Logistická regrese
• Analýza rozptylu
• Testování hypotéz
• Diskriminační analýza
• Shluková analýza
• Analýza přežití
• …
SAS
58
• Analýza časových řad:
• Regresní modely
• Modely se sezónními faktory
• Autoregresní modely
• ARIMA
• Metody exponenciálního
vyrovnání
• …
SAS
59
• Více o SASu: http://www.sas.com/offices/europe/czech/
• (neúplný) seznam komerčních společností využívající SAS:
http://www.sas.com/offices/europe/czech/reference/list.html
• o akademickém programu:
http://www.sas.com/offices/europe/czech/academic/index.html
• o konferenci SAS forum:
http://www.sas.com/reg/offer/cz/2010_sas_forum_2010
http://www.sas.com/reg/offer/cz/2011_sasforum
Software -SPSS
 : www.spss.cz
60
SPSS
 IBM SPSS/ PASW Modeler 13 (dříve Clementine)
http://www.spss.cz/ibmspss_modeler.htm
61
SPSS
 Více o IBM SPSS Modeler 13 (dříve Clementine):
http://www.spss.cz/ibmspss_modeler.htm
 (neúplný) seznam zákazníků: http://www.spss.cz/zakaznici.htm
 Akademický program: http://www.spss.com/academic/
62
Software -Statistica
 www.statistica.cz
63
Statistica
 Více o Statistica Data Miner: http://www.statistica.cz/produkty/5-
dataminingove-nastroje/21-statistica-data-miner/detail/
 (neúplný) seznam zákazníků: http://www.statsoft.com/customers/
 Akademický program: http://www.statsoft.com/academic/
 Petra Beranová – stručný manuál k ovládání programu STATISTICA:
http://www.statsoft.cz/download/soubory/STATISTICA_manual.pdf
64
SAS Programs
65
DATA steps are typically used to create
SAS data sets.
PROC steps are typically used to process SAS
data sets (that is, generate reports
and graphs, edit data, and sort data).
• A SAS program is a sequence of steps that the user submits
for execution.
Raw
Data
DATA
Step
Report
SAS
Data
Set
SAS
Data
Set
PROC
Step
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
66
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
proc print data=work.nonclub;
title "Non Club Members";
var Country Gender Customer_Name;
run;
DATA
Step
PROC
Step
SAS Programs
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Step Boundaries
67
SAS steps begin with either of the following:
 DATA statement
 PROC statement
SAS detects the end of a step when it encounters
one of the following:
 a RUN statement (for most steps)
 a QUIT statement (for some procedures)
 the beginning of another step (DATA statement
or PROC statement)
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
68
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
proc print data=work.clubmembers;
proc print data=work.nonclub;
title "Non Club Members";
var Country Gender Customer_Name;
run;
Step Boundaries
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Submitting a SAS Program
69
• When you execute a SAS program, the results generated by SAS are
divided into two major parts:
SAS log contains information about the processing
of the SAS program, including any warning
and error messages.
SAS output contains reports generated by SAS
procedures and DATA steps.
• The Workspace includes tabs containing both the log and output, while
the Process Flow, by default, displays icons only for the output.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Log
70Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC PRINT Output
SAS Terminology
71
• SAS documentation and text in the SAS windowing
environment use the following terms interchangeably:
SAS Data Set SAS Table
Variable Column
Observation Row
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
72
SAS statements have these characteristics:
 usually begin with an identifying keyword
 always end with a semicolon
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
proc print data=work.nonclub;
title "Non Club Members";
var Country Gender Customer_Name;
run;
SAS Syntax Rules
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Syntax Rules
73
SAS statements are free-format.
 One or more blanks or special characters can be used to separate words.
 Statements can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;run;
proc print data=work.nonclub; run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Syntax Rules
74
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;run;
proc print data=work.nonclub; run;
SAS statements are free-format.
 One or more blanks or special characters can be used to separate words.
 Statements can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Syntax Rules
75
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;run;
proc print data=work.nonclub; run;
SAS statements are free-format.
 One or more blanks or special characters can be used to separate words.
 Statements can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Syntax Rules
76
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;run;
proc print data=work.nonclub; run;
SAS statements are free-format.
 One or more blanks or special characters can be used to separate words.
 Statements can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Syntax Rules
77
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;run;
proc print data=work.nonclub; run;
SAS statements are free-format.
 One or more blanks or special characters can be used to separate words.
 Statements can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
78
SAS comments consist of text that SAS ignores during
processing. You can use comments anywhere in a SAS program
to
 document the purpose of the program
 explain segments of the program
 mark SAS code as non-executing text.
Two methods of commenting are shown below:
SAS Comments
/* comment */
* comment ;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
79
SAS Comments: Examples
/* Split data based on membership */
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
proc print data=work.nonclub;
title "Non Club Members";
*var Country Gender Customer_Name;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Syntax Errors
Syntax errors occur when program statements do not conform
to the rules of the SAS language.
Examples of syntax errors:
 misspelled keywords
 unmatched quotation marks
 missing semicolons
 invalid options
When SAS encounters a syntax error, SAS prints a warning or
an error message to the log.
80
ERROR 22-322: Syntax error, expecting one of the following:
a name, a quoted string, (, /, ;, _DATA_, _LAST_,
_NULL_.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
How Do You Include Data in a Project?
81
Selecting File 
Open  Data
adds a shortcut
to a SAS data
source in the
project.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
How Do You Include Data in a Program?
One possibility is to include the full path and filename each
time that a SAS data set is referenced.
82
data "s:\workshop\cust_age.sas7bdat";
set "s:\workshop\customer.sas7bdat";
/*Calculate each customer's age*/
Age=int(yrdif(Birth_Date,today(),"actual"));
run;
proc print data="s:\workshop\cust_age.sas7bdat";
var Customer_Name Gender Country Age;
title "Customer Listing";
run;
ep02d03.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Libraries
83
Files
Libraries
You can think of a SAS
library as a drawer in a
filing cabinet and a SAS
data set as one of the file
folders in the drawer.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Assigning a Libref
Regardless of which
host operating system
you use, you identify
SAS libraries by
assigning a library
reference name (libref)
to each library.
This libref can serve as
a shortcut in
SAS programs in place
of the full path
or filename.
84
libref
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
When a SAS session starts, SAS automatically creates one
temporary and at least one permanent SAS library that you can
access.
SAS Libraries
 work - temporary library
(contents are deleted when SAS
closes)
sasuser - permanent library
(contents are permanently saved)
85
sasuser
work
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Libraries
86
 You can also create and access
your own permanent libraries.
orion – permanent library
orion
orion
sasuser
work
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Assigning a Libref
You can use the LIBNAME statement to assign a libref
to a SAS library. The LIBNAME statement is a global statement.
General form of the LIBNAME statement:
The rules for naming a libref are as follows:
 must be 8 or fewer characters
 must begin with a letter or underscore
 remaining characters are letters, numbers,
or underscores
87
LIBNAME libref 'SAS-data-library' <options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Two-Level SAS Filenames
Every SAS file has a two-level name:
The data set orion.sales is a SAS file
in the orion library.
 The first name (libref)
refers to the library.
 The second name (filename)
refers to the file in the library.
88
sales
libref.filename
sasuser
work
orion
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
How Do You Include Data in a Program?
 využijeme knihovny (libraries)
89
libname orion "s:\workshop";
data work.cust_age;
set orion.customer;
/*Calculate each customer's age*/
Age=int(yrdif(Birth_Date,today(),"actual"));
run;
proc print data=work.cust_age;
var Customer_Name Gender Country Age;
title "Customer Listing";
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
libname orion "s:\workshop";
data work.cust_age;
set orion.customer;
/*Calculate each customer's age*/
Age=int(yrdif(Birth_Date,today(),"actual"));
run;
proc print data=cust_age;
var Customer_Name Gender Country Age;
title "Customer Listing";
run;
Temporary SAS Filename
The default libref is work if the libref is omitted.
90
work.cust_agecust_age
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
91
Import dat
*.sas7bdat
Základních pět možností importu dat:
1. Import v SAS EG
2. Import wizard
3. Proc import
4. Data step
5. Proc SQL
92
Import Wizard
 The Import Wizard is a point-and-click graphical
interface that enables you to create a SAS data set
from several types of external files including the
following:
 dBASE files (*.DBF)
 Excel spreadsheets (*.XLS)
 Microsoft Access tables (.MDB)
 delimited files (*.*)
 comma-separated values (*.CSV)
 …
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
93
Import Wizard
94
Import Wizard
95
PROC IMPORT
PROC IMPORT OUT= WORK.sales
DATAFILE= "S:\Workshop\sales.xls"
DBMS=EXCEL REPLACE;
RANGE="Australia$";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
96
GETNAMES=YES | NO
 determines whether SAS will use the first row
of data in a Microsoft Excel worksheet or range
as column names.
YES specifies to use the first row of data in
an Excel worksheet or range as column
names.
NO specifies not to use the first row of data
in an Excel worksheet or range as column
names. SAS generates and uses the variable
names F1, F2, F3, and so on.
 The default is YES.
PROC IMPORT
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
97
MIXED=YES | NO
 specifies whether to import data with both
character and numeric values and convert
all data to character.
YES specifies that all data values will
be converted to character.
NO specifies that numeric data will be
missing when a character type is assigned.
Character data will be missing when
a numeric data type is assigned.
 The default is NO.
PROC IMPORT
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
98
SCANTEXT=YES | NO
specifies whether to read the entire data
column and use the length of the longest string
found as the SAS column width.
YES scans the entire data column and uses
the longest string value to determine the
SAS column width.
NO does not scan the column and defaults
to a width of 255.
 The default is YES.
PROC IMPORT
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
99
SCANTIME=YES | NO
specifies whether to scan all row values in
a date/time column and automatically
determine the TIME data type if only time
values exist.
YES specifies that a column with only time values
be assigned the TIME8. format.
NO specifies that a column with only time values
be assigned the DATE9. format.
 The default is NO.
PROC IMPORT
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
100
USEDATE=YES | NO
 specifies whether to use the DATE9. format for
date/time values in Excel workbooks.
YES specifies that date/time values be assigned
the DATE9. format.
NO specifies that date/time values be assigned
the DATETIME16. format.
 The default is YES.
PROC IMPORT
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
101
Proc import vs. Data step
PROC IMPORT OUT= WORK.MDATA1
DATAFILE=
"G:\dukumenty\diplomka-data.txt"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
data work.mdata2;
length
BIRTHPLACE $ 25
AGE $ 25
.
.
.
EDUCATION $ 25
;
infile 'G:\dukumenty\diplomka-data.csv' delimiter = ';'
DSD lrecl=3276 firstobs=2 ;
input
BIRTHPLACE
AGE
.
.
.
EDUCATION
;
run;
102
Import z SQL databáze
libname my_data 'C:\Scoring\SASdata\';
proc sql;
connect to odbc as mssql (complete="DRIVER=SQL Server;
SERVER=sqlserv;Trusted_connection=Yes ");
create view my_data.wset_of_segments as select * from connection to mssql
(select * from db1.rezac.segmenty);
disconnect from mssql;
quit;
proc sql;
create table my_data.set_segments as
select
*
from my_data.wset_of_segments
;
quit;
103
Formats (Informats)
<$>(in)format-namew.<d>
 An informat is an instruction that SAS uses to read data values.
 A format is an instruction that SAS uses to write data values.
 SAS (in)formats have the following form:
(In)format
name
Total width
of the field
to read
Number of
decimal places
Required
delimiter
Indicates a
character
informat
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
104
Formats (Informats)
InFormats by Category:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/
viewer.htm#a001239776.htm
Category Description
Character instructs SAS to read character data values into character variables.
Column Binary instructs SAS to read data stored in column-binary or multipunched form
into character and numeric variables.
Date and Time instructs SAS to read date values into variables that represent dates, times,
and datetimes.
ISO 8601 instructs SAS to read date, time, and datetime values that are written in the
ISO 8601 standard into either numeric or character variables.
Numeric instructs SAS to read numeric data values into numeric variables.
105
Formats (Informats)
Category Description
Character instructs SAS to write character data values from character variables.
Date and Time instructs SAS to write data values from variables that represent dates, times,
and datetimes.
ISO 8601 instructs SAS to write date, time, and datetime values using the ISO 8601
standard.
Numeric instructs SAS to write numeric data values from numeric variables.
Formats by Category:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/
viewer.htm#a001263753.htm
106
Selected Informats
8. or 8.0 reads eight columns of numeric data.
Raw Data Value Informat SAS Data Value
8.01 2 3 4 5 6 7 1
2
2 3 4 5 6 7
8.21 2 3 4 5 6 7 1 2
2
3 4 5 . 6 7
8.2 reads eight columns of numeric data
and may insert a decimal point in the value.
Raw Data Value Informat SAS Data Value
8.21 2 3 4 . 5 6 7 1 2
2
3 4 . 5 6 7
8.01 2 3 4 . 5 6 7 1 2
2
3 4 . 5 6 7
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
107
Selected Informats
$8.J A M E S J
A
A M E
J
S
$CHAR8.J A M E S J A M E S
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
$8. reads eight columns of character data and removes
leading blanks.
Raw Data Value Informat SAS Data Value
$CHAR8. reads eight columns of character data and preserves
leading blanks.
Raw Data Value Informat SAS Data Value
108
Selected Informats
COMMA7.0$ 1 2 , 5 6 7
2
1
3
2 5 6 7
MMDDYY8.1 0 / 2 9 / 0 1 1 5 2 7 7
COMMA7. reads seven columns of numeric data and removes
selected nonnumeric characters
such as dollar signs and commas.
Raw Data Value Informat SAS Data Value
MMDDYY8. reads dates of the form 10/29/01.
Raw Data Value Informat SAS Data Value
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
109
Datumové formáty
 Date values that are stored as SAS dates are special
numeric values.
 A SAS date value is interpreted as the number of days
between January 1, 1960, and a specific date.
01JAN1959 01JAN1960 01JAN1961
-365 0 366
01/01/1959 01/01/1960 01/01/1961
informat
format
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
110
Datumové formáty
 SAS uses date informats to read and convert dates
to SAS date values.
10/29/2001 MMDDYY10. 15277
10/29/01 MMDDYY8. 15277
29OCT2001 DATE9. 15277
29/10/2001 DDMMYY10. 15277
Informat
Raw Data
Value
Converted
Value
Examples:
Number of days between
01JAN1960 and 29OCT2001
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
111
Optimalizace práce s daty v SAS
 Pro (velmi) velké datové soubory je vhodné použití
komprese a indexování SASovských tabulek. Více na:
Příklad: data lib1.tab2 (compress=binary index=(var1 var2));
set lib1.tab1;
…
run;
http://www2.sas.com/proceedings/sugi27/p023-27.pdf
http://www2.sas.com/proceedings/sugi28/003-28.pdf
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default
/viewer.htm#a001288760.htm
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default
/viewer.htm#a000131138.htm
ODS – The Output Delivery System
The Output Delivery System (ODS) enables you to produce
output in a variety of formats, including HTML, RTF, PDF, and the
default SAS listing.
The ODS statements above create an HTML file, salesrep.html,
using the output produced by the PROC PRINT step.
112Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The PRINT Procedure
The PRINT procedure prints the observations in a SAS data set
and uses all or some of the variables.
The PRINT procedure above includes TITLE and FOOTNOTE
statements, which are global statements and do not need to be
enclosed in a DATA or PROC step.
113Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Program Output
Partial PROC PRINT Output
(SAS Output window)
114
Quarter 1 Orion Sales Reps
Males Only
---------------------- Country=AU -----------------
------
First Month of
Last Name Name Bonus Bonus
Wills Matsuoka 1 $300
Surawski Marinus 1 $300
Shannan Sian 1 $300
Scordia Randal 2 $300
Pretorius Tadashi 3 $300
Nowd Fadi 1 $300
Magrath Brett 1 $300
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Partial PROC PRINT Output
(HTML format)
115
3. Organizace dat, úvod do SQL
116
Historie skladování dat
 V minulosti byla data ukládána v jednom velkém souboru, ke
kterému se přistupovalo indexovanými sekvenčními metodami.
Soubor byl indexován na základě předpokládaných způsobů
dotazování. Velkou nevýhodou bylo to, že se informace v
záznamech opakovaly a typy dotazů byly předurčeny.
datum jmeno prijmeni adresa_ulice adresa_mesto cislo_uctu platba zustatek
980103 Jan Novak Dlouha 5 Praha 1 9945371 100,00 100,00
980105 Jan Novak Dlouha 5 Praha 1 9945371 1500,00 1600,00
980106 Jan Novak Dlouha 5 Praha 1 9945371 -1500,00 50,00
980106 Karel Nemec Lucni 4 Praha 2 24867134 3000,00 6000,00
980107 Karel Nemec Lucni 4 Praha 2 24867134 -4000,00 2000,00
980108 Jan Novak Dlouha 5 Praha 1 9945371 -150,00 -100,00
980111 Karel Nemec Lucni 4 Praha 2 24867134 5000,00 7000,00
117
id_klient
jmeno
prijmeni
adresa_ulice
adresa_mesto
…
id_transakce
id_ucet
datum
platba
zustatek
…
klient
transakce
id_ucet
id_klient
…
ucet
SELECT klient.jmeno, klient.prijmeni, klient.adresa_ulice,
klient.adresa_mesto, ucet.cislo_uctu, transakce.zustatek
FROM klient, ucet, transakce
WHERE klient.id_klient = ucet.id_klient;
AND transakce.id_ucet = ucet.id_ucet;
AND transakce.zustatek < 100;
GROUP BY klient.adresa_mesto;
Relační databáze
Relační databáze
 Relační databáze je databáze založená na relačním modelu. Často se tímto pojmem
označuje nejen databáze samotná, ale i její konkrétní softwarové řešení.
 Relační databáze je založena na tabulkách, jejichž řádky obvykle chápeme jako
záznamy a eventuelně některé sloupce v nich (tzv. cizí klíče) chápeme tak, že
uchovávají informace o relacích mezi jednotlivými záznamy v matematickém slova
smyslu.
 Termín relační databáze definoval Edgar F. Codd v roce 1970.
 způsoby kladení dotazů:
 QBE (query by example)
 SQL (structured query language)
 Dle relační teorie lze pomocí základních operací (sjednocení, kartézský součin, rozdíl,
selekce, projekce a spojení) uskutečnit veškeré operace s daty a ostatní operace jsou již
jen kombinacemi těchto pěti.
118
Relační databáze
 Základem relačních databází jsou databázové tabulky. Jejich sloupce se nazývají atributy nebo
pole, řádky tabulky jsou pak záznamy. Atributy mají určen svůj konkrétní datový typ doménu.
Řádek je řezem přes sloupce tabulky a slouží k vlastnímu uložení dat. Konkrétní
tabulka pak realizuje podmnožinu kartézského součinu možných dat všech sloupců - relaci.
 Primární klíč
 Primární klíč je jednoznačný identifikátor záznamu, řádku tabulky. Primárním klíčem může být
jediný sloupec či kombinace více sloupců tak, aby byla zaručena jeho jednoznačnost. Pole klíče musí
obsahovat hodnotu, tzn. nesmí se zde vyskytovat nedefinovaná prázdná hodnota NULL. V praxi se
dnes často používají umělé klíče, což jsou číselné či písmenné identifikátory - každý nový záznam
dostává identifikátor odlišný od identifikátorů všech předchozích záznamů (požadavek na unikátnost
klíče), obvykle se jedná o celočíselné řady a každý nový záznam dostává číslo vždy o jednotku vyšší
(zpravidla zcela automatizovaně) než je číslo u posledního vloženého záznamu (číselné označení
záznamů s časem stoupá).
 Cizí klíč
 Dalším důležitým pojmem jsou nevlastní/cizí klíče. Slouží pro vyjádření vztahů, relací, mezi
databázovými tabulkami. Jedná se o pole či skupinu polí, která nám umožní identifikovat, které
záznamy z různých tabulek spolu navzájem souvisí.
119
Relační databáze – vztahy mezi tabulkami
 Vztahy, neboli relace, slouží ke svázání dat, která spolu
souvisejí a jsou umístěny v různých databázových tabulkách. V
zásadě rozlišujeme čtyři typy vztahů.
 mezi daty v tabulkách není žádná spojitost, proto nedefinujeme žádný vztah.
 1:1 používáme, pokud záznamu odpovídá právě jeden záznam v jiné databázové
tabulce a naopak. Takovýto vztah je používán pouze ojediněle, protože většinou není
pádný důvod, proč takovéto záznamy neumístit do jedné databázové tabulky. Jedno
z mála využití je zpřehlednění rozsáhlých tabulek. Jako ilustraci je možné použít
vztah řidič - automobil. V jednu chvíli (diskrétní časový okamžik) řídí jedno auto
právě jeden řidič a zároveň jedno auto je řízeno právě jedním řidičem.
120
Relační databáze – vztahy mezi tabulkami
 1:N přiřazuje jednomu záznamu více záznamů z jiné tabulky. Jedná se o
nejpoužívanější typ relace, jelikož odpovídá mnoha situacím v reálném životě. Jako
reálný příklad může posloužit vztah autobus - cestující. V jednu chvíli cestující jede
právě jedním autobusem a v jednom autobuse může zároveň cestovat více
cestujících.
 M:N je méně častým. Umožňuje několika záznamům z jedné tabulky přiřadit
několik záznamů z tabulky druhé. V databázové praxi bývá tento vztah z praktických
důvodů nejčastěji realizován kombinací dvou vztahů 1:N a 1:M, které ukazují do
pomocné tabulky složené z kombinace obou použitých klíčů (třetí resp. tzv. vazební
tabulka). Příkladem z reálného života by mohl být vztah výrobek - vlastnost.
Výrobek může mít více vlastností a jednu vlastnost může mít více výrobků. V
reálném životě nicméně existuje velké množství vztahů M : N, mimo jiné také proto,
že často existuje praktická potřeba zachovávat i údaje o historii těchto vztahů z
časového hlediska (jeden řidič v delším časovém období řídí více rozličných aut a
jedno auto v delším časovém období může mít více různých řidičů).
121
122
Slovník pojmů
 ODS Operational Data Store
 DWH DataWareHouse
 DataMart
 Meta Data
 BI Business Intelligence
 OLAP On Line Analytical Processing
 OLTP On Line Transaction Processing
 ETL Extract, Transform, Load
 ELT Extract, Load, Transform
 EAI Enterprise Application Integration
 ERP Enterprise Resource Planning
 DBMS Database Management System
 SQL Structured Query Language
123
ODS: Short for operational data store, a type of database that
serves as an interim area for a data warehouse in order to store
time-sensitive operational data that can be accessed quickly and
efficiently. In contrast to a data warehouse, which contains large
amounts of static data, an ODS contains small amounts of
information that is updated through the course of business
transactions. An ODS will perform numerous quick and simple
queries on small amounts of data, such as acquiring an account
balance or finding the status of a customer order, whereas a data
warehouse will perform complex queries on large amounts of data.
An ODS contains only current operational data while a data
warehouse contains both current and historical data.
DWH: Abbreviated DW, a collection of data designed to support management
decision making. Data warehouses contain a wide variety of data that present a
coherent picture of business conditions at a single point in time.
Development of a data warehouse includes development of systems to extract
data from operating systems plus installation of a warehouse database system
that provides managers flexible access to the data.
The term data warehousing generally refers to the combination of many
different databases across an entire enterprise. Contrast with data mart.
DataMart: A database, or collection of databases, designed to help
managers make strategic decisions about their business. Whereas a
data warehouse combines databases across an entire enterprise,
data marts are usually smaller and focus on a particular subject or
department. Some data marts, called dependent data marts, are
subsets of larger data warehouses.
BI: Most companies collect a large amount of data from their business
operations. To keep track of that information, a business and would need to
use a wide range of software programs , such as Excel, Access and different
database applications for various departments throughout their organization.
Using multiple software programs makes it difficult to retrieve information in a
timely manner and to perform analysis of the data.
The term Business Intelligence (BI) represents the tools and systems that play
a key role in the strategic planning process of the corporation. These systems
allow a company to gather, store, access and analyze corporate data to aid in
decision-making. Generally these systems will illustrate business intelligence in
the areas of customer profiling, customer support, market research, market
segmentation, product profitability, statistical analysis, and inventory and
distribution analysis to name a few.
Meta Data: Data about data. Metadata describes how and when
and by whom a particular set of data was collected, and how the
data is formatted. Metadata is essential for understanding
information stored in data warehouses and has become increasingly
important in XML-based Web applications.
A Database Management System (DBMS) is a set of computer programs
that controls the creation, maintenance, and the use of a database. Details on
http://en.wikipedia.org/wiki/
Database_management_system
SQL (někdy vyslovováno anglicky es-kjů-el, někdy též síkvl ) je
standardizovaný dotazovací jazyk používaný pro práci s daty v
relačních databázích. SQL je zkratka anglických slov Structured
Query Language (strukturovaný dotazovací jazyk).
Slovník pojmů
124
ETL: Short for extract, transform, load, three database functions
that are combined into one tool to pull data out of one database
and place it into another database.
Extract -- the process of reading data from a
database.
Transform -- the process of converting the
extracted data from its previous form into the form it needs to be
in so that it can be placed into another database. Transformation
occurs by using rules or lookup tables or by combining the data
with other data.
Load -- the process of writing the data into the
target database.
ETL is used to migrate data from one database to another, to form
data marts and data warehouses and also to convert databases
from one format or type to another.
EAI: Acronym for enterprise application integration. EAI is the unrestricted
sharing of data and business processes throughout the networked applications
or data sources in an organization. Early software programs in areas such as
inventory control, human resources, sales automation and database
management were designed to run independently, with no interaction between
the systems. They were custom built in the technology of the day for a specific
need being addressed and were often proprietary systems. As enterprises grow
and recognize the need for their information and applications to have the ability
to be transferred across and shared between systems, companies are investing
in EAI in order to streamline processes and keep all the elements of the
enterprise interconnected.
OLAP: Short for Online Analytical Processing, a category of software
tools that provides analysis of data stored in a database. OLAP tools
enable users to analyze different dimensions of multidimensional data.
For example, it provides time series and trend analysis views. OLAP
often is used in data mining.
The chief component of OLAP is the OLAP server, which sits between a
client and a database management systems (DBMS). The OLAP server
understands how data is organized in the database and has special
functions for analyzing the data. There are OLAP servers available for
nearly all the major database systems.
OLTP: Short for On-Line Transaction Processing. Same as transaction
processing.
Transaction processing: A type of computer processing in which the
computer responds immediately to user requests. Each request is considered to
be a transaction. Automatic teller machines for banks are an example of
transaction processing.
The opposite of transaction processing is batch processing, in which a batch of
requests is stored and then executed all at one time. Transaction processing
requires interaction with a user, whereas batch processing can take place
without a user being present.
ERP: Short for enterprise resource planning, a business management system
that integrates all facets of the business, including planning, manufacturing,
sales, and marketing. As the ERP methodology has become more popular,
software applications have emerged to help business managers implement ERP
in business activities such as inventory control, order tracking, customer
service, finance and human resources.
Slovník pojmů
125
Datový sklad (Data Warehouse)
 Definice (W.H. Inmon 1996):
Datový sklad je
subjektově orientovaný
integrovaný
časově proměnný
stálý
soubor dat, který slouží pro podporu rozhodování.
126
Datový sklad
 prvotní koncepce datována počátkem 80.let
 vznik z potřeby jednoduchého přístupu ke strukturovanému
úložišti kvalitních dat
 pomáhá získat odpovědi pro lepší rozhodování
 umožňuje použití dat pro dotazování, reportování a analýzu
127
Struktura datového skladu
 třívrstvá architektura:
 datový sklad
 aplikační vrstva
 prezentační vrstva
 fyzicky centralizovaný nebo distribuovaný
128
Datový sklad
129
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
Datový sklad
130
Datový sklad
131
Datové Modely
 Star (hvězda)
 Snowflake (vločka)
 Starflake
 Constellation
(souhvězdí)
132
Příklad schématu hvězda (star)
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
133
Příklad schématu vločka (Snowflake)
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
province_or_street
country
city
134
Příklad schématu souhvězdí (Constellation)
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
135
Příklad datové kostky
Celkový roční prodej
TV v USADatum
Země
suma
sum
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
USA
Kanada
Mexiko
suma
136
Datové „kvádry“ odpovídající datové kostce
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
137
Typické OLAP Operace
 Roll up (drill-up): sumarizace dat
 Postoupení v hierarchii o úroveň výše nebo redukce dimenze (např. z
kostky na čtverec).
 Drill down (roll down): opak roll-up –zajímá nás větší detail
 Z vyšší úrovně sumarizace na nižší úroveň nebo zavedení nových
datových dimenzí.
 Slice and dice (krájet a kostkovat):
 Výběr datového podprostoru.
 Ostatní operace:
 drill across: zahrnutí více datových tabulek (kostek)
 drill through: přes základní úroveň datové kostky zpět k podkladovým
relačním tabulkám (pomocí SQL)
138
Architektura OLAP Serverů
 Relační OLAP (Relational OLAP -ROLAP)
 Využívá relační nebo rozšířenou relační DBMS pro ukládání a správu dat
datového skladu a OLAPovou střední vrstvu pro podporu chybějících částí.
 Zahrnuje optimalizační možnosti DBMS, implementaci agregační
navigační logiky a doplňkové nástroje a služby.
 Vícedimenzionální OLAP (Multidimensional OLAP - MOLAP)
 Technologie založená na vícedimenzionálních datových polích (vč. technik
pro řídké matice).
 Rychlé indexování předem spočtených sumarizovaných dat.
 Hybridní OLAP (Hybrid OLAP - HOLAP)
 Uživatelsky flexibilní, tj. low level: relační, high-level: pole.
 Specializované SQL servery
 specializovaná podpora pro SQL dotazy nad star/snowflake schématy.
139
ROLAP
 Data uložená v relační databázi – nejsou duplikována,
ovšem není k nim možný přístup bez připojení k
zdrojové databázi.
 dotazy OLAP se převádějí do klasických dotazů SQL –
může být nevýhodou (limitované možnosti SQL,
pomalejší odezva).
 Vhodný jen pro omezené množství dat.
140
MOLAP
 „tradiční“ OLAP.
 Data uložena v multidimenzionálních kostkách mimo
relační databázi. Jsou tudíž duplikována a je možný
přístup i bez spojení s původním zdrojem dat.
 Hlavní výhodou je rychlá odezva na dotazy. Vše je
předpočítáno a uloženo při tvorbě kostek.
141
HOLAP
 ponechává původní data v relačních tabulkách,
agregace ukládá v multidimenzionálním formátu
 poskytuje propojení mezi rozsáhlými objemy dat v
relačních tabulkách
 výhoda rychlejšího výkonu multidimenzionálně
uložených agregací
142
Budování datového skladu
 metoda „velkého třesku“:
analýza požadavků podniku
vytvoření podnikového datového skladu
vytvoření datových tržišť
 přírůstková (evoluční) metoda
143
Plnění datového skladu
 počáteční plnění + pravidelná aktualizace
 plnění pomocí datových pump
 postupy ETL:
extrakce
transformace
loading
144
Co je SQL?
The SQL procedure uses Structured Query
Language to perform the following tasks:
• retrieve and manipulate SAS data sets
• create and delete SAS data sets
• generate reports
• add or modify values in a SAS data set
• add, modify, or drop columns in a SAS data set
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
145
Úvod do SQL
 General form of an SQL procedure query to
generate output:
PROC SQL;
SELECT variables
FROM SAS-data-set;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
146
 Create a listing report of product activity.
 Step 1: Invoke the SQL procedure.
 Step 2: Identify the variables to display on the
report.
Separate the variables with commas.proc sql;
select CustomerID, CustomerFirstName,
CustomerLastName
proc sql;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
147
 Step 3: Identify the input data set.
 Step 4: End the procedure with a QUIT statement.
proc sql;
select CustomerID, CustomerFirstName,
CustomerLastName
from univ.mastercustomers;
quit;
proc sql;
select CustomerID, CustomerFirstName,
CustomerLastName
from univ.mastercustomers;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
148
 SQL joins have the following characteristics:
 They do not require sorted data.
 They can be performed on up to 32 data sets
at one time.
 They allow complex matching criteria using
the WHERE clause.
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
149
 General form of an SQL procedure join to generate
output:
PROC SQL;
SELECT variables
FROM SAS-data-set1 AS alias1,
SAS-data-set2 AS alias2
WHERE alias1.variable=alias2.variable;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
150
 Create a listing report by joining data sets
univ.mastercustomers and
univ.customerorders by CustomerID.
 Step 1: Invoke the SQL procedure and list the variables
to display.
proc sql;
select CustomerID, CustomerFirstName,
CustomerLastName, OrderID,
UnitPrice, Quantity
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
151
 Step 2: Identify the data sets to join and provide
a table alias for each.
 Because CustomerID exists in both data sets,
identify which CustomerID to use.
proc sql;
select m.CustomerID, CustomerFirstName,
CustomerLastName, OrderID,
UnitPrice, Quantity
from univ.mastercustomers as m,
univ.customerorders as c
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
152
 Step 3: State the condition on which observations are
matched and terminate the query.
proc sql;
select m.CustomerID, CustomerFirstName,
CustomerLastName, OrderID,
UnitPrice, Quantity
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
153
Create a new variable named TotSale by multiplying
Quantity by UnitPrice. Name the new variable
TotSale.
proc sql;
select m.CustomerID, CustomerFirstName,
CustomerLastName, OrderID,
UnitPrice, Quantity,
Quantity * UnitPrice as TotSale
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
154
 General form of a PROC SQL query to create
a SAS data set:
PROC SQL;
CREATE TABLE SAS-data-set AS
SELECT ...
other SQL clauses;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
155
 Join the tables univ.mastercustomers and
univ.customerorders to create a new data set.
proc sql;
create table work.ordertotals as
select m.CustomerID,
CustomerFirstName,
CustomerLastName, OrderID,
UnitPrice, Quantity,
Quantity*UnitPrice as TotSale
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
156
 General form of an SQL procedure query using
labels and formats:
PROC SQL;
SELECT variable LABEL='column-header'
FORMAT=format.
FROM SAS-data-set ;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
157
 Enhance the previous report.
proc sql;
select m.CustomerID,
CustomerFirstName format=$10.,
CustomerLastName format=$15.,
OrderID,
UnitPrice format=dollar7.2,
Quantity,
Quantity * UnitPrice as TotSale
format=dollar8.2
label='Total Sale Amount'
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
158
 Partial Output
Customer Customer Customer Unit Sale
ID First Name Last Name OrderID Price Quantity Amount
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
062096 Craig Knapmeyer 1240062267 $36.00 3 $108.00
062096 Craig Knapmeyer 1240832690 $27.00 4 $108.00
062284 Robert Britt 1238409388 $15.00 1 $15.00
062284 Robert Britt 1238409388 $33.00 1 $33.00
064810 Randall Goodman 1238248877 $175.00 4 $700.00
064810 Randall Goodman 1238248877 $283.00 1 $283.00
064810 Randall Goodman 1238273875 $220.00 1 $220.00
064810 Randall Goodman 1238768955 $52.00 1 $52.00
064810 Randall Goodman 1238842450 $24.00 1 $24.00
064810 Randall Goodman 1239353817 $59.00 2 $118.00
064810 Randall Goodman 1239489696 $11.00 2 $22.00
064810 Randall Goodman 1239608721 $22.00 3 $66.00
064810 Randall Goodman 1239608721 $46.00 3 $138.00
064810 Randall Goodman 1240590287 $21.00 2 $42.00
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
159
 General form of an SQL procedure query to
generate summary output:
 If a summary function is used in the SELECT clause
with only one argument, then an overall statistic is
calculated down the column.
PROC SQL;
SELECT group-variable,
SUM(analysis-variable)
FROM SAS-data-set
GROUP BY group-variable;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
160
 Step 1: Identify the variables to display, the input data
sets, and the matching criteria.
proc sql;
select m.CustomerID,
CustomerFirstName format=$10.,
CustomerLastName format=$15.,
sum(Quantity) label= 'Total Quantity',
sum(Quantity*UnitPrice) as TotSale
format=dollar12.2
label='Total Sale Amount'
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
161
 Step 2: Identify the grouping variable(s).
proc sql;
select m.CustomerID,
CustomerFirstName format=$10.,
CustomerLastName format=$15.,
sum(Quantity) label='Total Quantity',
sum(Quantity*UnitPrice) as TotSale
format=dollar12.2
label='Total Amount Purchased'
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID
group by m.CustomerID, CustomerFirstName,
CustomerLastName;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
162
 General form of an SQL procedure query to
generate ordered output:
 The default is ascending order.
PROC SQL;
SELECT group-variable,
SUM(analysis-variable)
FROM SAS-data-set
GROUP BY group-variable
ORDER BY variable1 <, variable2> ;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
163
 Order the report by total sale.
proc sql;
select m.CustomerID,
CustomerFirstName format=$10.,
CustomerLastName format=$15.,
sum(Quantity) label='Total Quantity',
sum(Quantity*UnitPrice) as TotSale
format=dollar12.2
label='Total Amount Purchased'
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID
group by m.CustomerID, CustomerFirstName,
CustomerLastName
order by TotSale;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
164
 Order the report by total sale – v sestupném pořadí
proc sql;
select m.CustomerID,
CustomerFirstName format=$10.,
CustomerLastName format=$15.,
sum(Quantity) label='Total Quantity',
sum(Quantity*UnitPrice) as TotSale
format=dollar12.2
label='Total Amount Purchased'
from univ.mastercustomers as m,
univ.customerorders as c
where m.CustomerID=c.CustomerID
group by m.CustomerID, CustomerFirstName,
CustomerLastName
order by TotSale desc;
quit;
Úvod do SQL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
165
Inner JOIN
 The INNER JOIN keywords can be used to join tables. The ON clause replaces
the WHERE clause for specifying columns to join. PROC SQL provides these
keywords primarily for compatibility with the other joins (OUTER, RIGHT, and
LEFT JOIN). Using INNER JOIN with an ON clause provides the same
functionality as listing tables in the FROM clause and specifying join columns
with a WHERE clause.
proc sql outobs=6;
title ’Oil Production/Reserves
of Countries’;
select p.country, barrelsperday
’Production’, barrels
’Reserves’
from sql.oilprod p,
sql.oilrsrvs r
where p.country = r.country
order by barrelsperday desc;
proc sql ;
select p.country,
barrelsperday
’Production’, barrels
’Reserves’
from sql.oilprod p inner
join sql.oilrsrvs r
on p.country = r.country
order by barrelsperday
desc;
=
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
166
 Outer joins are inner joins that are augmented with rows from one table that do
not match any row from the other table in the join. The resulting output includes
rows that match and rows that do not match from the join’s source tables.
Nonmatching rows have null values in the columns from the unmatched table.
Use the ON clause instead of the WHERE clause to specify the column or
columns on which you are joining the tables. However, you can continue to use
the WHERE clause to subset the query result.
Left JOIN
 A left outer join lists matching rows and rows from the lefthand
table (the first table listed in the FROM clause) that do
not match any row in the right-hand table. A left join is
specified with the keywords LEFT JOIN and ON.
proc sql;
select Capital format=$20., Name ’Country’
format=$20.,
Latitude, Longitude
from sql.countries a left join sql.worldcitycoords b
on a.Capital = b.City and
a.Name = b.Country;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
167
Right JOIN
 A right join, specified with the keywords RIGHT JOIN and ON, is
the opposite of a left join: nonmatching rows from the right-hand
table (the second table listed in the FROM clause) are included
with all matching rows in the output.
proc sql outobs=10;
select City format=$20., Country
’Country’ format=$20., Population
from sql.countries right join
sql.worldcitycoords
on Capital = City and
Name = Country
order by City;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
168
• A full outer join, specified with the keywords FULL JOIN
and ON, selects all matching and nonmatching rows.
Inner/Full Outer/Left/Right JOIN
proc sql outobs=10;
select City ’#City#(WORLDCITYCOORDS)’
format=$20.,
Capital ’#Capital#(COUNTRIES)’
format=$20.,
Population, Latitude, Longitude
from sql.countries full join
sql.worldcitycoords
on Capital = City and
Name = Country;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
169
 Cross Join
 A cross join is a Cartesian product; it returns the product of two tables.
 Union Join
 A union join combines two tables without attempting to match rows. All
columns and rows from both tables are included.
 Natural Join
 A natural join automatically selects columns from each table to use in
determining matching rows. With a natural join, PROC SQL identifies
columns in each table that have the same name and type; rows in which the
values of these columns are equal are returned as matching rows. The ON
clause is implied.
Speciální typy JOIN
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
170
 Další filtrování výstupu pomocí „having“
proc sql;
select CustomerGroup,
sum(Quantity*UnitPrice) as TotSale
format=dollar12.2
label='Total Amount Purchased',
from customers as m left join
customerorders as c
on m.CustomerID=c.CustomerID
group by m.CustomerGroup
having TotSale ge 10000 and TotSale ne .
order by TotSale desc
;
quit;
Úvod do SQL
Větší nebo rovno 10000
Různé od „missing“,
tj. chybějící hodnoty
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
171
 Zjištění počtu rozdílných (distinct) klientů v daných
skupinách:
proc sql;
select CustomerGroup,
count(distinct m.CustomerID) as pocet,
from customers as m left join
customerorders as c
on m.CustomerID=c.CustomerID
group by m.CustomerGroup
order by TotSale desc
;
quit;
Úvod do SQL
172
 Výpis prvních deseti (obecně n) řádků a všech sloupců:
proc sql outobs=10;
select
*
from customers
;
quit;
Úvod do SQL
173
 cca 500 „funkcí“
• textové
• datumové
• matematické
• statistické
• pravděpodobnostní
• finanční
• …
Functions and CALL Routines
Více na:
http://support.sas.com/documentation/cdl/en/lrdict/
64316/HTML/default/viewer.htm#a000245860.htm
174
 Arithmetic Operators:
Math operators:
Symbol Definition Example Result
** exponentiation a**3 raise A to the third power
* Multiplication 2*y multiply 2 by the value of Y
/ division var/5 divide the value of VAR by 5
+ addition num+3 add 3 to the value of NUM
- subtraction sale-
discount
subtract the value of DISCOUNT
from the value of SALE
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
175
 Comparison Operators :
Math operators:
Symbol Mnemonic
Equivalent
Definition Example
= EQ equal to a=3
^= NE not equal to (table note 1) a ne 3
¬= NE not equal to
~= NE not equal to
> GT greater than num>5
< LT less than num<8
>= GE greater than or equal to (table note 2) sales>=300
<= LE less than or equal to (table note 3) sales<=100
IN equal to one of a list num in (3, 4, 5)
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
176
 Logical Operators:
Math operators:
Více na:
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML
/default/viewer.htm#a000780367.htm
Symbol Mnemonic Equivalent Example
& AND (a>b & c>d)
| OR (a>b or c>d)
! OR
¦ OR
¬ NOT not(a>b)
ˆ NOT
~ NOT
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Special WHERE Operators
Special WHERE operators are operators that can only
be used in a WHERE expression.
177
Symbol Mnemonic Definition
BETWEEN-AND an inclusive range
IS NULL missing value
IS MISSING missing value
? CONTAINS a character string
LIKE a character pattern
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
BETWEEN-AND Operator
The BETWEEN-AND operator selects observations
in which the value of a variable falls within an inclusive range of values.
Examples:
Equivalent Expressions:
178
where salary between 50000 and 100000;
where salary not between 50000 and 100000;
where salary between 50000 and 100000;
where 50000 <= salary <= 100000;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
IS NULL and IS MISSING Operators
The IS NULL and IS MISSING operators select
observations in which the value of a variable is missing.
 The operator can be used for both character and numeric
variables.
 You can combine the NOT logical operator with
IS NULL or IS MISSING to select nonmissing values.
Examples:
179
where Employee_ID is null;
where Employee_ID is missing;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
CONTAINS Operator
The CONTAINS (?) operator selects observations that
include the specified substring.
 The position of the substring within the variable's values
does not matter.
 The operator is case sensitive when comparisons are made.
Example:
180
where Job_Title contains 'Rep';
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
LIKE Operator
The LIKE operator selects observations by comparing
character values to specified patterns.
There are two special characters available for specifying a
pattern:
 A percent sign (%) replaces any number of characters.
 An underscore (_) replaces one character.
Consecutive underscores can be specified.
A percent sign and an underscore can be specified in the
same pattern.
The operator is case sensitive.
181Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
LIKE Operator
Examples:
This WHERE statement selects observations that begin
with any number of characters and end with an N.
This WHERE statement selects observations that begin
with a T, followed by a single character, followed by an M,
and followed by any number of characters.
Possible values include Tom and Tammy.
182
where Name like '%N';
where Name like 'T_M%';
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
183
Which WHERE statement will return the observations that have
a first name starting with the letter M for the given values?
a. Answer
b. Answer
c. Answer
d. Answer
where Name like '_, M_';
where Name like '%, M%';
where Name like '_, M%';
where Name like '%, M_';
first name
last name
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
184
Which WHERE statement will return the observations that have
a first name starting with the letter M for the given values?
a. Answer
b. Answer
c. Answer
d. Answer
where Name like '_, M_';
where Name like '%, M%';
where Name like '_, M%';
where Name like '%, M_';
first name
last name
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
185
Odkazy na právě vypočtené sloupce
 When you use a column alias to refer to a calculated value, you
must use the CALCULATED keyword with the alias to inform
PROC SQL that the value is calculated within the query. The
following example uses two calculated values, LowC and HighC,
to calculate a third value, Range:
proc sql outobs=12;
title ’Range of High and Low
Temperatures in Celsius’;
select City, (AvgHigh - 32) * 5/9
as HighC format=5.1,
(AvgLow - 32) * 5/9 as LowC
format=5.1,
(calculated HighC - calculated
LowC)
as Range format=4.1
from sql.worldtemps;
 You can specify a
calculated column only in
a SELECT clause or a
WHERE clause
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
186
Podmíněné přiřazení hodnoty
 You can use conditional logic within a query by using a CASE expression to
conditionally assign a value. You can use a CASE expression anywhere that you
can use a column name. You must close the CASE logic with the END keyword.
proc sql outobs=12;
title ’Climate Zones of World Cities’;
select City, Country, Latitude,
case
when Latitude gt 67 then ’North Frigid’
when 67 ge Latitude ge 23 then ’North Temperate’
when 23 gt Latitude gt -23 then ’Torrid’
when -23 ge Latitude ge -67 then ’South Temperate’
else ’South Frigid’
end as ClimateZone
from sql.worldcitycoords
order by City;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
187
Více o proceduře SQL
http://support.sas.com/documentation/cdl/en/sqlproc/63043/PDF
/default/sqlproc.pdf
http://support.sas.com/documentation/cdl/en/sqlproc/63043/
HTML/default/viewer.htm#titlepage.htm
188
Datová matice
 Nutný formát dat pro modelování.
 2-rozměrná matice n x p.
 Řádky reprezentují n statistických jednotek (klientů)
 Sloupce reprezentují p statistických proměnných.
189
Zásady tvorby datové matice
 Replikovatelnost tvorby dat. matice
 žádné manuální úpravy dat
 Srozumitelnost tvorby dat. matice
 podrobné komentáře
 Zpětná konektivita dat. matice
 primární klíče (id) všech podkladových
datových tabulek
190
4. Příprava dat –čištění, kategorizace,
agregace, transformace dat (WOE),
úvod do SAS Data Step
191
Čištění dat: Praktické zkušenosti
 Pokud vaše nová data obsahují více než 30 čísel, tak je v nich
skoro jistě nějaká chyba.
 Čištění a příprava dat zabírá obvykle 80 – 90 %
analytikova času.
 Pokud budete VELMI pečliví v této fázi, ušetříte si daleko víc
času a nervů později – jinak stavíte dům na písku.
 GIGO…Garbage in, Garbage out (smetí dovnitř, smetí ven)
 sebelepší model (proces) nevyrobí ze smetí nic jiného než
opět smetí.
192
Co způsobí nekvalitní data
 Správa nekvalitních/nadbytečných dat
 Nedoručené zásilky (marketing, fakturace)
 Nesprávné výsledky zpracování (reporting, analýzy, data mining)
 Špatné fungování systému (nekompatibilita)
 Ztráta image, nespokojení klienti
193
Co způsobí nekvalitní data




194
Čištění dat: Ověření souboru
Ověření souboru s daty / zdrojů dat
 Jsou to správná data (čas vzniku, výzkum…)?
 Jsou kompletní, bez duplicit, umím je číst…
Zkoumání případů
 Mají identifikátory?
 Jsou tyto ID správné?
 Neopakují se (duplicity)?
 Existují i „skoro“ duplicity – dva podobné, ale ne přesně totožné záznamy o
tomtéž subjektu.
 Nejsou vynechány?
195
Čištění dat: Ověření proměnných
Zkoumání metadat o proměnných
 Jsou tam všechny proměnné a správně značené?
 Je jasné, co znamenají (kódovníky, definice…)? Dokumentace
OK?
 Pozor na mezinárodní studie, produkty konsorcií agentur a opakované vlny
výzkumů. Jemné nuance metody mohou způsobit hrubý nesoulad !
 Neopakuje se některá proměnná vícekrát?
196
Čištění dat: Průzkum proměnných
 Nabývá přípustných hodnot (x out of range)?
 „Divné“ kódy („xxx“, „9999“…)
 Duplicitní kódy pro stejnou věc („Ž“, „ž“, „žena“, „zena“…)
 Kódování češtiny/ruštiny/…
 Překlepy apod.
 Editovací distance (Levenshteinova (Владимир Иосифович Левенштейн), ...)
pomohou odhalit překlep
 Editovací distance = počet elementárních editovacích kroků potřebných
pro změnu jednoho řetězce na druhý. Viz
http://www.merriampark.com/ld.htm k Levenshteinově distanci
 Je zde aplet, který ji umí počítat
 Shlukování řetězců podle ED
197
Čištění dat: Průzkum proměnných
 Slučování podobných kategorií (prodavač – prodejce –
prodavačka);
 Málo četné kategorie (národnost brazilská…) – je třeba
sloučit/přiřadit k nějaké(kým) více četné(ným)
kategorii(ím) na základě nějakého vhodného kriteria.
 Je distribuce přiměřená našemu očekávání (interval
hodnot, rozptyl, šikmost, špičatost, modální
hodnoty…)? Není např. příliš „ořezaná“ či naopak
„roztažená“?
 Někdy se obtížně poznává: Např. věk v části dat může být kódován jako
poslední dvojčíslí roku narození, a v jiné části dat jako 2007 – rok
narození.
198
Čištění dat: Průzkum proměnných
Shluky (clumping), typicky kolem zaokrouhlených hodnot
 Příjem –lidé rádi zaokrouhlují směrem nahoru.
 Nebo třeba kolem hranic věkových kvót, vzniklé tím, jak tazatelé
„upravují“ věky respondentů, aby se vešli do kvót.
Chybějící hodnoty (příčiny vzniku, zastoupení,…)!!!
Pozor na kódy časů (amer. x evrop. konvence), regionů
apod.!
199
Čištění dat: Vazby mezi daty
 Více proměnných
 Kontingenční tabulky, box ploty s kategoriemi, bodové grafy a jejich
matice, korelační koeficienty
 Logické vazby (např. 10tiletý nemůže být ženatý, 30tiletý nemůže
pracovat 20let,…)
 Hledání pomocí programu/kódu – podmínky vyjádříme pomocí prostředků
matematické logiky a necháme počítač, aby vyhledal případy, kde nejsou
splněny.
 Extrémní hodnoty vícerozměrného rozdělení
 Bodový graf
 Mahalanobisova vzdálenost od těžiště: [(x-t)T S-1 (x-t)]–1/2, kde t je vektor těžiště,
x zkoumaný bod a S kovarianční matice
 např. P. Filzmoser (2004) A multivariate outlier detection method,
http://www.statistik.tuwien.ac.at/public/filz/papers/minsk04.pdf
 Další vlastnosti; např. existují očekávané korelace?
200
Čištění dat: Vazby mezi daty
Sulpak
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
3 4 5 6 7 8 9 10 11 12 1 2
VT
OT
NA
MT
FK
CT
BT
 korektní vkládání dat do DB
 text. pole s názvem zboží vs. rolovací seznam
s typem zboží
 pořadí hodnot v rolovacím seznamu –
problém první (defaultní) hodnoty
201
Čištění dat: Odlehlé hodnoty
odlehlá hodnota
horní vnitřní hradba nebo max. hodnota
horní kvartil
medián
dolní kvartil
dolní vnitřní hradba nebo min. hodnota
extrémní hodnota
 kvartilová odchylka: q= x0.75 - x0.25
 vnitřní hradby: x0.25 – 1.5q , x0.75 + 1.5q
 vnější hradby: x0,25 - 3q, x0,75 + 3q
 Odlehlá hodnota leží mezi vnějšími a vnitřními hradbami, tj. v intervalu
(x0,75 + 1,5q, x0,75 + 3q) či v intervalu (x0,25 - 3q, x0,25 – 1,5q).
 Extrémní hodnota leží za vnějšími hradbami, tj. v intervalu (x0,75 + 3q, ∞)
či v intervalu (-∞, x0,25 - 3q).
202
Čištění dat: Opravy chyb
Zpět k pramenům!
Vyřazení podezřelých případů:
 Záměrné podvody, např. nespolehliví tazatelé (shluková analýza!).
 Neověřitelná data.
Vyřazení podezřelých hodnot.
Rekódování na správné hodnoty (imputace hodnot):
 imputace – průměrem, mediánem, max./min. hodnotou, pomocí
modelu (ZET algoritmus, stromy,…).
 Více viz např.:  http://nces.ed.gov/pubs2001/200117.pdf
 http://people.oregonstate.edu/~acock/missing/King.pdf
 http://link.springer.com/article/10.1023%2FA%3A1002814429456
 http://en.wikipedia.org/wiki/Imputation_(statistics)
 Algoritmus publikovali N.G. ZAGORUINKO a V.N. ELKINA v roce 1982.
 Primárně určen pro poměrové proměnné.
 Algoritmus probíhá takto:
 Lineární transformací normalizuj všechny sloupce datové
matice na interval [0, 1]
 Vyber j-tý sloupec (obsahující chybějící hodnoty)
 Pro každé dva sloupce j a k urči míru zaplnění a míru
podobnosti
 Odhadni hodnoty sloupce j pomocí hodnot sloupce k
(lineární regresí). Pro l=tý řádek (pracujeme právě s těmi
řádky, které mají známou (nechybějící) hodnotu)
dostaneme ( , jsou reg. parametry)
 Výsledný odhad známé hodnoty alj spočti jako vážený
průměr hodnot ,tj.
 Chybějící hodnoty aij odhadni pomocí tak, že ve vzorci
pro položíme l=i, když pomocí popsané lin. regrese
odhadneme pro vš. sloupce k, které nemají chybějící
hodnotu v řádku i
 Inverzní lin. transformací spočti hledané hodnoty. 203
Čištění dat: Opravy chyb – ZET algoritmus
)( kjLjk 
jkr
jkL
jkr
jklkjk
k
lj caba  jkcjkb
k
lja






 P
k jkjk
P
k
k
ljjkjk
lj
rL
arL
a
1
1~


= počet řádků, v
nichž současně ve
sloupcích j a k nejsou
chybějící hodnoty
= abs. hodn.
korelačního koef.
mezi sloupci j a k.
takové, které
minimalizuje součet
čtverců mezi a
ija~

lja lja~lja~
k
ija
204
Transformace dat
 Binarizace (dummy proměnné)
Dummy proměnné představují techniku využívající
dichotomické proměnné (kódované 0 nebo 1) pro
vyjádření jednotlivých hodnot nominálních
proměnných.
Název „dummy“ poukazuje na fakt, že přítomnost
znaku označeného kódem 1 reprezentuje faktor,
nebo soubor faktorů, který není měřitelný žádným
lepším způsobem v rámci dané analýzy.
205
Dummy proměnné
 Dummy proměnná přiřazuje hodnotu 1 danému pozorování vybrané proměnné a
hodnotu 0 ve zbývajících případech.
 Pro pohlaví (2 kategorie), např. přiřadí 1 pro ženu a 0 pro muže. V tomto případě
je postačující vytvoření právě jedné dummy proměnné.
 Pro rasu (4 kategorie), je třeba vytvořit více dummy proměnných.
P1=1, pokud rasa=„běloch“ a 0 jinak.
P2=1, pokud rasa=„černoch“ a 0 jinak.
P3=1, pokud rasa=„asiat“ a 0 jinak.
P4=1, pokud rasa=„ostatní“ a 0 jinak.
 Důležité: Všechny 4 proměnné nejsou zahrnuty do regrese (způsobilo by to
perfektní multikolinearitu, P4=1-P3-P2-P1).
 Počet dummy proměnných=počet kategorií -1.
 Vynechaná proměnná je „referenční“ proměnnou.
 Konstanta obsahuje informaci o této referenční proměnné.
 Koeficienty zahrnutých proměnných jsou brány ve vztahu ke konstantě.
206
Transformace dat
 Kategorizace spojitých proměnných
 pokud chceme využít WOE, je nutná
 decily, ale lze tvořit i tak aby byla maximalizována např.
informační hodnota
 Agregace
 podobných hodnot (věcná podobnost, podobnost hodnot
ve vztahu k cílové proměnné,…)
 málo četných hodnot
 Segmentace
 není nezbytná, ale může výrazně zvýšit predikční sílu
modelu, případně je vhodná z něj. business důvodů.
207
Categorization of predictors
 Every variable should be categorized (divided to reasonable number of
categories
- Best separation (default rates within categories are different as much as
possible)
- Time stability (ordering in categories by default rate is the same in different
periods of development sample)
208
Categorization of predictors
 We want to find out real statistical dependencies, not random
differences in default.
209
Transformace dat - WOE
 Good celkový počet dobrých klientů ve vzorku
 Bad celkový počet špatných klientů ve vzorku
 goodi
s , badi
s počet dobrých, resp. špatných klientů v i-té
kategorii příslušné s-té proměnné.
 celková šance
 šance i-té kategorie s-té proměnné
 poměr šancí (OR)
 WOE (weights of evidence)
bad
good
allodds _
s
i
s
is
i
bad
good
odds 
allodds
odds
ratioodds
s
is
i
_
_ 
 




























bad
bad
good
good
bad
good
bad
good
ratiooddsWOE s
i
s
i
s
i
s
i
s
i
s
i lnln_ln
210
Transformace dat -WOE
cat. # bad clients #good clients Def_rate odds OR % bad [1] % good [2] [3] = [2] / [1] WOE = ln[3]
1 4 1 80,0% 0,25 0,03 40,0% 1,1% 0,03 -3,58
2 2 6 25,0% 3,00 0,33 20,0% 6,7% 0,33 -1,10
3 2 18 10,0% 9,00 1,00 20,0% 20,0% 1,00 0,00
4 1 12 7,7% 12,00 1,33 10,0% 13,3% 1,33 0,29
5 1 53 1,9% 53,00 5,89 10,0% 58,9% 5,89 1,77
All 10 90 10,0% 9,00
ALL 100
40% = 4 / 10
1,1% = 1 / 90
80% = 4/ (4+1)
0,25 = 1/4
0,03 = 0,25 / 9
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
70,0%
80,0%
90,0%
100,0%
1 2 3 4 5
Def_rate
-4,00
-3,00
-2,00
-1,00
0,00
1,00
2,00
3,00
4,00
WOE
WOE Def_rate
211
Citlivost WOE
cat. # bad clients #good clients Def_rate odds OR % all % bad [1] % good [2] [3] = [2] / [1] WOE = ln[3]
1 400 100 80,0% 0,25 0,03 5,0% 40,0% 1,1% 0,03 -3,58
2 200 600 25,0% 3,00 0,33 8,0% 20,0% 6,7% 0,33 -1,10
3 200 1800 10,0% 9,00 1,00 20,0% 20,0% 20,0% 1,00 0,00
4 100 1200 7,7% 12,00 1,33 13,0% 10,0% 13,3% 1,33 0,29
5 100 5300 1,9% 53,00 5,89 54,0% 10,0% 58,9% 5,89 1,77
All 1000 9000 10,0% 9,00
ALL 10000
cat. # bad clients #good clients Def_rate odds OR % all % bad [1] % good [2] [3] = [2] / [1] WOE = ln[3]
1 200 50 80,0% 0,25 0,03 2,6% 25,0% 0,6% 0,02 -3,80
2 200 600 25,0% 3,00 0,33 8,2% 25,0% 6,7% 0,27 -1,32
3 200 1800 10,0% 9,00 1,00 20,5% 25,0% 20,1% 0,80 -0,22
4 100 1200 7,7% 12,00 1,33 13,3% 12,5% 13,4% 1,07 0,07
5 100 5300 1,9% 53,00 5,89 55,4% 12,5% 59,2% 4,74 1,56
All 800 8950 8,2% 11,19
ALL 9750
cat. # bad clients #good clients Def_rate odds OR % all % bad [1] % good [2] [3] = [2] / [1] WOE = ln[3]
1 300 200 60,0% 0,67 0,07 5,0% 33,3% 2,2% 0,07 -2,72
2 200 600 25,0% 3,00 0,33 8,0% 22,2% 6,6% 0,30 -1,22
3 200 1800 10,0% 9,00 1,00 20,0% 22,2% 19,8% 0,89 -0,12
4 100 1200 7,7% 12,00 1,33 13,0% 11,1% 13,2% 1,19 0,17
5 100 5300 1,9% 53,00 5,89 54,0% 11,1% 58,2% 5,24 1,66
All 900 9100 9,0% 10,11
ALL 10000
cat. # bad clients #good clients Def_rate odds OR % all % bad [1] % good [2] [3] = [2] / [1] WOE = ln[3]
1 300 200 60,0% 0,67 0,07 5,0% 30,0% 2,2% 0,07 -2,60
2 300 500 37,5% 1,67 0,19 8,0% 30,0% 5,6% 0,19 -1,69
3 200 1800 10,0% 9,00 1,00 20,0% 20,0% 19,8% 0,99 -0,01
4 100 1200 7,7% 12,00 1,33 13,0% 10,0% 13,2% 1,32 0,28
5 100 5300 1,9% 53,00 5,89 54,0% 10,0% 58,2% 5,82 1,76
All 1000 9000 10,0% 9,00
ALL 10000
1. Zachování Def_rate,
ale zmenšení četnosti
1. kategorie (změna
%all)
2. Změna Def_rate při
zachování %all.
3. Jako u 2, navíc změna
def_rate u další
kategorie.
The SORT Procedure
The SORT procedure rearranges the observations
in work.qtr1salesrep and places them in order
by descending Last_Name within Country.
The OUT= option in the SORT procedure can be used
to create an output data set, instead of overwriting the input data
set.
212Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FORMAT Procedure
The FORMAT procedure creates user-defined formats and
informats, and stores them in the SAS catalog work.formats by
default.
213
 Více na: http://www2.sas.com/proceedings/sugi27/p056-27.pdf
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Range(s) can be
 single values
 ranges of values
 lists of values.
Labels
 can be up to 32,767 characters in length
 are typically enclosed in quotation
marks, although it is not required.
Character User-Defined Format
The OTHER keyword matches all values that do not
match any other value or range.
214
proc format;
value $ctryfmt 'AU' = 'Australia'
'US' = 'United States'
other = 'Miscoded';
run;
character
format
name
keyword labels
discrete
character
values
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Character User-Defined Format
215
proc format;
value $ctryfmt 'AU' = 'Australia'
'US' = 'United States'
other = 'Miscoded';
run;
proc print data=orion.sales label;
var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
label Employee_ID='Sales ID'
Job_Title='Job Title'
Salary='Annual Salary'
Birth_Date='Date of Birth'
Hire_Date='Date of Hire';
format Salary dollar10.0
Birth_Date Hire_Date monyy7.
Country $ctryfmt.;
run;
Part 2
Part 1
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Character User-Defined Format
Partial PROC PRINT Output
216
Annual Date of Date of
Obs Sales ID Job Title Salary Country Birth Hire
60 120178 Sales Rep. II $26,165 Australia NOV1954 APR1974
61 120179 Sales Rep. III $28,510 Australia MAR1974 JAN2004
62 120180 Sales Rep. II $26,970 Australia JUN1954 DEC1978
63 120198 Sales Rep. III $28,025 Australia JAN1988 DEC2006
64 120261 Chief Sales Officer $243,190 United States FEB1969 AUG1987
65 121018 Sales Rep. II $27,560 United States JAN1944 JAN1974
66 121019 Sales Rep. IV $31,320 United States JUN1986 JUN2004
67 121020 Sales Rep. IV $31,750 United States FEB1984 MAY2002
68 121021 Sales Rep. IV $32,985 United States DEC1974 MAR1994
69 121022 Sales Rep. IV $32,210 United States OCT1979 FEB2002
70 121023 Sales Rep. I $26,010 United States MAR1964 MAY1989
71 121024 Sales Rep. II $26,600 United States SEP1984 MAY2004
72 121025 Sales Rep. II $28,295 United States OCT1949 SEP1975
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Numeric User-Defined Format
217
proc format;
value tiers 20000-49999 = 'Tier 1'
50000-99999 = 'Tier 2'
100000-250000 = 'Tier 3';
run;
numeric ranges
labels
numeric
format name
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
proc format;
value tiers low-<50000 = 'Tier 1'
50000- 100000 = 'Tier 2'
100000<-high = 'Tier 3';
run;
LOW encompasses the lowest possible value.
HIGH encompasses the highest possible value.
Numeric User-Defined Formats
The less than (<) symbol excludes values from ranges.
 Put < after the value if you want to exclude the first value in
a range.
 Put < before the value if you want to exclude the last value
in a range.
218
50000 - 100000 Includes 50000 Includes 100000
50000 - < 100000 Includes 50000 Excludes 100000
50000 < - 100000 Excludes 50000 Includes 100000
50000 < - < 100000 Excludes 50000 Excludes 100000
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple User-Defined Formats
219
Multiple VALUE statements can be in a single
PROC FORMAT step.
proc format;
value $ctryfmt 'AU' = 'Australia'
'US' = 'United States'
other = 'Miscoded';
value tiers low-<50000 = 'Tier 1'
50000- 100000 = 'Tier 2'
100000<-high = 'Tier 3';
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FORMAT Procedure
220
proc format;
value $goods_t
'BT'='A'
'BZ'='D‚
''='missing'
' '='missing'
'.'='missing'
;
run;
proc tabulate data=lib1.tab1
missing;
title "D vs. goods_type";
class goods_type D;
table (goods_type all),(D
all)*(n colpctn='c%'
rowpctn='r%');
format goods_type $goods_t.;
run;
The FORMAT Procedure
221
proc format;
value good_typ
1=1
2=3
3=10
;
run;
data lib1.tab1;
set lib1.tab1;
goods_type3=goods_type2;
format goods_typen3n good_typ.;
run;
proc format;
invalue good_t2e
'BT'=4
'BZ'=5
'CK'=5
other=-1
;
run;
data lib1.tab1;
set lib1.tab1;
goods_type1=upcase(goods_type);
goods_type3n=input(goods_type1,good_t2e.)
;
evid_id=put(evid_id,z10.);
;
run;
Replacing Missing Values
222
The COALESCE function enables you to replace missing values in a column with a
new value that you specify. For every row that the query processes, the COALESCE
function checks each of its arguments until it finds a nonmissing value, then returns
that value. If all of the arguments are missing values, then the COALESCE function
returns a missing value. For example, the following query replaces missing values in
the LowPoint column in the SQL.CONTINENTS table with the words Not Available:
proc sql;
title ’Continental Low
Points’;
select Name,
coalesce(LowPoint,
’Not Available’) as
LowPoint
from sql.continents;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The DATA Step
The SAS DATA step
 is the original SAS programming language for data
manipulation
 can be used as a complete programming language
 is generated by SAS Enterprise Guide when data is imported
or in support of other tasks.
223Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Advantages of the DATA Step over SQL
224
DATA Step SQL
Can read data from many different
sources
Can only read from SAS database
tables
Can create multiple tables in a
single pass of the data
Can only output one table at a time
Has comprehensive conditional
processing
Only has the CASE clause
Can deal with repetitive
programming using loops and
arrays
Does not support loops or arrays
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Advantages of SQL over the DATA Step
SQL DATA Step
Is very flexible when joining
multiple tables with non-common
key variables
Can require several steps to join
multiple tables with different key
variables
Can, in some cases, replace
multiple SAS steps
Can require several steps
Is the native language of databases Might need to generate SQL to get
to data that is not SAS data
225
Choose the right tool for the task to be completed.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The DATA statement begins a DATA step and provides
the name of the SAS data set being created.
General form of the DATA statement:
The DATA statement can create temporary or permanent
data sets.
DATA output-SAS-data-set;
SET input-SAS-data-set;
<additional SAS statements>
RUN;
The DATA Statement
226Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The SET Statement
The SET statement reads observations from a SAS data set for
further processing in the DATA step.
General form of the SET statement:
By default, the SET statement does the following:
 names the SAS data set(s) to be read
 reads all observations and all variables from the input data set
 can read temporary or permanent data sets
227
DATA output-SAS-data-set;
SET input-SAS-data-set;
<additional SAS statements>
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario: Reading a SAS Data Set
228
data work.comp;
set orion.sales;
run;
This program does the following:
 reads all the rows and all the
columns from the sales data set in
the orion library
 writes all the rows and all the
columns to a data set named
comp in the Work library
Partial Listing of comp
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Selecting Variables
You can control the variables written out to SAS data
sets using the following:
 the DROP statement to specify the variables that
you want excluded
 the KEEP statement to specify the variables that
you want included
General form of DROP and KEEP statements:
229
DROP variable1 variable2 …;
KEEP variable1 variable2 …;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario: Selecting Variables
230
data work.comp;
set orion.sales;
drop Gender Salary Job_Title
Country Birth_Date Hire_Date;
run;
This program can do these tasks:
 read all the rows and columns from
orion.sales
 write all the rows and the three
columns not excluded via the DROP
statement to a data set called comp in
the Work library
Partial Listing of comp
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Selecting Rows
231
Orion wants to subset the data to only include Australian
employees with a salary greater than $30,000.
Partial Listing of austemp
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Selecting Rows with the WHERE Statement
 Only one WHERE statement can be included
in a DATA step.
 The expressions that can be used are the same as
expressions built in the Filter Data tab using either the
Edit Filter window or the Advanced Expression Editor.
232
You can control which rows are read from a SAS data set by using
the WHERE statement.
General form of the WHERE statement:
WHERE expression;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Comparison Operators -examples
233
where Gender = 'M';
where Gender eq ' ';
where Salary >= 50000;
Values must be
separated by
commas or blanks.
where Country in ('AU','US');
where Country in ('AU' 'US');
where Salary ne .;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Arithmetic Operators - examples
234
where Salary / 12 < 6000;
where (Salary / 12 ) * 1.10 >= 7500;
where Salary + Bonus <= 10000;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logical Operators - examples
where Gender ne 'M' and Salary >=50000;
where Gender ne 'M' or Salary >= 50000;
where Country = 'AU' or Country = 'US';
where Country not in ('AU' 'US');
Multiple Choice Poll – Correct Answer
Which WHERE statement correctly subsets for numeric
months May, June, or July and character names with a missing
value?
a. where Months in (5 - 7) and Names = . ;
b. where Months in (5 , 6 , 7) and Names = ' ';
c. where Months in ('5' , '6' , '7') and Names = '.';
235Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Creating New Variables
Assignment statements are used in the DATA step to update existing
variables or create new variables.
An assignment statement does the following:
 evaluates an expression
 assigns the resulting value to a variable
General form of an assignment statement:
236
variable=expression;
DATA output-SAS-data-set;
SET input-SAS-data-set;
variable = expression;
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Expressions
237
Operators are
 symbols that request
arithmetic calculations
 SAS functions.
Operands are
 variable names
 constants.
 An expression contains operands and operators
that form a set of instructions that produce a value.
 An expression entered in an assignment statement is identical
to an expression built using the SAS Enterprise Guide Advanced
Expression Editor.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Operands
Operands are constants (character, numeric, or date)
and variables (character or numeric).
Examples:
238
character constant
date constant
numeric constant
variable
Bonus = 500;
Gender = 'M';
NewSalary = 1.1 * Salary;
Hire_Date = '01APR2008'd;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Date Constants
239
The constant 'ddMMMyyyy'd (example: '14dec2000'd) creates a
SAS date value from the date enclosed in quotation marks.
dd is a one- or two-digit value for the day.
MMM is a three-letter abbreviation for the month
(JAN, FEB, MAR, and so on).
yyyy is a four-digit value for the year.
d is required to convert the quoted string to a SAS
date.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Operators
Operators are symbols that represent an arithmetic
calculation and SAS functions.
Examples:
240
Revenue = Quantity * Price;
NewCountry = upcase(Country);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Arithmetic Operators
Arithmetic operators indicate that an arithmetic calculation is
performed.
If a missing value is an operand for an arithmetic operator, the
result is a missing value.
241
Symbol Definition Priority
** exponentiation I
- negative prefix I
* multiplication II
/ division II
+ addition III
- subtraction III
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Choice Poll – Correct Answer
What is the result of the assignment statement given
the values of var1 and var2?
a. . (missing)
b. 0
c. 5
d. 10
242
If an operand is missing for an arithmetic operator,
the result is missing.
num = var1 + var2 / 2;
var1 var2
. 10
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using SAS Functions
SAS functions can do the following:
 perform arithmetic operations
 compute sample statistics (for example: sum, mean, and
standard deviation)
 manipulate SAS dates
 process character values
 perform many other tasks
Sample statistics functions ignore missing values.
 SAS functions can be used in the DATA step or in the
Advanced Expression Editor of the Query Builder to create new
columns or filter data.
243Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Choice Poll – Correct Answer
What is the result of the assignment statement given
the values of var1, var2, and var3?
a. . (missing)
b. 0
c. 4
d. 6
244
Average = mean(Var1,Var2,Var3);
Var1 Var2 Var3
9 . 3
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using Date Functions
You can use SAS date functions to do the following:
 create SAS date values
 extract information from SAS date values
245
Calendar Date
01JAN1959 01JAN1960 01JAN1961
-365 0 366
SAS Date Value
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Date Functions: Creating SAS Dates
Example:
246
TODAY() obtains the date value from the
system clock.
MDY(month,day,year) uses numeric month, day, and year
values to return the corresponding
SAS date value.
Days_Since_Order = today() – Order_Date;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Date Functions: Extracting Information
Example:
247
YEAR(SAS-date)
extracts the year from a SAS date and
returns a four-digit value for year.
QTR(SAS-date)
extracts the quarter from a SAS date and
returns a number from 1 to 4.
MONTH(SAS-date)
extracts the month from a SAS date and
returns a number from 1 to 12.
DAY(SAS-date)
extracts the day of the month from a SAS
date and returns a number from 1 to 31.
WEEKDAY(SAS-date)
extracts the day of the week from a SAS
date and returns a number from 1 to 7,
where 1 represents Sunday, and so on.
BonusMonth = month(Hire_Date);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The LABEL Statement
Permanent labels can also be assigned in the DATA step.
General form of the LABEL statement:
 A label can be up to 256 characters.
 Any number of variables can be associated with labels in a single
LABEL statement.
 Using a LABEL statement in a DATA step permanently associates
labels with variables by storing the label in the descriptor portion
of the SAS data set.
248
LABEL variable = 'label'
variable = 'label'
variable = 'label';
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario: Formats and Labels
249
data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
drop Gender Salary Job_Title Country
Birth_Date;
format Bonus Compensation dollar8.
Hire_Date date9.;
label Employee_ID="Employee ID"
First_Name="First Name"
Last_Name="Last Name"
BonusMonth="Month of Bonus"
Hire_Date="Hire Date";
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
250
5. SAS Data Step – podmíněné
kódy, cykly, pole
No
DO index=start TO stop
BY increment;
SAS statements
END;
Define start,
stop, and increment
values.
Set index=start.
Execute statements in loop.
index = index + increment
Is
index
out of
range?
Yes
Other SAS
statements
IFexpression THEN DO;
executable statements
END;
ELSE DO;
executable statements
END;
Business Scenario
Customer_Type_ID indicates
the type of club member. A value
of 3010 indicates non-club
members.
Orion wants to create two data
sets, one for club members and
one for other customers.
251
Listing of clubmembers
Listing of customer
Listing of nonclub
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The DATA Statement
The DATA statement lists the data sets to be created.
General form of the DATA statement:
By default, the same rows are written to every listed data set.
252
DATA <SAS-data-set(s)> <SAS-data-set(s)>;
data work.clubmembers work.nonclub;
set orion.customer;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The OUTPUT Statement
 The OUTPUT statement controls when the values are
written to the output SAS data set.
General form of the OUTPUT statement:
 If no data set is specified, then output goes to all of the
data sets listed in the DATA statement.
253
OUTPUT <SAS-data-set(s)>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Conditional Execution
General form of IF-THEN and ELSE statements:
254
An expression contains operands and operators that form a set of
instructions that produce a value.
Operators are
 symbols that request
– a comparison
– a logical operation
– an arithmetic calculation
 SAS functions.
Operands are
 variable names
 constants.
IF expression THEN statement;
ELSE statement;
Only one executable statement is allowed in an IF-THEN or ELSE
statement.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Program with Conditional Output
When Customer_Type_ID is equal to 3010, this
indicates that the customer is not a club member. Rows
where this expression is true are output to the
work.nonclub SAS data set.
255
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
ep03d05.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Program with Conditional Output
Otherwise, rows are written to the work.clubmembers
data set.
256
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario
Orion management wants to add an additional column to the
clubmembers data set to indicate the membership type.
Customer_Type_ID numbers between 1000 and 2000 are
Members, and between 2000 and 3000 are Gold Members.
257
Partial Listing of clubmembers
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Executing Multiple Statements Conditionally
If Customer_Type_ID does not equal 3010, then these three
statements should be executed:
258
if Customer_Type_ID < 2000
then Type="Club Member";
else Type="Gold Club Member";
output clubmembers;
However, only one executable statement is allowed after THEN or ELSE.
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output work.nonclub;
else output work.clubmembers;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Executing Multiple Statements Conditionally
You can use the DO and END statements to execute
a group of statements based on a condition.
259
General form of the DO and END statements:
IFexpression THEN DO;
executable statements
END;
ELSE DO;
executable statements
END;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Program with Conditional Output
260
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output nonclub;
else do;
if Customer_Type_ID < 2000
then Type="Club Member";
else Type="Gold Club Member";
output clubmembers;
end;
run;
ep03d06.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario: Results
261
Why are some values of Type truncated?
Partial Listing of clubmembers
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Program with Conditional Output
When a program is compiled, each variable must
be defined with a name, type, and length. The length
of Type is set to 11 based on the first occurrence
in the program.
262
data work.clubmembers work.nonclub;
set orion.customer;
if Customer_Type_ID = 3010
then output nonclub;
else do;
if Customer_Type_ID < 2000
then Type="Club Member";
else Type="Gold Club Member";
output clubmembers;
end;
run;
Délka je
nastavena
na 11 znaků.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
263
Conditional Statements
 Conditional statements can create values for a new
variable based on whether a condition is true or false.
 General form of the IF-THEN and ELSE statements:
IF expression THEN statement;
ELSE IF expression THEN statement;
. . .
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
264
data univ.totalorders;
set univ.customerorders;
TotalSale=UnitPrice*Quantity;
if Quantity in (1,2) then Level='Level I';
else if Quantity=3 then Level='Level II';
else if Quantity ge 4 then Level='Level III';
else Level='Miscoded';
run;
Conditionally Executing Statements
Quantity
1 or 2
3
4 or more
0, Missing
or Bad Data
Level
Miscoded
Level I
Level II
Level III
c03s3d1.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
265
Listing Output
 Partial Output
Where are the Level II and Level III values?
Quantity Level Total Sale
1 Level I $20
3 Level I $234
1 Level I $50
5 Level I $560
1 Level I $12
2 Level I $84
2 Level I $46
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
266
The LENGTH Statement
 You can use the LENGTH statement to define the length of
a variable explicitly.
 General form of the LENGTH statement:
Example:
LENGTH variable(s) $ length;
length Level $ 9;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
267
data univ.totalorders;
set univ.customerorders;
length Level $ 9;
TotalSale=UnitPrice*Quantity;
if Quantity in (1,2)then Level='Level I';
else if Quantity=3 then Level='Level II';
else if Quantity ge 4 then Level='Level III';
else Level='Miscoded';
run;
The LENGTH Statement
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Quantity Level Total Sale
1 Level I $20
3 Level II $234
1 Level I $50
5 Level III $560
1 Level I $12
2 Level I $84
2 Level I $46
First. and Last. Variables
 If you use a by statement along with a set statement in a
data step then SAS creates two automatic variables,
FIRST.variable and LAST.variable, where variable is the name
of the by variable. FIRST.variable has a value 1 for the first
observation in the by group and 0 for all other observations
in the by group. LAST.variable has a value 1 for the last
observation in the by group and 0 for all other observations
in the by group.
268Zdroj: http://www.pauldickman.com/teaching/sas/set_by.php
First. and Last. Variables
269Zdroj: http://www.pauldickman.com/teaching/sas/set_by.php
data temp;
input group x;
datalines;
1 23
1 34
1 .
1 45
2 78
2 92
2 45
2 89
2 34
2 76
3 31
4 23
4 12
;
run;
/**************************************************
The automatic variables first.group and last.group are not saved with the data set.
Here we write them to data set variables to show their contents.
**************************************************/
data new;
set temp;
by group;
first=first.group;
last=last.group;
run;
proc print;
title 'Raw data along with first.group and last.group';
run;
Obs group x first last
1 1 23 1 0
2 1 34 0 0
3 1 . 0 0
4 1 45 0 1
5 2 78 1 0
6 2 92 0 0
7 2 45 0 0
8 2 89 0 0
9 2 34 0 0
10 2 76 0 1
11 3 31 1 1
12 4 23 1 0
13 4 12 0 1
First. and Last. Variables
270Zdroj: http://www.pauldickman.com/teaching/sas/set_by.php
/**************************************************
A common task in data cleaning is to identify
observations with a duplicate ID number. If we set
the data set by ID, then the observations which
are not duplicated will be both the first and the
last with that ID number. We can therefore write
any observations which are not both first.id and
last.id to a separate data set and examine them.
**************************************************/
data single dup;
set temp;
by group;
if first.group and last.group then output single;
else output dup;
run;
proc print data=dup;
run;
proc print data=single;
run;
Obs group x
1 1 23
2 1 34
3 1 .
4 1 45
5 2 78
6 2 92
7 2 45
8 2 89
9 2 34
10 2 76
11 4 23
12 4 12
Obs group x
1 3 31
First. and Last. Variables
271Zdroj: http://www.pauldickman.com/teaching/sas/set_by.php
/**************************************************
We may also want to do data set processing within
each by group. In this example we construct the
cumulative sum of the variable X within each group.
**************************************************/
data cusum(keep=group sum);
set temp;
by group;
if first.group then sum=0;
sum+x;
if last.group then output;
run;
proc print data=cusum noobs;
title 'Sum of X within each group';
run;
Sum of X within each group
group sum
1 102
2 414
3 31
4 35
First. and Last. Variables
272Zdroj: http://www.pauldickman.com/teaching/sas/set_by.php
/**************************************************
As an aside, if you simply want the sum of X within
each group, one of the many way of obtaining this
is with PROC PRINT.
**************************************************/
proc print data=temp;
title 'All data with X summed within each group';
by group;
sum x;
sumby group;
run;
All data with X summed within each group
------------------------------------ group=1 ---------------
Obs x
1 23
2 34
3 .
4 45
----- ---
group 102
------------------------------------ group=2 ----------------
Obs x
5 78
6 92
7 45
8 89
9 34
10 76
----- ---
group 414
----------------------------------- group=3 -----------------
Obs x
11 31
----------------------------------- group=4 ----------------
Obs x
12 23
13 12
----- ---
group 35
===
582
First. and Last. Variables and Sum Statement
 The First. and Last. variables along with the sum statement
can be used to create the values for the summary data set.
273
data orders(keep=Customer_Name Quantity
Product_ID Total_Retail_Price)
noorders(keep=Customer_Name Birth_Date)
summary(keep=Customer_Name NumOrders);
merge orion.customer
work.order_fact(in=order);
by Customer_ID;
if order=1 then do;
output orders;
if first.Customer_ID then NumOrders=0;
NumOrders+1;
if last.Customer_ID then output summary;
end;
else output noorders;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
274
DO loops - introduction
 DO loops can be used to do the following:
 perform repetitive calculations
 generate data
 eliminate redundant code
 execute SAS code conditionally
 read data
 Rozlišujeme 2 základní formy DO cyklů:
 Iterative …pevně daná délka cyklu
 Conditional iterative … cyklus běží pokud/dokud je splněna
zadaná podmínka
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
275
Business Need
 Orion Star wants to encourage employees to save for
retirement.
 If 6% of an employee’s salary is invested in a 401(k) plan
each year (not to exceed $11,000), how much money
could be saved after 30, 40, and 50 years?
 (Assume that the average return of the plan is 11%.)
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
276
Repetitive Processing
data retire(keep = EmployeeID Salary Investment
Value_after_30_years
Value_after_40_years
Value_after_50_years);
set univ.usemps;
Retirement = 0;
Investment = 0.06 * Salary;
if Investment gt 11000 then Investment = 11000;
*Year 1;
Retirement = Retirement + Investment;
Retirement = Retirement * 1.11;
*Year 2;
Retirement = Retirement + Investment;
Retirement = Retirement * 1.11;
*Year 3;
.
.
.
*Year 30;
Retirement = Retirement + Investment;
Retirement = Retirement * 1.11;
Value_after_30_years = Retirement;
Partial DATA Step
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
277
DO Loop Syntax
 General form of a simple iterative DO loop:
DO index-variable=start TO stop <BY increment>;
SAS statements
END;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
 The values of start, stop, and increment
• must be numbers or expressions that yield numbers
• are established before executing the loop
• if omitted, increment defaults to 1.
The Iterative DO Statement
Index-variable details:
 The index-variable is written to the output data set
by default.
 At the termination of the loop, the value of index-variable is
one increment beyond the stop value.
Modifying the value of index-variable affects the
number of iterations, and might cause infinite looping
or early loop termination.
278Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
279
No
DO index=start TO stop
BY increment;
SAS statements
END;
Define start,
stop, and increment
values.
Set index=start.
Execute statements in loop.
index = index + increment
Is
index
out of
range?
Yes
DO Loop Processing
Other SAS
statements
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
280
DO Loop Syntax
You can define the values used to increment the
loop with start TO stop <BY increment>.
2 4 6 8 10 12

10 8 6 4 2 0

3.6 3.65 3.70 3.75 3.80 3.85

do i=2 to 10 by 2;
do i=10 to 2 by -2;
do k=3.6 to 3.8 by .05;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
281
DO Loop Syntax
 Expressions:
 Dates:
do z=k to n/10;
do date='01JAN2003'd to '31JAN2003'd;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
282
DO Loop Syntax
 General form of a DO loop with a value list:
 The values in the list can be numeric or character.
DO index-variable=value1, value2, value3…;
SAS statements
END;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
 Discrete numeric values separated by commas:
 Character values enclosed in quotes and separated by commas:
do n=1,5,15,30,60;
do month='JAN','FEB','MAR';
283
DO Loop Code
data retire(keep = EmployeeID Salary Investment
Value_after_30_years
Value_after_40_years
Value_after_50_years);
set univ.usemps;
Retirement = 0;
Investment = 0.06 * Salary;
if Investment gt 11000 then Investment = 11000;
do year = 1 to 50;
/* Add the Investment amount each year */
Retirement = Retirement + Investment;
Retirement= Retirement * 1.11;
/* Retirement value after 30, 40 and 50 years */
if Year = 30 then
Value_after_30_years = Retirement;
else if Year = 40 then
Value_after_40_years = Retirement;
else if Year = 50 then
Value_after_50_years = Retirement;
end;
run;
c03s4d1.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Conditional Iterative Processing
 You can use DO WHILE and DO UNTIL statements
to stop the loop when a condition is met rather than
when the loop executed a specific number of times.
To avoid infinite loops, be sure that the specified
condition will be met.
284Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The DO WHILE Statement
 The DO WHILE statement executes statements
in a DO loop repetitively while a condition is true.
 General form of the DO WHILE loop:
 The value of expression is evaluated at the top of the loop.
 The statements in the loop never execute if expression
is initially false.
DO WHILE (expression);
<additional SAS statements>
END;
285Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The DO UNTIL Statement
 The DO UNTIL statement executes statements
in a DO loop repetitively until a condition is true.
 General form of the DO UNTIL loop:
 The value of expression is evaluated at the bottom of the loop.
 The statements in the loop are executed at least once.
Although the condition is placed at the top of
the loop, it is evaluated at the bottom of the loop.
DO UNTIL (expression);
<additional SAS statements>
END;
286Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario
 Determine the number of
years that it would take for an
account to exceed $1,000,000 if
$5,000 is invested annually at
4.5 percent.
287Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using the DO UNTIL Statement
PROC PRINT Output

data invest;
do until(Capital>1000000);
Year+1;
Capital+5000;
Capital+(Capital*.045);
end;
run;
proc print data=invest noobs;
format Capital dollar14.2;
run;
Capital Year
$1,029,193.17 52
288Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Iterative DO Loop with a Conditional
Clause
 You can combine DO WHILE and DO UNTIL statements
with the iterative DO statement.
 General form of the iterative DO loop with a conditional
clause:
 This is one method of avoiding an infinite loop
in a DO WHILE or DO UNTIL statements.
DO index-variable=start TO stop <BY increment>
WHILE | UNTIL (expression);
<additional SAS statements>
END;
289Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using DO UNTIL with an Iterative DO Loop
Determine the value of the account again. Stop the loop if 30
years is reached or more than $250,000 is accumulated.
Year Capital
27 $264,966.67
PROC PRINT Output
data invest;
do Year=1 to 30 until(Capital>250000);
Capital+5000;
Capital+(Capital*.045);
end;
run;
proc print data=invest noobs;
format capital dollar14.2;
run;
In a DO UNTIL loop, the
condition is checked
before the index
variable is incremented.
290Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using DO WHILE with an Iterative DO Loop
Determine the value of the account again, but this time
use a DO WHILE statement.
Year Capital
28 $264,966.67
PROC PRINT Output
data invest;
do Year=1 to 30 while(Capital<=250000);
Capital+5000;
Capital+(Capital*.045);
end;
run;
proc print data=invest noobs;
format capital dollar14.2;
run;
In a DO WHILE loop,
the condition is
checked after the
index variable is
incremented.
291Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Nested DO Loops
Nested DO loops are loops within loops.
 Be sure to use different index variables for each loop.
 Each DO statement must have a corresponding END
statement.
 The inner loop executes completely for each iteration of the
outer loop.
DO index-variable-1=start TO stop <BY increment>;
DO index-variable-2=start TO stop <BY increment>;
<additional SAS statements>
END;
END;
292Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Leave statement
 Stops processing the current loop and resumes with the next
statement in the sequence. data week;
input name $ idno start_yr status $ dept $;
bonus=0;
do year= start_yr to 1991;
if bonus ge 500 then leave;
bonus+50;
end;
datalines;
Jones 9011 1990 PT PUB
Thomas 876 1976 PT HR
Barnes 7899 1991 FT TECH
Harrell 1250 1975 FT HR
Richards 1002 1990 FT DEV
Kelly 85 1981 PT PUB
Stone 091 1990 PT MAIT ;
Více viz
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202782.htm 293
294
Array Processing
 You can use arrays to simplify programs that
 perform repetitive calculations
 create many variables with the same attributes
 read data
 rotate SAS data sets by making variables into observations or
observations into variables
 compare variables
 perform a table lookup
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
295
What Is a SAS Array?
 A SAS array
 is a temporary grouping of SAS variables that are arranged
in a particular order
 is identified by an array name
 exists only for the duration of the current DATA step
 is not a variable
 Each value in an array is
 called an element
 identified by a subscript that represents the position
of the element in the array.
 When you use an array reference, the corresponding
value is substituted for the reference.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
296
What Is a SAS Array?
ID QTR4QTR2 QTR3QTR1
CONTRIB
First
element
Second
element
Third
element
Fourth
element
Array References
CONTRIB{1} CONTRIB{2} CONTRIB{3} CONTRIB{4}
Array Name
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
297
The ARRAY Statement
 The ARRAY statement defines the elements in an array.
These elements are processed as a group. You refer to
elements of the array by the array name and subscript.
ARRAY array-name {subscript} <$> <length>
<array-elements> <(initial-value-list)>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
{subscript} the number of elements
$ indicates character elements
length the length of elements
array-elements the names of elements
298
The ARRAY Statement
 The ARRAY statement
 must contain all numeric or all character elements
 must be used to define an array before the array name can be
referenced
 creates variables if they do not already exist
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
299
Defining an Array
 Write an ARRAY statement that defines the four
quarterly contribution variables as elements of an array.
array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4;
First
Element
Second
Element
Third
Element
Fourth
Element
ID QTR4QTR2 QTR3QTR1
CONTRIB
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
300
Defining an Array
 Variables that are elements of an array do not need to
have similar, related, or numbered names.
array Contrib2{4} Q1 Qrtr2 ThrdQ Qtr4;
QTR4QRTR2 THRDQQ1
CONTRIB2
First
Element
Second
Element
Third
Element
Fourth
Element
ID
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
data charity(keep=employee_id qtr1-qtr4);
set orion.employee_donations;
array Contrib1{3} qtr1-qtr4;
array Contrib2{5} qtr:;
/* additional SAS statements */
run;
177 array Contrib1{3} qtr1-qtr4;
ERROR: Too many variables defined for the dimension(s) specified for
the array Contrib1.
178 array Contrib2{5} qtr:;
ERROR: Too few variables defined for the dimension(s) specified for
the array Contrib2.
The subscript and
element-list must
agree.
Defining an Array
301Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
302
Processing an Array
 Array processing often occurs within DO loops. An iterative
DO loop that processes an array has the following form:
 To execute the loop as many times as there are elements in
the array, specify that the values of index-variable range from 1
to number-of-elements-in-array.
DO index-variable=1 TO number-of-elements-in-array;
additional SAS statements
using array-name{index-variable}…
END;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
303
CONTRIB{QTR}
4
CONTRIB{4}
3
CONTRIB{3}
2
CONTRIB{2}
Processing an Array
array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4;
do Qtr = 1 to 4;
Contrib{Qtr} = Contrib{Qtr}*1.25;
end;
QTR4QTR2 QTR3QTR1
1
Value of
Index
Variable Qtr
CONTRIB{1}
Array
Reference
First
Element
Second
Element
Third
Element
Fourth
Element
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
304
Performing Repetitive Calculations
data charity(drop = Qtr);
set univ.donate;
array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4;
do Qtr = 1 to 4;
Contrib{Qtr} = Contrib{Qtr}*1.25;
end;
run;
c03s5d1.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
305
Performing Repetitive Calculations
 Partial PROC PRINT Output
ID Qtr1 Qtr2 Qtr3 Qtr4
E00224 15.00 41.25 27.50 .
E00367 43.75 60.00 50.00 37.50
E00441 . 78.75 111.25 112.50
E00587 20.00 23.75 37.50 36.25
E00598 5.00 10.00 7.50 1.25
proc print data = charity noobs;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using an Array as a Function Argument
The program below passes an array to the SUM function.
Partial PROC PRINT Output
data test;
set orion.employee_donations;
array val{4} qtr1-qtr4;
Tot1=sum(of qtr1-qtr4);
Tot2=sum(of val{*});
run;
proc print data=test;
var employee_id tot1 tot2;
run;
Obs Employee_ID Tot1 Tot2
1 120265 25 25
2 120267 60 60
3 120269 80 80
The array is passed as
if it were a variable list.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA. 306
The DIM Function
 The DIM function returns the number of elements
in an array. This value is often used as the stop value
in a DO loop.
 General form of the DIM function: DIM(array_name)
array Contrib{*} qtr:;
num_elements=dim(Contrib);
do i=1 to num_elements;
Contrib{i}=Contrib{i}*1.25;
end;
run;
data charity;
set orion.employee_donations;
keep employee_id qtr1-qtr4;
array Contrib{*} qtr:;
do i=1 to dim(Contrib);
Contrib{i}=Contrib{i}*1.25;
end;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA. 307
Creating Variables with Arrays
The Contrib array refers to existing variables. The Diff array
creates three variables: Diff1, Diff2, and Diff3.
data change;
set orion.employee_donations;
drop i;
array Contrib{4} Qtr1-Qtr4;
array Diff{3};
do i=1 to 3;
Diff{i}=Contrib{i+1}-Contrib{i};
end;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA. 308
Array s „nedefinovanou“ dimenzí
%let list= Qtr1 Qtr2 Qtr3 Qtr4;
data change;
set orion.employee_donations;
array Contrib{*} &list;
… more SAS statements …
run;
data tab2;
set tab1;
array AllNums{*} _numeric_;
do i = 1 to dim(AllNums);
… more SAS statements …
end;
run;
309
310
6. SAS Data Step – spojování a
rotace tabulek
311
Concatenation
 A concatenation
 combines two or more data sets, one after the other, into a
single data set
 uses the SET statement.
 The new data set
 contains all observations from the original data sets in
sequential order
 by default, contains all variables from the original data sets.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
312
Concatenation
Month
JAN
FEB
MAR
Sales
25354
28999
26489
Month
APR
MAY
JUN
Sales
23541
24877
24653
+ =
Month
JAN
FEB
MAR
APR
MAY
JUN
Sales
25354
28999
26489
23541
24877
24653
Data Set 2Data Set 1
Combined
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
313
Concatenation
Month
JAN
FEB
MAR
Sales
25354
28999
26489
Goal
26500
27000
27500
Month
APR
MAY
JUN
Sales
23541
24877
24653
+
=
Sales
25354
28999
26489
23541
24877
24653
Month
JAN
FEB
MAR
APR
MAY
JUN
Goal
26500
27000
27500
.
.
.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
314
Coding for Concatenation
 Use the SET statement in a DATA step to concatenate SAS
data sets.
 General form of a DATA step concatenation:
 Příklad:
DATA SAS-data-set ;
SET SAS-data-set-1 SAS-data-set-2 ;
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
data univ.mastercustomers;
set univ.uscustomers
univ.usnewcustomers;
run;
c04s1d1.sas
315
Interleaving
 Interleaving uses a SET statement and a BY statement
to combine two or more SAS data sets.
 The data set created through interleaving
 contains all observations from the original data sets
 is arranged by the values of the BY variable(s)
 by default, contains all variables from the original
data sets.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
316
Interleaving
Month
JAN
FEB
MAR
MAR
Sales
26510
24530
20122
14258
Month
JAN
JAN
FEB
FEB
MAR
APR
Sales
21654
19873
22306
24003
19855
23502
+
Month
JAN
JAN
JAN
FEB
FEB
FEB
MAR
MAR
MAR
APR
Sales
26510
21654
19873
24530
22306
24003
20122
14258
19855
23502
=
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
317
Sorting a SAS Data Set
 Before you interleave SAS data sets, all data sets must be
sorted on the variable(s) that determine(s) the order of
observations in the final data set.
 You can use PROC SORT to sort data.
 General form of PROC SORT:
PROC SORT DATA=input-SAS-data-set
OUT=output-SAS-data-set;
BY <DESCENDING> by-variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
318
Sorting a SAS Data Set
 The SORT procedure
 rearranges the observations in a SAS data set
 can create a new SAS data set that contains the rearranged
observations
 can sort on multiple variables
 can sort in ascending (default) or descending order
 does not generate printed output
 treats missing values as the smallest possible value.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
319
PROC SORT Example
proc sort data=univ.uscustomers
out=uscustomers;
by CustomerID;
run;
proc sort data=univ.usnewcustomers
out=usnewcustomers;
by CustomerID;
run;
c04s1d2.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
320
Coding to Interleave SAS Data Sets
 After the original data sets are properly sorted, the DATA
step with a SET statement and a BY statement is used to
interleave the sorted data sets.
 General form of the DATA step:
DATA SAS-data-set;
SET SAS-data-set-1 SAS-data-set-2;
BY variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
321
Interleaving Example
data univ.mastercustomers;
set uscustomers usnewcustomers;
by CustomerID;
run;
c04s1d2.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
322
Match-Merging
 Match-merging
 combines observations from two or more SAS data sets into a
single observation in a new data set according to the values of
a common variable
 can be used to combine observations having a
one-to-one, one-to-many, or many-to-many relationship.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
323
Business Scenario
 The data set univ.customerorders contains sales order information.
Customer Order
ID Date OrderID ProductID Quantity Price
029858 15128 1239347234 230100600005 4 130
029858 15171 1239686972 240800100020 1 122
029858 15171 1239686972 240800100036 1 468
 The univ.mastercustomers contains mailing address and other information
about customers.
Customer Customer Customer
ID FirstName LastName CustomerAddress CustomerGroup
000492 David Dulin 147 Bowling Farm Ct Orion Club
000551 Blu Peachey 85 Lake Boone Trl Internet/...
000738 Jerry Krejci 700 Fernwood Dr Orion Club Gold
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
324
Match-Merging
Create a data set named shipping that contains
the name and addresses of customers who have placed orders by
merging the univ.customerorders data set and the
univ.mastercustomers data set.
Customer
Customer Order First Customer
ID Date ... Name LastName CustomerAddress
029858 15128 ... Alice Maxam 81 Flagstone Pl
029858 15171 ... Alice Maxam 81 Flagstone Pl
029858 15171 ... Alice Maxam 81 Flagstone Pl
Partial SAS Output
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
325
Coding the Match-Merge
 Data sets combined with a match-merge must be sorted by the
common variable. Use PROC SORT to prepare data if necessary.
 The DATA step with a MERGE statement and a BY statement is
used to match-merge two or more SAS
data sets.
 General form of the DATA step:
DATA SAS-data-set ;
MERGE SAS-data-set-1 SAS-data-set-2;
BY variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
326
Match-Merging with Nonmatches
proc sort data=univ.customerorders
out = customerorders;
by CustomerID;
proc sort data=univ.mastercustomers
out=mastercustomers;
by CustomerID;
data shipping;
merge customerorders
mastercustomers;
by CustomerID;
run;
c04s2d1.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
327
Match-Merging with Nonmatches
 By default, SAS writes all observations, matches and nonmatches,
to the output data set.
Partial work.shipping data set
CustomerID
029858
029858
030643
OrderDate
15265
15278
15129
CustomerID
029858
030596
030643
CustomerAddress
81 Flagstone Pl
582 Guffy Drive
13 Highfalls Court
Partial customerorders data set Partial mastercustomers data set
CustomerID OrderDate CustomerAddress
029858
029858
030596
030643
15265
15278
.
15129
81 Flagstone Pl
81 Flagstone Pl
582 Guffy Drive
13 Highfalls Court
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
328
Controlling Nonmatches
 The IN= data set option identifies whether a SAS data set contributed data
to the current observation.
 An IF statement conditionally enables a following statement to execute.
 Combined use of the two techniques writes out only matches to the final
data set.
General form of the IN= data set option:
variable is a temporary numeric variable that has two possible values:
SAS-data-set (IN = variable)
0
indicates that the data set did not contribute to
the current observation.
1
indicates that the data set did contribute to the
current observation.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
329
The IN= Data Set Option
 Využití při spojování tabulek:
DATA SAS-data-set ;
MERGE SAS-data-set-1 (IN=IN1)
SAS-data-set-2 (IN=IN2);
BY variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
330
Eliminating Nonmatches
Resulting Data Set
proc sort data=univ.customerorders
out=customerorders;
by CustomerID;
run;
proc sort data=univ.mastercustomers
out=mastercustomers;
by CustomerID;
run;
data shipping;
merge customerorders (in=inorders)
mastercustomers (in=inmaster);
by CustomerID;
if inorders=1 and inmaster=1;
run; c04s2d2.sas
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
331
CustomerID
029858
030596
030643
CustomerAddress
81 Flagstone Pl
582 Guffy Drive
13 Highfalls Court
CustomerID
029858
029858
030643
OrderDate
15265
15278
15129
Eliminating Nonmatches
CustomerID OrderDate CustomerAddress
029858
029858
030643
15265
15278
15129
81 Flagstone Pl
81 Flagstone Pl
13 Highfalls Court
Partial work.shipping data set
Partial customerorders data set Partial mastercustomers data set
 The observations that do not appear in both data sets are not
written to the new data set.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
332
The RENAME= Data Set Option
 The RENAME= data set option can be used to change
the name of a variable from an input data set.
 General form of the RENAME= data set option:
SAS-data-set(RENAME = (old-name = new-name))
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
333
Using the RENAME= Option
 If the key variable in the last example were named
differently in each data set, then
the RENAME= option would need to be used.
CustNum
029858
029858
030643
OrderDate
15265
15278
15129
CustomerID
029858
030596
030643
CustomerAddress
81 Flagstone Pl
582 Guffy Drive
13 Highfalls Court
Partial Orders Information Partial Customer Information
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
334
Using the RENAME= Option
proc sort data=univ.customerorders
out=customerorders;
by CustomerID;
run;
proc sort data=univ.mastercustomers
out=mastercustomers;
by CustomerID;
run;
data shipping;
merge customerorders ((in=inorders)
rename=(CustNum=CustomerID))
mastercustomers (in=inmaster);
by CustomerID;
if inorders=1 and inmaster=1;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
 Restructure the input data set, and create a separate observation
for each nonmissing quarterly contribution. The output data set,
rotate, should contain only Employee_ID, Period, and Amount.
335
Employee_ID Period Amount
120265 Qtr4 25
120267 Qtr1 15
120267 Qtr2 15
120267 Qtr3 15
120267 Qtr4 15
120269 Qtr1 20
120269 Qtr2 20
120269 Qtr3 20
120269 Qtr4 20
Employee_ID Qtr1 Qtr2 Qtr3 Qtr4 Paid_By
120265 . . . 25 Cash or Check
120267 15 15 15 15 Payroll Deduction
120269 20 20 20 20 Payroll Deduction
Rotating a SAS Data Set
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Rotating a SAS Data Set
336
data rotate (keep=Employee_Id Period Amount);
set orion.employee_donations
(drop=recipients paid_by);
array contrib{4} qtr1-qtr4;
do i=1 to 4;
if contrib{i} ne . then do;
Period=cats("Qtr",i);
Amount=contrib{i};
output;
end;
end;
run;
 The DATA step below rotates the input data set.
An output observation will be written if a contribution
was made in a given quarter.
Only include
nonmissing values
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TRANSPOSE Procedure
337
PROC TRANSPOSE DATA=input-data-set
<OUT=output-data-set>
<NAME = variable-name>;
<BY <DESCENDING> variable-1
<...<DESCENDING> variable-n> <NOTSORTED>;>
<VAR variable(s);>
<ID variable;>
RUN;
NAME= specifies a new name for the _NAME_ column. The values in this column identify the
variable that supplied the values in the row.
BY specifies the variable(s) to use to form BY groups.
VAR specifies the variable(s) to transpose.
ID specifies the variable whose values will become the new variables.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TRANSPOSE Procedure
The TRANSPOSE procedure
 transposes selected variables into observations
 transposes numeric variables by default
 transposes character variables only if explicitly
listed in a VAR statement.
338Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using the Transpose Procedure
339
Start with a simple PROC TRANSPOSE step:
Partial Listing of rotate2
The output is very different from the desired results. A row was created for
each variable. A column was created for each of the 124 observations.
proc transpose
data=orion.employee_donations
out=rotate2;
run;
_NAME_ _LABEL_ COL1 COL2 COL3 ... COL124
Employee_ID Employee ID 120265 120267 120269 121147
Qtr1 . 15 20 10
Qtr2 . 15 20 10
Qtr3 . 15 20 10
Qtr4 25 15 20 10
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Results of a Simple Transposition
Compare PROC TRANSPOSE output to the original data:
Partial Listing of orion.employee_donations
Partial Listing of rotate2
 All the numeric variables were transposed by default. Paid_By, a character
variable, was not transposed.
340
Employee_ID Qtr1 Qtr2 Qtr3 Qtr4 Paid_By
120265 . . . 25 Cash or Check
120267 15 15 15 15 Payroll Deduction
120269 20 20 20 20 Payroll Deduction
_NAME_ _LABEL_ COL1 COL2 COL3 ... COL124
Employee_ID Employee ID 120265 120267 120269 121147
Qtr1 . 15 20 10
Qtr2 . 15 20 10
Qtr3 . 15 20 10
Qtr4 25 15 20 . 10
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Results of a Simple Transposition
Partial Listing of orion.employee_donations
Partial Listing of rotate2
341
Employee_ID Qtr1 Qtr2 Qtr3 Qtr4 Paid_By
120265 . . . 25 Cash or Check
120267 15 15 15 15 Payroll Deduction
120269 20 20 20 20 Payroll Deduction
120270 20 10 5 . Cash or Check
120271 20 20 20 20 Payroll Deduction
_NAME_ _LABEL_ COL1 COL2 COL3 ... COL124
Employee_ID Employee ID 120265 120267 120269 121147
Qtr1 . 15 20 10
Qtr2 . 15 20 10
Qtr3 . 15 20 10
Qtr4 25 15 20 . 10
 Each observation (row) in the input data set becomes
a variable (column) in the output data set.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC TRANSPOSE Results
 The data should be grouped by Employee_ID with a separate
observation for each transposed variable.
342
_NAME_ _LABEL_ COL1 COL2 COL3 ... COL124
Employee_ID Employee ID 120265 120267 120269 120271
Qtr1 . 15 20 20
Qtr2 . 15 20 20
Qtr3 . 15 20 20
Qtr4 25 15 20 20
Partial Listing of rotate2
Employee _NAME_ COL1
_ID
120265 Qtr1 .
120265 Qtr2 .
120265 Qtr3 .
120265 Qtr4 25
120267 Qtr1 15
120267 Qtr2 15
...
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The BY Statement
343
Use a BY statement to group the output by Employee_ID.
All numeric variables other than the BY variable are transposed.
proc transpose
data=orion.employee_donations
out=rotate2;
by Employee_ID;
run;
proc print data=rotate2 noobs;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Use of the BY statement results in one observation for each
transposed variable per Employee_ID, and includes missing values.
Partial PROC PRINT Output
If there were additional numeric variables, an observation would be
created for each.
Improved PROC TRANSPOSE Results
344
Employee_ID _NAME_ COL1
120265 Qtr1 .
120265 Qtr2 .
120265 Qtr3 .
120265 Qtr4 25
120267 Qtr1 15
120267 Qtr2 15
120267 Qtr3 15
120267 Qtr4 15
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The VAR Statement
345
 The VAR statement is used to specify which variables
to transpose. It can include character and numeric variables.
The VAR statement has no effect in this example
because Qtr1-Qtr4 will be transposed by default.
proc transpose
data=orion.employee_donations
out=rotate2;
by Employee_ID;
var Qtr1-Qtr4;
run;
proc print data=rotate2 noobs;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Renaming Variables in PROC TRANSPOSE
346
proc transpose
data=orion.employee_donations
out=rotate2(rename=(col1=Amount))
name=Period;
by employee_id;
run;
proc print data=rotate2 noobs;
run;
The RENAME= data set
option is used to change
the name of COL1.
The PROC TRANSPOSE
option, NAME=, is used
to rename _NAME_.
Employee_ID Period Amount
120265 Qtr1 .
120265 Qtr2 .
120265 Qtr3 .
120265 Qtr4 25
120267 Qtr1 15
120267 Qtr2 15
Partial Listing of rotate2
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The WHERE= Data Set Option
 There is no option or statement in PROC TRANSPOSE
to eliminate observations with missing values for the transposed
variable. However, this can be achieved
using a WHERE= data set option in the output data set.
347
proc transpose
data=orion.employee_donations
out=rotate2(rename=(col1=Amount)
where=(Amount ne .))
name=Period;
by employee_id;
run;
proc print data=rotate2 noobs;
run;
proc freq data=rotate2;
tables Period/nocum nopct;
label Period=" ";
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
No Missing Values
Partial PROC PRINT Output:
348
Employee_ID Period Amount
120265 Qtr4 25
120267 Qtr1 15
120267 Qtr2 15
120267 Qtr3 15
120267 Qtr4 15
120269 Qtr1 20
120269 Qtr2 20
120269 Qtr3 20
120269 Qtr4 20
120270 Qtr1 20
120270 Qtr2 10
120270 Qtr3 5
The resulting data set has no missing values.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario
 The manager of Sales asked for a report showing
monthly sales and a total for each customer.
349
Monthly Sales by Customer
Customer_ID Month1 Month2 … Month12 Total
1 1000 . 500 2000
2 . . 200 750
3 1200 . . 2200
4 500 150 350 1000
5 . 1000 . 2500
Sketch of the Desired Report
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Business Scenario Considerations
The data set orion.order_summary contains an
observation for each month in which a customer placed
an order (101 total observations). The data set is sorted
by Customer_ID and has no missing values.
350
Partial Listing of orion.order_summary
Order_
Customer_ID Month Sale_Amt
5 5 478.00
5 6 126.80
5 9 52.50
5 12 33.80
10 3 32.60
10 4 250.80
10 5 79.80
10 6 12.20
10 7 163.29
The number of
observations per
customer varies.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Customer_
ID Month1 ... Month5 Month6 ... Month9 ... Month12
5 . 478.00 126.80 52.50 33.80
Business Scenario Considerations
The report requires rotating the columns into rows. Use PROC
TRANSPOSE again to restructure the data set, and this time from narrow to
wide.
Desired Output
351
Customer Order_
_ID Month Sale_Amt
5 5 478.00
5 6 126.80
5 9 52.50
5 12 33.80
10 3 32.60
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using PROC TRANSPOSE
Start with a simple PROC TRANSPOSE.
Partial Listing of orion.order_summary
352
proc transpose data=orion.order_summary
out=annual_orders;
run;
proc print data=annual_orders noobs;
run;
Order_
Customer_ID Month Sale_Amt
5 5 478.00
5 6 126.80
5 9 52.50
5 12 33.80
10 3 32.60
10 4 250.80
10 5 79.80
101 observations
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Using PROC TRANSPOSE
353
 The resulting data set has three observations, one for each numeric variable
in the input data set: Customer_ID, Order_Month, and Sale_Amt.
The variables COL1-COL101 represent the 101 observations in the input data
set.
Group the output by Customer_ID.
_NAME_ _LABEL_ COL1 COL2 COL3 COL4 COL5 ... COL101
Customer_ID Customer ID 5 5.0 5.0 5.0 10.0 70201.0
Order_Month 5 6.0 9.0 12.0 3.0 8.0
Sale_Amt 478 126.8 52.5 33.8 32.6 1075.5
Customer 5
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Customer
_ID _NAME_ COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9
5 Order_Month 5.0 6.0 9.0 12.0 . . . . .
5 Sale_Amt 478.0 126.8 52.5 33.8 . . . . .
10 Order_Month 3.0 4.0 5.0 6.0 7.00 8.0 11.0 12.0 .
10 Sale_Amt 32.6 250.8 79.8 12.2 163.29 902.5 1894.6 143.3 .
11 Order_Month 9.0 . . . . . . . .
11 Sale_Amt 78.2 . . . . . . . .
The BY Statement
 The BY statement groups by Customer_ID and
produces an observation for each transposed variable,
Order_Month and Sale_Amt.
354
proc transpose data=orion.order_summary
out=annual_orders;
by Customer_ID;
run;
Notice the varying number of
columns for each customer.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Creating Columns Based on a Variable
 Instead of transposing Order_Month, use its values
to create new variables. A value of 5.0 represents orders placed
in May, 6.0 represents orders placed in June, and so on.
Add an ID statement.
355
Customer
_ID _NAME_ COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9
5 Order_Month 5.0 6.0 9.0 12.0 . . . . .
5 Sale_Amt 478.0 126.8 52.5 33.8 . . . . .
10 Order_Month 3.0 4.0 5.0 6.0 7.00 8.0 11.0 12.0 .
10 Sale_Amt 32.6 250.8 79.8 12.2 163.29 902.5 1894.6 143.3 .
11 Order_Month 9.0 . . . . . . . .
11 Sale_Amt 78.2 . . . . . . . .
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The ID Statement
 The ID statement identifies the variable whose values
will become the names of the new columns.
356
proc transpose data=orion.order_summary
out=annual_orders;
by Customer_ID;
id Order_Month;
run;
Customer_ID _NAME_ _5 _6 _9 _12 ...
5 Sale_Amt 478.0 126.80 52.5 33.80
10 Sale_Amt 79.8 12.20 . 143.30
11 Sale_Amt . . 78.2 .
12 Sale_Amt . 48.40 87.2 .
18 Sale_Amt . . . .
The values of Order_Month
(1, 2, 3, … 12) are used to create
variable names _1 through _12.
The remaining variable, Sale_Amt, is transposed.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Changing the Variable Names
 The PREFIX= option is used to set a prefix for each
new variable name. The prefix replaces the underscore.
357
Customer_ID _NAME_ Month5 Month6 Month9 ...
5 Sale_Amt 478.0 126.80 52.5
10 Sale_Amt 79.8 12.20 .
11 Sale_Amt . . 78.2
12 Sale_Amt . 48.40 87.2
18 Sale_Amt . . .
proc transpose data=orion.order_summary
out=annual_orders
prefix=Month;
by Customer_ID;
id Order_Month;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Print the Transposed Data Set
 A VAR statement in the PRINT procedure specifies
the desired order of the variables.
358
proc print data=annual_orders noobs;
var Customer_ID Month1-Month12;
run;
Customer_ID Month1 Month2 Month3 Month4 Month5 ...
5 . . . . 478.0
10 . . 32.6 250.8 79.8
11 . . . . .
12 . 117.6 . . .
18 . 29.4 . . .
24 195.6 . 46.9 . .
27 174.4 . 140.7 205.0 .
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Další detaily o Proc Transpose
http://support.sas.com/documentation/cdl/en/proc/61895/HTML
/default/viewer.htm#transpose-overview.htm
http://www.google.cz/url?sa=t&source=web&cd=10&ved=0CHsQF
jAJ&url=http%3A%2F%2Fwww.hasug.org%2Fnewsletters%2Fhasug
200408%2Fproc_transpose.ppt&rct=j&q=sas%20proc%20transpose
&ei=y1uOTe_xNcbcsgaN07WVCg&usg=AFQjCNEZzkLpkXcDMLRl
8kmQdFYRkM6MIA&cad=rja
http://support.sas.com/resources/papers/proceedings09/060-
2009.pdf
http://www2.sas.com/proceedings/sugi27/p016-27.pdf
359
360
7. Explorační analýza
- základní popis dat, tabulky
361
 Je třeba pochopit data:
 najít chyby v datech
 najít vzory v datech
 najít porušení statistických předpokladů, testování hypotéz
 …a především proto, že pokud to neuděláme, budeme mít
velké problémy později.
Explorační analýza – PROČ?
362
Explorace dat - jednorozměrná
 Frekvenční tabulky, histogramy:
pocet podil badrate
Muz 248 768 55,0% 13,08%
Zena 203 194 45,0% 7,69%
Total 451 962 100,0% 10,66%
delka_zamestnani pocet podil badrate
0 20 825 4,6% 4,69%
1 163 144 36,1% 13,43%
2 67 462 14,9% 12,80%
3 43 778 9,7% 10,97%
4 26 256 5,8% 10,01%
5 27 526 6,1% 9,32%
6 15 893 3,5% 8,16%
8 18 036 4,0% 8,39%
10 17 195 3,8% 6,72%
20 33 641 7,4% 5,60%
24 5 176 1,1% 4,48%
48 12 934 2,9% 4,28%
666 96 0,0% 3,13%
Total 451 962 100,0% 10,66%
pohlavi
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
Muz Zena
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
12,00%
14,00%
podil
badrate
delka_zamestnani
0,0%
10,0%
20,0%
30,0%
40,0%
0 1 2 3 4 5 6 8 10 20 24 48 666
0,00%
5,00%
10,00%
15,00%
podil
badrate
363
 výše úvěru vs. cílová proměnná (bad rate).
 je třeba vysvětlit veškeré „nestandardní“ závislosti
 úplné pochopení dat vede k interpretovatelným
modelům s vysokou prediktivní silou
Explorace dat - jednorozměrná
vyse_uveru
0,0%
5,0%
10,0%
15,0%
20,0%
25,0%
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0,00%
5,00%
10,00%
15,00%
20,00%
podil
badrate
OK? Nebo je to způsobeno jiným faktorem???
364
 spojité proměnné:
 průměr
 modus
 kvantily
 rozptyl
 min./maximální
hodnota
 vztah k cílové proměnné
 často je vhodná kategorizace
(následně frekvenční tabulky,
vztah k cílové proměnné)
Explorace dat - jednorozměrná
 U dichotomické cílové prom.
(0/1) jde o relativní zastoupení
vybrané kategorie (např. bad rate)
pro vhodné intervaly zkoumané
proměnné. Intervaly můžou být:
 pevně dané, např. 0-10,10-20,…
 decily/percentily
 klouzavé okno
365
 Histogramy, box ploty
 Stabilita v čase
Počet návrhů smluv - typ zboží
0
10 000
20 000
30 000
40 000
50 000
60 000
27.2. - 5.3. 6.3. - 12.3. 13.3. - 19.3. 20.3. - 26.3. 27.3. - 2.4. 3.4. - 9.4.
BT
CT
FK
MT
NA
OT
VT
Počet návrhů smluv - typ zboží
0%
20%
40%
60%
80%
100%
27.2. - 5.3. 6.3. - 12.3. 13.3. - 19.3. 20.3. - 26.3. 27.3. - 2.4. 3.4. - 9.4.
VT
OT
NA
MT
FK
CT
BT
Explorace dat - jednorozměrná
366
Kontingenční tabulky
 absolutní četnosti slouží ke
kontrole jestli některá
kombinace hodnot není příliš
málo četná
 relativní četnosti (řádkově +
sloupcově podmíněné) slouží
k odhalení vztahů mezi
proměnnými
do 5 000 5 000 - 10 000 10 000 - 15 000 víc než 15 000
BT 4 291 8 581 9 176 9 044
CT 7 587 12 493 6 500 7 236
FK 258 1 017 851 557
MT 27 191 39 551 16 524 5 992
NA 426 1 088 1 114 2 737
OT 2 478 3 689 2 103 3 475
VT 384 1 001 963 9 086
row% do 5 000 5 000 - 10 000 10 000 - 15 000 víc než 15 000
BT 13,8% 27,6% 29,5% 29,1%
CT 22,4% 36,9% 19,2% 21,4%
FK 9,6% 37,9% 31,7% 20,8%
MT 30,5% 44,3% 18,5% 6,7%
NA 7,9% 20,3% 20,8% 51,0%
OT 21,1% 31,4% 17,9% 29,6%
VT 3,4% 8,8% 8,4% 79,5%
col% do 5 000 5 000 - 10 000 10 000 - 15 000 víc než 15 000
BT 10,1% 12,7% 24,6% 23,7%
CT 17,8% 18,5% 17,5% 19,0%
FK 0,6% 1,5% 2,3% 1,5%
MT 63,8% 58,7% 44,4% 15,7%
NA 1,0% 1,6% 3,0% 7,2%
OT 5,8% 5,5% 5,6% 9,1%
VT 0,9% 1,5% 2,6% 23,8%
Explorace dat - vícerozměrná
367
Počet návrhů smluv - typ zboží
0%
20%
40%
60%
80%
100%
BT CT FK MT NA OT VT
17 a víc
12-16
10-11
8-9
6-7
4-5
Počet návrhů smluv - typ zboží
0%
20%
40%
60%
80%
100%
4-5 6-7 8-9 10-11 12-16 17 a víc
VT
OT
NA
MT
FK
CT
BT
Počet návrhů smluv - typ zboží
0%
20%
40%
60%
80%
100%
BT CT FK MT NA OT VT
víc než 15 000
10 000 - 15 000
5 000 - 10 000
do 5 000
Počet návrhů smluv - typ zboží
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
do 5 000 5 000 - 10 000 10 000 - 15 000 víc než 15 000
VT
OT
NA
MT
FK
CT
BT
Explorace dat - vícerozměrná
368
 Věk vs. délka zaměstnání
Explorace dat - vícerozměrná
5 let …defaultní
hodnota???
 Věk vs. délka zaměstnání vs. default
Pomocí tohoto typu grafu lze
pohodlně zobrazit vztahy mezi
3 proměnnými.
Vztah mezi spojitými proměnnými –
bodové grafy, korelace
369
weak STRONGSTRONG
Correlation Coefficient
0-1 1
Negative Positive
Extreme Data Values
370
r =
0.02
r = 0.82
Odlehlé (extrémní) hodnoty mohou zcela zkreslit výsledky analýzy.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
371
Diskriminační síla proměnných pro prediktivní
modely
Weight of evidence, information value
r ... number of levels (categories) of the categorical variable
gi ... number of ”goods” the in i-th category
bi ... number of ”bads” the in i-th category
G := Σ gi ... total number of ”goods”
B := Σ bi ... total number of ”bads”
Weight of evidence for the i-th category: woei = ln (gi / G) – ln (bi / B)
Information value for the i-th category: Inf_vali = [(gi / G) − (bi / B)] ·
woei
Total information value for the corresponding variable: Inf_val = Σ inf_vali
372
 <0.02 unpredictive
 0.02 – 0.1 weak
 0.1 – 0.3 medium
 0.3 – 0.5 strong
 > 0.5 too high …je třeba prověřit, pravděpodbně je něco špatně
Diskriminační síla proměnných
Incorporation Date
Raw RegVar Percant B G TOT G/B Odds %Good %Bad Bad Rate WoE IV
0 & NOI inc_1 12% 139 952 1091 7 11% 19% 12,7% -0,557 0,046116
1 inc_2 13% 133 1073 1206 8 12% 19% 11,0% -0,394 0,023731
2-7 miss 42% 299 3601 3900 12 42% 42% 7,7% 0,007 2,04E-05
8-15 inc_3 22% 108 1942 2050 18 23% 15% 5,3% 0,408 0,030887
16+ inc_4 11% 39 1019 1058 26 12% 5% 3,7% 0,781 0,050288
Total 718 8587 9305 12 7,7% 0,151
Summary
373
 Lorenzova křivka, Giniho index
Diskriminační síla proměnných
A
BA
A
Gini 2


.],[),(
)(
.
.
HLaaFy
aFx
GOODn
BADm







mn
k
kGOODnGOODnkBADmkBADm FFFFGini k
2
1..1.. )()(1
 <0.05 unpredictive
 0.05 – 0.1 weak
 0.1 – 0.2 medium
 0.2 – 0.5 strong
 > 0.5 too high
374
 Lorenzova křivka …kontrola monotónnosti vysvětlované
proměnné (def. rate) na dané vysvětlující proměnné
Diskriminační síla proměnných
Kategorizace (WOE)
375
Diskriminační síla proměnných
pohlavi Gini: 0,1401 Info.Value: 0,0828
pocet podil badrate
Muz 248 768 55,0% 13,08%
Zena 203 194 45,0% 7,69%
Total 451 962 100,0% 10,66%
delka_zamestnani_hrube Gini: 0,1611 Info.Value: 0,1100
pocet podil badrate
0 20 825 4,6% 4,69%
1 163 144 36,1% 13,43%
5 165 022 36,5% 11,29%
666 102 971 22,8% 6,45%
Total 451 962 100,0% 10,66%
delka_zamestnani_jemne Gini: 0,1762 Info.Value: 0,1285
delka_zamestnani pocet podil badrate
0 20 825 4,6% 4,69%
1 163 144 36,1% 13,43%
2 67 462 14,9% 12,80%
3 43 778 9,7% 10,97%
4 26 256 5,8% 10,01%
5 27 526 6,1% 9,32%
6 15 893 3,5% 8,16%
8 18 036 4,0% 8,39%
10 17 195 3,8% 6,72%
20 33 641 7,4% 5,60%
24 5 176 1,1% 4,48%
48 12 934 2,9% 4,28%
666 96 0,0% 3,13%
Total 451 962 100,0% 10,66%
pohlavi
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
Muz Zena
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
12,00%
14,00%
podil
badrate
delka_zamestnani_hrube
0,0%
5,0%
10,0%
15,0%
20,0%
25,0%
30,0%
35,0%
40,0%
0 1 5 666
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
12,00%
14,00%
16,00%
podil
badrate
delka_zamestnani
0,0%
10,0%
20,0%
30,0%
40,0%
0 1 2 3 4 5 6 8 10 20 24 48 666
0,00%
5,00%
10,00%
15,00%
podil
badrate
 Je vhodné vytvořit souhrnný přehled (frekv. tabulky, histogramy,
Gini, Info. Value,…) pro všechny uvažované proměnné.
The FREQ Procedure
 The FREQ procedure can do the following:
 produce one-way to n-way frequency and crosstabulation (contingency)
tables
 compute chi-square tests for one-way to n-way tables and measures of
association and agreement for contingency tables
 automatically display the output in a report and save the output in a SAS
data set
 General form of the FREQ procedure:
A FREQ procedure with no TABLES statement generates one-way frequency
tables for all data set variables.
376
PROC FREQ DATA=SAS­data­set <option(s)>;
TABLES variable(s) </ option(s)>;
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABLES Statement
A one-way frequency table produces frequencies, cumulative
frequencies, percentages, and cumulative percentages.
377
The FREQ Procedure
Cumulative Cumulative
Gender Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
F 68 41.21 68 41.21
M 97 58.79 165 100.00
Cumulative Cumulative
Country Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AU 63 38.18 63 38.18
US 102 61.82 165 100.00
proc freq data=orion.sales;
tables Gender Country;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
one-way
frequency tables
The TABLES Statement
An n-way frequency table produces cell frequencies, cell
percentages, cell percentages of row frequencies, and cell
percentages of column frequencies, plus total frequency and
percent.
378
proc freq data=orion.sales;
tables Gender*Country;
run;
columnsrows
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FREQ Procedure
Table of Gender by Country
Gender Country
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚AU ‚US ‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
F ‚ 27 ‚ 41 ‚ 68
‚ 16.36 ‚ 24.85 ‚ 41.21
‚ 39.71 ‚ 60.29 ‚
‚ 42.86 ‚ 40.20 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
M ‚ 36 ‚ 61 ‚ 97
‚ 21.82 ‚ 36.97 ‚ 58.79
‚ 37.11 ‚ 62.89 ‚
‚ 57.14 ‚ 59.80 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 63 102 165
38.18 61.82 100.00
two-way
frequency table
Additional SAS Statements
Additional statements can be added to enhance the report.
379
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States';
run;
options nodate pageno=1;
ods html file='p112d01.html';
proc freq data=orion.sales;
tables Gender*Country;
where Job_Title contains 'Rep';
format Country $ctryfmt.;
title 'Sales Rep Frequency Report';
run;
ods html close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Options to Suppress Display of Statistics
Options can be placed in the TABLES statement
after a forward slash to suppress the display of the default statistics.
380
Option Description
NOCUM suppresses the display of cumulative frequency and cumulative percentage.
NOPERCENT suppresses the display of percentage, cumulative percentage, and total percentage.
NOFREQ suppresses the display of the cell frequency and total frequency.
NOROW suppresses the display of the row percentage.
NOCOL suppresses the display of the column percentage.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Option Description
LIST displays n-way tables in list format.
CROSSLIST displays n-way tables in column format.
FORMAT= formats the frequencies in n-way tables.
LIST and CROSSLIST Options
381
Cumulative Cumulative
Gender Country Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
F Australia 27 16.36 27 16.36
F United States 41 24.85 68 41.21
M Australia 36 21.82 104 63.03
M United States 61 36.97 165 100.00
Table of Gender by Country
Row Column
Gender Country Frequency Percent Percent Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
F Australia 27 16.36 39.71 42.86
United States 41 24.85 60.29 40.20
Total 68 41.21 100.00
----------------------------------------------------------------------
M Australia 36 21.82 37.11 57.14
United States 61 36.97 62.89 59.80
Total 97 58.79 100.00
----------------------------------------------------------------------
Total Australia 63 38.18 100.00
United States 102 61.82 100.00
Total 165 100.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
tables Gender*Country / crosslist;
tables Gender*Country / list;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC FREQ Statement Options
Options can also be placed in the PROC FREQ statement.
382
Option Description
NLEVELS
displays a table that provides the number of levels for each variable named
in the TABLES statement.
PAGE displays only one table per page.
COMPRESS
begins the display of the next one-way frequency table on the same page as
the preceding one-way table ifthere is enough space to begin the table.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
proc freq data=orion.sales nlevels;
tables Gender Country Employee_ID;
run;
The FREQ Procedure
Number of Variable Levels
Variable Levels
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Gender 2
Country 2
Employee_ID 165
Output Data Sets
PROC FREQ produces output data sets using two different
methods.
 The TABLES statement with an OUT= option is used to
create a data set with frequencies and percentages.
 The OUTPUT statement with an OUT= option is used to
create a data set with specified statistics such as the chisquare
statistic.
383
OUTPUT OUT=SAS­data­set <options>;
TABLES variables / OUT=SAS-data-set <options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The MEANS Procedure
 The MEANS procedure provides data summarization tools to compute
descriptive statistics for variables across all observations and within groups
of observations.
 General form of the MEANS procedure:
By default, the MEANS procedure reports the number of nonmissing
observations, the mean, the standard deviation, the minimum value, and
the maximum value of all numeric variables.
384
PROC MEANS DATA=SAS­data­set <statistic(s)> <option(s)>;
VAR analysis-variable(s);
CLASS classification-variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
proc means data=orion.sales;
run;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Employee_ID 165 120713.90 450.0866939 120102.00 121145.00
Salary 165 31160.12 20082.67 22710.00 243190.00
Birth_Date 165 3622.58 5456.29 -5842.00 10490.00
Hire_Date 165 12054.28 4619.94 5114.00 17167.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
The VAR Statement
The VAR statement identifies the analysis variables
and their order in the results.
385
proc means data=orion.sales;
var Salary;
run;
The MEANS Procedure
Analysis Variable : Salary
N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
165 31160.12 20082.67 22710.00 243190.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The CLASS Statement
The CLASS statement identifies variables whose values define
subgroups for the analysis.
386
proc means data=orion.sales;
var Salary;
class Gender Country;
run;
The MEANS Procedure
Analysis Variable : Salary
N
Gender Country Obs N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
F AU 27 27 27702.41 1728.23 25185.00 30890.00
US 41 41 29460.98 8847.03 25390.00 83505.00
M AU 36 36 32001.39 16592.45 25745.00 108255.00
US 61 61 33336.15 29592.69 22710.00 243190.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The CLASS Statement
387
The MEANS Procedure
Analysis Variable : Salary
N
Gender Country Obs N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
F AU 27 27 27702.41 1728.23 25185.00 30890.00
US 41 41 29460.98 8847.03 25390.00 83505.00
M AU 36 36 32001.39 16592.45 25745.00 108255.00
US 61 61 33336.15 29592.69 22710.00 243190.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
analysis
variable
classification
variables
proc means data=orion.sales;
var Salary;
class Gender Country;
run;
The CLASS statement adds the N Obs column, which is the number of
observations for each unique combination of the class variables.
statistics for analysis variable
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC MEANS Statistics
The statistics to compute and the order to display them can be specified in the
PROC MEANS statement.
 další dostupné statistiky:
388
proc means data=orion.sales sum mean range;
var Salary;
class Country;
run;
The MEANS Procedure
Analysis Variable : Salary
N
Country Obs Sum Mean Range
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AU 63 1900015.00 30158.97 83070.00
US 102 3241405.00 31778.48 220480.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Descriptive StatisticKeywords
CLM CSS CV LCLM MAX
MEAN MIN MODE N NMISS
KURTOSIS RANGE SKEWNESS STDDEV STDERR
SUM SUMWGT UCLM USS VAR
QuantileStatistic Keywords
MEDIAN |
P50
P1 P5 P10 Q1 | P25
Q3 | P75 P90 P95 P99 QRANGE
Hypothesis Testing Keywords
PROBT T
PROC MEANS Statement Options
 Options can also be placed in the PROC MEANS statement.
389
Option Description
MAXDEC= specifies the number of decimal places to use in printing the statistics.
FW= specifies the field width to use in displaying the statistics.
NONOBS
suppresses reporting the total number of observations for each unique
combination of the class variables.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The MEANS Procedure
Analysis Variable : Salary
N
Country Obs N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AU 63 63 30159 12699 25185 108255
US 102 102 31778 23556 22710 243190
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
proc means data=orion.sales maxdec=0;
The MEANS Procedure
Analysis Variable : Salary
N
Country Obs N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AU 63 63 30159.0 12699.1 25185.0 108255.0
US 102 102 31778.5 23555.8 22710.0 243190.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
proc means data=orion.sales maxdec=1;
Output Data Sets
PROC MEANS produces output data sets using the
following method:
The output data set contains the following variables:
 BY variables
 class variables
 the automatic variables _TYPE_ and _FREQ_
 the variables requested in the OUTPUT statement
390
OUTPUT OUT=SAS­data­set <options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
391
proc means data=orion.sales sum mean range;
var Salary;
class Gender Country;
output out=work.means1;
run;
proc print data=work.means1;
run;
The statistics in the
PROC statement impact
only the MEANS report,
not the data set.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Obs Gender Country _TYPE_ _FREQ_ _STAT_ Salary
1 0 165 N 165.00
2 0 165 MIN 22710.00
3 0 165 MAX 243190.00
4 0 165 MEAN 31160.12
5 0 165 STD 20082.67
6 AU 1 63 N 63.00
7 AU 1 63 MIN 25185.00
8 AU 1 63 MAX 108255.00
9 AU 1 63 MEAN 30158.97
10 AU 1 63 STD 12699.14
11 US 1 102 N 102.00
12 US 1 102 MIN 22710.00
13 US 1 102 MAX 243190.00
14 US 1 102 MEAN 31778.48
15 US 1 102 STD 23555.84
16 F 2 68 N 68.00
17 F 2 68 MIN 25185.00
18 F 2 68 MAX 83505.00
19 F 2 68 MEAN 28762.72
20 F 2 68 STD 6974.15
default statistics
OUTPUT Statement OUT= Option
The OUTPUT statement can also do the following:
 specify the statistics for the output data set
 select and name variables
The NOPRINT option suppresses the display of all output.
392
proc means data=orion.sales noprint;
var Salary;
class Gender Country;
output out=work.means2
min=minSalary max=maxSalary
sum=sumSalary mean=aveSalary;
run;
proc print data=work.means2;run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 0 165 22710 243190 5141420 31160.12
2 AU 1 63 25185 108255 1900015 30158.97
3 US 1 102 22710 243190 3241405 31778.48
4 F 2 68 25185 83505 1955865 28762.72
5 M 2 97 22710 243190 3185555 32840.77
6 F AU 3 27 25185 30890 747965 27702.41
7 F US 3 41 25390 83505 1207900 29460.98
8 M AU 3 36 25745 108255 1152050 32001.39
9 M US 3 61 22710 243190 2033505 33336.15
OUTPUT Statement OUT= Option
_TYPE_ is a numeric variable that shows which
combination of class variables produced the summary
statistics in that observation.
PROC PRINT Output

393
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 0 165 22710 243190 5141420 31160.12
2 AU 1 63 25185 108255 1900015 30158.97
3 US 1 102 22710 243190 3241405 31778.48
4 F 2 68 25185 83505 1955865 28762.72
5 M 2 97 22710 243190 3185555 32840.77
6 F AU 3 27 25185 30890 747965 27702.41
7 F US 3 41 25390 83505 1207900 29460.98
8 M AU 3 36 25745 108255 1152050 32001.39
9 M US 3 61 22710 243190 2033505 33336.15
summary by Country only
summary by Gender only
summary by Country and Gender
overall summary
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
For this example,
_TYPE_ = 0 is an overall summary
PROC PRINT Output:

394
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 0 165 22710 243190 5141420 31160.12
2 AU 1 63 25185 108255 1900015 30158.97
3 US 1 102 22710 243190 3241405 31778.48
4 F 2 68 25185 83505 1955865 28762.72
5 M 2 97 22710 243190 3185555 32840.77
6 F AU 3 27 25185 30890 747965 27702.41
7 F US 3 41 25390 83505 1207900 29460.98
8 M AU 3 36 25745 108255 1152050 32001.39
9 M US 3 61 22710 243190 2033505 33336.15
_TYPE_ Type of Summary _FREQ_
0 overall summary 165
1 summary by Country only 63 AU + 102 AU = 165
2 summary by Gender only 68 F + 97 M = 165
3
summary by Country
and Gender
27 F AU + 41 F US + 36 M
AU + 61 M US = 165
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
OUTPUT Statement OUT= Option
 Options can be added to the PROC MEANS statement to control the output data set.
395
Option Description
NWAY
specifies that the output data set contain only statistics for the
observations with the highest _TYPE_ value.
DESCENDTYPES orders the output data set by descending _TYPE_ value.
CHARTYPE
specifies that the _TYPE_ variable in the output data set is a character
representation of the binary value of _TYPE_.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 0 165 22710 243190 5141420 31160.12
2 AU 1 63 25185 108255 1900015 30158.97
3 US 1 102 22710 243190 3241405 31778.48
4 F 2 68 25185 83505 1955865 28762.72
5 M 2 97 22710 243190 3185555 32840.77
6 F AU 3 27 25185 30890 747965 27702.41
7 F US 3 41 25390 83505 1207900 29460.98
8 M AU 3 36 25745 108255 1152050 32001.39
9 M US 3 61 22710 243190 2033505 33336.15
without options
OUTPUT Statement OUT= Option
396Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 F AU 3 27 25185 30890 747965 27702.41
2 F US 3 41 25390 83505 1207900 29460.98
3 M AU 3 36 25745 108255 1152050 32001.39
4 M US 3 61 22710 243190 2033505 33336.15
5 F 2 68 25185 83505 1955865 28762.72
6 M 2 97 22710 243190 3185555 32840.77
7 AU 1 63 25185 108255 1900015 30158.97
8 US 1 102 22710 243190 3241405 31778.48
9 0 165 22710 243190 5141420 31160.12
with DESCENDTYPES
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 F AU 3 27 25185 30890 747965 27702.41
2 F US 3 41 25390 83505 1207900 29460.98
3 M AU 3 36 25745 108255 1152050 32001.39
4 M US 3 61 22710 243190 2033505 33336.15
with NWAY
min max sum ave
Obs Gender Country _TYPE_ _FREQ_ Salary Salary Salary Salary
1 00 165 22710 243190 5141420 31160.12
2 AU 01 63 25185 108255 1900015 30158.97
3 US 01 102 22710 243190 3241405 31778.48
4 F 10 68 25185 83505 1955865 28762.72
5 M 10 97 22710 243190 3185555 32840.77
6 F AU 11 27 25185 30890 747965 27702.41
7 F US 11 41 25390 83505 1207900 29460.98
8 M AU 11 36 25745 108255 1152050 32001.39
9 M US 11 61 22710 243190 2033505 33336.15
with CHARTYPE
The SUMMARY Procedure
The SUMMARY procedure provides data summarization tools
to compute descriptive statistics for variables across all
observations and within groups of observations.
General form of the SUMMARY procedure:
397
PROC SUMMARY DATA=SAS­data­set <statistic(s)>
<option(s)>;
VAR analysis-variable(s);
CLASS classification-variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The SUMMARY Procedure
The SUMMARY procedure uses the same syntax
as the MEANS procedure.
The only differences to the two procedures are
the following:
398
PROC MEANS PROC SUMMARY
The PRINT option is set by default,
which displays output.
The NOPRINT option is set by default,
which displays no output.
Omitting the VAR statement analyzes
all the numeric variables.
Omitting the VAR statement produces a
simple count of observations.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABULATE Procedure
The TABULATE procedure displays descriptive statistics
in tabular format.
General form of the TABULATE procedure:
399
PROC TABULATE DATA=SAS­data­set <options>;
CLASS classification­variable(s);
VAR analysis-variable(s);
TABLE page­expression,
row­expression,
column­expression </ option(s)>;
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Dimensional Tables
The TABULATE procedure produces one-, two-, or threedimensional
tables.
400
page dimension row dimension
column
dimension
one-dimensional 
two-dimensional  
threedimensional
  
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TABLE Statement
 The TABLE statement describes the structure of the table.
 Commas separate the dimension expressions.
 Every variable that is part of a dimension expression must be specified
as a classification variable (CLASS statement) or an analysis variable
(VAR statement).
 Příklady:
401
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
table , , ;
row
expression
page
expression
column
expression
dimension expressions
table , , ;
row
expression
page
expression
column
expression
table Gender , Country;
table Country;
table Job_Title , Gender , Country;
The CLASS Statement
The CLASS statement identifies variables to be used
as classification, or grouping, variables.
General form of the CLASS statement:
 N, the number of nonmissing values, is the default statistic
for classification variables.
 Examples of classification variables:
Job_Title, Gender, and Country
402
CLASS classification­variable(s);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The VAR Statement
The VAR statement identifies the numeric variables
for which statistics are calculated.
General form of the VAR statement:
 SUM is the default statistic for analysis variables.
 Examples of analysis variables:
Salary and Bonus
403
VAR analysis­variable(s);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
One/two-Dimensional Table
404
proc tabulate data=orion.sales;
class Country;
table Country;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ Country ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ AU ‚ US ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ N ‚ N ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ 63.00‚ 102.00‚
Šƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
proc tabulate data=orion.sales;
class Gender Country;
table Gender, Country;
run;
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ ‚ Country ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ AU ‚ US ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ N ‚ N ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚Gender ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚
‚F ‚ 27.00‚ 41.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚M ‚ 36.00‚ 61.00‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Three-Dimensional Table
405
proc tabulate data=orion.sales;
class Job_Title Gender Country;
table Job_Title, Gender, Country;
run;
Job_Title Sales Rep. I
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ ‚ Country ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ AU ‚ US ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ N ‚ N ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚Gender ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚
‚F ‚ 8.00‚ 13.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚M ‚ 13.00‚ 29.00‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Job_Title Sales Rep. II
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ ‚ Country ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ AU ‚ US ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ N ‚ N ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚Gender ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚
‚F ‚ 10.00‚ 14.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚M ‚ 8.00‚ 14.00‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Dimension Expression
Elements that can be used in a dimension expression:
 classification variables
 analysis variables
 the universal class variable ALL
 keywords for statistics
Operators that can be used in a dimension expression:
 blank, which concatenates table information
 asterisk *, which crosses table information
 parentheses (), which group elements
406Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Dimension Expression
407
proc tabulate data=orion.sales;
class Gender Country;
var Salary;
table Gender all, Country*Salary;
run;
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ ‚ Country ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ AU ‚ US ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ Salary ‚ Salary ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ Sum ‚ Sum ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚Gender ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚
‚F ‚ 747965.00‚ 1207900.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚M ‚ 1152050.00‚ 2033505.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚All ‚ 1900015.00‚ 3241405.00‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC TABULATE Statistics
408
Descriptive Statistic Keywords
CSS CV LCLM MAX
MEAN MIN MODE N NMISS
KURTOSIS RANGE SKEWNESS STDDEV STDERR
SUM SUMWGT UCLM USS VAR
PCTN REPPCTN PAGEPCTN ROWPCTN COLPCTN
PCTSUM REPPCTSUM PAGEPCTSUM ROWPCTSUM COLPCTSUM
Quantile Statistic Keywords
MEDIAN | P50 P1 P5 P10 Q1 | P25
Q3 | P75 P90 P95 P99 QRANGE
Hypothesis Testing Keywords
PROBT T
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC TABULATE Statistics
409
proc tabulate data=orion.sales;
class Gender Country;
var Salary;
table Gender all, Country*Salary*(min max);
run;
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ ‚ Country ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ AU ‚ US ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ Salary ‚ Salary ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ Min ‚ Max ‚ Min ‚ Max ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚Gender ‚ ‚ ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚
‚F ‚ 25185.00‚ 30890.00‚ 25390.00‚ 83505.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚M ‚ 25745.00‚ 108255.00‚ 22710.00‚ 243190.00‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚All ‚ 25185.00‚ 108255.00‚ 22710.00‚ 243190.00‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Additional SAS Statements
Additional statements can be added to enhance the report.
410
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States';
run;
options nodate pageno=1;
ods html file='p112d08.html';
proc tabulate data=orion.sales;
class Gender Country;
var Salary;
table Gender all, Country*Salary*(min max);
where Job_Title contains 'Rep';
label Salary='Annual Salary';
format Country $ctryfmt.;
title 'Sales Rep Tabular Report';
run;
ods html close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Output Data Sets
PROC TABULATE produces output data sets using the
following method:
The output data set contains the following variables:
 BY variables
 class variables
 the automatic variables _TYPE_, _PAGE_, and
_TABLE_
 calculated statistics 411
PROC TABULATE DATA=SAS-data-set
OUT=SAS-data-set <options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
412
proc tabulate data=orion.sales
out=work.tabulate;
where Job_Title contains 'Rep';
class Job_Title Gender Country;
table Country;
table Gender, Country;
table Job_Title, Gender, Country;
run;
proc print data=work.tabulate;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Obs Job_Title Gender Country _TYPE_ _PAGE_ _TABLE_ N
1 AU 001 1 1 61
2 US 001 1 1 98
3 F AU 011 1 2 27
4 F US 011 1 2 40
5 M AU 011 1 2 34
6 M US 011 1 2 58
7 Sales Rep. I F AU 111 1 3 8
8 Sales Rep. I F US 111 1 3 13
9 Sales Rep. I M AU 111 1 3 13
10 Sales Rep. I M US 111 1 3 29
11 Sales Rep. II F AU 111 2 3 10
12 Sales Rep. II F US 111 2 3 14
13 Sales Rep. II M AU 111 2 3 8
14 Sales Rep. II M US 111 2 3 14
15 Sales Rep. III F AU 111 3 3 7
16 Sales Rep. III F US 111 3 3 8
17 Sales Rep. III M AU 111 3 3 10
18 Sales Rep. III M US 111 3 3 9
PROC Statement OUT= Option
_TYPE_ is a character variable that shows which
combination of class variables produced the summary
statistics in that observation.
Partial PROC PRINT Output
413
Obs Job_Title Gender Country _TYPE_ _PAGE_ _TABLE_ N
1 AU 001 1 1 61
2 US 001 1 1 98
3 F AU 011 1 2 27
4 F US 011 1 2 40
5 M AU 011 1 2 34
6 M US 011 1 2 58
0 for Job_Title,
1 for Gender, and 1
for Country
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
_PAGE_ is a numeric variable that shows the logical page
number that contains that observation.
Partial PROC PRINT Output
414
Obs Job_Title Gender Country _TYPE_ _PAGE_ _TABLE_ N
7 Sales Rep. I F AU 111 1 3 8
8 Sales Rep. I F US 111 1 3 13
9 Sales Rep. I M AU 111 1 3 13
10 Sales Rep. I M US 111 1 3 29
11 Sales Rep. II F AU 111 2 3 10
12 Sales Rep. II F US 111 2 3 14
13 Sales Rep. II M AU 111 2 3 8
14 Sales Rep. II M US 111 2 3 14
15 Sales Rep. III F AU 111 3 3 7
16 Sales Rep. III F US 111 3 3 8
17 Sales Rep. III M AU 111 3 3 10
18 Sales Rep. III M US 111 3 3 9
Page 3 for
Sales Rep. III
Page 2 for
Sales Rep. II
Page 1 for
Sales Rep. I
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Statement OUT= Option
_TABLE_ is a numeric variable that shows the number
of the TABLE statement that contains that observation.
Partial PROC PRINT Output
415
Obs Job_Title Gender Country _TYPE_ _PAGE_ _TABLE_ N
1 AU 001 1 1 61
2 US 001 1 1 98
3 F AU 011 1 2 27
4 F US 011 1 2 40
5 M AU 011 1 2 34
6 M US 011 1 2 58
7 Sales Rep. I F AU 111 1 3 8
8 Sales Rep. I F US 111 1 3 13
9 Sales Rep. I M AU 111 1 3 13
10 Sales Rep. I M US 111 1 3 29
1 for first TABLE statement
2 for second TABLE statement
3 for third TABLE statement
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Více o PROC TABULATE:
416
 In the SUGI 28 proceedings:
 “The Simplicity and Power of the TABULATE Procedure”,
by Dan Bruns
http://www2.sas.com/proceedings/sugi28/197-28.pdf
 Online (from the SUGI 27 proceedings):
 “Anyone Can Learn PROC TABULATE”,
by Lauren Haworth,
http://www2.sas.com/proceedings/sugi27/p060-27.pdf
The UNIVARIATE Procedure
The UNIVARIATE procedure produces summary reports
that display descriptive statistics.
General form of the UNIVARIATE procedure:
The VAR statement specifies the analysis variables and their
order in the results.
417
PROC UNIVARIATE DATA=SAS-data-set;
VAR variable(s);
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The UNIVARIATE Procedure
The following PROC UNIVARIATE step shows default
descriptive statistics for Salary.
Without the VAR statement, SAS will analyze all
numeric variables.
418
proc univariate data=orion.nonsales;
var Salary;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The UNIVARIATE Procedure
The UNIVARIATE procedure can produce the following
sections of output:
 Moments
 Basic Statistical Measures
 Tests for Locations
 Quantiles
 Extreme Observations
 Missing Values
419Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
420
8. Vizualizace dat, SAS/Graph
úvod do SAS Enterprise Miner
421
Vizualizace – zdroje
 Na prvním místě se obvykle citují knihy prof. Tufteho, např. Tufte E.R. (1983) The
Visual Display of Quantitative Information, Graphic Press, Chesire, Conn.
 Weby o vizualizaci, např.
 http://www.datavis.ca/gallery/ - galerie s poučným výkladem a příklady i
nezdařených či lživých grafů
 http://www.agocg.ac.uk/ - John Lansdown (1992) Aspects of Design in
Computer Graphics: Some Notes –
http://www.agocg.ac.uk/train/hitch/hitch.htm
 Jiné weby, např. stránky různých vizualizačních programů a organizací
 http://www.cybergeography.org/atlas/atlas.html nebo
http://miner3d.com/products/gallery.html
422
Vizualizace – historie
 William Playfair, 1786: první publikovaná prezentační grafika
 Dr. John Snow, 1845: epidemie cholery v Londýně
 Florence Nightingale, 1858:
důvody úmrtí v průběhu
Krymské války (1853-1856)
 Harry Beck, 1931: schéma Londýnského metra
423
Vizualizace – investigativní analýza
 http://www.i2inc.com/
Law Enforcement Government Commercial
» Counterterrorism
» Narcotics investigations
» Organized crime
» Intelligence analysis
» Fraud
» Missing persons
» Major investigations
» Counterfeiting
» Immigration control
» Major event security
» Money laundering
» Gang investigations
» Criminal prosecutions
» National security
» Military intelligence
» Embassy security
» Postal inspection and fraud
» Prison investigations
» Park and wildlife services
» Antitrust investigations
» Tax fraud investigations
» Customs investigations
» Forensic accounting
» Money laundering
» Insider trading violations
» Corporate security
» Anti-pirating investigations
» Entertainment copyright violations
» Competitive intelligence
» Civil lawsuits
» Fraud:
» Credit card
» Insurance
» Retail
» Health care
» Commercial
» Telephone
424
 Osobní kontakty, pojistné podvody
Vizualizace – investigativní analýza
 Praní špinavých peněz, kriminální gangy
425
Vizualizace – risk management
426
Vizualizace - dendrogram
427
Vizualizace – ekonomie
428
Meteo-vizualizace
429
Kartogram
 Obce s počtem 500 a více obyvatel s vysokorychlostním připojením k
internetu, podle okresů (%), k 31.12.2006
430
Kartodiagram
431
Grafy –další typy
432
 Která přímka roste strměji?
Měřítko grafu
433
Měřítko grafu
 Pohled tvůrce grafu:
 Zvýraznění trendu – pozitivní výsledky.
 Potlačení trendu – negativní výsledky.
 Pohled uživatele grafu:
 Grafy bez uvedeného měřítka jsou silně podezřelé.
 Nepodléhat podsouvané informaci o růstu/poklesu.
434
Odstrašující příklady vizualizace:
Zdroj: http://www.datavis.ca/gallery/
What Is SAS/GRAPH Software?
SAS/GRAPH software is a component of SAS software
that enables you to create the following types of graphs:
 bar, block, and pie charts
 two-dimensional scatter plots and line plots
 three­dimensional scatter and surface plots
 contour plots
 maps
 text slides
 custom graphs
435Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Základní typy grafů
436
Pie Charts (GCHART Procedure)Bar Charts (GCHART Procedure)
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Scatter and Line Plots (GPLOT Procedure) Bar Charts with Line Plot Overlay
(GBARLINE Procedure)
Three-Dimensional Surface and Scatter
Plots, Maps
437Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
 Procedury G3D, G3GRID,
SGRENDER …více na
support.sas.com
Maps (GMAP Procedure) Multiple graphs on a page
(GREPLAY Procedure)
Producing Bar and Pie Charts with the
GCHART Procedure
General form of the PROC GCHART statement:
Use one of these statements to specify the chart type:
438
PROC GCHART DATA=SAS-data-set;
HBAR chart-variable . . . </ options>;
HBAR3D chart-variable . . . </ options>;
VBAR chart-variable . . . </ options>;
VBAR3D chart-variable . . . </ options>;
PIE chart-variable . . . </ options>;
PIE3D chart-variable . . . </ options>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Vertical/horizontal Bar Chart
• Produce a vertical/horizontal bar chart that displays the
number of employees in each department.
dept is the chart
variable
proc gchart
data=univ.employees;
vbar dept;
run;
proc gchart
data=univ.employees;
hbar dept;
run;
439Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Pie Chart
dept is the
chart variable
proc gchart data=univ.employees;
pie dept;
run;
• Produce a pie chart that displays the number of employees in
each department.
440Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Character/Numeric Chart Variable
 If the chart variable is character, then a bar or slice is
created for each unique variable value.
 For numeric chart variables, the variables are assumed
to be continuous unless otherwise specified.
 The GCHART procedure creates the equivalent
of a histogram from the data.
 Intervals are automatically calculated and identified
by midpoints.
 One bar or slice is constructed for each midpoint.
441Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Numeric Chart Variable
 Produce a vertical bar chart on the numeric variable YearsOnJob.
proc gchart data=univ.employees;
vbar YearsOnJob;
run;
YearsOnJob is
the chart variable
442Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The DISCRETE Option
 To override the default behavior for numeric chart variables, use
the DISCRETE option in the HBAR, VBAR, or PIE statement.
 The DISCRETE option produces a bar or slice for each unique
numeric variable value; the values are no longer treated as
intervals.
proc gchart data=univ.employees;
vbar YearsOnJob / discrete;
run;
YearsOnJob is the
chart variable, but
the DISCRETE
option modifies how
SAS displays the
values.
443Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Summary Statistic
 By default, the statistic that determines the length or height
of each bar or size of pie slice is a frequency count (N).
proc gchart data=univ.employees;
vbar dept / sumvar=salary type=mean;
run;
444Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Analysis Variable
 To override the default frequency count, you can use the
following HBAR, VBAR, or PIE statement options:
 If an analysis variable is
 specified, the default value of TYPE is SUM
 not specified, the default value of TYPE is FREQ.
SUMVAR= identifies the analysis variable to use for the
sum or mean calculation.
TYPE= specifies that the height or length of the bar or
size of the slice represents a mean or sum of
the analysis-variable values.
445Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Bar Chart Using Formats
 Produce a bar chart
that displays the
average salary
of employees in
each department.
proc gchart data=univ.employees;
vbar dept / sumvar=Salary type=mean;
format Salary dollar8.;
run;
446Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Další možnosti proc gchart
 Na adrese http://support.sas.com/sassamples/graphgallery/PROC_GCHART.html
lze nalézt galerii možných typů grafů (včetně kódů!).
447
Další možnosti proc gchart
 A ještě několik typů…
448
Producing Plots with the GPLOT
Procedure
You can use the GPLOT procedure to plot one variable
against another within a set of coordinate axes.
General form of a PROC GPLOT step:
449
PROC GPLOT DATA=SAS-data-set;
PLOT vertical-variable*horizontal-variable </ options>;
RUN;
QUIT;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The GPLOT Procedure
Produce a plot of salary versus bonus for each employee.
proc gplot data=univ.employees;
plot Salary*Bonus;
title ‘Relationship of Salary and Bonus';
run;
450Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SYMBOL Statement
 You can use the SYMBOL statement to do the following:
 define plotting symbols
 draw lines through the data points
 specify the color of the plotting symbols and lines
 General form of the SYMBOL statement:
 The value of n can range from 1 to 255.
 If n is omitted, the default is 1.
 Symbol statement is global and additive:
SYMBOLn options;
global After being defined, the statements remain in effect until changed or until
the end of the SAS session.
additive Specifying the value of one option does not affect the values of other
options.
451Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SYMBOL Statement Options
 You can specify the plotting symbol you want with the VALUE=
option in the SYMBOL statement:
 Selected symbol values are shown below:
 You can use the I= option in the SYMBOL statement
to draw lines between the data points.
 Selected interpolation values:
VALUE=symbol | V=symbol
PLUS (default) DIAMOND
STAR TRIANGLE
SQUARE NONE (no plotting symbol)
I=interpolation
JOIN joins the points with straight lines.
SPLINE joins the points with a smooth line.
NEEDLE draws vertical lines from the points to the horizontal axes.
R overlays a simple linear regression line on the plot.
452Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SYMBOL Statement Options
 Use a star as the plotting symbol and superimpose
a regression line on the plot.
plot Salary*Bonus;
symbol value=star i=r;
run;
453Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Additional SYMBOL Statement Options
 You can enhance the appearance of the plots with the
following selected options:
WIDTH=width W=width specifies the thickness of the line.
COLOR=color C=color specifies the color of the line and plot symbols.
plot Salary*Bonus;
symbol c=green w=3;
run;
454Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Canceling SYMBOL Statements
 You can cancel a SYMBOL statement by submitting
a null SYMBOL statement.
 To cancel all SYMBOL statements, submit the following
statement:
 Zrušení všech předchozích voleb (návrat k defaultnímu nastavaní)
symbol1;
goptions reset=symbol;
goptions reset=global;
455Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Controlling the Axis Appearance
 You can modify the appearance of the axes that
PROC GPLOT produces with the following:
 PLOT statement options
 the LABEL statement
 the FORMAT statement
 You can use PLOT statement options to control the scaling
and color of the axes, and the color of the axis text.
 Selected PLOT statement options for axis control:
HAXIS=values scales the horizontal axis.
VAXIS=values scales the vertical axis.
CAXIS=color specifies the color of both axes.
CTEXT=color specifies the color of the text on both axes.
456Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PLOT Statement Options, Label statement
plot Salary*Bonus
/ vaxis=0 to 200000 by 50000
haxis=0 to 10000 by 2000
ctext=blue;
run;
457Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
plot Salary*Bonus /
vaxis=0 to 200000 by 50000
haxis=0 to 10000 by 2000 ctext=blue;
label Salary=' Annual Salary'
Bonus=' 2002 Bonus';
run;
Gplot options – další možnosti
458
Plot <Y Variable>*<X Variable> / <options>;
• Options for plotting
• Plot options
•Legend= or nolegend: specifies figure legend options
•Overlay: allows overlay of more than one Y variable
•Skipmiss: breaks the plotting line where Y values are missing
•Appearance option
•Axis: Specifies axis label and value options
•Symbol: Specified symbol options
•href, vref: Draws vertical or horizontal reference lines on plot
•frame/fr or noframe/nofr: specifies whether or not to frame the plot
•caxis/ca, cframe/cfr, chref/ch, cvref/cv, ctext/c: specifies colors used for
axis, frame, text or reference lines.
459
Gplot options – další možnosti
460
•AXIS<1..99> <options>;
•Label Option;
•Angle/a=degrees (0-359)
•Color/c=text color
•Font/f=font
•Height/h=text height (default=1)
•Justify=(left/center/right)
•Label=“text string”
•Order Option;
•Order=(a to b by c): major tick marks will show up at intervals based on c.
•Example order=(0 to 3 by 1);
•Value Option;
•value=(“” “” “”): applies text label to each major tick.
•Example Value=( “Start” “Middle” “End”)
• axis1 label=(a=90 c=black f=“arial” h=1.2 “time” a=90 c=black f=“arial” h=1.0 “hours”);
Gplot options – další možnosti
461
Gplot options – další možnosti
Další možnosti proc gplot
462
Na adrese http://support.sas.com/sassamples/graphgallery/PROC_GPLOT.html
lze nalézt galerii možných typů grafů (včetně kódů!). Na adrese
http://ebookbrowse.com/sas-gplot-slides-1-26-2011-ppt-d138883835 lze najít další
návody a ukázky včetně kódů.
ODS Graphics
463
• Recall that the Output Delivery System (ODS), added in Version 8, manages all tabular output created by
procedures and enables you to display it in a variety of destinations, such as HTML and RTF. SAS 9.1 introduced an
extension to ODS, referred to as ODS Graphics, which—together with corresponding modifications to statistical
procedures—equips these procedures to create graphics as automatically as tables. This eliminates the need for
additional programming.
Více viz:
1) http://support.sas.com/rnd/base/topics/statgraph/
2) http://support.sas.com/resources/papers/76822_O
DSGraph2011.pdf
3) http://susanslaughter.files.wordpress.com/2011/06/s
vsug-2011-
handout_for_getting_started_with_ods_statistical_g
raphics.pdf
4) http://www2.sas.com/proceedings/sugi31/192-31.pdf
• Automaticky vytvářet grafické výstupy
umí tyto procedury (platí pro SAS 9.3):
ODS Graphics
464
ods listing style=statistical;
ods graphics on;
proc lifetest data=grouped plots=survival(cb=hw test
atrisk=0 to 1500 by 250); time Time*Cens(0); strata
Treatment; by Site;
run;
ods listing style=statistical;
ods graphics on;
proc reg data=sashelp.Class;
model Weight=Height;
quit;
ODS Graphics
465
data Data1;
input disease n age; datalines;
0 14 25
0 20 35
0 19 45
7 18 55
6 12 65
17 17 75
;
run;
ods graphics on;
proc logistic data=Data1 plots(only)=(roc(id=obs) effect);
model disease/n=age
/ scale=none clparm=wald clodds=pl rsquare; units age=10;
run;
ods graphics off;
Více viz PLOTS options u PROC Logistic:
http://support.sas.com/documentation/cdl/en/statug/63033/HT
ML/default/viewer.htm#statug_logistic_sect004.htm#statug.logi
stic.logiploteffect nebo na adrese:
http://support.sas.com/documentation/cdl/en/statug/63033/HT
ML/default/viewer.htm#statug_logistic_sect050.htm
ODS Graphics – novinky SAS 9.3
466
Více viz:
http://support.sas.com/documentation/cdl/en/grstatproc/65235
/HTML/default/viewer.htm#grstatprocwhatsnew93.htm
 inclusion with Base SAS and name change
 changes to the default ODS output
 new plot statements are available for the SGPLOT and SGPANEL procedures.
 new options and enhancements are available for the PROC SGPLOT, PROC SGPANEL, and
PROC SGSCATTER statements.
 new options and enhancements are available for the existing plot statements in the SGPLOT
and SGPANEL procedures.
 new options and enhancements are available for the axis statements in the SGPLOT and
SGPANEL procedures.
 new options and enhancements are available for the SGRENDER procedure.
 enhancements are available for the SGDESIGN procedure.
 a new attribute map feature provides a mechanism for controlling the visual attributes that are
applied to specific group data values in your graphs.
 a new annotation feature provides a mechanism for adding shapes, images, and annotations to
graph output.
ODS Graphics – components of a graph
467
1. Graph
a visual representation of data. The graph can contain titles,
footnotes, legends, and one or more cells that have one or more
plots.
2. Cell
a distinct rectangular subregion of a graph that can contain plots,
text, and legends.
3. Title
descriptive text that is displayed above any cell or plot areas in
the graph.
4. Plot
a visual representation of data such as a scatter plot, a series line,
a bar chart, or a histogram. Multiple plots can be overlaid in a cell
to create a graph.
5. Legend
refers collectively to the legend border, one or more legend
entries (where each entry has a symbol and a corresponding
label) and an optional legend title.
6. Axis
refers collectively to the axis line, the major and minor
tick marks, the major tick mark values, and the axis
label. Each cell has a set of axes that are shared by all the
plots in the cell. In multi­cell graphs, the columns and rows of
cells can share common axes if the cells have the same data
type.
7. Footnote
descriptive text that is displayed below any cell or plot areas in
the graph.
Proc Sgplot
468
Proc Sgplot
469
proc sgplot data=sashelp.stocks (where=(date
>= "01jan2000"d and stock = "IBM"));
title "Stock Trend";
series x=date y=close;
series x=date y=low;
series x=date y=high;
run;
proc sgplot data=sashelp.classfit;
title "Fit and Confidence Band from Precomputed Data";
band x=height lower=lower upper=upper / legendlabel="95% CLI" name="band1";
band x=height lower=lowermean upper=uppermean / fillattrs=GraphConfidence2
legendlabel="95% CLM" name="band2";
scatter x=height y=weight;
series x=height y=predict / lineattrs=GraphPrediction legendlabel="Predicted Fit"
name="series";
keylegend "series" "band1" "band2" / location=inside
position=bottomright;
run;
proc sgplot data=sashelp.heart;
title "Cholesterol Distribution";
histogram cholesterol;
density cholesterol;
density cholesterol / type=kernel;
keylegend / location=inside
position=topright;
run;
proc sgplot data=sashelp.stocks
(where=(date >= "01jan2000"d and date
<= "01jan2001"d and stock = "IBM"));
title "Stock Volume vs. Close";
vbar date / response=volume;
vline date / response=close y2axis;
run;
Proc Sgpanel
470
Další viz :
http://support.sas.com/documentation/cdl/en/grstatproc/65
235/HTML/default/viewer.htm#p00mgdlxbij4v3n0zewfb9cpf
xu1.htm
SAS makra – M.Friendly
 Michael Friendly,York University:
SAS Graphic Programs and Macros
 Univariate displays
 Bivariate displays
 Multivariate displays
 Cluster analysis
 Maps
http://www.datavis.ca/sasmac/
471
1.1 Introduction to SAS Enterprise Miner
SAS EM – stručný popis
Více viz:
http://support.sas.com/documentation/onlinedoc/miner/
http://support.sas.com/documentation/cdl/en/emgsj/65354/PDF/default/emgsj.pdf
 In SAS Enterprise Miner, the data mining process is driven by a process flow
diagram that you create by dragging nodes from a toolbar that is organized by SEMMA
categories and dropping them onto a diagram workspace.
 The graphical user interface (GUI) is designed in such a way that the business
analyst who has little statistical expertise can navigate through the data mining
methodology, and the quantitative expert can explore each node in depth to fine-tune
the analytical process.
 SAS Enterprise Miner automates the scoring process and supplies complete scoring
code for all stages of model development in SAS, C, Java, and PMML. The scoring code
can be deployed in a variety of real-time or batch environments within SAS, on the
Web, or directly in relational databases.
Analysis Element Organization
Projects Libraries
and
Diagrams
Process
Flows
Nodes
Datasources
Reports
Workspaces
System
EMWS
EMWS1
…
em_dgraph IDs
Part
…
My Library
SAS Enterprise Miner – Interface Tour
Menu bar and shortcut buttons
Project panel
SAS Enterprise Miner – Interface Tour
Properties panel
SAS Enterprise Miner – Interface Tour
Help panel
SAS Enterprise Miner – Interface Tour
Diagram workspace
SAS Enterprise Miner – Interface Tour
Process flow
SAS Enterprise Miner – Interface Tour
Node
SAS Enterprise Miner – Interface Tour
SEMMA tools palette
SAS Enterprise Miner – Interface Tour
SEMMA – Sample Tab
• Append
• Data Partition
• File Import
• Filter
• Input Data
• Merge
• Sample
• Time Series
SEMMA – Explore Tab
• Association
• Cluster
• DMDB
• Graph Explore
• Market Basket
• Multiplot
• Path Analysis
• SOM/Kohonen
• StatExplore
• Variable Clustering
• Variable Selection
SEMMA – Modify Tab
• Drop
• Impute
• Interactive Binning
• Principal Components
• Replacement
• Rules Builder
• Transform Variables
SEMMA – Model Tab
• AutoNeural
• Decision Tree
• Dmine Regression
• DMNeural
• Ensemble
• Gradient Boosting
• Least Angle Regression
• MBR
• Model Import
• Neural Network
• Partial Least Squares
• Regression
• Rule Induction
• Support Vector Machines
• Two Stage
SEMMA – Assess Tab
• Cutoff
• Decisions
• Model Comparison
• Score
• Segment Profile
Beyond SEMMA – Utility Tab
• Control Point
• End Groups
• Ext Demo
• Metadata
• Reporter
• SAS Code
• Start Groups
Credit Scoring Tab (Optional)
• Credit Exchange
• Interactive Grouping
• Reject Inference
• Scorecard
490
9. Regrese. Logistická regrese
Y = b0 + b1X
Regression Best
Fit Line
Unknown
Relationship
Y = b0 + b1X
Y – Y
Residual
^
^ ^ ^
Overview
491
Type of Predictors
Type of Response Categorical Continuous Categorical and
Continuous
Continuous Analysis of
Variance
Linear
Regression
Analysis of
Covariance
(Regression with
dummy variables)
Categorical Logistic
Regression
or Contingency
Tables
Logistic
Regression
Logistic
Regression
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Přehled procedur SASu pro regresi
492
CATMOD, GAM, GENMOD, GLIMMIX, GLM,
LIFEREG, LOESS, LOGISTIC, MIXED, NLIN,
NLMIXED, ORTHOREG, PHREG, PLS, PROBIT, REG,
ROBUSTREG, RSREG, SURVEYLOGISTIC,
SURVEYPHREG, SURVEYREG, TRANSREG.
 SAS/STAT:
 SAS/ETS:
AUTOREG, COUNTREG, MODEL, PANEL, PDLREG,
SYSLIN.
„klasická“
lineární regrese
logistická regrese
Simple Linear Regression Model
493Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
b0
1 unit
b1 units
?Weight
Height
in.
lb.
The CORR Procedure
 S regresní analýzou souvisí analýza korelační.
 Když pro nic jiného, tak alespoň v souvislosti s explorační
analýzou je vhodné prozkoumat data pomoví procedury CORR.
General form of the CORR procedure:
PROC CORR DATA=SAS-data-set <options>;
VAR variables;
WITH variables;
ID variables;
RUN;
494Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The CORR Procedure
 Scatter plots and scatter plot matrices are available through ODS
Graphics.
 ID statement enables you to specify additional variables to
identify observations in scatter plots and scatter plot matrices.
 Selected options:
 PLOTS <(ONLY)> <= plot-request>
 PLOTS <(ONLY)> <= (plot-request < plot-request >) >
 ALL
 MATRIX <( matrix-options )>
 SCATTER <( scatter-options )>
 HIST | HISTOGRAM
 NVAR=ALL | n
 ELLIPSE=PREDICTION | CONFIDENCE | NO
495Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC CORR –příklad výstupu
496
Simple Linear Regression Model
497Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Y = b0 + b1X
Regression Best
Fit Line
Unknown
Relationship
Y = b0 + b1X
Y – Y
Residual
^
^ ^ ^
Y = b0 + b1X1 + 
Assumptions:
 The mean of the Ys is accurately
modeled by a linear function of
the Xs.
 The random error term, , is
assumed to have a normal
distribution with a mean of zero.
 The random error term, , is
assumed to have a constant
variance, 2.
 The errors are independent.
Unknown
Relationship
Y = b0 + b1X
Violation of Model Assumptions
 Normality – does not affect the parameter estimates, but it
affects the test results.
 Constant Variance – does not affect the parameter estimates,
but the standard errors are compromised.
 Independent observations – does not affect the parameter
estimates, but the standard errors are compromised.
 Linear in the parameters – indicates a misspecified model,
and therefore the results are not meaningful.
Multiple Linear Regression with Two Variables
Consider the two-variable model
Y = b0 + b1X1 + b2X2 + 
where
Y is the dependent variable.
X1 and X2 are the independent or
predictor variables.
 is the error term.
b0, b1, and b2 are unknown parameters.
***
*
*
*
* *
*
*
b
X1
Y
X2
0
No relationship:
X
X
*
*
*
*
*
*
*
* *
*
*
*
*
Y
*
*
*
*
2
1
A relationship:
499Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Multiple Linear Regression Model
In general, you model the dependent variable Y as a linear
function of k independent variables, (the Xs) as
Y = b0 + b1X1 + ... + bkXk + 
Model Hypothesis test:
Null Hypothesis:
 The regression model does not fit the data better than the baseline model.
 b1 = b2 = … = bk = 0
Alternative Hypothesis:
 The regression model does fit the data better than the baseline model.
 Not all bis equal zero.
500Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Analytical Analysis vs. Prediction
Prediction:
 The terms in the model, the values of their coefficients, and their
statistical significance are of secondary importance.
 The focus is on producing a model that is the best at predicting
future values of Y as a function of the Xs. The predicted value of Y is
given by
kk XXY bbb ˆˆˆˆ
110  
501Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Analytical Analysis:
 The focus is on understanding the relationship between the
dependent variable and the independent variables.
 Consequently, the statistical significance of the coefficients is
important as well as the magnitudes and signs of the coefficients.
kk XXY bbb ˆˆˆˆ
110  
Model Selection Options
The SELECTION= option in the MODEL statement of PROC REG
supports these model selection techniques:
All-possible regressions ranked using
 RSQUARE, ADJRSQ or CP
Stepwise selection methods
 STEPWISE, FORWARD, or BACKWARD

SELECTION=NONE is the default.
502Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Model Selection Statistics
 Coefficient of determination (R2)
 Adjusted coefficient of determination (adjusted R2)
 Mallows’ Cp statistic
 Akaike’s information criteria (AIC)
 Schwarz’s Bayesian criteria (SBC)
SST
SSE
R 12
2
2
/ /( )
1 1
/ /( 1)
( 1)
1 (1 )
( )
E
T
SSE df SSE n p
AdjR
SST df SST n
n
R
n p

   


  
 503
SSR = ∑ ( y – y )
SSE = ∑ ( y – y )
^
^
2
Independent variable (x)
Dependentvariable
Independent variable (x)
Dependentvariable
Population mean: y
Mallows’ Cp Statistic
 p is the number of parameters in the model being
evaluated, including the intercept.
 n is the total number of observations.
 Models with Cp > p are underspecified.
 Mallows recommends choosing the first model
where Cp  p.
( )( )p full
p
full
MSE MSE n p
C p
MSE
 
 
504
Information Criteria
 Akaike’s information criteria (AIC)
 Schwarz’s Bayesian criteria (SBC)
Smaller values indicate a better model.
p
n
SSE
nAIC 2)ln()( 
)ln()ln()( np
n
SSE
nSBC 
505
Select Candidate Models
Candidate models can be identified by using
 your subject-matter knowledge
 information gathered from data exploration
 automatic selection criteria available in the
REG procedure
 all possible models ranked by
 R2, adjusted R2, or Mallows’ Cp
 stepwise selection
 forward, backward , stepwise, MAXR, or MINR
 other statistics such as AIC and SBC
 residual plots to evaluate model fit and model assumptions.
506
Conservative Significance Levels
Sample Size
Evidence 30 50 100 1000
Weak .076 .053 .032 .009
Positive .028 .019 .010 .003
Strong .005 .003 .001 .0003
Very Strong .001 .0005 .0001 .00004
507Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The REG Procedure
General form of the REG procedure:
508
PROC REG DATA=SAS-data-set <options>;
MODEL dependent(s)=regressor(s) </ options>;
RUN;
http://support.sas.com/documentation/cdl/en/statug/63033/HT
ML/default/viewer.htm#statug_reg_sect003.htm
Popis + jednoduchý příklad:
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Influential Observations versus
Outliers
Horsepower
Price
Outlier Influential Observation
Horsepower
509Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Studentized Residual
Studentized residuals (SR) are obtained by dividing the
residuals by their standard errors.
Suggested cutoffs are as follows:
 |SR| > 2 for data sets with a relatively small number
of observations
 |SR| > 3 for data sets with a relatively large number
of observations
510Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Cook’s D Statistic
Cook’s D statistic is a measure of the simultaneous change in the
parameter estimates when an observation is deleted from the
analysis.
A suggested cutoff is , where n is the sample size.
If the above condition is true, then the observation might have an
adverse effect on the analysis.
n
Di
4

511Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
DFFITS
DFFITSi measures the impact that the ith observation
has on the predicted value.

)ˆ(
ˆˆ
DFFITS
)(
i
ii
i
Ys
YY 

is the ith predicted value.
is the ith predicted value when the ith observation is
deleted.
is the standard error of the ith predicted value.
iYˆ
)(
ˆ
iY
)ˆ( iYs
512Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Identifying Influential Observations –
DFBETAs
measures the change in each parameter estimate when an
observation is deleted from the model.
 bj is the parameter estimate for the jth independent variable
 bj(i) is the parameter estimate for the jth independent variable with
the ith observation deleted from the analysis
 is the standard error of the jth parameter estimate when all
observations are included in the analysis
ˆ( )jb
)(ˆ
)(
)(
j
ijj
ij
b
bb
DFBETA



Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA. 513
Identifying Influential Observations –
The Covariance Ratio
measures the change in the precision of the parameter
estimates when an observation is deleted from the model.
 
  12
12





XXs
XXs
COVRATIO
iii
i
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA. 514
Identifying Influential Observations –
Summary of Suggested Cutoffs
4
n
2 p
n
2
n
2
n
3
1
p
n
3
1
p
n
Influential Statistics Cutoff Values
RSTUDENT Residuals |RSTUDENT| > 2
LEVERAGE LEVERAGE >
Cook’s D CooksD >
DFFITS |DFFITS| >
DFBETAS |DFBETAS| >
COVRATIO COVRATIO < or COVRATIO >
2 p
n
4
n
2
n
3
1
p
n

3
1
p
n

2
p
n
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA. 515
Lineární regrese – PROC REG
516
PROC REG <options> ;
<label:>MODEL dependents=<regressors> </ options> ;
BY variables ;
FREQ variable ;
ID variables ;
VAR variables ;
WEIGHT variable ;
ADD variables ;
DELETE variables ;
<label:>MTEST <equation, ...,equation> </ options> ;
OUTPUT <OUT=SAS-data-set>< keyword=names> <...keyword=names> ;
PAINT <condition | ALLOBS> </ options > | < STATUS | UNDO> ;
RESTRICT equation, ...,equation ;
REWEIGHT <condition | ALLOBS> </ options > | < STATUS | UNDO> ;
PLOT <yvariable*xvariable> <=symbol> <...yvariable*xvariable> <=symbol> </ options> ;
PRINT <options> <ANOVA> <MODELDATA> ;
REFIT ;
RESTRICT equation, ...,equation ;
REWEIGHT <condition | ALLOBS> </ options > | < STATUS | UNDO> ;
<label:>TEST equation,<,...,equation> </ option> ;
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_sect001.htmVíce na:
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Modelování kategoriální responze
 Nastane default?
St. X Y
1 2.6 1
2 1.4 0
3 .65 1
4 4.1 1
5 .25 0
6 1.9 0
„klasická“ regrese
není vhodná
používá se
logistická regrese.
Yi = b0 + b1X1i + i
 If the response variable is categorical, then how do you code the response
numerically?
 If the response is coded (1=Yes and 0=No) and your regression equation
predicts 0.5 or 1.1 or -0.4, what does that mean practically?
 If there are only two (or a few) possible response levels, is it reasonable to
assume constant variance and normality?
517Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Generalized Linear Models
 The distribution of the observations can come from the
exponential family of distributions.
 The variance of the response variable is a specified function of
its mean.
 Xb is fit to a function of E(y) (called a link function)
suggested by the distribution of the observations: g(E(y)) =
g() = Xb
0 1 1( ( )) Xi i k kig E y x xb b b b    …
Link function
518
Examples of Generalized Linear Models
 *Models often use the LOG link in practice. 519
Types of Logistic Regression
520Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logit Transformation
Logistic regression models transform probabilities called logits*.
where
i indexes all cases (observations)
pi is the probability the event (a default, for example) occurs in
the ith case
ln is the natural log (to the base e).
* The logit is the natural log of the odds.
521
 
logit( ) ln
1
i
i
i
p
p
p
 
    
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logit link function
522
Logit
Transform
logit
link function
0 1
5
-5
The logit link
function transforms
probabilities
(between 0 and 1) to
logit scores (between
−∞ and +∞).
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logistic Regression Model
logit (pi) = b0 + b1X1 + . . . + bkXk
where
 logit (pi)= logit of the probability of the event
 b0=intercept of the regression equation
 bk=parameter estimate of the kth predictor variable
523Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
)( 110
1
1
kk XXi
e
p bbb 

 
The Fitted Surface, discrimination
524Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
logit(p) p
1
0
x1
x2
x1
x2
0
0
1
x1
x2
p
x1
x2
above
below
525
Odhad parametrů
 Metoda maximální věrohodnosti vede na soustavu nelineárních
rovnic.
 Tuto soustavu řešíme Netwon-Raphsonovou iterační metodou.
 Více na: • http://www.stat.cmu.edu/~cshalizi/402/lectures/14-logistic-
regression/lecture-14.pdf
• http://czep.net/stat/mlelr.pdf
• http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf
Maximálně věrohodný odhad (MLE)
Zdroj: http://www2.imperial.ac.uk/~abellott/Credit%20Scoring%202.pdf 526
Maximálně věrohodný odhad
527
Log-likelihood
1
ˆb 2
ˆb
Maximálně věrohodný odhad
528
Maximálně věrohodný odhad
529
Standard errors on the MLE
530
MLE- testování hypotéz
531
MLE – konfidenční intervaly
532
Likelihood Ratio Test
533
Newton-Raphsonova metoda
534
 Základní princip metody:
 Maticový zápis:
 Jde o numerickou iterační metodu -> je třeba zkontrolovat, zda
byla splněna podmínka konvergence (metoda „dokonvergovala“
k optimálnímu řešení)
xT
e
xp b
b 


1
1
),( 

n
i
x
i
T
i
i
T
exyL
1
)1log()( b
bb
b
b
bb
b
bb






)()(
12
LL
T
oldnew
))(()( 11
pyWXWXWXX oldTTnew
 
bb
)),(1(),(...W
),(...
)1(...
...
oldold
old
bb
b
ii
i
i
xxpnn
xpp
pnX
yy


vektor pozorování vysvětlované proměnné
matice plánu, typu
vektor pravděpodobností
diagonální matice vah, s diag. prvky
535
Výhody logistické regrese
 Málo parametrů
 Snadné použití i interpretace
 Lze snadno začlenit i diskrétní prediktory
 Funguje dobře i na datech, která se poměrně značně liší od
gaussovských směsí
 A především většinou dobře funguje, pokud věnujeme
odpovídající pozornost přípravě dat
 praktická zkušenost: ve čtyřech případech z pěti je logistická regrese na
datech, která analyzuji, buď nejlepší nebo zhruba stejně dobrá jako jiné
metody.
536
Interpretace, rozdíly proti OLS
 Regresní koeficienty b: kladné znamenají, že proměnná svým
růstem zvyšuje šanci zařazení do skupiny kódované číslem 1, a
naopak záporné indikují pokles této šance
 Často se používá exp(bi): je to faktor, kterým se násobí šance
p/(1–p) při jednotkovém nárůstu xi a neměnných ostatních xk
 Pozor na různá měřítka, v nichž xi mohou být měřena;
 Místo F-testu celkové validity nyní máme chí-kvadrátový test pro
totéž
 Místo t-testu signifikance proměnných v modelu jsou Waldovy
statistiky; je to v podstatě totéž a čteme to stejně
 Místo R2 jsou jen pseudo-R2
Příklad
Zdroj: http://www2.imperial.ac.uk/~abellott/Credit%20Scoring%202.pdf 537
Příklad
538
Příklad
539
Příklad
540
Logistic Regression with Sequential Steps
 Forward regression
 starts with a baseline model (intercept-only)
 searches all variables and finds the strongest one
 keeps adding variables in order of strength until no significant
improvement is achieved in the model.
 Backwards regression
 starts with a full model using all variables
 removes the weakest input variable provided that taking it out
does not cause a significant reduction in the fit of the model
 continues removing the weakest input variables in order unless
there is a significant reduction in the fit of the model; at which
point the algorithm stops.
541Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Logistic Regression with Sequential Steps
 Stepwise regression
 is a combination of forward and backward regression
 begins the same way as forward
 re-evaluates the statistical significance of all included variables
after each new variable is added.
  If a previously included variable becomes statistically
insignificant when a new variable is added, that variable
is then removed.
  The algorithm stops when no more variables can be
found that add significantly to the fit of the model and
all variables remaining in the model are statistically
significant. 542Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Scalability in PROC LOGISTIC
25 50 75 100 150 200
Number of Variables
All
Subsets
Stepwise
Time
543Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Logistic Regression Task
544Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Specify the level
of the response
variable that you
want to model.
For example, do
you want to model
the probability of
a 0 or a 1?
Volba linkovací
funkce.
LOGISTIC Procedure
General form of the LOGISTIC procedure:
Více např. na:
545
http://www.okstate.edu/sas/v8/saspdf/stat/chap39.pdf
http://www.okstate.edu/sas/v8/sashtml/onldoc.htm
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC LOGISTIC <options>;
CLASS variable</v-options>;
MODEL response = <effects></options>;
ODDSRATIO <’label’> variable </ options>;
ROC <’label’> <specification> </ options>;
ROCCONTRAST <’label’><contrast></ options>;
SCORE <options>;
UNITS predictor1=list1 </option>;
OUTPUT <OUT=SAS-data-set> keyword=name…
keyword=name></option>;
RUN;
LOGISTIC Procedure - příklady
546
ods html file=“logistic_vyvoj.html" style=sasweb;
proc logistic data=dm1.data_vyvoj descending;
model good4=goods_type_w phone_w a_uver_w
fam_state_w income_w credit_w vek_w
;
run;
ods html close;
proc logistic data=dm1.score_base
outest=work.model_def;
CLASS AGE_d EDUCATION_d CAR_AGE_d / param=glm;
MODEL def_bad = AGE_d EDUCATION_d CAR_AGE_d
total_income_d(init_pay_by_INCOME_d)
/ SELECTION=FORWARD HIERARCHY=MULTIPLECLASS;
score out=work.tab_scored_def;
run;
LOGISTIC Procedure - příklady
547
proc logistic
data=dm1.score_base outest=work.model_def namelen=200;
where client_type="1-Novy";
CLASS sex_k child_num_k fam_state_k age_k;
MODEL def_bad = AGE_w EDUCATION_w AGE_w*EDUCATION_w
sex_k|child_num_k|fam_state_k|age_k@4
/selection=stepwise slentry=0.6 slstay=0.1 details corrb
;
run;
proc logistic
data=dm1.score_base inest=hc.modelSU namelen=200;
CLASS sex_k child_num_k fam_state_k age_k;
MODEL def_bad = AGE_w EDUCATION_w AGE_w*EDUCATION_w
sex_k|child_num_k|fam_state_k|age_k@4
/selection=none maxiter=0;
output out=dm1.data_all_scr (keep=id_credit score def_bad compress=yes)
prob=score;
run;
What Happens to Classification Variables?
 The Logistic Regression task assumes a linear relationship
between predictors and the logit for the response.
 For categorical variables, that assumption cannot be met.
 Specification as a Classification variable creates “design
variables” representing the information in the categorical
variables.
 The design variables are the ones actually used in model
calculations.
 There are many possible “parameterizations” of the design
variables.
548Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Effects (default) Coding: An Example
549
b0 = the average value of the logit across all categories
b1 = the difference between the logit for Low income and the average logit
b2 = the difference between the logit for Medium income and the average logit
logit(p) = b0 + b1 * DLow income
+ b2* DMedium income
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate
Standard
Error
Wald
Chi-Square Pr > ChiSq
Intercept 1 -0.5363 0.1015 27.9143 <.0001
IncLevel 1 1 -0.2259 0.1481 2.3247 0.1273
IncLevel 2 1 -0.2200 0.1447 2.3111 0.1285
CLASS Value Label 1 2
IncLevel 1 Low Income 1 0
2 Medium Income 0 1
3 High Income -1 -1
Design Variables
Reference Cell Coding: An Example
550
logit(p) = b0 + b1 * DLow income
+ b2* DMedium income
b0 = the value of the logit when income is High
b1 = the difference between the logits for Low and High income
b2 = the difference between the logits for Medium and High income
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate
Standard
Error
Wald
Chi-Square Pr > ChiSq
Intercept 1 -0.0904 0.1608 0.3159 0.5741
IncLevel 1 1 -0.6717 0.2465 7.4242 0.0064
IncLevel 2 1 -0.6659 0.2404 7.6722 0.0056
CLASS Value Label 1 2
IncLevel 1 Low Income 1 0
2 Medium Income 0 1
3 High Income 0 0
Design Variables
Odds Ratio Calculation from the
Current Logistic Regression Model
Logistic regression model:
Odds ratio (females to males):
OR-nominální prom.:
OR-spojité prom.:
551
0
malesodds b
e10
femalesodds bb 
 e
gender)()log(logit(p) 10  bbodds
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
10 bb 
e
0b
e
odds ratio = =
1b
e
Model Fit versus Complexity
552Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
1 2 3 4 5 6
Model fit statistic
training
validation
1 2 3 4 5 6
Model fit statistic
Choose simplest
optimal model.
Model Assessment: Comparing Pairs
 Counting concordant, discordant, and tied pairs is a way to assess
how well the model predicts its own data and therefore how well
the model fits.
 In general, you want a high percentage of concordant pairs and low
percentages of discordant and tied pairs.
 Následuje příklad určení těchto párů na modelu predikujícím zda
daná osoba nakoupí zboží za více než 100$.
553Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
< $100 $100 +
Comparing Pairs
554Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
< $100 $100 +
P(100+) = .32 P(100+) = .42
Theactual sorting agrees with the model.
This is a concordant pair.
< $100 $100 +
P(100+) = .42 P(100+) = .32
Theactual sorting disagrees with the model.
This is a discordant pair.
< $100 $100 +
P(100+) = .42 P(100+) = .42
The model cannot distinguish between the two.
This is a tied pair.
 PROC Logistic standardně nabízí četnosti
(relativní) jednotlivých typů párů a z nich
odvozené statistiky kvality modelu:
555
10. Rozhodovací stromy, neuronové sítě
556
Princip rozhodovacích stromů
 Rozděl a panuj: vhodně
rozdělím zkoumané objekty
do skupin...
 a v každé skupině opět
postupuji stejně (rekurze)...
 dokud nedojdu k malým
skupinkám, na něž stačí
zcela jednoduchý model.
DIVIDE ET IMPERA !
Jan Spousta: Přednášky k data miningu.
557
Historie metody
 DIVIDE ET IMPERA je staré římské přísloví, ale…
 jeho použití v analýze dat ve smyslu rozhodovacích stromů
bylo navrženo až roku 1959 W. A. Belsonem
 W. A. Belson: britský sociolog a metodolog, zabýval se především
kriminalitou mládeže
 Původní citace: William A. Belson: Matching and Prediction
on the Principle of Biological Classification, Applied Stat.,
VIII:65-75, 1959.
 Již předtím (minimálně od 30. let 20. stol.) se však statistici zabývali
problémy kategorizace spojitých proměnných a dělením populací,
ovšem v jiném kontextu (Yule, Fisher…)
Jan Spousta: Přednášky k data miningu.
558
Historie metody (pokrač.)
 První počítačově implementovaný
algoritmus se jmenoval AID
– vznikl roku 1963
 Citace: James N. Morgan, John A.
Sonquist: Problems in the Analysis
of Survey Data, and a Proposal, Journal of the American
Statistical Association, 58:415-435, 1963.
 AID byl založen na analýze rozptylu (sumy čtverců) –
pomůcka pro přípravu ANOVA => základ statistického směru
teorie rozh. stromů (CHAID, SEARCH aj.)
Portrét: James N. Morgan
Jan Spousta: Přednášky k data miningu.
559
Historie metody (pokrač.)
 Označení „rozhodovací strom“ (Decision Tree) je snad z r.
1966 (Experiments in Induction – E. B. Hunt, J. Marinová a P.
J. Stone) => směr zakotvený v teorii umělé inteligence.
 Zde vyšli z teorie informace – rozdělení na podskupiny má
přinést „informační zisk“, snížit entropii (implementováno
např. v dnes užívaných algoritmech ID3, C4.5 a C5).
 Rozvoj aplikací a uplatnění i mimo oblast teoretické vědy
přinesl nástup rychlých PC a rozvoj data miningu (cca
polovina 90. let 20. století)
Portrét: Earl B. HuntJan Spousta: Přednášky k data miningu.
560
Proč se hovoří o stromech?
 Postupné dělení skupin zkoumaných případů lze znázornit
stromovým schématem
 Kořen – větve – listy: terminologie teorie grafů
www.elseware.fr/storm
Jan Spousta: Přednášky k data miningu.
561
Proč „rozhodovací“?
 Strom lze vyjádřit pomocí schémat jestliže - pak
 Lze snadno aplikovat do rozhodovacích procesů
Jan Spousta: Přednášky k data miningu.
562
Regresní stromy
 Je-li cílová proměnná spojitá, mluvíme o regresních stromech.
 Každý list určuje predikovanou hodnotu cílové proměnné.
 Všechny pozorování příslušející do daného uzlu mají stejnou predikovanou
hodnotu.
RM<6.9
NOX<.66NOX<.67
16RM<6.5 14
NOX<.51
22 NOX<.63
27
2719
RM<7.4
4633
563
Co je lepší: stromy, regrese...?
 Neexistuje obecné pravidlo, kdy volit jaký typ algoritmu – nejlepší bývá
vyzkoušet jich několik
 Neexistují data vyloženě vhodná pro jeden typ (a vyloženě nevhodná pro
jiný)
 Často však v praxi dosáhnou všechny metody podobnou přesnost =>
rozhodne interpretovatelnost, snadnost použití, stabilita výsledků, objem
potřebných vstupních dat...
Jan Spousta: Přednášky k data miningu.
564
Co je lepší? (pokrač.)
 An Empirical Comparison of Decision Trees and Other
Classification Methods (Tjen-Sien Lim, Wei-Yin Loh, YuShan
Shih, 1998) – srovnání 33 různých metod na 32
datových množinách
 Hlavní závěr: Průměrné chybovosti většiny klasifikátorů se
od sebe statisticky významně neliší. Značné rozdíly jsou však
ve výpočetním čase, který jednotlivé klasifikátory spotřebují.
 Nejlepší metody s přijatelným časem: polytomická logistická
regrese a rozhodovací strom QUEST
 Nutno dodat, že obě programovali autoři článku
Jan Spousta: Přednášky k data miningu.
565
Binární nebo obecné stromy?
Binární stromy
 Např. CART, C5, QUEST
 Z uzlu vždy 2 větve
 Rychlejší výpočet (méně
možností)
 Je třeba mít více uzlů
 Zpravidla přesnější
=> Data Mining, skóry
Obecné stromy
 Např. CHAID, Exhaustive
CHAID
 Počet větví libovolný
 Interpretovatelnost
člověkem je lepší
 Strom je menší
 Zpravidla logičtější
=> segmentace, mrktg.
Jan Spousta: Přednášky k data miningu.
566
Klasická prezentace: dendrogram
Jan Spousta: Přednášky k data miningu.
567
Alternativní prezentace: box chart
x
y
Vhodné pro malý počet použitých prediktorů (zde x a y)
Jan Spousta: Přednášky k data miningu.
568
Alternativní prezentace: výseče
Snadno vidíme podíl
jednotlivých větví na
celém počtu případů.
Barva znázorňuje podíl
hledané kategorie nebo
míru homogenity uzlu.
Méně vhodné, jde-li nám
o rozhodovací pravidla.
Jan Spousta: Přednášky k data miningu.
569
Alternativní prezentace: text
Jednoduché, ale hůře čitelné a málo výrazné
Jan Spousta: Přednášky k data miningu.
Jak se „vyrobí“ strom
Zejména je třeba určit
 proměnnou, podle které dojde ke
štěpení stromu
 kategorizaci hodnot této proměnné
tak, aby štěpení bylo optimální
Zásadní je přitom zda uvažujeme:
 Ordinalní/Nominální/Intervalovou
proměnou
 Binární/vícenásobné štěpení
A důležité je také kritérium pro
štěpení:
 Redukce různorodosti/“znečištění“
(Impurity reduction)
 Chi-kvadrát test
Variable
X10
X10
X10
X1
X1
.
.
.
Values
0.5
1.8
11, 46
2.4
1, 4, 61
.
.
.
Partitioning on an Ordinal Input
1—234
12—34
123—4
1—2—34
1—23—4
12—3—4
1—2—3—4
3
3
1
 
 
 
3
3
2
 
 
 
3
1
3
 
 
 
Splits
1 ( 1)!
1 ( 1)! ( )!
L L
B B L B
  
 
   
1
2
1
2 1
1
L
L
l
L
l


 
  
 

Partitioning on a Nominal Input
1—234
2—134
3—124
4—123
12—34
13—24
14—23
1—2—34
1—3—24
1—4—23
2—3—14
2—4—13
3—4—12
1—2—3—4
2 3 4 total
2 1 1
3 3 1 4
4 7 6 1 14
5 15 25 10 51
6 31 90 65 202
7 63 301 350 876
8 127 966 1701 4139
9 255 3025 7770 21146
B:
L
( , ) ( 1, ) ( 1, 1)S L B B S L B S L B     
Impurity Reduction
Child2
impurity2
n2
Child1
impurity1
n1
Child4
impurity4
n4
Child3
impurity3
n3
Parent
impurity0
n0
31 2 4
0 0 0 0
(0) (1) (2) (3) (4)
nn n n
i i i i i i
n n n n
 
      
 
Gini Impurity
2
1
1 2
r
j j k
j j k
p p p
 
  
Entropy
   1 2 2
1
, , , log
r
r i i
i
H p p p p p

 
0.0
0.5
1.0
0.0 0.5 1.0
2r 
1p
 1 12 1p p
Chi-Squared Test
 
2
2
644
O E
E


 
(3 1)(2 1) 2    
.342
.342
.316
.656 .344
 
2
O E
E

n=1064
Expected
Observed
38.5X1: <38.5
293 71
363 1
42 294
1
7
9
239 125
239 125
225 116
12 23
64 123
149 273
p-Value Adjustments
644
660
814
2
4
6
140
141
172
96
4560
156849
138
137
167
2
 10log (P) 10log ( P)mm
X10: 0.5 41.5 51.5
9 143 65
221 88 1
1 4 16
147
54
315
1
7
9
X1: 38.5
293 71
363 1
42 294
1
7
9
X1: 17.5 36.5
249 42 73
338 25 1
26 16 294
1
7
9
p-Value Adjustments
579
Algoritmus CHAID – úvod
 CHi-squared Automatic Interaction Detector
 Jeden z nejrozšířenějších rozhodovacích stromů v komerční
oblasti (vedle QUEST a C4.5 / C5)
 Kass, Gordon V. (1980). An exploratory technique for
investigating large quantities of categorical data. Applied
Statistics, Vol. 29, pp. 119-127.
 Založeno na autorově disertaci na University of Witwatersrand
(Jihoafrická rep.)
 Předchůdci: AID – Morgan a Sonquist, 1963; THAID – Morgan a
Messenger, 1973
Jan Spousta: Přednášky k data miningu.
580
Připomenutí: Test nezávislosti χ2
 Nezávislost testujeme na základě výrazu
 Jsou-li x a y nezávislé, má tento výraz Pearsonovo chí-kvadrát
rozdělení s df = (r – 1)(s – 1)
 Test: plocha pod grafem „nad“ pozorovanou hodnotou (~signifikance
p) < α => zamítnu hypotézu nezávislosti x a y
 Současné testování více hypotéz
=> nutno adjustovat α (Bonferroni)
Jan Spousta: Přednášky k data miningu.
581
Algoritmus CHAID: idea
 Začíná se u celého souboru
 Postupné větvení / štěpení souboru (přípustné je rozdělení
na libovolný počet větví vycházejících z jednoho uzlu)
 Algoritmus je rekurzivní – každý uzel se dělí podle stejného
předpisu
 Zastaví se, pokud neexistuje statisticky signifikantní
rozdělení => vzniká list
 Obvykle je navíc podmínka minimálního počtu případů v uzlu
a/nebo v listu, příp. maximální hloubky stromu
 http://support.spss.com/ProductsExt/SPSS/Documentation/Statistic
s/algorithms/14.0/TREE-CHAID.pdf
Jan Spousta: Přednášky k data miningu.
582
CHAID: postup v uzlu
 Pro všechny prediktory
 Vytvoř kontingenční tabulku target x prediktor (rozměr k x l)
 Pro všechny dvojice hodnot prediktoru spočti chí-kvadrátový test podtabulky
(k x 2)
 „Podobné“ (=ne signifikantně odlišné) dvojice postupně spojuj (počínaje
nejnižšími hodnotami chí-kvardrátu) a přepočítávej výchozí kontingenční
tabulku. Zastav se, když signifikance všech zbylých podtabulek je vyšší než
stanovená hodnota.
 Zapamatuj si spojené kategorie a signifikanci chí-kvadrátu výsledné tabulky s
redukovanou dimenzionalitou
 Vyber prediktor, kde je tato signifikance nejnižší
 Pokud jsou splněny podmínky štěpení, rozděl případy v uzlu podle
již „spojených“ kategorií
Jan Spousta: Přednášky k data miningu.
583
CHAID: zhodnocení
 Pokud je počet kategorií prediktoru n, tak je třeba
provádět jen řádově n2 testů
 Kdyby se testovala všechna možná rozdělení, rostl by
počet testů exponenciálně s růstem n
 CHAID tím šetří výpočetní čas, ale zároveň není
zaručeno, že najde optimální řešení uzlu (greedy
search v uzlu)
 Ordinální znak: lze spojit jen sousední kategorie
 Spojitý znak: nutná je kategorizace
 Zde existují lepší i horší implementace
Jan Spousta: Přednášky k data miningu.
584
Další algoritmy
 Existují desítky příbuzných algoritmů, často navzájem
dost podobných
 Zde pouze naznačíme vlastnosti několika z nich (často
používaných a/nebo zajímavých)
 CART
 ID3 a C5
 QUEST
 TreeNet
Jan Spousta: Přednášky k data miningu.
585
CART / C&RT
 Classification And Regression Tree
 Algoritmus je založen na počítání míry diverzity („nečistoty“) uzlu:
chci maximalizovat
div(matka) – (div(dcera A) + div(dcera B))
konst.
s tím, že sčítance vážíme podílem případů v uzlech
 Giniho míra diverzity (inspirace z ekonomie, kde se podobně měří
nerovnosti v distribuci majetku a příjmů)
divGini = 1 – Σ pi
2
 pi jsou u nás relativní četnosti v uzlech
Portrét: Corrado GiniJan Spousta: Přednášky k data miningu.
586
ID3, C4.5, C5 (See5)
 Místo Giniho míry užívají entropii
diventrop = – Σ pi ln2 pi
= střední počet bitů potřebných pro zakódování
případu v daném uzlu
 Binární stromy
 Zabudovaný algoritmus pro zjednodušení množiny odvozených
pravidel – lepší interpretovatelnost
 Ross Quinlan: Induction of decision trees (1986); týž: C4.5:
Programs for Machine Learning, (1993); týž: C5.0 Decision Tree
Software (1999)
 http://www.rulequest.com/see5-info.html
Portrét: Ross QuinlanJan Spousta: Přednášky k data miningu.
587
QUEST
 Quick, Unbiased and Efficient
Statistical Tree
 Loh, W.-Y. and Shih, Y.-S. (1997),
Split selection methods for classification trees, Statistica Sinica,
vol. 7, pp. 815-840
 Výběr štěpící proměnné na základě statistického testu
nezávislosti prediktor x target => mírně suboptimální, ale rychlé,
navíc výběr štěpící proměnné je nevychýlený
 Jen nominální target (=závisle proměnná)
 Binární strom, pruning
 Používá se imputace chybějících hodnot
Portrét: Wei-Yin Loh
Jan Spousta: Přednášky k data miningu.
588
TreeNet
 Friedman, J. H. (1999): Greedy Function Approximation: A
Gradient Boosting Machine, Technical report, Dept. of Statistics,
Stanford Univ.
 Namísto jednoho velkého stromu „les“ malých
 Výsledná predikce vzniká váženým součtem predikcí jednotlivých
složek
 Analogie Taylorova rozvoje: rozvoj do stromů
 Špatně interpretovatelné (černá skříňka), ale robustní a přesné;
nižší nároky na kvalitu a přípravu dat než neuronová síť nebo
boosting běžných stromů
 Komerční, www.salford-systems.com
Portrét: Jerome H. FriedmanJan Spousta: Přednášky k data miningu.
Stromy v SAS EM
Stromy v SAS EM
 Další možnosti nastavení viz help
 možný výsledek…
591
Decision Tree v SAS EM
592
Neuronové sítě (Neural Networks)
 Někdy se také uvádí název Artificial Neural Networks (ANN),
tj. umělé neuronové sítě.
 Založené na pozorované funkcionalitě lidského mozku.
 Ovšem v porovnání s mozkem jde o velmi zjednodušený
matematický model.
 Často jde u NN o adaptivní systém, který mění svou strukturu
na základě vnějších či vnitřních informací získaných v průběhu
učící fáze.
 Využívají se např. při vyhledávání vzorů v datech, rozpoznávání
řeči nebo klasifikačních problémech.
 http://en.wikipedia.org/wiki/Artificial_neural_network
593
Příklad neuronové sítě
(výška)
(pohlaví)
(malý)
(střední)
(vysoký)
vstup
(input)
skrytá vrstva
(hidden layer)
výstup
(output)
The Neuron
 Excitatory (+) and inhibitory (-) inputs, arriving at the
dendrites, are weighted by adaptable synapses.
 The weighted inputs are added together.
 If the sum is greater than an adaptable threshold (bias)
value, the neuron sends activation down its axon.
594Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The McCulloch-Pitts Neuron
 A McCulloch-Pitts neuron with d inputs is formally
defined by the following equation:
 The step function, (.), turns each McCulloch-Pitts neuron
into a linear classifier/discriminator.
595
xwwfyE
d
i
ii 





 1
0)(
 E(y)
x1
xd
w1
wd
w0
...
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Hebb Rule
 The strength of the connection between neurons i and j
should be adjusted in accordance with the equation:
 The eta () term is the neuron’s learning rate, which scales the
amount of weight adjustment.
 Permitted learning rate values range from 0 to 1.
 Large learning rate values risk divergence.
596
jiij xyw ˆ
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Widrow-Hoff Delta Rule
 Hebb’s learning rule is unstable.
 Widrow and Hoff proposed a variant of Hebb’s rule, one that
is stable under a range of learning rates:
 They called their learning model the delta rule.
 Because the delta rule reduces the sum of squared error, it is
also known as the least mean squares rule.
597
jiiij xyyw )ˆ(  
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Perceptron
 The perceptron is a pattern-recognition machine invented in
the 1950s for optical character recognition.
 Each processing unit is a McCulloch-Pitts neuron.
 A perceptron with n outputs is a discriminator function that
divides the input space into n distinct regions.
598
 
n
1
2
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The Limitations of a Simple Perceptron
 The simple (linear) perceptron can only solve linearly
separable problems.
 The EXLUSIVE OR truth table (below) is an example of a
problem that is not linearly separable.
599
Inputs Output
x1 x2
F F F
T F T
F T T
T T F
x1
T
T
F
F
T
T
F
F
x2
?
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
600
(Ne)Výhody NN
Výhody:
 Schopnost učení.
 Snadná parametrizace.
 Robustnost.
 Řeší mnoho problémů.
Nevýhody:
 Nesnadné porozumění/interpetace.
 Můžou trpět přeučením (overfitting).
 Vstupy musí být numerické.
 Obtížná verifikace.
601
The Impact of Noisy Data
neural network
regression
neural network
regression
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Typy neuronových sítí
 Existuje celá řada typů neuronových sítí, přičemž
každý z nich se hodí na jinou třídu úlohy.
 Podle přítomnosti „učitele“ dělíme neuronové sítě na
 sítě s učitelem (srovnávání výstupu s požadovaným)
 sítě bez učitele (bez vnějšího arbitru).
602
Typy neuronových sítí podle zpracování signálu
603
Symbol Způsob zpracování signálu
– chybívrstva
* žádné
L lineární kombinace
V vzdálenost
Z znaménko
S sigmoida
G Gaussova funkce
E exponenciála
MIN nejmenšívyhrává
MAX největšívyhrává
Typ sítě Vrstvy Autoři
Vstupní Skrytá Výstupní
OLAM * – L(+Z) Haykin
HEBB * – L+Z Hopfield
HAMM * L+MAX L+Z Lipmann
MLP1 * L+Z L+Z Widrow, Hoff
MLP2 * L+S L+S Rummelhart
SOM * – V+MIN Kohonen
RBF * V+G L Poggio, Girosi
MOD * L+E L Jacobs, Jordan
COUNT * V+MIN L Nielsen
 V tabulce je základních devět
typů sítí:
 optimální lineární asociativní
paměť (Optimum Linear
Associative Memory – OLAM),
 Hebbova síť (HEBB),
 Hammingova síť (HAMM),
 vícevrstvá síť s bipolárními
neurony (Multi Layer Perceptron
1 – MLP1),
 vícevrstvá síť se spojitým
chováním (MLP2), Kohonenovy
mapy (SOM),
 síť s radiální bází (RBF),
 modulární síť (MOD) a
 síť se zpětným šířením
(counterpropagation –
COUNT).
 Další sítě lze vytvářet jejími
kombinacemi.
Asociativní neuronové sítě
 U asociativní paměti probíhá vybavení příslušné
informace na základě její částečné znalosti (asociace).
 Rozlišujeme sítě s pamětí
 autoasociativní (upřesnění či zúplnění vstupní
informace na základě již naučeného)
 heteroasociativní (vybavení si sdružené informace na
základě vstupní asociace)
604
Učení neuronových sítí
 Algoritmus učení je různý, nicméně obecně má tyto kroky:
 inicializace vah (malé náhodné hodnoty)
 předložení nového vzoru (vektor reálných hodnot X)
 výpočet aktuálního vstupu (podle f aktivační funkce)
 přizpůsobení vah (přepočtení vah podle zjištěné odchylky)
 opakování procesu učení (až do stabilizace vah wi)
 Fáze učení sítě se nazývá adaptivní a po naučení je síť ve fázi
vybavování (aktivní fázi).
605
Využití neuronových sítí
Úloha Vhodné neuronové sítě
logické obvody HEBB, HAMM, MLP1
odstranění šumu MLP1, MLP2, RBF, MOD
řeč a výslovnost MLP2, SOM
komprese COUNT
data mining OLAM, HEBB, SOM
optické rozpoznávání znaků HEBB, OLAM, HAMM, MLP1, MLP2, RBF,
SOM
606
607
Linear Perceptron



d
i
ii xwwyEg
1
0
1
))((
 )(1
yEg
x1
xd
w0
wd
w1
...
x2
w2
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
608
Activation Functions
Elliott
arctan
logistic
tanh
0
1
1
0 Net Input
Activation
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
609
Multilayer Perceptron
 )(1
yEg
x1
xd
w0
w01
w0n
wdn
w1n
wd1
w11
w1
wn
......
  










h
i
d
j
jijiii xwwgwwyEg
1 1
00
1
))((
hidden layer
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
610
Shaping the Sigmoid
)tanh( 110110 xwwww 
11 0w 
11w
11w
0w
0 1w w
0 1w w
 01
11
w
w
x
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
611
Sigmoidal Basis Functions
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
612
Skip-Layer Perceptron
 )(1
yEg
x1
xd
w0
w01
w0n
wdn
w1n
wd1
w11
w1
wn
......
skip layer
   










d
k
kk
h
i
d
j
jijiii xwxwwgwwyEg
11 1
00
1
))((
hidden layer
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
613
MLP with Two Hidden Layers
 )(1
yEg
x1
xd
w0
w011
w01d
wdmn
w11m
wdm1
w111
w1
wm
......
w01
w0n
...
w11
wdm
wd1
w1n
    










m
k
n
j
d
i
iijkjkjjkkkk xwwgwwgwwyEg
1 1 1
000
1
)())((
nested hidden layers
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
How Many?
 A single hidden layer network models any continuous
relationship between the inputs and outputs.
 Two hidden layers model discontinuous relationships.
 The number of hidden units that will be required in each
defined hidden layer is problem specific.
614Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Overview of Radial Basis Functions
 Ordinary Radial Basis Functions (ORBF).
 Normalized Radial Basis Functions (NRBF).
615Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
 )(1
yEg
x1
xd
w0
w01
w0h
wdn
w1n
wd1
w11
w1
wh
......
  


















h
i
d
j
ijjii wxwwwyEg
1 1
2
00
1
)(exp))((
hidden unit
RBF Combination Functions
 XRADIAL Unequal Heights and Widths.
 EQRADIAL Equal Heights and Widths.
 EWRADIAL Equal Widths.
 EHRADIAL Equal Heights.
 EVRADIAL Equal Volumes.
616Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
617
Normalized Radial Basis Functions
 )(1
yEg
x1
xd
w0
w01
w0n
wdn
w1n
wd1
w11
w1
wn
......
+ …
+ …

























 
 


d
j
jijiiik
j j
i
i wxwafe
e
e
wwyEg
1
22
0
h
1i
1
0
1
)()ln(.expwhere))((
hidden unit
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Constructing Custom Neural Networks
PROC NEURAL
 underlies the Neural Network node
 enables you to construct virtually any feed-forward neural
network architecture.
618
PROC NEURAL DATA=<data> DMDBCAT=<catalog>;
INPUT <inputs> / LEVEL=<input level>;
TARGET <targets> / LEVEL=<target level>;
ARCHI <architecture-name>;
PRELIM <starts> MAXITER=<iterations>;
TRAIN;
RUN;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The PROC NEURAL/Architecture Statement
 The PROC NEURAL statement invokes the neural network procedure.
 Options include the ability to read in saved networks and to assign validation and test
data sets.
 The SAS data set must already have been cataloged by means of the DMDB procedure.
619
PROC NEURAL DATA=<libref.>SAS-data-set
DMDBCAT=catalog <option-list>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
ARCHI architecture-name <HIDDEN=n> <DIRECT>;
 The ARCHITECTURE statement constructs a network with either zero (a
linear model) or one hidden layers.
 The statement sets the appropriate COMBINE=
and ACT= functions, based on the specified architecture-name.
The TARGET/TRAIN Statement
 The TARGET statement identifies
the target variables.
 It is also used to specify the target
layer activation and error functions.
620
TARGET | OUTPUT variable-list /
<ACT=activation-function>
<BIAS|NOBIAS >
<COMBINE=combination-function>
<ERROR=keyword>
<ID=name>
<LEVEL=value>
<MESTA=number>
<MESTCON=number>
<SIGMA=number>
<STD=method>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
TRAIN OUT=<libref.>SAS-data-set
OUTEST=<libref.>SAS-data-set
OUTFIT=<libref.>SAS-data-set
<ACCEL|ACCELERATE=number>
<DECEL|DECELERATE=number>
<DUMMIES | NODUMMIES>
<ESTITER=i>
<LEARN=number>
<MAX|MAXMOMENTUM=number>
<MAXITER=integer>
<MAXLEARN=number>
<MAXTIME =number>
<MINLEARN=number>
<MOM|MOMENTUM=number>
<TECHNIQUE=name>;
621
Neural Network v SAS EM
622
Neural Network v SAS EM
623
Neural Network v SAS EM
624
Neural Network v SAS EM
625
11. Evaluace modelu –
LC(ROC),CAP, Gini, KS, Lift, IV
626
 Uvažujeme dva základní skupiny indexů kvality. První je založena na distribuční
funkci. Mezi nejpoužívanější indexy patří
 Druhá skupina indexů je založena na pravděpodobnostní hustotě. Mezi
nejznámější indexy patří
Měření kvality modelu
 Kolmogorovova-Smirnovova statistika (KS)
 Giniho index (Somersovo D, Kendalovo τα , Goodman-Kruskal γ)
 C-statistika
 Lift.
 Střední diference (Mahalanobisova vzdálenost)
 Informační statistika/hodnota (IVal).
 Je nemožné využívat predikční modely efektivně bez znalosti jejich
kvality/diskriminační síly.
 Většinou je k dispozici celá řada modelů a je třeba vybrat jen jeden – ten nejlepší.
Počet dobrých klientů:
Počet špatných klientů:
Proporce dobrých/špatných klientů:
)1(
1
)(
1
.  
Ki
n
i
GOODn DasI
n
aF
)0(
1
)(
1
.  
Ki
m
i
BADm DasI
m
aF
],[ HLa)(
1
)(
1
. asI
N
aF i
N
i
ALLN  
 Empirické distribuční funkce:
mn
m
pB

,
mn
n
pG


n
m
 Kolmogorovova-Smirnovova statistika (KS)
)()(max ,,
],[
aFaFKS GOODnBADm
HLa






.,0
,1
jinak
dobrýjeklient
DK




jinak
platíA
AI
0
1
)(
627
Indexy založené na distribuční funkci
KS – výpočet v SASu
628
ods graphics on;
proc npar1way edf plots=edfplot
data=st192.sales_score;
class purchase;
var score;
run;
ods graphics off;
Více viz:
http://support.sas.com/documentation/cdl/en/statug/63033/HTML
/default/viewer.htm#statug_npar1way_a0000000202.htm
 Lorenzova křivka (LC)
 Tato definice a název (LC) je konzistentní s Müller, M.,
Rönz, B. (2000). Stejnou definici křivky, ovšem pod názvem
ROC lze nalézt v Thomas et al. (2002). Siddiqi (2006)
používá název ROC pro křivku s prohozenými osami a LC
pro křivku s na svislé ose a na ose
horizontální.
.],[),(
)(
.
.
HLaaFy
aFx
GOODn
BADm


)(. aF BADm )(. aF ALLN
629
Indexy založené na distribuční funkci
 Lorenzova křivka (LC)
 Giniho index
A
BA
A
Gini 2


 .,),(
)(
.
.
HLaaFy
aFx
GOODn
BADm


   




mn
k
kGOODnGOODnkBADmkBADm FFFFGini k
2
1..1..1
kBADmF .
kGOODnF .
kde ( ) je k-tá hodnota vektoru empirické distribuční funkce špatných (dobrých) klientů
Lorenzova křivka, Giniho index
630
kde jsou bivariantní, stochasticky nezávislé, náhodné vektory nad
touž datovou populací, a značí střední hodnotu. V našem případě je Y=1 jestliže je
klient dobrý a Y=0 jestliže je klient špatný. Proměnná X reprezentuje skóre.
kde je Kendallovo definované jako
 Giniho index je speciální případ Somersova D (Somers (1962)),
které je pořadovou asociační mírou definovanou jako
XX
XY
YXD



XY a     2121 YYsignXXsignEXY 
 ,, 11 YX  22 ,YX
E
Thomas (2009) uvádí, že Somersovo D hodnotící výkonnost daného credit
scoringového modelu lze vypočítat pomocí
mn
bgbg
D
ij
j
i
i
ij
j
i
i
S



 
kde gi (bj) je počet dobrých (špatných) klientů v i-tém intervalu skóre.
Somersovo D, Kendalovo 
631
 Dále platí, že DS může být vyjádřeno pomocí MannWhitneyho
U-statistiky.
 Seřaď datový vzorek ve vzestupném pořadí podle skóre a sečti pořadí
dobrých klientů ve vzniklé posloupnosti. Označme tento součet jako RG.
Potom
12 


mn
U
DS
 1
2
1
 nnRU G
Somersovo D, Mann-Whitney U
632
 Konkordantní pár (X1,Y1), (X2,Y2):
 Diskordantní pár:
 V našem případě X představuje skóre a Y ukazatel dobrého
klienta (DK). Protože dobrý klient má hodnotu Y=1 a špatný Y=0,
je zřejmé, že u konkordantního páru má dobrý klient vyšší
hodnotu skóre než klient špatný.
)sgn()sgn( 1212 YYXX 
)sgn()sgn( 1212 YYXX 
Konkordantní, diskordantní páry
633
 Uvažujme tedy dva náhodně vybrané klienty, přičemž jeden je dobrý (Y1=1) a
druhý špatný (Y2=0), skóre prvního označme s1, druhého s2. Pak
 Konkordantní pár (Concordant): s1>s2
 Diskordantní pár (Discordant): s1<s2
 Vázaný pár (Tied): s1=s2
TiedDiscodrantConcordant
DiscodrantConcordant
DS
###
##



 Somersovo D:
 Goodmanovo-Kruskalovo Gamma:
DiscodrantConcordant
DiscodrantConcordant
##
##



Somersovo D, Goodman-Kruskal gamma
634
2
1 Gini
CAstatc


)01( 2121  KK DDssPstatc
c
 C-statistika:
Tato statistika je rovna pravděpodobnosti, že náhodně vybraný dobrý
klient má vyšší skóre než náhodně vybraný špatný klient, tj.
635
Indexy založené na distribuční funkci
V tomto případě máme na x-ové ose
proporci všech klientů (FALL) a na y-vé
ose proporci špatných klientů (FBAD).
Ideální model je tentokrát
reprezentován lomenou čarou z bodu
[0, 0] přes [pB, 1] do bodu [1, 1].
Výhoda tohoto obrázku je ta, že je
možné odečíst proporci zamítnutých
špatných klientů vs. celková proporce
zamítnutých klientů. Např. vidíme, že
pokud chceme zamítnout 70%
špatných klientů, musíme zamítat
přibližně 40% všech žadatelů.
CAP (Lift chart):
Gini

)p-0.5(1
diagonálouaCAPmeziPlocha
diagonálouamodeluideálníhoCAPmeziPlocha
diagonálouaCAPmeziPlocha
AR
B
AR (Accuracy Ratio)
CAP – index AR
636
 Další možnou mírou kvality scoringového modelu je Lift, který říká kolikrát
je daný model, při dané úrovni zamítání, lepší než náhodný model. Přesněji
řečeno jde o poměr proporce špatných klientů se skóre menším nebo rovno
dané hodnotě skóre a, , ku proporci špatných klientů v celé populaci.
Formálně jej lze zapsat takto:
BadRate
aBadRate
aabsLift
)(
)( 
],[ HLa
N
n
asI
YasI
YYI
YI
asI
YasI
BadRate
aCumBadRate
aLift
i
mn
i
i
mn
i
mn
i
mn
i
i
mn
i
i
mn
i
)(
)0(
)10(
)0(
)(
)0(
)(
)( 1
1
1
1
1
1


























637
Indexy založené na distribuční funkci
638
Lift v SASu
%macro lift(data=,score=,response=); /*Vypsani tabulky liftu +
vykresleni*/
/* Vytvoreni poradi dle skore*/
proc sort data=&data;
by &score;
run;
data work.score;
set &data;
rank+1;
run;
/*Rozdeleni na decily dle skore*/
proc rank data=score groups=10 out=score;
var rank;
ranks decile;
run;
data score;
set score;
decile=decile+1;
run;
/*vytvoreni tabulky pro lift*/
proc sql;
create table work.lift1 as
select decile,count(*) as N,sum(&response) as N_of_bad, (calculated
N_of_bad)/(calculated N)*100 as bad_rate,
(calculated bad_rate)/(select 100*sum(&response)/count(*) from score) as
abs_lift
from score
group by decile;
quit;
/*Vypocet kumulativniho liftu*/
proc sql;
create table work.lift2 as
select *, (select sum(N_of_bad)/sum(N) from work.lift1 as a where
a.decile<=b.decile)/(select sum(&response)/count(*) from work.score) as cum_lift
from work.lift1 as b;
quit;
/*Vypis tabulky pro lift*/
title „Lift";
proc print data=work.lift2 nonobs;
format bad_rate 4.2
abs_lift 5.3
cum_lift 5.3;
run;
/*vykresleni liftu*/
proc gplot data=work.lift2;
title 'Absolutni a kumulativni lift';
plot (abs_lift cum_lift)*decile /overlay legend grid;
symbol interpol=join w=5;
run;
quit;
%mend lift;
%lift(data=comp1.score_age,score=score,response=SeriousDlqin2yrs);
# bad clients Bad rate abs. Lift # bad clients Bad rate cum. Lift
1 100 16 16,0% 3,20 16 16,0% 3,20
2 100 12 12,0% 2,40 28 14,0% 2,80
3 100 8 8,0% 1,60 36 12,0% 2,40
4 100 5 5,0% 1,00 41 10,3% 2,05
5 100 3 3,0% 0,60 44 8,8% 1,76
6 100 2 2,0% 0,40 46 7,7% 1,53
7 100 1 1,0% 0,20 47 6,7% 1,34
8 100 1 1,0% 0,20 48 6,0% 1,20
9 100 1 1,0% 0,20 49 5,4% 1,09
10 100 1 1,0% 0,20 50 5,0% 1,00
All 1000 50 5,0%
absolutely cumulatively
decile # cleints
-
0,50
1,00
1,50
2,00
2,50
3,00
3,50
1 2 3 4 5 6 7 8 9 10
decile
Liftvalue
abs. Lift
cum. Lift
Gini=0,55
0
0,2
0,4
0,6
0,8
1
0 0,2 0,4 0,6 0,8 1
Lornz curve
Base line
 Pro výpočet lze použít tabulku s počty všech
a špatných klientů v daných intervalech skóre
(např. decilech).
639
Indexy založené na distribuční funkci
# bad clients Bad rate abs. Lift # bad clients Bad rate cum. Lift
1 100 8 8,0% 1,60 8 8,0% 1,60
2 100 12 12,0% 2,40 20 10,0% 2,00
3 100 16 16,0% 3,20 36 12,0% 2,40
4 100 5 5,0% 1,00 41 10,3% 2,05
5 100 3 3,0% 0,60 44 8,8% 1,76
6 100 2 2,0% 0,40 46 7,7% 1,53
7 100 1 1,0% 0,20 47 6,7% 1,34
8 100 1 1,0% 0,20 48 6,0% 1,20
9 100 1 1,0% 0,20 49 5,4% 1,09
10 100 1 1,0% 0,20 50 5,0% 1,00
All 1000 50 5,0%
decile # cleints
absolutely cumulatively
-
0,50
1,00
1,50
2,00
2,50
3,00
3,50
1 2 3 4 5 6 7 8 9 10
decile
Liftvalue
abs. Lift
cum. Lift
Gini=0,48
0
0,2
0,4
0,6
0,8
1
0 0,2 0,4 0,6 0,8 1
Lornz curve
Base line
 Pokud bad rate není monotonní:
 LC vypadá OK
 Gini se mírně sníží
 Lift ovšem vypadá
podivně
640
Indexy založené na distribuční funkci
# bad clients Bad rate abs. Lift # bad clients Bad rate cum. Lift
1 100 16 16,0% 3,20 16 16,0% 3,20
2 100 12 12,0% 2,40 28 14,0% 2,80
3 100 8 8,0% 1,60 36 12,0% 2,40
4 100 5 5,0% 1,00 41 10,3% 2,05
5 100 3 3,0% 0,60 44 8,8% 1,76
6 100 2 2,0% 0,40 46 7,7% 1,53
7 100 1 1,0% 0,20 47 6,7% 1,34
8 100 1 1,0% 0,20 48 6,0% 1,20
9 100 1 1,0% 0,20 49 5,4% 1,09
10 100 1 1,0% 0,20 50 5,0% 1,00
All 1000 50 5,0%
absolutely cumulatively
decile # cleints
# bad clients Bad rate abs. Lift # bad clients Bad rate cum. Lift
1 100 1 1,0% 0,20 1 1,0% 0,20
2 100 1 1,0% 0,20 2 1,0% 0,20
3 100 1 1,0% 0,20 3 1,0% 0,20
4 100 1 1,0% 0,20 4 1,0% 0,20
5 100 2 2,0% 0,40 6 1,2% 0,24
6 100 3 3,0% 0,60 9 1,5% 0,30
7 100 5 5,0% 1,00 14 2,0% 0,40
8 100 8 8,0% 1,60 22 2,8% 0,55
9 100 12 12,0% 2,40 34 3,8% 0,76
10 100 16 16,0% 3,20 50 5,0% 1,00
All 1000 50 5,0%
absolutely cumulatively
decile # cleints
Gini= - 0,55
0
0,2
0,4
0,6
0,8
1
0 0,2 0,4 0,6 0,8 1
Lornz curve
Base line
-
0,50
1,00
1,50
2,00
2,50
3,00
3,50
1 2 3 4 5 6 7 8 9 10
decile
Liftvalue
abs. Lift
cum. Lift
 Pokud má skóre zcela opačný smysl,
obdržíme „opačné“ obrázky.
641
Indexy založené na distribuční funkci
Gini = 0.42
0
0,2
0,4
0,6
0,8
1
0 0,2 0,4 0,6 0,8 1
Lornz curve
Base line
# bad clients Bad rate
1 100 35 35,0%
2 100 16 16,0%
3 100 8 8,0%
4 100 8 8,0%
5 100 7 7,0%
6 100 6 6,0%
7 100 6 6,0%
8 100 5 5,0%
9 100 5 5,0%
10 100 4 4,0%
All 1000 100 10,0%
decile # cleints
# bad clients Bad rate
1 100 20 20,0%
2 100 18 18,0%
3 100 17 17,0%
4 100 15 15,0%
5 100 12 12,0%
6 100 6 6,0%
7 100 4 4,0%
8 100 3 3,0%
9 100 3 3,0%
10 100 2 2,0%
All 1000 100 10,0%
decile # cleints
 SC 1:
K-S = 0.36
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
good
bad
K-S = 0.34
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
good
bad
Gini= 0,42
0
0,2
0,4
0,6
0,8
1
0 0,2 0,4 0,6 0,8 1
Lornz curve
Base line
 SC 2:
Je evidentní, že pouze Gini a KS nestačí!!! 642
Indexy založené na distribuční funkci
-
0,50
1,00
1,50
2,00
2,50
3,00
3,50
4,00
1 2 3 4 5 6 7 8 9 10
decile
Liftvalue
abs. Lift
cum. Lift
-
0,50
1,00
1,50
2,00
2,50
1 2 3 4 5 6 7 8 9 10
decile
Liftvalue
abs. Lift
cum. Lift
 SC 1:  SC 2:
Lift20% = 2.55 >
Lift50% = 1.48 <
Lift20% = 1.90
Lift50% = 1.64
SC 2 je lepší, pokud je předpokládaná míra zamítaní (reject rate) přibližně 50%.
SC 1 je významně lepší, pokud je předpokládaný reject rate přibližně 20%.
643
Indexy založené na distribuční funkci
],[,
)(
)(
)(
.
.
HLa
aF
aF
aLift
ALLN
BADm

]1,0()),((
1
))((
))((
)( 1
..1
..
1
..
 


qqFF
qqFF
qFF
qQLift ALLNBADm
ALLNALLN
ALLNBADm
})(],,[min{)( .
1
. qaFHLaqF ALLNALLN 
))1.0((10)1.0( 1
..%10

 ALLNBADm FFQLiftQLift
Lift, QLift
 Lift can be expressed and computed by formula:
 In practice, Lift is computed corresponding to 10%, 20%, . . . , 100% of
clients with the worst score. Hence we define:
 Typical value of q is 0.1. Then we have
644
Lift and QLift for ideal model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
7
8
9
10
QLiftvalue
FN.ALL
1/pB
pB
 It is natural to ask how look Lift and QLift in case of ideal model.
Hence we derived following formulas.
 QLift for ideal model:
 Lift for ideal model:
We can see that the upper limit of Lift and
QLift is equal to .Bp
1 645
Lift Ratio (LR)
 Once we know form of QLift for ideal model, we can define Lift
Ratio as analogy to Gini index.
0 0.2 0.4 0.6 0.8 1
0
1
2
3
4
5
6
7
8
9
10
F
N.ALLQLiftvalue
1/p
B
p
B
A
B
Actual model
Ideal model
Random model
 It is obvious that it is global measure of
model's quality and that it takes values from 0 to
1. Value 0 corresponds to random model, value 1
match to ideal model. Meaning of this index is
quite simple. The higher, the better. Important
feature is that Lift Ratio allows us to fairly
compare two models developed on different
data samples, which is not possible with Lift.
646
 Since Lift Ratio compares areas under Lift function for actual and ideal models, next
concept is focused on comparison of Lift functions themselves. We define Relative Lift
function by
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F
N.ALL
RLIFT
Actual model
Ideal model
Random model
 In connection to RLift we define Integrated
Relative Lift (IRL):
 It takes values from , for random model, to 1, for ideal model. Following
simulation study shows interesting connection to c-statistics.
2
5.0
2
Bp

Rlift, IRL
647
Příklad
 We consider two scoring models with score distribution given in the table below.
 We consider standard meaning of scores, i.e. higher score band means better clients (the
highest probability of default have clients with the lowest scores, i.e. clients in score band 1).
 Gini indexes are equal for both models.
 It is evident from the Lorenz curves, that the first model is stronger for higher score bands and
the second one is better for lower score bands.
 The same we can read from values of QLift.
score band # clients q
Scoring Model 1 Scoring Model 2
# bad clients
# cumul.
bad clients
# cumul.
bad rate QLift # bad clients
# cumul.
bad clients
# cumul.
bad rate QLift
1 100 0.1 20 20 20.0% 2.00 35 35 35.0% 3.50
2 100 0.2 18 38 19.0% 1.90 16 51 25.5% 2.55
3 100 0.3 17 55 18.3% 1.83 8 59 19.7% 1.97
4 100 0.4 15 70 17.5% 1.75 8 67 16.8% 1.68
5 100 0.5 12 82 16.4% 1.64 7 74 14.8% 1.48
6 100 0.6 6 88 14.7% 1.47 6 80 13.3% 1.33
7 100 0.7 4 92 13.1% 1.31 6 86 12.3% 1.23
8 100 0.8 3 95 11.9% 1.19 5 91 11.4% 1.14
9 100 0.9 3 98 10.9% 1.09 5 96 10.7% 1.07
10 100 1.0 2 100 10.0% 1.00 4 100 10.0% 1.00
All 1000 100 100
Gini = 0.42
Gini = 0.42
648
 Since Qlift is not defined for q=0, we extrapolated the value by
)3.0()2.0(3)1.0(3)0( QLiftQLiftQLiftQLift 
According to both Qlift and Rlift
curves we can state that:
 If expected reject rate is up to
40%, then model 2 is better.
 If expected reject rate is more
than 40%, then model 1 is better.
Příklad
649
 Now, we consider indexes LR and IRL:
A
B
BA
A
LR


scoring model
1
scoring model
2
GINI 0.420 0.420
QLift(0.1) 2.000 3.500
LR 0.242 0.372
IRL 0.699 0.713
Using LR and IRL we can state
that model 2 is better than model
1 although their Gini coefficients
are equal.
2
1
22











mn
mSnS
S
bg
S
MM
D
bg 

kde S je společná směrodatná odchylka:
gM bM
gS bS
 Střední diference
(Mahalanobis distance):
, jsou střední hodnoty dobrých (špatných) klientů
, jsou příslušné směrodatné odchylky.
650
Indexy založené na hustotě
dx
xf
xf
xfxfI
BAD
GOOD
BADGOODval 





 


)(
)(
ln))()((
)()()( xfxfxf BADGOODdiff 







)(
)(
ln)(
xf
xf
xf
BAD
GOOD
LR
 Informační hodnota (Ival) – spojitý případ (Divergence):
• jde o symetrizovanou Kullback-Leiblerovu divergenci známou také pod
názvem J-divergence.
651
Indexy založené na hustotě
 Vytvoříme intervaly skóre – typicky decily. Počet dobrých (špatných) klientů v i-tém
intervalu označíme .
 Musí platit
 Potom dostáváme
 












i i
iii
val
nb
mg
m
b
n
g
I ln
 Informační statistika/hodnota (Ival) – diskrétní případ:
)( ii bg
ibg ii  0,0
652
Indexy založené na hustotě
 Předpokládejme, že skóre dobrých a špatných klientů je normálně rozloženo, tj. jejich
pravděpodobnostní hustoty mají tvar
 Odhady parametrů a :
 Společná směrodatná odchylka:
 Odhady střední hodnoty a směrodatné odchylky skóre všech klientů :
.
, jsou směrodatné odchylky skóre dobrých (špatných) klientů
2
2
2
)(
2
1
)( g
gx
g
GOOD exf






2
2
2
)(
2
1
)( b
bx
b
BAD exf 





gbb  ,, b
gM bM
gS bS
, jsou aritmetické průměty skóre dobrých (špatných) klientů
2
1
22











mn
mSnS
S
bg
mn
mMnM
MM
bg
ALL



ALLALL  ,
2
1
2222
)(
)()(











mn
MMmMMnmSnS
S
bgbg
ALL
653
Normálně rozložené skóre
1
2
2
22











 







DDD
KS
Kde je distribuční funkce standardizovaného normálního rozložení, je distribuční
funkce s parametry , a je standardizovaná kvantilová funkce.

 bg
D








 
Dpq
q
qQLift G
ALL
)(
1
)( 1


2
DIval 
1
2
2 






D
Gini
S
MM
D
bg 

 Předpokládejme, že směrodatné odchylky obou skóre jsou rovny hodnotě σ, pak:
)(
)(1

)(2
,
 
 2







 
Dpq
S
S
q
qQLift G
ALL
)(
1
)( 1
654
Normálně rozložené skóre
 Obecně, tj. bez předpokladu rovnosti směrodatných odchylek skóre:












 cbDa
b
D
b
a
cbDa
b
D
b
a
KS bggb 2
1
2
1 22
*2**2*

,22
gba  
22
*
bg
bg
D




 22
*
bg
bg
SS
MM
D



kde 






b
g
c


ln,22
gbb  








































b
g
gbgbb
gb
g
gb
gb
b
g
gbgbg
gb
b
gb
gb
S
S
SSDSSS
SS
DS
SS
SS
S
S
SSDSSS
SS
DS
SS
SS
KS
ln)(2)(
1
ln)(2)(
1
22*22
22
*
22
22
22*22
22
*
22
22
2
2
655
Normálně rozložené skóre
1)(2 *
 DGini





 



b
bALLALL
ALLALLq
q
q
q
q
Lift
bb



)(1
))((
1 1
1
, 2








 2
2
2
2
2*
2
1
,1)1(
b
g
g
b
val AADAI












 2
2
2
2
2*
2
1
,1)1(
b
g
g
b
val
S
S
S
S
AADAI





 


b
bALL
S
MMqS
q
qQLift
)(1
)(
1
 Obecně, tj. bez předpokladu rovnosti směrodatných odchylek skóre:
656
Normálně rozložené skóre
 Gini ,
0b 12
b KS: ,
0b 12
b
• Gini > KS
 KS i Gini reagují velmi
silně na změnu , ale
zůstávají téměř beze změny
ve směru .
g
2
g
657
Normálně rozložené skóre
 Lift10%: ,0b 12
b
 Ival: ,0b 12
b
 V případě indexu
Lift10% je evidentní
silná závislost na μg
a významně vyšší
závislost na σg
2 než v
případě KS a Gini.
 Opět velmi silná
závislost na μg. Navíc
hodnota Ival míří
velmi rychle k
nekonečnu pokud
se σg
2 blíží nule.
658
Normálně rozložené skóre
659
ROC (Receiver operating characteristic )
660
ROC –TPR, FPR
TPR = TP / P = TP / (TP + FN)
FPR = FP / N = FP / (FP + TN)
661
ROC curve
ySensitivitTPRy
ySpecificitFPRx

 1
662
ROC - ACC
A B C C'
TPR = 0.63 TPR = 0.77 TPR = 0.24 TPR = 0.88
FPR = 0.28 FPR = 0.77 FPR = 0.88 FPR = 0.24
ACC = 0.68 ACC = 0.50 ACC = 0.18 ACC = 0.82
TP=63 FP=28 91
FN=37 TN=72 109
100 100 200
TP=77 FP=77 154
FN=23 TN=23 46
100 100 200
TP=24 FP=88 112
FN=76 TN=12 88
100 100 200
TP=88 FP=24 112
FN=12 TN=76 88
100 100 200
Accuracy:
ACC = (TP + TN) / (P + N)
663
ROC – AUC, Gini
 AUC (area under curve, neboli plocha pod ROC křivkou) je
rovna pravděpodobnosti, že daný model ohodnotí náhodně
vybraného dobrého klienta vyšším skóre než náhodně vybraného
špatného klienta. Dá se ukázat, že plocha pod ROC křivkou se dá
vyjádřit pomocí Mann-Whitneymu U, které testuje rozdíl
mediánů mezi dvěma skupinami spojitých skóre. AUC se dá
vyjádřit i pomocí Giniho koeficientu pomocí vzorce
Gini + 1 = 2xAUC
664
Další evaluační grafy
Boxplot Histogram
PD - absolutně PD - kumulativně
SC1
SC2
SC1
SC2
665
Postupy evaluace
 evaluace na učících datech
Evaluace na učících datech použitých k učícímu procesu není
ke zjištění kvality modelu vhodná a má nízkou vypovídací
schopnost, protože často může dojít k přeučení modelu. Odhad
predikční kvality modelu na učících datech se nazývá
resubstituční nebo interní odhad. Odhady ukazatelů kvality
modelů provedených na učících datech jsou nadhodnocené,
proto se místo nich používají testovací data, která se v rámci
přípravy dat pro tyto účely vyčlení.
 evaluace na testovacích datech
Evaluace na testovacích datech již má patřičnou vypovídací
schopnost, jelikož tato data nebyla použita k sestavení modelu.
Na testovací data jsou kladeny určité požadavky. Soubor
testovacích dat by měl obsahovat dostatečné množství dat a měl
by reprezentovat či vystihovat charakteristiky učících dat.
Empiricky doporučený poměr učících a testovacích dat je 75%,
resp. 25% případů. Zajištění patřičné reprezentativnosti je
realizováno pomocí náhodného stratifikovaného výběru.
Postupy evaluace
666
 křížové ověřování (cross-validation)
V případě nedostatečného počtu pozorování, kdy rozdělení datového souboru na učící a testovací data
za účelem vyhodnocení modelu není možné, je vhodné použít metodu křížového ověřování. Výhodou
této metody na rozdíl od dělení datového souboru je, že každý případ z dat je použit k sestavení modelu
a každý případ je alespoň jednou použit k testování. Postup je následující:
• Soubor dat je náhodně rozdělen do n disjunktních podmnožin tak, že každá podmnožina
obsahuje přibližně stejný počet záznamů. Výběry jsou stratifikovány podle tříd (příslušnosti k
určité třídě), aby bylo zajištěno, že podíly jednotlivých tříd podmnožin jsou zhruba stejné jako v
celém souboru.
• Z těchto n disjunktních podmnožin se vyčlení n-1 podmnožin pro sestavení modelu (konstrukční
podmnožina) a zbývající podmnožina (validační podmnožina) je použita k jeho vyhodnocení.
Model je tedy evaluován na podmnožině dat, ze kterých nebyl sestaven a na této množině dat je
odhadována jeho predikční kvalita.
• Celý postup se zopakuje n-krát a dílčí odhady ukazatelů kvality se zprůměrňují. Velikost validační
podmnožiny lze přibližně stanovit jako poměr počtu případů ku počtu validačních podmnožin.
Postupy evaluace
667
 bootstrap metoda
Metoda bootstrap zkoumá charakteristiky jednotlivých resamplovaných vzorků, které byly pořízeny z
empirického výběru. Pokud původní výběr osahuje m prvků, tak každý má naději objevit se v
resamplovaném výběru. Při úplném resamplování o velikosti vzorku n jsou uvažovány všechny možné
výběry a existuje tedy m n možných výběrů. Úplné resamplování je teoreticky proveditelné, ale
vyžádalo by si mnoho času. Alternativou je simulace Monte Carlo, pomocí níž se aproximuje úplné
resamplování tak, že se provede B náhodných výběrů (obvykle se volí 500 – 10000 výběrů) s tím, že
každý prvek je vždy nahrazen (vrácen zpět do osudí). Jsou-li dána data X={X1, …, Xn) a je-li
požadován odhad parametru θ, provede se z původních dat B výběrů a pro každý výběr je spočítán
odhad parametru θ . Bootstrap odhad parametru je určen jako průměr dílčích odhadů. V případě
evaluace modelů bude parametrem θ zvolený ukazatel predikční kvality.
 jackknife
Tato metoda je založena na sekvenční strategii odebírání a vracení prvků do výběru o velikosti n. Pro
datový soubor, který obsahuje n prvků, procedura generuje n vzorků s počtem prvků n-1. Pro každý
zmenšený výběr o velikosti n-1 je odhadnuta hodnota parametru. Dílčí odhady se následně
zprůměrují podobně jako u metody bootstrap.
668
Postupy evaluace
Bootstrap
 Základní postup metody (n pozorování):
 Generování k (obvykle 20 a více) nezávislých výběrů n prvků s vracením z
původních dat
 Tyto výběry z výběru n pozorování jsou vlastně přibližnou náhradou za
nezávislé výběry z celé populace = idea metody
 Na každém z k výběrů odhadneme model stejně jako na původním
základním výběru
 Populace k různých výsledků nám umožní odhadovat stabilitu
odhadovaných parametrů
 Zobecnitelnost lze odhadovat s použitím prvků, které v daném kole nebyly
vybrány – OOB odhady (=out-of-bag), – tj. užít je jako testovací množiny
 Těchto OOB prvků je v průměru 36,8 % z počtu pozorování
Portrét: Bradley Efron, který tuto verzi metody Monte Carlo r. 1979 publikoval
669Jan Spousta: Přednášky k data miningu.
670
12. Úvod do makro jazyka v SAS.
%MACRO macro-name(keyword=value, …, keyword=value);
macro text
%MEND <macro-name>;
Purpose of the Macro Facility
The macro facility is a text processing facility for automating
and customizing SAS code. The macro facility helps
minimize the amount of SAS code you must type to perform
common tasks.
The macro facility supports the following:
 symbolic substitution within SAS code
 automated production of SAS code
 dynamic generation of SAS code
 conditional construction of SAS code
671Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Purpose of the Macro Facility
The macro facility enables you to do the following:
 create and resolve macro variables anywhere within
a SAS program
 write and call macro programs (macro definitions or
macros) that generate custom SAS code
672Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Substituting System Values
Example: Include system values within SAS footnotes.
proc print data=orion.customer;
title "Customer List";
footnote1 "Created 10:24 Monday, 31MAR2008";
footnote2 "on the WIN System Using SAS 9.2";
run;
Automatic macro variables store system values that can be
used to avoid hardcoding.
673Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Substituting User-Defined Values
 Example: Reference the same value repeatedly
throughout a program.
User-defined macro variables enable you to define a value once
and substitute that value repeatedly within a program.
proc freq data=orion.order_fact;
where year(order_date)=2008;
table order_type;
title "Order Types for 2008";
run;
proc means data=orion.order_fact;
where year(order_date)=2008;
class order_type;
var Total_Retail_Price;
title "Price Statistics for 2008";
run;
674Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Conditional Processing
Is it
Friday? Yes
Daily report
proc print data=orion.orders;
run;
proc means data=orion.orders;
run;
 Example: Generate a detailed report on a daily basis. Generate an
additional report every Friday, summarizing data on a weekly basis.
A macro program can conditionally execute selected portions of a
SAS program based on user-defined conditions.
675Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Repetitive Processing
Example: Generate a similar report each year from 2008 to 2010.
A macro program can generate SAS code repetitively,
substituting different values with each iteration.
proc print data=orion.year2008;
run;
proc print data=orion.year2009;
run;
proc print data=orion.year2010;
run;
676Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Data-Driven Applications
Example: Create
separate
subsets of a
selected
data set for
each unique
value of a
selected
variable.
A macro program can generate data-driven code.
data AU CA DE IL TR US ZA;
set orion.customer;
select(country);
when("AU") output AU;
when("CA") output CA;
when("DE") output DE;
when("IL") output IL;
when("TR") output TR;
when("US") output US;
when("ZA") output ZA;
otherwise;
end;
run;
677Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Global Macro Variables
 When SAS is invoked, a global symbol table is created and initialized with
automatic or system-defined macro variables.
You can also create user-defined global macro variables with the %LET
statement:
Automatic
Variables
Global Symbol Table
. .
. .
SYSTIME 09:47
SYSVER 9.1
. .
. .
CITY Denver
DATE 05JAN2001
User-defined
Variables
%let city=Denver;
%let date=05JAN2001;
678Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Referencing a Macro Variable
 To substitute the value of a macro variable in your
program, you must reference it.
A macro variable reference
 is made by preceding the macro variable name with
an ampersand (&)
 causes the macro processor to search for the named
macro variable and return its value from the symbol
table if the variable exists.
679Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Referencing a Macro Variable
Example: Use a macro variable reference to make a
substitution in a SAS program statement.
generates
where OrderCity="&city";
WHERE ORDERCITY="Denver";
Global Symbol Table
CITY Denver
DATE 05JAN2001
680Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Referencing a Macro Variable
Referencing a nonexistent macro variable results
in a warning message.
generates
 When the macro processor
cannot act upon a macro
variable reference, a message
is printed in the SAS log.
Global Symbol Table
CITY Denver
DATE 05JAN2001
title "Orders from &cityst";
WARNING: Apparent symbolic reference CITYST not resolved.
681Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Referencing a Macro Variable
Referencing an invalid macro variable name results
in an error message.
generates
title "Orders from &THE_CITY_IN_WHICH_THE_ORDER_ORIGINATED";
ERROR: Symbolic variable name
THE_CITY_IN_WHICH_THE_ORDER_ORIGINATED
must be 32 or fewer characters long.
Global Symbol Table
CITY Denver
DATE 05JAN2001
682Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Displaying Macro Variable Values
 Use the SYMBOLGEN system option to monitor the value that
is substituted for a macro variable referenced.
General form of the SYMBOLGEN system option:
This system option displays the results of resolving macro variable
references in the SAS log.
 The default option setting is NOSYMBOLGEN.
OPTIONS SYMBOLGEN;
683Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Displaying Macro Variable Values
Partial SAS Log
Why is no message displayed for the final example?
where CustomerAddress2 contains "&city";
SYMBOLGEN: Macro variable CITY resolves to Denver
where CustomerAddress2 contains '&city';
Global Symbol Table
CITY Denver
DATE 05JAN2001
684Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Displaying Macro Variable Values
 To verify the values of macro variables, you may want
to write your own messages to the SAS log. The %PUT statement
writes text to the SAS log.
General form of the %PUT statement:
%PUT text ;
685Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Displaying Macro Variable Values
 The %PUT statement
 writes to the SAS log only
 always writes to a new log line starting in column one
 writes a blank line if text is not specified
 does not require quotation marks around text
 resolves references to macro variables in text before text is written
 removes leading and trailing blanks from text unless
a macro quoting function is used
 wraps lines when the length of text is greater than
the current line size setting
 can be used inside or outside a macro definition.
686Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Displaying Macro Variable Values
Example: Write a message to the SAS log to verify
the value of the macro variable CITY.
Partial SAS Log
Global Symbol Table
CITY Denver
DATE 05JAN2001
%put The value of the macro variable CITY is: &city;
The value of the macro variable CITY is: Denver
687Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
System-Defined Automatic Macro Variables
 Some automatic macro variables have fixed values that are set at SAS invocation:
Name Value
SYSDATE date of SAS invocation (DATE7.)
SYSDATE9 date of SAS invocation (DATE9.)
SYSDAY day of the week of SAS invocation
SYSTIME time of SAS invocation
SYSENV FORE (interactive execution)
BACK (noninteractive or batch execution)
SYSSCP abbreviation for the operating system used such as OpenVMS,
WIN, HP 300
SYSVER release of SAS software being used
SYSJOBID identifier of current SAS session or batch job mainframe systems:
the userid or job name other systems: the process ID (PID)
688Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
 Some automatic macro variables have values that change
automatically, based on submitted SAS statements:
System-Defined Automatic Macro Variables
Name Value
SYSLAST name of the most recently created SAS data set in the
form libref.name. If no data set was created, the value
is _NULL_.
SYSPARM text specified at program invocation.
689Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Displaying Automatic Macro Variables
 The values of automatic macro variables can be displayed in
the SAS log by specifying the _AUTOMATIC_ argument in the
%PUT statement.
%PUT _AUTOMATIC_;
 The values of the macro variables
SYSDATE, SYSDATE9, and
SYSTIME are character strings,
not SAS date or time values.
690Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %LET Statement
 The %LET statement enables you to define a macro variable and
assign it a value.
General form of the %LET statement:
%LET variable=value;
691Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %LET Statement
 Rules for the %LET statement:
• Variable can be any name following the SAS naming
convention.
• Value can be any string.
• If variable already exists in the symbol table, value replaces the
current value.
• If either variable or value contains a macro statement or macro
variable reference, the trigger is evaluated before the
assignment is made. continued...
692Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %LET Statement
 Rules for the %LET statement:
 The maximum length is 64K characters.
 The minimum length is 0 characters (null value).
 Numeric tokens are stored as character strings.
 Mathematical expressions are not evaluated.
 The case of value is preserved.
 Quotes bounding literals are stored as part of value.
 Leading and trailing blanks are removed from value
before the assignment is made.
693Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Value
%LET Statement Examples
 Use the rules on the previous page to determine the values
assigned to macro variables by these %LET statements:
%let name=Ed Norton;
%let name2=' Ed Norton ';
%let title="Joan’s Report";
%let start=;
%let total=0;
%let sum=3+4;
%let total=&total+&sum;
%let x=varlist;
Ed Norton
' Ed Norton '
"Joan’s Report"
0
3+4
0+3+4
varlist
694Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Deleting User-Defined Macro Variables
 The %SYMDEL statement deletes one or more user-defined macro
variables from the global symbol table.
To release memory, delete macro variables from the global symbol table
when they are no longer needed.
General form of the %SYMDEL statement:
Example: Delete the macro variables OFFICE and UNITS.
%SYMDEL macro-variables;
%symdel office units;
695Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Referencing Macro Variables
 You can reference macro variables anywhere in your
program, including these special situations:
Macro variable references adjacent to leading and/or trailing
text:
text&variable
&variabletext
text&variabletext
Adjacent macro variable references:
&variable&variable
696Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Combining Macro Variables with Text
%let month=jan;
proc chart data=orion.y2000&month;
hbar week / sumvar=sale;
run;
proc plot data=orion.y2000&month;
plot sale*day;
run;
generates
proc chart data=orion.y2000jan;
hbar week / sumvar=sale;
run;
proc plot data=orion.y2000jan;
plot sale*day;
run;
697Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Combining Macro Variables with Text
 This example illustrates adjacent macro variable references.
Example: Modify the previous program to allow
both the month and the year to be
substituted.
%let year=2000;
%let month=jan;
proc chart data=orion.y&year&month;
hbar week / sumvar=sale;
run;
proc plot data=orion.y&year&month;
plot sale*day;
run;
698Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Combining Macro Variables with Text
 You can place text immediately after a macro variable reference if it does not
change the reference.
Example: Modify the previous program to
substitute the name of an analysis
variable.
%let year=2000;
%let month=jan;
%let var=sale;
proc chart data=orion.y&year&month;
hbar week / sumvar=&var;
run;
proc plot data=orion.y&year&month;
plot &var*day;
run;
699Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Macro Variable Name Delimiter
The word scanner recognizes the end of a macro variable
reference when it encounters a character that cannot be part
of the reference.
A period (.) is a special delimiter that ends a macro variable
reference. The period does not appear as text when the macro
variable is resolved.
700Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Macro Variable Name Delimiter
%let graphics=g;
%let year=2000;
%let month=jan;
%let var=sale;
proc &graphics.chart data=orion.y&year&month;
hbar week / sumvar=&var;
run;
proc &graphics.plot data=orion.y&year&month;
plot &var*day;
run;
701Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Macro Variable Name Delimiter
Use another period after the delimiter period to supply the
needed token.
%let lib=orion;
...
libname &lib 'SAS-data-library';
proc &graphics.chart data=&lib..y&year&month;
...
proc &graphics.plot data=&lib..y&year&month;
702Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Macro Variable Name Delimiter
The first period is treated as a delimiter, the second
as text.
The compiler receives this:
proc &graphics.chart data=&lib..y&year&month;
delimiter text
...
proc gchart data=orion.y2000jan;
...
703Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Macro Functions
Selected character string manipulation functions:
Other functions:
%UPCASE translates letters from lowercase to uppercase.
%SUBSTR extracts a substring from a character string.
%SCAN extracts a word from a character string.
%INDEX searches a character string for specified text.
%EVAL performs arithmetic and logical operations.
%SYSFUNC executes SAS functions.
%STR quotes special characters.
%NRSTR quotes special characters, including macro triggers.
704Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %SUBSTR Function
General form of the %SUBSTR function:
 The %SUBSTR function returns the portion
of argument beginning at position for a length
of n characters.
 When n is not supplied, the %SUBSTR function returns the
portion of argument beginning
at position to the end of argument.
%SUBSTR(argument, position <,n>)
705Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %SCAN Function
General form of the %SCAN function:
The %SCAN function does the following:
 returns the nth word of argument, where words
are strings of characters separated by delimiters
 uses a default set of delimiters if none are specified
 returns a null string if there are fewer than n words
in argument
%SCAN(argument, n <,delimiters>)
706Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %EVAL Function
General form of the %EVAL function:
The %EVAL function does the following:
 performs arithmetic and logical operations
 truncates non-integer results
 returns a text result
 returns 1 (true) or 0 (false) for logical operations
 returns a null value and issues an error message when noninteger
values are used in arithmetic operations
%EVAL(expression)
707Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %EVAL Function
 Example: Compute the first year of a range based on the
current date.
%let thisyr=%substr(&sysdate9,6);
%let lastyr=%eval(&thisyr-1);
proc means data=orion.order_fact maxdec=2 min max mean;
class order_type;
var total_retail_price;
where year(order_date) between &lastyr and &thisyr;
title1 "Orders for &lastyr and &thisyr";
title2 "(as of &sysdate9)";
run;
708Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The %SYSFUNC Function
The %SYSFUNC macro function executes SAS functions.
General form of the %SYSFUNC function:
 SAS function(argument(s)) is the name of a SAS function
and its corresponding arguments.
 The second argument is an optional format for
the value returned by the first argument.
%SYSFUNC(SAS function(argument(s)) <,format>)
709Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Defining a Macro
A macro or macro definition enables you to write macro programs.
General form of a macro definition:
macro-name follows SAS naming conventions.
macro-text can include the following:
%MACRO macro-name;
macro-text
%MEND <macro-name>;
• any text
• SAS statements or steps
• macro variable references
• macro statements, expressions, or calls
• any combination of the above
710Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Calling a Macro
A macro call
 causes the macro to execute
 is specified by placing a percent sign before the name of the macro
 can be made anywhere in a program (similar to a macro variable
reference)
 represents a macro trigger
 is not a statement (no semicolon required).
General form of a macro call:
%macro-name
711Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Simple Macro
 A macro can generate SAS code.
 Example: Write a macro that generates a PROC MEANS
step. Reference macro variables within the macro.
• This macro contains no macro language statements.
%macro calc;
proc means data=orion.order_item &stats;
var &vars;
run;
%mend calc;
712Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Simple Macro
Example: Call the CALC macro. Precede the call with
%LET statements that populate macro variables
referenced within the macro.
%let stats=min max;
%let vars=quantity;
%calc
713Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Macro Parameters
Example: Define a macro with a parameter list of macro
variables referenced within the macro.
%macro calc(stats,vars);
proc means data=orion.order_item &stats;
var &vars;
run;
%mend calc;
714Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Positional Parameters
Positional parameters use a one-to-one correspondence
between the following:
 parameter names supplied on the macro definition
 parameter values supplied on the macro call
%macro calc(stats,vars);
proc means data=orion.order_item &stats;
var &vars;
run;
%mend calc;
%calc(min max,quantity)
715Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Keyword Parameters
 A parameter list can include keyword parameters.
 General form of a macro definition with keyword parameters:
Keyword parameters are assigned a default value after
an equal (=) sign.
%MACRO macro-name(keyword=value, …, keyword=value);
macro text
%MEND <macro-name>;
716Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Keyword Parameters
General form of a macro call with keyword parameters:
keyword=value combinations can be
 specified in any order
 omitted from the call without placeholders.
If omitted from the call, a keyword parameter receives
its default value.
%macro-name(keyword=value, …, keyword=value)
717Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Keyword Parameters
Example: Assign default parameter values by defining
the macro with keyword parameters.
%macro count(opts=,start=01jan04,stop=31dec04);
proc freq data=orion.orders;
where order_date between
"&start"d and "&stop"d;
table order_type / &opts;
title1 "Orders from &start to &stop";
run;
%mend count;
options mprint;
%count(opts=nocum)
%count(stop=01jul04,opts=nocum nopercent)
%count()
718Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Roland’s SAS Macros:
http://www.datasavantconsulting.com/roland/
Macros by SAS users worlwide:
http://schick.tripod.com/p-index.html
Více viz např.
http://support.sas.com/documentation/cdl/en/mcrolref/61885/HTML/defa
ult/viewer.htm#mcrolrefwhatsnew902.htm
http://support.sas.com/documentation/cdl/en/mcrolref/61885/HTML/defa
ult/viewer.htm#macro-stmt.htm
http://www.nesug.org/proceedings/nesug03/bt/bt009.pdf
http://www2.sas.com/proceedings/sugi24/Handson/p149-24.pdf
719
720
13. Úprava výstupů/reportů
SASu, export ze SASu.
PROC Print
721
 The PRINT procedure prints the observations in a SAS data
set, using all or some of the variables. You can create a
variety of reports ranging from a simple listing to a highly
customized report that groups the data and calculates
totals and subtotals for numeric variables.
PROC PRINT <option(s)>;
BY <DESCENDING> variable-1 <...<DESCENDING> variable-n><NOTSORTED>;
PAGEBY BY-variable;
SUMBY BY-variable;
ID variable(s) <option>;
SUM variable(s) <option>;
VAR variable(s) <option>;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Print
722
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000057825.htm
Více na:
ods rtf file='your_file.rtf';
proc print data=empdata(obs=12);
id idnumber / style(DATA) = {background = red foreground = white}
style(HEADER) = {background = blue foreground = white};
title 'Personnel Data'; run;
ods rtf close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Report
723
 PROC REPORT combines features of PROC PRINT, PROC
SUMMARY, PROC SORT, and the data step – it can sort and
summarize data and perform calculations.
proc sql;
create table smallprod as
select country,
region,
prodtype,
product,
actual label=''format=comma10.2,
predict label=''format=comma10.2,
month
from sashelp.prdsale
where mod(monotonic(), 75) = 0
order by ranuni(94612);
quit;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Report
724Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Report
725Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
PROC Report
726
www.excursive.com/sas/ProcReportSlides.pdfVíce na:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473620.htm
http://www.sas.com/offices/NA/canada/downloads/presentations/Vancouver_Spring_2007/Craig.pdf
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Global Statements
The following are global statements that enhance reports:
 OPTIONS
 TITLE
 FOOTNOTE
 ODS
Global statements are specified anywhere in your
SAS program and they remain in effect until canceled,
changed, or your SAS session ends.
727Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The OPTIONS Statement
The OPTIONS statement changes the value of one
or more SAS system options.
General form of the OPTIONS statement:
 Some SAS system options change the appearance
of a report.
 The OPTIONS statement is not usually included
in a PROC or DATA step.
728
OPTIONS option(s);
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS System Options for Reporting
Selected SAS System Options:
729
DATE (default)
displays the date and time that the SAS session began
at the top of each page of SAS output.
NODATE
does not display the date and time that the SAS session
began at the top of each page of SAS output.
NUMBER (default)
prints page numbers on the first line of each page
of SAS output.
NONUMBER
does not print page numbers on the first line of each page
of SAS output.
PAGENO=n
defines a beginning page number (n) for the next page of
SAS output.
continued...
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS System Options for Reporting
Selected SAS System Options:
730
CENTER (default) centers SAS output.
NOCENTER left-aligns SAS output.
PAGESIZE=n
PS=n
defines the number of lines (n) that can be printed
per page of SAS output.
LINESIZE=width
LS=width
defines the line size (width) for the SAS log and
SAS output.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS System Options for Reporting
731
options ls=80 date number;
proc means data=orion.sales;
var Salary;
run;
09:11 Monday, January 14, 2008 35
The MEANS Procedure
Analysis Variable : Salary
N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
165 31160.12 20082.67 22710.00 243190.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
80 characters wide
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS System Options for Reporting
732
1
The FREQ Procedure
Cumulative Cumulative
Country Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AU 63 38.18 63 38.18
US 102 61.82 165 100.00
80 characters wide
options nodate pageno=1;
proc freq data=orion.sales;
tables Country;
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TITLE Statement
The TITLE statement specifies title lines for SAS output.
General form of the TITLE statement:
 Titles appear at the top of the page.
 The default title is The SAS System.
 The value of n can be from 1 to 10.
 An unnumbered TITLE is equivalent to TITLE1.
 Titles remain in effect until they are changed, canceled, or
you end your SAS session.
733
TITLEn 'text ';
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The FOOTNOTE Statement
The FOOTNOTE statement specifies footnote lines
for SAS output.
General form of the FOOTNOTE statement:
 Footnotes appear at the bottom of the page.
 No footnote is printed unless one is specified.
 The value of n can be from 1 to 10.
 An unnumbered FOOTNOTE is equivalent to FOOTNOTE1.
 Footnotes remain in effect until they are changed, canceled, or
you end your SAS session.
734
FOOTNOTEn 'text ';
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
The TITLE and FOOTNOTE Statements
735
footnote1 'By Human Resource Department';
footnote3 'Confidential';
proc means data=orion.sales;
var Salary;
title 'Orion Star Sales Employees';
run;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Orion Star Sales Employees
The MEANS Procedure
Analysis Variable : Salary
N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
165 31160.12 20082.67 22710.00 243190.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
By Human Resource Department
Confidential
Changing Titles and Footnotes
TITLEn or FOOTNOTEn
 replaces a previous title or footnote with the same number
 cancels all titles or footnotes with higher numbers.
736
 The null TITLE statement cancels all titles.
 The null FOOTNOTE statement cancels all
footnotes.
title;
footnote;
Canceling All Titles and Footnotes
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Output Delivery System
Output can be sent to a variety of destinations by using ODS
statements.
737
Procedure
Output
LISTING
HTML
RTF
PDF
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Default ODS Destination
 The LISTING destination is
the default ODS destination.
 The LISTING destination
directs output to the
OUTPUT window and the
GRAPH window.
738
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
ods listing;
proc freq data=orion.sales;
tables Country;
run;
proc gchart
data=orion.sales;
hbar Country / nostats;
run;
Default ODS Destination
The ODS LISTING CLOSE statement stops sending output to the OUTPUT and
GRAPH windows.
A warning will appear in the SAS log if the LISTING destination is closed and no
other destinations are active.
739
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
ods listing close;
proc freq data=orion.sales;
tables Country;
run;
proc gchart
data=orion.sales;
hbar Country / nostats;
run;
23 ods listing close;
24
25 proc freq data=orion.sales;
26 tables Country;
27 run;
WARNING: No output destinations active.
NOTE: There were 165 observations read from the data set ORION.SALES.
HTML, PDF, and RTF Destinations
ODS destinations such as HTML, PDF, and RTF
are opened and closed in the following manner:
740
ODS destination FILE = ' filename.ext '
<options>;
SAS code to generate a report(s)
ODS destination CLOSE;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Single Destination
Output can be sent to only one destination.
741
ods listing close;
ods html file='example.html';
proc freq data=orion.sales;
tables Country;
run;
ods html close;
ods listing;
It is a good habit to open the
LISTING destination at the end of
a program to guarantee an open
destination for the next
submission.
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Destinations
Output can be sent to many destinations.
To view the results, all destinations except the LISTING
destination must be closed.
742
ods listing;
ods pdf file='example.pdf';
ods rtf file='example.rtf';
proc freq data=orion.sales;
tables Country;
run;
ods pdf close;
ods rtf close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Destinations
Use _ALL_ in the ODS CLOSE statement to close all open
destinations including the LISTING destination.
743
ods listing;
ods pdf file='example.pdf';
ods rtf file='example.rtf';
proc freq data=orion.sales;
tables Country;
run;
ods _all_ close;
ods listing;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Multiple Procedures
Output from many procedures can be sent to ODS destinations.
744
ods listing;
ods pdf file='example.pdf';
ods rtf file='example.rtf';
proc freq data=orion.sales;
tables Country;
run;
proc means data=orion.sales;
var Salary;
run;
ods _all_ close;
ods listing;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
File Location
A path can be specified to control the location of where
the file is stored.
If no path is specified, the file is saved in the current
default directory.
745
ods html file='s:\workshop\example.html';
proc freq data=orion.sales;
tables Country;
run;
proc means data=orion.sales;
var Salary;
run;
ods html close;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
STYLE= Option
Use a STYLE= option in the ODS destination statement
to specify a style definition.
 A style definition describes how to display the presentation
aspects such as colors and fonts
of SAS output.
 STYLE= cannot be used with the LISTING destination.
746
ODS destination FILE = 'filename.ext'
STYLE = style-
definition;
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Supplied Style Definitions
747
Analysis Astronomy Banker BarrettsBlue
Beige blockPrint Brick Brown
Curve D3d Default Education
EGDefault Electronics fancyPrinter Festival
FestivalPrinter Gears Journal Magnify
Meadow MeadowPrinter Minimal Money
NoFontDefault Normal NormalPrinter Printer
Rsvp Rtf sansPrinter sasdocPrinter
Sasweb Science Seaside SeasidePrinter
serifPrinter Sketch Statdoc Statistical
Theme Torn Watercolor
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
SAS Supplied Style Definitions
The following style definitions are new to SAS 9.2:
748
grayscalePrinter Harvest HighContrast
Journal2 Journal3 Listing
monochromePrinter Ocean Solutions
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Examples
749
STYLE=DEFAULT
STYLE=SASWEB
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Examples
750
STYLE=PRINTER
STYLE=JOURNAL
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Destinations Used with Excel
The following destinations create files that can be opened in
Excel.
751
Procedure
Output
CSVALL
MSOFFICE2K
EXCELXP
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Destinations Used with Excel
752
Destination Type of File Viewed In
CSVALL Comma-Separated Value
Editor or Microsoft
Excel
MSOFFICE2K Hypertext Markup Language
Web Browser or
Microsoft Word or
Microsoft Excel
EXCELXP Extensible Markup Language Microsoft Excel
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
CSVALL Destination
CSVALL does not include any style information.
753Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
ods csvall file='myexcel.csv';
proc freq data=orion.sales;
tables Country;
run;
proc means data=orion.sales;
var Salary;
run;
ods csvall close;
MSOFFICE2K Destination
754Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
ods msoffice2k file='myexcel.html';
proc freq data=orion.sales;
tables Country;
run;
proc means data=orion.sales;
var Salary;
run;
ods msoffice2k close;
MSOFFICE2K keeps the style information including spanning headers.
EXCELXP Destination
755Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
ods tagsets.excelxp file='myexcel.xml';
proc freq data=orion.sales;
tables Country;
run;
proc means data=orion.sales;
var Salary;
run;
ods tagsets.excelxp close;
EXCELXP keeps the style information and each procedure is a separate sheet.
The file you are creating is not an Excel file!!!
756
MSOFFIC2K
CSVALL
EXCELXP
Reprodukováno se svolením společnosti SAS Institute Inc., Cary, NC, USA.
Data _null_
757
/*WRITE RAW DATA SEPARATED
BY BLANKS*/
data _null_;
set iris;
file "c:\temp\iris.dat";
put sepallen
sepalwid
petallen
petalwid
species;
run;
/*WRITE RAW DATA
SEPARATED BY TABS*/
data _null_;
set iris;
file "c:\temp\iris.txt"
dlm="09"X;
put sepallen
sepalwid
petallen
petalwid
species;
run;
/*WRITE RAW DATA
SEPARATED BY COMMAS*/
data _null_;
set iris;
file "c:\temp\iris.csv" dlm=",";
put sepallen
sepalwid
petallen
petalwid
species;
run;
/*WRITE RAW DATA INTO SPECIFIED COLUMNS*/
data _null_;
set iris;
file " c:\temp\iris_column.dat";
put species 1-10
sepallen 12-15
sepalwid 17-20
petallen 22-25
petalwid 27-30;
run;
Proc Export
PROC EXPORT DATA= WORK.IRIS
OUTFILE= "C:\temp\iris.xls"
DBMS=EXCEL REPLACE;
SHEET="sheet1";
RUN;
758
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/d
efault/viewer.htm#a000316288.htm
Obecně:
Export Wizard
759
760
14. Reference
761
Literatura - knihy
 Allison, P.D. (2009). Logistic Regression using SAS: Theory and
Application, 8th printing, SAS Institute Inc.
 Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice
for Retail Credit Risk Management and Decision Automation, Oxford:
Oxford University Press.
 Giudici, P. (2003). Applied Data Mining: statistical methods for
business and industry, Chichester : Wiley.
Han, J., Kamber, M. (2006). Data mining: Concepts and Techniques,
2nd ed. San Francisco: Morgan Kaufmann.
762
Literatura - knihy
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of
Statistical Learning: Data Mining, Inference, and Prediction, New York:
Springer-Verlag.
 Hosmer, D. W., Lemeshow S. (2000). Applied Logistic Regression,
Textbook and Solutions Manual , 2nd ed., New York: John Wiley and
Sons.
Siddiqi, N. (2006). Credit Risk Scorecards: developing and
implementing intelligent credit scoring, New Jersey: Wiley.
 Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit, and
Portfolio, Oxford: Oxford University Press.
763
Literatura - knihy
 Thomas, L.C., Edelman, D.B., Crook, J.N. (2002). Credit Scoring
and Its Applications, Philadelphia: SIAM Monographs on Mathematical
Modeling and Computation.
Wilkie, A.D. (2004). Measures for comparing scoring systems, In:
Thomas, L.C., Edelman, D.B., Crook, J.N. (Eds.), Readings in Credit
Scoring. Oxford: Oxford University Press, pp. 51-62.
 Witten, I.H., Frank, E. (2005). Data Mining: Practical Machine
Learning Tools and Techniques, San Francisco: Morgen Kaufmann.
764
Literatura - časopisy
 Crook, J.N., Edelman, D.B., Thomas, L.C. (2007). Recent developments in
consumer credit risk assessment. European Journal of Operational Research,
183 (3), 1447-1465
 Hand, D.J. and Henley, W.E. (1997). Statistical Classification Methods in
Consumer Credit Scoring: a review. Journal. of the Royal Statistical Society,
Series A., 160,No.3, 523-541.
 Harrell, F.E., Lee, K.L. and Mark, D.B. (1996). Multivariate prognostic models:
issues in developing models, evaluating assumptions and adequacy, and
measuring and reducing errors. Statistics in Medicine, 15, 361-387.
 Lilliefors, H.W. (1967). On the Komogorov-Smirnov test for normality with
mean and variance unknown. Journal of the American Statistical Association, 62,
399-402.
 Nelsen, R. B. (1998). Concordance and Gini’s measure of association. Journal
of Nonparametric Statistics, 9, Isssue 3, 227–238.
 Newson R. (2006). Confidence intervals for rank statistics: Somers' D and
extensions. The Stata Journal, 6(3), 309-334.
Řezáč M. & Řezáč F. (2011). How to Measure the Quality
of Credit Scoring Models. Finance a úvěr - Czech Journal of Economics and
Finance, 61(5), 486-507.
 Somers R. H. (1962). A new asymmetric measure of association for ordinal
variables. American Sociological Review, 27, 799-811.
 Thomas, L.C. (2000). A survey of credit and behavioural scoring: forecasting
financial risk of lending to consumers. International Journal of Forecasting,
16(2), 149-172 .
Literatura - časopisy
765
Literatura - web
 Coppock, D.S. (2002). Why Lift?, DM Review Online,
www.dmreview.com/news/53291.html
 Xu, K. (2003). How has the literature on Gini’s index evolved in past 80 years?,
www.economics.dal.ca/RePEc/dal/wparch/howgini.pdf
 Xin Ming Tu, Wan Tang (2006). Categorical Data Analysis.
http://www.urmc.rochester.edu/smd/biostat/people/faculty/TuSite/bst466/handouts.htm
 Jiawei Han and Micheline Kamber (2006). Data Mining: Concepts and Techniques.
http://www.cs.illinois.edu/~hanj/bk2/
 Jens Peter Dittrich (2007). Data warehousing.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_datawh/Data_Mining.pdf
 Joe Carthy (2006). Data Warehousing.
http://www.csi.ucd.ie/staff/jcarthy/home/DataMining/DM-Lecture02-01.ppt
 Jan Spousta. Přednášky k data miningu. [cit. 19.03.2009] http://samba.fsv.cuni.cz/~soukup
766
Další zajímavé zdroje informací
http://www.cs.uiuc.edu/homes/hanj/
http://www-users.cs.umn.edu/~kumar/
 http://www.kdnuggets.com/
 http://www.kdnuggets.com/datasets/competitions.html
 http://www.crc.man.ed.ac.uk/conference/
 http://www.crc.man.ed.ac.uk/conference/archive/
 http://www.kmining.com/info_conferences.html
 http://en.wikipedia.org/wiki/Data_mining
 http://cs.wikipedia.org/wiki/Data_mining
 http://en.wikipedia.org/wiki/Credit_scorecards
767
Užitečné zdroje dat
http://archive.ics.uci.edu/ml/
http://kdd.ics.uci.edu/
http://sede.neurotech.com.br:443/PAKDD2009/
http://www.dataminingbook.com/
http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e.html
768