2. Input data characteristics
mkweb.bcgsc.ca
technology.sbkb.org
www.padowan.dk
axblog4u.wordpress.com
Process of data visualization
• Generating data
– Measuring, simulation, modeling
• Can be lengthy (measuring, simulation) and costly
(simulation, modeling)
• Visualization (the rest of the visualization pipeline)
– Visual mapping, rendering
• Can be fast or slow, depending on hardware and
implementation
• Interaction (user feedback)
– How the user can interact with the visualization
Passive visualization
• The following three steps are strictly separated
– Generating data – after finishing this phase
– Off-line visualization
• Displaying the generated data
• Result is a video or animation
– Passive visualization
• Exploration of the results of the previous phase
Interactive visualization
• Here only the generating data phase is
separated
– Off-line data generation
– Interactive visualization
• Generated data is available for interactive
visualization
• Options: selection, parametrization of visualization
• Currently very popular technique
Interactive steering
• All three steps are connected
– Generating data on the fly
– Interactive visualization enabling real-time view
onto data
– Extended interaction
• The user can control the simulation process, change
design when modeling, etc.
• Very complicated and costly process
Comparison
Data
• Central topic of visualization
• Data influences the selection of appropriate
visualization technique (along with the user)
• Important questions:
– Where data „lives“
(what is the
data space)
– Type of data
– Which representation
is meaningful
blogs.agi.com
Data space
• Different properties
– Dimensionality of data space
– Coordinate system
– Region of influence (local or global impact)
wiki.brown.edu en.wikipedia.org
Data definition
• Raw data x preprocessed data
• Data item (r1, r2, …, rn)
• Each ri record contains m variables (v1, v2, …,
vm)
• vi is often denoted as observation
Definition of variables
• Independent variable ivi
– not influenced by any other variable (e.g., time)
• Dependent variable dvj
– is influenced by one or more independent variables
(e.g., temperature)
• Record can be represented as
ri = (iv1, iv2, …, ivmi, dv1, dv2, …, dvmd)
where m = mi + md
Data generated by function
• Independent variables = domain
• Dependent variables = range
mathworld.wolfram.com
Types of variables
• Physical types
– Characterized by the input format
– Characterized by the type of possible operations
– Example: bool, string, int, float,…
• Abstract types
– Data description
– Characterized by methods/attributes
– Can be hierarchical
– Example: plants, animals, …
Data types
• Ordinal
– Binary
– Discrete
– Continuous
• Nominal
– Categorical
– Sorted
– Random
www.123rf.com
orgmode.org
www.forbes.com
www.icoachmath.com
tennysusantobi.blogspot.com
www.cincomsmalltalk.com
Scale
• 3 basic attributes:
– Ordering relation on data
– Distance metric
– Existence of absolute zero
• Fixing the minimal value of
variable
Data representation
• Depends on:
– The presence of spatial domain
• If it is not inherently in data, which domain to choose?
– How the dimensions are used?
• Data characteristics
• Available visualization space (2D/3D)
• Which part of data is in focus?
• In which parts we can use more abstracted
representation?
Data space vs. data properties
Examples
• Discrete data – set of values, visualization
using bar charts, pie charts, …
wiki.opossem.org
Examples
• Continuous data – function, visualization using
graphs
www.isanybodylistening.com.au
Examples
• 2D real numbers
– Function of two variables, visualization using 2D
height maps, contours in 2D, …
acko.net
timothyandrewbarber.blogspot.com
Examples
• 2D vector fields, visualization using hedgehog
plots, LIC (line integral convolution),
streamlets, …
en.wikipedia.org
csis.pace.edu
www.cg.tuwien.ac.at
Examples
• Spatial data + time
– 3D flow, visualization using streamlines,
streamsurfaces
www.solartornado.info
www2.cs.uh.edu
Examples
• Spatial data
– 3D density, visualization using isosurfaces, volume
rendering
www.ssg-surfer.com
viscg.uni-muenster.de
Examples
• Multidimensional data
– Set of n dimensions, visualization using parallel
coordinates, glyphs, icons, …
datamining.typepad.com
www.emeraldinsight.com
www.cs.umd.edu
Structure inside and between records
• Data sets consist of:
– Syntax – data representation (so called data model)
– Semantics – relationships within one record or
between records (so called conceptual model)
• Types of structures:
– Scalars, vectors, tensors
– Geometry and grids
– Other forms
hamiltonhealth.ca
Scalars, vectors, and tensors
• Scalar = individual number in record
– e.g., age
• Vector = composition of several variables to
one record
– e.g., point in 2D space, RGB
• Tensor = defined by its order and space
dimension. Represented by field or matrix.
– e.g., transformation matrix in 3D
Geometry and grids
• Geometry is represented using coordinates of
records
• Grid – geometry can be derived from the
starting position, orientation, and step size in
horizontal and vertical direction
m.ihned.cz
danielwalsh.tumblr.com
Non-uniform geometry
• We need to store coordinates of all records –
they cannot be derived
• Non-uniform grid
www.scielo.org.mx
www.tafsm.org
blog.nasm.si.edu
Other types of structures
• Another important structure is topology
• It defines so called connectivity
• Important in resampling and interpolation
xpertnetworking.wordpress.com
www.bugman123.com
Time
• Enormous range of values
(picosecs vs. millenia)
• Expressed absolutely or relatively
• Data sets containing time can have regular
distribution (regular sampling) or irregular one
(e.g., transaction processing – according to the
time stamp of its execution)
megworden.com
Other examples of
structured data
• Magnetic Resonance Imaging (MRI)
• Computational fluid dynamics (CFD)
• Financing
• CAD systems
• Counting people
• Social networks
www.demografie.infosuccessfulworkplace.com
www.impactlab.net
lotusenthusiast.net
www.mjmdesigns.co.uk
Taxonomy – 7 types of data
• 1D (linear sets and sequences)
• 2D (maps)
• 3D (objects, shapes)
• nD (relations)
• Trees (hierarchy)
• Networks (graphs)
• Temporal data www.janscholten.com
Linear data
• Long lists of items
– Menu items
– Source code
• Fisheye displays
http://ds.cc.yamaguchi-u.ac.jp/~ichikay/pfp7/iv/pics/SeeSoft-line.jpg
http://ds.cc.yamaguchi-u.ac.jp/~ichikay/pfp7/iv/pics/SeeSoft-line.jpg
http://www.cs.umd.edu/hcil/fisheyemenu/
2D data - maps
• GIS (Geographical Information Systems)
– Maps (e.g., Google Earth)
http://www.wimp.com/unusualplaces/
– Spatial queries
– Spatial data analysis
www.bowdoin.edu
3D data
• Different types of 3D data vis
• Scientific visualization
en.wikipedia.org
www.ornl.gov
gvis.grc.nasa.gov
Multidimensional data
• Records in relational database
• Two solutions:
– Drawing all possible pairs of variables in 2D graph
• Simple but unusable for general overview of the data
– „Parallel coordinates“
• Method for displaying multidimensional data (Alfred
Inselberg)
en.wikipedia.org
Parallel coordinates
andrewgelman.com
eagereyes.org
Parallel coordinates
vis.lbl.gov
whyevolutionistrue.wordpress.com
Trees
• Displays not only data itself but also their
structure
– e.g., genetic trees, file systems
• Number of items
increases significantly
in lower levels of tree
Trees
Trees
• Tree Maps
– Displaying the tree data as nested rectangles
– Tree with
a million
records:
www.cs.umd.edu
Networks
• Similar to trees – we aim to display the data
structure
• Networks = nodes + edges
• Design should contain:
– Minimal edge crossings
– Minimal edge length
– Minimal edge bending
archaeologicalnetworks.wordpress.com
Networks
sydney.edu.au
wallblog.co.uk
us.fotolia.com
Temporal data
• Displaying data dependent on time
– Trend and seasonal graphs
www.psdgraphics.com
www.demondemon.com
Temporal data
www.sciencedirect.com
Temporal data - LifeLines
http://www.cs.umd.edu/hcil/lifelines/
Data preprocessing
• Displaying raw data = precise, identification of
outliers, missing data, …
• Sometimes preprocessing is required
blog.wpfwonderland.com
Preprocessing – techniques
• Metadata and statistics
• Missing values and data “cleaning“
• Normalization
• Segmentation
• Sampling and interpolation
• Dimension reduction
• Data aggregation
• Smoothing and filtration
• Raster to vector
Metadata and statistics
• Metadata – information for preprocessing
– Reference point for measurement
– Unit of measurement
– Symbol for missing values
– Resolution
• Statistical analysis
– Detection of missing records
– Cluster analysis
– Correlation analysis
Missing values and data “cleaning“
• Removing wrong records
• Assigning a given value
• Assigning an average value
• Assigning a value derived from the nearest
neighbor value
• Calculating the value (imputation)
Normalization
• Transformation of the input dataset
• Adjusting values measured on different scales to
a notionally common scale
• Normalization to interval [0.0, 1.0]:
dnormalized = (doriginal - dmin)/(dmax - dmin)
• Clamping according
to the threshold
values
Segmentation
• Classification of input data into given
categories
• Split-and-merge
iterative algorithm
blog.campaigner.com
Split-and-merge
▪ similarThresh = defines the similarity of two regions with given characteristics
▪ homogeneousThresh = defines the region homogeneity (uniformity)
do {
changeCount = 0;
for each region {
compare region with neighboring ones and find the most similar one;
if the most similar one is within similarThresh of the current region {
connect these two regions;
changeCount++;
}
evaluate the homogeneity of the region;
if homogeneity of region is smaller than homogeneousThresh {
split the region to two parts;
changeCount++;
}
} until changeCount == 0
Complex parts of the algorithm
• Determining the similarity of two regions
• Evaluating the homogeneity of a region –
histogram
• Splitting the region
www.statcan.gc.ca
Possible problem
• Infinite loop by repeating split and merge
steps of the same region
• Solution:
– Changing the threshold value for similarity or
homogeneity
– Taking into account other region properties (e.g.,
size and shape of regions)
Sampling and interpolation
• Transformation of input data
• Interpolation = sampling method
– Linear interpolation
– Bilinear interpolation
– Non-linear interpolation
inperc.com
Bilinear interpolation
• Uniform grid
• Horizontal + vertical interpolation
Non-linear interpolation
• Problems with linear interpolation – zero
connectivity in grid points
• Solution = using quadratic and cubic splines
Result
• Original image (24x24 pixels)
cubic B-spline filter Catmull-Rom
research.cs.wisc.edu
Resampling
• Pixel replication
• Neighbor averaging
• Data subsetting
giscommons.org
Dimension reduction
• Preparing multidimensional data for displaying
• Keep as much original information as possible
• Techniques:
– PCA (principal component analysis)
– MDS (multidimensional scaling)
– SOMs (Kohonen self-organizing
maps)
Interactive Data Visualization - Fondations, Techniques and Applications. Matthew Ward
PCA intuitively
1. We select a line in space visualizing ndimensional
data. This line covers the most of
the input data items and is called the first
principal component (PC).
2. We select a second line perpendicular to the
first PC, this forms the second PC.
3. We repeat this until we proces all PC dimensions
or until we reach a desired number of principle
components.
web.media.mit.edu
PCA – principal component analysis
1) 2)
3) 4)
http://ordination.okstate.edu/PCA.htm
MDS – multidimensional scaling
• Based on comparing the distances between
individual data items in original and reduced
space
scikit-learn.org
MDS – multidimensional scaling
1) We calculate the distances between all pairs of data points in the original space. If
we have n points as an input, this step requires n(n – 1)/2 operations.
2) We transfer all input data points to points in the reduced dimension space (often
randomly).
3) We calculate stress, i.e., difference in distance between points in the original and
reduced space. This can be done using different approaches.
4) If the average and cummulated stress value is smaller than the user-defined
threshold, the algorithm ends and returns the result..
5) If the stress value is higher than the threshold, for each point we calculate a
directional vector pointing to the desired shift direction in order to reduce stress
between this point and the other points. This is determined as the weighted
average of vectors between this point and its neighbors and its weight is derived
from stress value calculated between individual pairs. Positive stress value repulses
the points, negative one attracts them. The higher the absolute value of stress, the
bigger movement of point.
6) Based on these calculations we transform tha data points to the target reduced
dimension, according to the calculated vectors. Return to step 3 of the algorithm.
MDS – multidimensional scaling
lear.inrialpes.fr/src/yorg/doc/index.html
Data aggregation
• Aggregation = clustering of similar data to
groups.
Smoothing and filtration
• Signal processing techniques – noise removal
• Convolution in 1D:
424
11 +−
++=
iii ppp
pi
Converting rasters to vectors
• Used for:
– Data compression
– Image comparison
– Data transformation
• Methods:
– Thresholding
– Region growing
– Edge detection
– …
Conclusion
• The techniques mentioned improve the
efficiency of visualization
• We have to inform the user that the data has
been transformed!!!