Integrating Dialogue Systems with Images
Jaromír Plhák, Ivan Kopeček and Radek Ošlejšek
{xplhak, kopecek, oslejsek@fi.muni.cz
Faculty of Informatics
Masaryk University, Brno
Czech Republic
C:\Users\xplhak\Desktop\filogo.png
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
1/12
TSD 2012, Brno

C:\Users\xplhak\Desktop\filogo.png
2/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Motivation
ØLooking at this photo from a holiday approximately three years ago: It is somewhere in the Czech
Republic, but where exactly? And what is the name of that castle? And who is the man behind your
girlfriend? ...
Ø
Figure 1: Photo from a holiday
TSD 2012, Brno
Communicative image
An “intelligent” image that is able to discuss its content with the user

-Pernstejn castle
-Such pieces of information are virtually inaccessible. However, some relevant pieces of
information can be retrieved using current information technologies. GPS coordinates allow us to
determine where the photo has been taken. Face recognition may help reveal the identities of
persons. The orientation may help to determine some objects in the picture (e.g. the peak in the
background)
-The idea of communicative images presented in this paper lies in enabling the people to
communicate with the images, i.e. enabling the users to get easily relevant pieces of information
from the images and enabling simultaneously the images to gather relevant information about
themselves from the users

C:\Users\xplhak\Desktop\filogo.png
3/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Key Concepts
ØDialogue-based communication in natural language
ØOntology-based knowledge base
oOntology Web Language
ØScalable Vector Graphics
oRaster image and objects annotation
oowl:import
ØInformation sources and algorithms
oEXIF data; Face detection; Similarity search algorithms
oInformation from users
oCrowdsourcing
TSD 2012, Brno

-At the beginning, we need to build up an communicative image from the classical photo or image.
- SVG – support structured annotation data
   * Either 2D points laying in the middle of the relevant objects or invisible (transparent)
polygons determining approximate silhouettes can be exploited.
   * owl:import - In this way, the knowledge base can be shared by many pictures. On the other
hand, concrete annotation data, i.e. concrete values of the properties prescribed by the
ontologies, are stored directly in the SVG format in the form of XML elements
- Once an image is transformed into the SVG format, the system tries to acquire as much information
about the image as possible, using auto-detection and image recognition techniques, e.g. face
detection and recognition algorithms, similarity search algorithms searching in large collections
of tagged pictures, EXIF data extraction from photos, etc.
- After this initial stage, the user is informed about the estimated content and invited to confirm
or refute the information and to continue with questioning.
-New pieces of the acquired information are stored in the image ontology and reused by subsequent
interactions
-Points in the middle of the relevant objects
-Transparent polygons for approximate silhouettes

C:\Users\xplhak\Desktop\filogo.png
4/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Ontologies
ØGraphical ontology
oGeneral visual characteristics
•Unusual size
•Typical shape
•Dominant color
oNavigation
•General and mutual position
ØDomain-specific ontologies
oFamily and friends
oSights
oNature
<Hair rdf:ID="head">
    <hasColor rdf:resource="#brown" />
</Hair>
ICCHP-ULD 2012, Linz
Figure 2: Part of graphical ontology

* We have developed a basic graphical ontology which restricts abstraction to the aspects that are
suitable and utilizable for dialogue-based investigation of a graphical content.
-This ontology focuses on generic visual characteristics of graphical objects and enables to define
several typical or usual aspects, e.g. unusual size, typical shape, dominant color etc. Using this
ontology the annotator can say that some object is ”mostly red, oval and unusually big”, for
instance.
-The ontology include navigational part as well, enabling to describe either general or mutual
position of objects in the picture.
* Although the graphical ontologies cover basic visual characteristics that are useful for generic
dialogue interactions, verbal descriptions of domain-specific pictures require to employ
specialized ontology extensions helping to generate domain-specific dialogues. For example, a
domain model, Family, provides vocabulary and background knowledge to classify people by their
relationships, similarly to the popular ”circles” known from social networks. People in a photo are
identified and assigned into ”circles”(e.g. my family, friends, colleagues, etc.).

C:\Users\xplhak\Desktop\filogo.png
5/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Dialogues with Images
ØCommunication modes
oInformation retrieval mode
oImage information supplementing mode
oFree communication mode
ØCommunication Analysis
oSmall fragment of natural language
•Relatively simple grammars
•Frames technology
•Standard techniques for misunderstanding solving
How far is it from this hotel to the nearest beach?
How far is it from <SLOT1> to <SLOT2>?
TSD 2012, Brno

1. The information retrieval mode (the mode with user’s initiative). In this mode the user just
puts questions and gets answers.
2. The image information supplementing mode (the mode with image’s initiative). This mode is used
for supplying the image with missing pieces of information.
3. The free communication mode (the mode with mixed initiative). In this mode, there are no apriory
limitations on user or image. Both user and image can ask questions and give answers.
In each mode, the user can switch to another mode.
Typically, we can restrict ourselves to a small fragment of natural language, so that the engine
can be based on relatively simple grammars in combination with the frames technology and standard
techniques for misunderstanding solving. For instance, the question ”How far is it from this hotel
to the nearest beach?” is resolved using the template ”How far is it from SLOT1 to SLOT2?”. The
system expects both SLOT1 and SLOT2 to be filled by the specific “object” entries.
*

graph_final C:\Users\xplhak\Desktop\filogo.png
6/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Dialogues with Images
TSD 2012, Brno
ØGrammar generator
oGeneral ontology structure (concepts and relations)
oSet of objects in the image
oThe restrictions on the given slots (cardinality or value type)
oActual dialogue strategy
oDialogue context
oHistory of the communication
Figure 3: Example of application structure

DM prunes the redundant patterns and resolves the conflicts.

C:\Users\xplhak\Desktop\filogo.png
7/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Dialogues with Images
TSD 2012, Brno
Ø
o
o
o
Figure 4:  Example of a dialogue communication in Information retrieval mode

S: This is a photo of a group of your friends standing before the Pernstejn castle. Now, you can
open and view the photo or start the communication with image.
U: Open the image, please.
S: /opens the photo/ File is opened. Now, you can start the communication with the image.
U: What is the resolution of this picture?
S: The resolution of this picture is 8 MPix. Enter your next query, please.
U: Who is in the picture?
S: There is a group of your friends. George, John and Elisabeth. There are also four other groups
of tourists.
U: Give me the detailed information about this caste.
S: Pernstejn Castle is a castle located on a rock above the village of Nedvedice and the rivers
Svratka …
U: Thank you, bye.
S: Good bye!

DM prunes the redundant patterns and resolves the conflicts.

C:\Users\xplhak\Desktop\filogo.png
8/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Implemented Modules
TSD 2012, Brno
ØGateDialogueBean
oDialogue manager
ØGateOwlBean
oProcessing of ontologies
oOWL reasoner Pellet
ØGateSvgBean
oSVG processing
oBatik library
ØClient scripts
oWeb browser plug-in

DM prunes the redundant patterns and resolves the conflicts.

C:\Users\xplhak\Desktop\filogo.png
9/12
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Applications Domains
ØPersonal photos and albums
oOrganize
oSearch
ØE-learning study materials linked to domain-specific knowledge base
S: In the picture there is the chemical structure of antioxidant resveratrol.
U: What is antioxidant resveratrol?
S: Antioxidant resveratrol consists of two benzene nucleus and three hydroxyl groups.
U: What is benzene nucleus?
S: Benzene nucleus …
/The system uses definition from the chemical ontology/
Fig. 2: Antioxidant resveratrol
TSD 2012, Brno

-Chai et al. proposed an intelligent photo album enabling to organize and search collection of
family photos by means of ontologies and SWRL questioning. If communicative images are implemented
into this scheme, we see straightforward way to enhance its functionality. Because the photos in
the album are organized by means of OWL ontology, it might be possible to employ the mechanism of
generating dialogues from domain ontologies. In this way, the user could organize photos via
dialogue as well.
-E-learning is another field, which could benefit from communicative images. A very important
feature of the presented communicative images concept is, that because of the concept is based on
formal ontologies, it is fully compatible with the Semantic web paradigm and simultaneously fully
supporting multilinguality.

C:\Users\xplhak\Desktop\filogo.png
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Application Domains
ØInternet graphics exploration
oSocial networks
oElectronic news
oWeb page editing
ØApplications for people with special needs
oVisually impaired people
oOlder adults
oPeople with any constraint in communication
TSD 2012, Brno
10/12

-The concept of communicative images seems to be a promising approach in accessibility of graphics
for visually impaired people, older adults or people with any constraint in communication.
-It provides the users with special needs with another way to explore graphics on the Internet and
gives them more detailed information about the photos in the social networks or electronic news.
* The communicative images paradigm depends on the users cooperation and sharing data. Annotation
and knowledge gained during the interaction of one user is used to enhance and simplify interaction
of another user. Crowdsourcing present Web 2.0 technologies that could support this paradigm. If
content cannot be recognized by auto-detection techniques then the blind user can ask community of
sighted users for help, e.g. to describe ”what is that building in this picture”. Their answers are
employed to annotate the picture and extend the knowledge base in the same way as results of
machine-based recognition.
* Moreover, the communicative images paradigm makes it possible to build other useful applications.
Integration of communication images with a dialogue-based web editing would enable to the visually
impaired people to create their personal web pages with graphical content by means of dialogues.

C:\Users\xplhak\Desktop\filogo.png
© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Conclusion and Future Work
ØCommunication with image by means of natural language
ØEfficient images investigation using frame based dialogue management
oSupported by the ontologies
ØAutomated generation of grammars from knowledge base
ØTesting of the prototype
ØAbstraction of ontologies
oSemantic terms, attributes and relations valuable for the system
ØModules that search for additional knowledge
oInternet, picture analysis
11/12
TSD 2012, Brno

•Acquiring additional knowledge (Internet, picture analysis) – Kdo byl v tom závodě nejrychlejší?
Jaký typ fotografie to je?

© J. Plhák, I. Kopeček, R. Ošlejšek, FI MU Brno
Thank you for your attention!
xplhak@fi.muni.cz
12/12
TSD 2012, Brno