831708907
22
augustus 2003
This document is the Master Thesis made as a part of
my Master Degree in Computer Science Education (Software Systems) at the Open
University of the Netherlands. The work has been done in collaboration with the
research department of the company Agfa in Mortsel Belgium.
Name Guido Naudts
Student number 831708907
Address Secretarisdreef 5
Telephone work 0030-2-542.76.01
Home 0030-14-51.32.43
E-mail Naudts_Vannoten@yahoo.com
Chairman: dr J.T. Jeuring, professor at the Open
University
Secretary : ir. F.J. Wester, senior lecturer at the Open
University
Coach: ir. J. De Roo, researcher at Agfa
Gevaert.
I want to thank Ir.J.De Roo for giving me the opportunity for making a thesis about a tremendous subject and his guidance in the matter.
Pr.J.Jeuring and Ir. F.Wester are thanked for their efforts to help me produce a readable and valuable text.
I thank M.P.Jones for his Prolog demo in the Hugs distribution that has very much helped me.
I thank all the people from the OU for their efforts during the years without which this work would not have been possible.
I thank my wife and children for supporting a father seated behind his computer during many years.
Coaching and graduation
committee
1.2.3.Conclusions of the case study
1.6. Relation with the current research
Chapter
2. The building blocks
2.4.Resource Description Framework RDF
2.7.2. The model theory of RDF
2.8. A global view of the Semantic Web
2.8.2. The layers of the Semantic Web
Chapter
3. RDF, RDFEngine and inferencing.
3.2.Recapitulation and definitions
3.4.Languages needed for inferencing
3.5.The model interpretation of a rule
3.14.Comparison of resolution and closure
paths
3.17.A formal theory of graph resolution
3.18.An extension of the theory to variable
arities
3.19.An
extension of the theory to typed nodes and arcs
Chapter
4. Existing software systems
Chapter 5. RDF, inferencing
and logic.
5.2.3. A problem with reification
5.3.Unification
and the RDF graph model
5.4.Completeness and soundness of the engine
5.7. Comparison with Prolog and Otter
5.7.1. Comparison of Prolog with RDFProlog
5.8.Differences
between RDF and FOL
5.11. RDF and constructive logic
5.12.
The Semantic Web and logic
5.15.The
World Wide Web and neural networks
6.3. An optimization technique
7.3. OWL Lite and inconsistencies
7.5. Reactions to inconsistencies and trust
7.5.1.Reactions to inconsistency
7.5.2.Reactions to a lack of trust
8.4. Searching paths in a graph
8.6. Simulation of logic gates.
8.8. A simulation of a flight reservation
9.2.Characteristics
of an inference engine
9.3. An evaluation of the RDFEngine
9.4. Forward versus backwards reasoning
Appendix
1: Excuting the engine
Appendix 2. Example
of a closure path.
Simulation
of a flight reservation
Appendix
4. Theorem provers an overview
2.2 Reasoning using resolution techniques: see
the chapter on resolution.
2.5. The matrix connection method
Higher order interactive provers
Overview of different logic systems
3. First order predicate logic
5.
Modal logic (temporal logic)
6. Intuistionistic or constructive logic
Appendix
5. The source code of RDFEngine
Appendix
6. The handling of variables
The Semantic Web is
an initiative of the World Wide Web
Consortium (W3C). The goal of this project is to define a series of
standards. These standards will create a framework for the automation of
activities on the World Wide Web
(WWW) that still need manual intervention by human beings at this moment.
A certain number of basic standards have been
developed already. Among them XML is without doubt the most known. Less known
is the Resource Description Framework
(RDF). This is a standard for the creation of meta information. This is the
information that has to be given to computers to permit them to handle tasks
previously done by human beings.
Information alone however is not sufficient. In order
to take a decision it is often necessary to follow a certain reasoning.
Reasoning is connected with logics. The consequence is that computers have to
be fed with programs and rules that can take decisions, following a certain
logic, based on available information.
These programs and reasonings are executed by inference engines. The definition of
inference engines for the Semantic Web is at the moment an active field of
experimental research.
These engines have
to use information that is expressed following the standard RDF. The
consequences of this requirement for the definition of the engine are explored
in this thesis.
An inference engine
uses information and, given a query,
reasons with this information with a solution
as a result. It is important to be able to show that the solutions, obtained in
this manner, have certain characteristics. These are, between others, the
characteristics completeness and validity. This proof is delivered in
this thesis in two ways:
1)
by means of a
reasoning based on the graph theoretical properties of RDF
2)
by means of a
model based on First Order Logics.
On the other hand these engines must take into account
the specificity of the World Wide Web. One of the properties of the web is the
fact that sets are not closed. This implies that not all elements of a set are
known. Reasoning then has to happen in an open
world. This fact has far reaching logical
implications.
During the twentieth century there has been an
impetous development of logic systems. More than 2000 kinds of logic were
developed. Notably First Order Logic (FOL)
has been under heavy attack, beside others, by the dutch Brouwer with the constructive or intuistionistic logic. Research is also done into kinds of logic
that are suitable for usage by machines.
In this thesis I argue that an inference engine
for the World Wide Web should follow a kind of constructive logic and that RDF,
with a logical implication added,
satisfies this requirement. It is shown that a constructive approach is
important for the verifiability of the results of inferencing.
The structure of the World Wide Web has a
number of consequences for the properties that an inference engine should
satisfy. A list of these properties is established and
discussed.
A query on the World Wide Web can extend over more
than one server. This means that a process of distributed inferencing must be
defined. This concept is introduced in an experimental inference engine,
RDFEngine.
An important characteristic for an engine that has to
be active in the World Wide Web is efficiency.
Therefore it is important to do research about the methods that can be used to
make an efficient engine. The structure has to be such that it is possible to
work with huge volumes of data. Eventually these data are kept in relational
databases.
On the World Wide Web there is information available
in a lot of places and in a lot of different forms. Joining parts of this
information and reasoning about the result can easily lead to the existence of contradictions and inconsistencies. These are inherent to the used logic or are
dependable on the application. A special place is taken by inconsistencies that
are inherent to the used ontology. For the Semantic Web the ontology is
determined by the standards rdfs and OWL.
An ontology introduces a classification of data and apllies restrictions to those data. An inference engine for the Semantic Web needs to have such characteristics that it is compatible with rdfs and OWL.
An executable specification of an inference engine in Haskell was constructed. This permitted to test different aspects of inferencing and the logic connected to it. This engine has the name RDFEngine. A large number of existing testcases was executed with the engine. A certain number of testcases was constructed, some of them for inspecting the logic characteristics of RDF.
Het Semantisch
Web is een initiatief van het World Wide
Web Consortium (W3C). Dit project beoogt het uitbouwen van een reeks van
standaarden. Deze standaarden zullen een framework scheppen voor de
automatisering van activiteiten op het World
Wide Web (WWW) die thans nog manuele tussenkomst vereisen.
Een bepaald aantal basisstandaarden zijn reeds ontwikkeld waaronder XML
ongetwijfeld de meest gekende is. Iets minder bekend is het Resource Description Framework (RDF).
Dit is een standaard voor het creëeren van meta-informatie. Dit is de
informatie die aan de machines moet verschaft worden om hen toe te laten de
taken van de mens over te nemen.
Informatie alleen is echter niet voldoende. Om een beslissing te kunnen
nemen moet dikwijls een bepaalde redenering gevolgd worden. Redeneren heeft te
maken met logica. Het gevolg is dat computers gevoed moeten worden met
programmas en regels die, volgens een bepaalde logica, besluiten kunnen nemen
aan de hand van beschikbare informatie.
Deze programmas en redeneringen worden uitgevoerd door zogenaamde inference engines. Dit is op het huidig
ogenblik een actief terrein van experimenteel onderzoek.
De engines
moeten gebruik maken van de informatie die volgens de standaard RDF wordt
weergegeven. De gevolgen hiervan voor de structuur van de engine worden in deze
thesis nader onderzocht.
Een inference
engine gebruikt informatie en, aan de hand van een vraagstelling, worden redeneringen doorgevoerd op deze informatie
met een oplossing als resultaat. Het
is belangrijk te kunnen aantonen dat de oplossingen die aldus bekomen worden
bepaalde eigenschappen bezitten. Dit zijn onder meer de eigenschappen volledigheid en juistheid. Dit bewijs wordt op twee manieren geleverd: enerzijds
bij middel van een redenering gebaseerd op de graaf theoretische eigenschappen
van RDF; anders bij middel van een op First Order Logica gebaseerd model.
Anderzijds dienen zij rekening te houden met de specificiteit van het
World Wide Web. Een van de eigenschappen van het web is het open zijn van verzamelingen.
Dit houdt in dat van een verzameling niet alle elementen gekend zijn. Het
redeneren dient dan te gebeuren in een open
wereld. Dit heeft verstrekkende logische implicaties.
In de loop van de twintigste eeuw heeft de logica een stormachtige
ontwikkeling gekend. Meer dan 2000 soorten logica werden ontwikkeld. Aanvallen
werden doorgevoerd op het bolwerk van de First
Order Logica (FOL), onder meer door de nederlander Brouwer met de constructieve of intuistionistische logica. Gezocht wordt ook naar soorten logica
die geschikt zijn voor het gebruik door machines.
In deze thesis wordt betoogd dat een inference engine voor het WWW een
constructieve logica dient te volgen enerzijds, en anderzijds, dat RDF
aangevuld met een logische implicatie
aan dit soort logica voldoet. Het wordt aangetoond that een constructieve
benadering belangrijk is voor de verifieerbaarheid van de resultaten van de
inferencing.
De structuur van het World Wide Web heeft een aantal gevolgen voor de
eigenschappen waaraan een inference engine dient te voldoen. Een lijst van deze
eigenschappen wordt opgesteld en besproken.
Een query op het World Wide Web kan zich uitstrekken over meerdere
servers. Dit houdt in dat een proces van gedistribueerde inferencing moet
gedefinieerd worden. Dit concept wordt geïntroduceerd in een experimentele
inference engine, RDFEngine.
Een belangrijke eigenschap voor een engine die actief dient te zijn in
het World Wide Web is efficientie.
Het is daarom belangrijk te onderzoeken op welke wijze een efficiente engine
kan gemaakt worden. De structuur dient zodanig te zijn dat het mogelijk is om
met grote volumes aan data te werken, die eventueel in relationele databases
opgeslagen zijn.
Op het World Wide Web is informatie op vele plaatsen en in allerlei
vormen aanwezig. Het samen voegen van deze informatie en het houden van
redeneringen die erop betrekking hebben kan gemakkelijk leiden tot het ontstaan
van contradicties en inconsistenties. Deze zijn inherent aan
de gebruikte logica of zijn afhankelijk van de applicatie. Een speciale plaats
nemen de inconsistenties in die inherent zijn aan de gebruikt ontologie. Voor
het Semantisch Web wordt de ontologie bepaald door de standaarden rdfs en OWL. Een ontologie voert een classifiering in van gegevens en legt
ook beperkingen aan deze gegevens op. Een inference engine voor het Semantisch
Web dient zulkdanige eigenschappen te bezitten dat hij compatibel is met rdfs
en OWL.
Een uitvoerbare specificatie van een inference
engine in Haskell werd gemaakt. Dit liet toe om allerlei aspecten van
inferencing en de ermee verbonden logica te testen. De engine werd RDFEngine
gedoopt. Een groot aantal bestaande testcases werd onderzocht. Een aantal
testcases werd bijgemaakt, sommige speciaal met het oogmerk om de logische
eigenschappen van RDF te onderzoeken.
In chapter 1 a case study is given and then the
goals of the research, that is the subject of the thesis, are exposed with the
case study as illustration. The methods used are explained and an overview is
given of the fundamental knowledge upon which the research is based. The
relation with current research is indicated. Then the structure of the thesis
is outlined.
A case study can serve as a guidance to what we want
to achieve with the Semantic Web. Whenever standards are approved they should
be such that important case studies remain possible to implement. Discussing all possible application fields here
would be out of scope. However one case study will help to clarify the goals of
the research.
A travel agent in Antwerp has a client who wants to go
to St.Tropez in France. There are rather a lot of possibilities for composing
such a voyage. The client can take the train to France, or he can take a bus or
train to Brussels and then the airplane to Nice in France, or the train to
Paris then the airplane or another train to Nice. The travel agent explains the
client that there are a lot of possibilities. During his explanation he gets an
impression of what the client really wants. Fig.1.1. gives a schematic view of
the case study.

Fig.1.1. A Semantic
Web case study
He agrees with the client about the itinerary: by
train from Antwerp to Brussels, by airplane from Brussels to Nice and by train
from Nice to St. Tropez. This still leaves room for some alternatives. The
client will come back to make a final decision once the travel agent has said
him by mail that he has worked out some alternative solutions like price for
first class vs second class etc...
Remark that the decision for the itinerary that has
been taken is not very well founded; only very crude price comparisons have
been done based on some internet sites that the travel agent consulted during his
conversation with the client. A very cheap flight from Antwerp to Cannes has
escaped the attention of the travel agent.
The travel agent will now further consult the internet
sites of the Belgium railways, the Brussels airport and the France railways to
get some alternative prices, departure times and total travel times.
Now let’s compare this with the hypothetical situation
that a full blown Semantic Web would exist. In the computer of the travel agent
resides a Semantic Web agent that has at its disposal all the necessary
standard tools. The travel agent has a specialised interface to the general
Semantic Web agent. He fills in a query in his specialised screen. This query
is translated to a standardised query format for the Semantic Web agent. The agent
consult his rule database. This database of course contains a lot of rules
about travelling as well as facts like e.g. facts about internet sites where
information can be obtained. There are a lot of ‘path’ rules: rules for
composing an itinerary (for an example of what such rules could look like see:
[GRAPH]. The agent contacts different other agents like the agent of the
Belgium railways, the agents of the French railways, the agent of the airports
of Antwerp, Brussels, Paris, Cannes, Nice etc...
With the information recieved its inference rules
about scheduling a trip are consulted. This is all done while the travel agent
is chatting with the client to detect his preferences. After some 5 minutes the
Semantic Web agent gives the travel agent a list of alternatives for the trip;
now the travel agent can immediately discuss this with his client. When a
decision has been reached, the travel agent immediately gives his Semantic Web
agent the order for making the reservations and ordering the tickets. Now the
client only will have to come back once for getting his tickets and not twice.
The travel agent not only has been able to propose a cheaper trip as in the
case above but has also saved an important amount of his time.
That a realisation of such a system is interesting is
evident. Clearly, the standard tools do have to be very flexible and powerful
to be able to put into rules the reasoning of this case study (path
determination, itinerary scheduling). All this rules have then to be made by
someone. This can of course be a common effort for a lot of travel agencies.
What exists now? A quick survey learns that there are
web portals where a client can make reservations (for hotel rooms). However the
portal has to be fed with data by the travel agent. There also exist software
that permits the client to manage his travel needs. But all that software has
to be fed with information obtained by a variety of means, practically always
manually.
A partial simulation namely the command of a ticket
for a flight can be found in chapter 10 paragraph 8.
In the case study different information sites have to
communicate one with another. If this has to be done in an automatic way a
standard has to be used. The standard that is used for expressing information
is RDF. RDF is a standard for expressing meta-information developed by the W3C.
The primary purpose was adding meta-information to web pages so that these
become more usable for computers [TBL]. Later the Semantic Web initiative was
launched by Berners-Lee with the purpose of developing standards for automating
the exchange of information between computers. The information standard is
again RDF [DESIGN]. However other standards are added on top of RDF for the
realisation of the Semantic Web (chapter 2).
The following research tasks were defined:
1)
define an inference process on top of RDF and give a
specification of this process in a
functional language.
In the case study the servers dispose of
information files that contain facts and rules. The facts are expressed in RDF.
The syntax of the rules is RDF also, but the semantic interpretation is
different. Using rules implies inferencing. On top of RDF a standardized inference
layer has to be defined. A functional language is very well suited for a
declarative specification. At certain points the inference process has to be
interrupted to direct queries to other sites on the World Wide Web. This
implies a process of distributed
inferencing where the inference process does not take place in one site
(one computer) but is distributed over several sites (see also fig. 1.1).
2)
Investigate which kind of logic is best suited for the
Semantic Web.
Using rules; making queries; finding
solutions; this is the subject of logic. So the relevance of logic for an
inferencing standard based on top of RDF has to be investigated.
3)
Give a proof of the validity and the completeness of
the specified inference process
It is without
question of course that the answers to a query have to be valid. By
completeness is meant that all answers to a query are found. It is not always necessary that all answers
are found but often it is.
4)
Investigate what can be done for augmenting the efficiency
of the inference process
A basic
internet engine should be fast because the amount of data can be high; it is
possible that data need to be collected from several sites over the internet.
This will already imply an accumulation of transmission times. A basic engine
should be optimized as much as possible.
5)
Investigate how inconsistencies can be handled by the
inference process
In a closed
world (reasoning on one computer system) inconsistencies can be mastered; on
the internet, when inferencing becomes a distributed process, inconsistencies
will be commonplace.
As of yet no standard is defined for inferencing on top of RDF. As will
be shown in this thesis there are a lot of issues involved and a lot of choices
to be made. The definition of such a standard is a necessary, but difficult
step in the progress of the Semantic Web. Therefore, in the research the accent
was put on points 1-3 where points 4 and 5 were treated less in depth.
In order to be able to define a standard a definition has to be provided
for following notions that represent minimal extensions on top of RDF: rules,
queries, solutions and proofs.
The
primary research method consists of writing a specification of an RDF inference
engine in Haskell. The engine is given the name RDFEngine. The executable
specification is tested using test cases where the collection of test cases of
De Roo [DEROO] plays an important role. Elements of logic, efficiency and
inconsistence can be tested by writing special test cases in interaction with a
study of the relevant literature.

Fig.1.2. The inference engine RDFEngine in a server
with its inputs and outputs.
In fig.1.2 the global structure of the inference
engine is drawn. The inputs and outputs are in Notation 3 (N3). Notation 3 is
choosen because it is easier to work with than RDF and it is convertible to RDF
syntax. However, during the research, another syntax was developed, called
RDFProlog.
The engine is given a modular structure and different
versions exist for different testing purposes.
What has to be standardized is the input and output of
the block marked ‘Basic engine’ in fig.1.2. For the Semantic Web the precise
contents of this block will not make a part of the standard.
For making the Haskell specification the choice was made for a resolution based engine. A forward reasoning engine could have been chosen as well but a resolution based engine has better characteristics for distributed inferencing. Other choices of engine are also possible. The choice of a particular engine is not the subject of this thesis but rather the definition of the interface to the engine. However, interesting results that were obtained about resolution based inferencing are certainly worth mentioning.
c) The RDF rules mail list [RDFRULES]
d) The classic resolution theory
[VANBENTHEM] [STANFORD]
e) Logic theories and mainly first order logic, constructive logic and
paraconsistent logic. [VANBENTHEM, STANFORDP]
The mail list RDF rules [RDFRULES] has as a
purpose to discuss the extension of RDF with inferencing features, mainly
rules, queries and answers. There are a lot of interesting discussions but no
clear picture is emerging at the moment.
Berners-Lee [DESIGN] seeks inspiration in first
order logic. He defines logic predicates (chapter 2), builtins, test cases and
a reference engine CWM. However the semantics of input and output formats for a
RDF engine are not clearly defined.
A common approach is to use existing logic
systems. RDF is then expressed in terms of these systems and logic (mostly
rules) is added on top e.g. Horn logic or frame logic. An example is found in
[DECKER]. In this case the semantics are those of the used logic.
The implementation of an ontology like OWL
(Ontology Web Language) implies also inferencing facilities that are necessary
for implementing the semantics of the ontology [OWL features].
Euler [DEROO] and CWM take a graph oriented
approach for the construction of the inference program. The semantics are
mostly defined by means of test cases.
The approach taken by my research for defining
the semantics of the inference input and output has two aspects:
1) It is graph oriented and, in fact, the theory is generally applicable to
labeled graphs.
2) From the logic viewpoint a constructive approach is taken for reasons of
verifiability (chapter 5).
3) A constructive graph oriented definition is given for rules, queries,
solutions and proofs.
Chapter two gives an
introduction into the basic standards that are used by RDF and into the syntax
and model theory of RDF. Readers who are familiar with some of these matters can
skip these.
Chapter three exposes the meaning of inferencing using
an RDF facts and rules database. The Haskell specification, RDFEngine, is
explained and a theory of resolution based on reasoning on a graph is
presented.
Chapter four discusses existing software systems.
After the introduction to RDF and inferencing in chapter three, it is now
possible to highlight the most important aspects of existing softwares. The
Haskell specification RDFEngine and the theory presented in the thesis were not
based upon these softwares but upon the points mentionned in 1.5.
Chapter five explores in depth the relationship of RDF
inferencing with logic.
Here the accent is put on a constructive logic
interpretation of the graph theory exposed in chapter three.
Chapter six discusses optimalization techniques.
Especially important is a technique based on the theory exposed in chapter
three.
Chapter seven disusses inconsistencies that can arise
during the course of inferencing processes on the Semantic Web.
Chapter eight gives a brief overview of some use cases
and applications.
Chapter nine finally gives a conclusion and indicates
possible ways for further research.
In this chapter a number of techniques are briefly
explained. These techniques are necessary building blocks for the Semantic Web.
They are also a necessary basis for inferencing on the web.
Readers
who are familiar with some topics can skip the relevant explanations.
XML (Extensible Markup Language) is a subset of SGML (Standard General Markup
Language). In its original signification a markup language is a language which
is intended for adding information (“markup” information) to an existing
document. This information must stay
separate from the original hence the presence of separation characters. In SGML
and XML ‘tags’ are used.
There are two kinds of tags: opening and closing tags.
The opening tags are keywords enclosed between the signs “<” and “>”. An
example: <author>. A closing tag
is practically the same, only the sign “/” is added e.g. </author>. With
these elements alone very interesting datastructures can be built. An example
of a book description:
<book>
<title>
The Semantic Web
</title>
<author>
Tim Berners-Lee
</author>
</book>
As can be seen it is quite easy to build hierarchical datastructures with these elements alone. A tag can have content too: in th