‹en-tête›

‹date/heure›

Cliquez pour modifier les styles du texte du masque

Deuxième niveau

Troisième niveau

Quatrième niveau

Cinquième niveau

‹pied de page›

‹#›

This slides were made for the presentation of the Master Thesis of Guido Naudts with the aim of obtaining the degree of engineer in software science. The coaching and graduation committee was composed as follows:

Chairman: dr J.T. Jeuring, professor at the Open University

Secretary : ir. F.J. Wester, senior lecturer at the Open University

Coach: ir. J. De Roo, researcher at Agfa Gevaert.

The main goal of the thesis was to study inferencing in connection with RDF, the Resource Description Framework, a standard of the W3C (World Wide Web Center).

The Semantic Web will be composed of a series of standards. These standards have to be organised into a certain structure that is an expression of their interrelationships. A suitable structure is a hierarchical, layered structure.

The slide illustrates the different parts of the Semantic Web in the vision of Tim Berners-Lee. The notions are explained in an elementary manner here.

Layer 1. Unicode and URI

At the bottom there is Unicode and URI. Unicode is the Universal code. Unicode codes the characters of all the major languages in use today.

Layer 2. XML, namespaces and XML Schema

Layer 3. RDF en RDF Schema

The first two layers consist of basic internet technologies. With layer 3 the Semantic Web begins.

RDF Schema (rdfs) has as a purpose the introduction of some basic ontological notions. An example is the definition of the notion “Class” and “subClassOf”.

Layer 4. The ontology layer

The definitions of rdfs are not sufficient. A more extensive ontological vocabulary is needed. This is the task of the Web Ontology workgroup of the W3C who has defined already OWL (Ontology Web Language) and OWL Lite (a subset of OWL).

Layer 5. The logic layer

In the case study the use of rules was mentioned. For expressing rules a logic layer is needed. An experimental logic layer exists.

Layer 6. The proof layer

In the vision of Tim Berners-Lee the production of proofs is not part of the Semantic Web. The reason is that the production of proofs is still a very active area of research and it is by no means possible to make a standardisation of this. A Semantic Web engine should only need to verify proofs. Someone sends to site A a proof that he is authorised to use the site. Then site A must be able to verify that proof. This is done by a suitable inference engine. Three inference engines that use the rules that can be defined with this layer are: CWM, Euler and RDFEngine developed as part of this thesis.

Layer 7. The trust layer

Without trust the Semantic Web is unthinkable. If company B sends information to company A but there is no way that A can be sure that this information really comes from B or that B can be trusted then there remains nothing else to do but throw away that information. The same is valid for exchange between citizens. The trust has to be provided by a web of trust that is based on cryptographic principles. The cryptography is necessary so that everybody can be sure that his communication partners are who they claim to be and what they send really originates from them.

The trust policy is laid down in a “facts and rules” database. This database is used by an inference engine like RDFEngine. A user defines his policy using a GUI that produces a policy database. A policy rule might be e.g. if the virus checker says ‘OK’ and the format is .exe and it is signed by ‘TrustWorthy’ then accept this input.

The impression might be created by the slide that this whole layered building has as purpose to implement trust on the internet. Indeed it is necessary for implementing trust but, once the pyramid of the slide comes into existence, on top of it all kind of applications can be built.

Layer 8. The applications

This layer is not in the figure; it is the application layer that makes use of the technologies of the underlying 7 layers. An example might be two companies A and B exchanging information where A is placing an order with B.

Before discussing the specific research goals, the globalcontext of the research is discussed using a case study. A travel agent in Antwerp has a client who wants to go to St.Tropez in France. There are rather a lot of possibilities for composing such a voyage. The client can take the train to France, or he can take a bus or train to Brussels and then the airplane to Nice in France, or the train to Paris then the airplane or another train to Nice. The travel agent explains the client that there are a lot of possibilities. During his explanation he gets an impression of what the client really wants. The travel agent agrees with the client about the itinerary: by train from Antwerp to Brussels, by airplane from Brussels to Nice and by train from Nice to St. Tropez. This still leaves room for some alternatives. The client will come back to make a final decision once the travel agent has said him by mail that he has worked out some alternative solutions like price for first class vs second class etc... Remark that the decision for the itinerary that has been taken is not very well founded; only very crude price comparisons have been done based on some internet sites that the travel agent consulted during his conversation with the client. A very cheap flight from Antwerp to Cannes has escaped the attention of the travel agent. The travel agent will now further consult the internet sites of the Belgium railways, the Brussels airport and the France railways to get some alternative prices, departure times and total travel times. Now let’s compare this with the hypothetical situation that a full blown Semantic Web would exist. In the computer of the travel agent resides a Semantic Web agent that has at its disposal all the necessary standard tools. The travel agent has a specialised interface to the general Semantic Web agent. He fills in a query in his specialised screen. This query is translated to a standardised query format for the Semantic Web agent. The agent consult his rule database. This database of course contains a lot of rules about travelling as well as facts like e.g. facts about internet sites where information can be obtained. There are a lot of ‘path’ rules: rules for composing an itinerary. The agent contacts different other agents like the agent of the Belgium railways, the agents of the French railways, the agent of the airports of Antwerp, Brussels, Paris, Cannes, Nice etc... With the information received its inference rules about scheduling a trip are consulted. This is all done while the travel agent is chatting with the client to detect his preferences. After some 5 minutes the Semantic Web agent gives the travel agent a list of alternatives for the trip; now the travel agent can immediately discuss this with his client. When a decision has been reached, the travel agent immediately gives his Semantic Web agent the order for making the reservations and ordering the tickets. Now the client only will have to come back once for getting his tickets and not twice. The travel agent not only has been able to propose a cheaper trip as in the case above but has also gained an important amount of time.

The following research questions were defined:

1) Define an inference process on top of RDF and give a specification of this process in a functional language.

In the case study the servers dispose of information files that contain facts and rules. The facts are expressed in RDF. The syntax of the rules is RDF also, but the semantic interpretation is different. Using rules implies inferencing. On top of RDF a standardized inference layer has to be defined. A functional language is very well suited for a declarative specification. At certain points the inference process has to be interrupted to direct queries to other sites on the World Wide Web. This implies a process of subquerying where the inference process does not take place in one site (one computer) but is distributed over several sites.

2) Is the specified inference process sound and complete?

It is without question of course that the answers to a query have to be valid. By completeness is meant that all answers to a query are found. It is not always necessary that all answers are found but often it is.

3) Which kind of logic is best suited for the Semantic Web?

Using rules; making queries; finding solutions; this is the subject of logic. So the relevance of logic for an inferencing standard based on top of RDF has to be investigated

4) What can be done for augmenting the efficiency of the inference process?

A basic internet engine should be fast because the amount of data can be high; it is possible that data need to be collected from several sites over the internet. This will already imply an accumulation of transmission times. A basic engine should be optimized as much as possible.

5) How can inconsistencies be handled by the inference process?

In a closed world (reasoning on one computer system) inconsistencies can be mastered; on the internet, when inferencing becomes a distributed process, inconsistencies will be commonplace.

An RDF file is a set of triples. A triple is composed of a subject, a predicate and an object. A triple can be a fact or a rule.

A fact: kind(christine, elza).

kind is the predicate, christine is the subject and elza is the object of the triple.

The notation that is used is a prolog-like syntax where variables are indicated by capitals and constants start with a miniscule.

A rule is also a triple. The predicate is ‘:>’ which is the implication sign.

In the rule:

child(X,Y), child(Y,Z) :> grandparent(Z,X).

the subject is: child(X,Y), child(Y,Z)

and the object is: grandparent(Z,X)

This rule says that, if X is a child of Y and Y is a child of Z then Z is a grandparent of X.

And of course a query can be done:

give all Z that are grandmother of X.

In this slide some aspects are discussed that are typical for the Semantic Web and that an reasoning engine for RDF has to take into account.

• subquestions:

as was shown in the case study, it must be possible for the inferencing program to direct queries to other sites and get answers in return.

• closed/open world: this will be explained in a following slide

• verifiability/trust -> origin

databases and results must be verifiable i.e. it must be necessary to check whether a RDF triple can be accepted or not. The link with the trust system is obvious. Verifiability/trust points also to the necessity of keeping track of the origin off all triples and within a triple of its labels (not all labels of a triple are necessarily originating from the same site).

• inconsistencies

While recieving information from many different sites, the danger of inconsistency is higher as in a stand-alone application.

This slide represents the structure of the reasoning program. The upper blocks represent the input. All input is put in a concrete RDF syntax like the one shown in a previous slide; other syntaxes are e.g. Notation 3 and the graph syntax. The first block contains metarules: these are general rules that can be used by a set of applications. This block might also contain rules for ontologies.

The second block represents the application dependent data.

The third block represents a question asked to the system.

The first of the middle blocks represents the reasoning kernel of the program. It uses an abstract RDF syntax (for this thesis specified in Haskell). This block can direct a query to another site (second block in the middle) and recieve a response. The lowest block represents the output (solutions and proofs of those solutions) of the program. The output is again in a concrete RDF syntax.

Several versions were made of the program but there are two main versions:

1)This first version has a rather classic structure (Prolog-like) and was thouroughly tested using the testcases of De Roo.

2)The second version has a structure that is better adapted to the needs of the Semantic Web; it is more modular; it can perform subqueries and respond to subqueries what makes it suited for a client-server configuration. It it is extensible (builtins i.e. extra functions like e.g. print or math functions can be added). It uses a hashtable for storing the RDF triples so that all triples that match with a certain goal can be retrieved at once.

This slide demonstrates the inner workings of the inferencing kernel. The mechanism used is ‘resolution based inferencing’ also called ‘backward reasoning’.

I repeat the database here:

Rules:

child(X, Y), child(Y, Z) :> grandparent(Z, X).

grandparent(Z, X), gender(Z, female) :> grandmother(Z, X).

Facts:

child(christine, elza).

child(wim, christine).

sexe(elza, female).

The query is: grandmother(Z,X). The grandmother rule says that somebody is a grandmother when the person is a grandparent and his gender is female. We now have to find somebody who is a grandparent with gender female. Two new nodes are drawn in the search tree:

grandparent(Z, X) and gender(Z, female).

Now some Z is a grandparent when (a certain) X is a child of (a certain Y) and Y is a child of Z. Two nodes are added again: child(X, Y) and child(Y, Z). Now no more rules can be applied. But the triple child(X,Y) can match with two facts. Lets first substitute X by christine and Y by elza to get: child(christine, elza). In the figure we see now that the node child(Y,Z) becomes: child(elza, Z). This means we have to find the parent Z of elza; but such fact is not in the database and therefore a cross is drawn over this goal. So the substitution that was done (X by christine and Y by elza) was no good. A backtrack must be done and another alternative has to be tried. Now we substitute X by wim and Y by christine. The other child node no becomes: child(christine, Z) and we see that now Z can be susbtituted by elza and the node: gender(Z, female) now becomes gender(elza, female) and this is confirmed by a fact in the database. So now we have found the solution: grandmother(elza, wim). The system will still try to find other solutions but none are found and inferencing terminates.

This slide show that the resolution based inference process can be represented in the form of a Finite State Machine (FSM). The ovals in the slide indicate the states; the rectangles are functions, that are executed when passing from state to another. Here follows an overview of all state changes.

1 à 2

The proces starts with state 1 where the menu is displayed. Besides functions that are not shown in the figure, from the menu an inferencing process can be started by selecting a database and a query.

2 à 3

In state 2 the data structures for the inferencing process are prepared and the query is entered in the goallist. Then the main state, 3, of the process is entered.

3 à 3

Depending on the next triple in the goallist there are three possibilities:

a) the triple contains a builtin e.g. (Txxx, io:print, ”Message to the user”). Some following triples might also be used. After executing the builtin, the FSM will return to state 3. b) the triple and a certain number of following triples define a subquery. The subquery is directed to another server. The returned results are integrated into the database and the FSM returns to state 3. c) the engine will take a goal and will try to match it with the database. One alternative is selected and added to the goallist and the other alternatives are pushed on the stack. If there were alternatives the FSM returns to state 3. If there were not, it goes to state 6.

3 à 4

The program detects an empty goallist. This means all goals have been ‘proved’ and the resolution refutation process is complete. A solution has been found and the FSM passes to state 4.

4 à 5

A solution has been found. The FSM passes control to the backtrack function that looks if alternatives are present on the stack for further investigation. If this is the case, then control is passed to state 5.

3 à 6

The solve function tries to unify a goal with the database; however the unification process fails. The choose function detects the lack of alternatives and the FSM passes to state 6.

6 à 5

The current search path is aborted. The backtrack function looks for further alternatives; if this is the case, the FSM passes control to state 5.

5 à 3

A backtrack has been done. An alternative will be chosen from the stack.

4 à 7

6 à 7

No more alternatives are present on the stack. The inferencing process is finished. The FSM passes to state 7.

7 à 1

The inferencing process is finished. Results are displayed and/or used and control is passed to state 1.

This slide explores the consequences of the open/closed world problem.

A difference is made between open and closed sets.

An example of an open set is the set of all large internet sites e.g. all sites with more than 100.000 HTML pages. Characteristics of this set are: • it has no known limits (there exists of course a limit at any point in time but it is not known; for a computer this is the same as no limit).

• its complement i.e. the set of all small internet sites is also an open set

• given a certain site there are three possibilities:

a) the site is a large site

b) the site is a small site

c) the size of the site is not known (will happen frequently)

So the ‘law of excluded middle’ does not apply.

• if the existence of a universal quantifier is defined by the possibility of iteratively naming all elements of the set, then such a quantifier does not exist for an open set. A quantifier ‘for_all_those_known’ does exist.

In contrast a closed set will be considered e.g. the set of all members of a club.

• a limit exists of course for this set: it is the number of members of the club.

• the complement (all non-members) still is an open set

• the ‘law of excluded middle’ does apply: a person is either member of the club, or he is not. • it is possible to name iteratively all members of the club so a universal quantifier does exist.

•

Other kinds of sets exist e.g. the natural numbers.

2) Is the specified inference process sound and complete?

Before discussing the characteristics soundness and completeness, it is necessary to say a word about the graph syntax of RDF. Recall that a RDF triple is composed of a subject, a predicate and an object. In the graph syntax a predicate is the label of an arc, subject and object form the labels of nodes of the graph. One node can represent at the same time a subject of one triple and an object of a different triple. The second graph in the slide represents the transitive closure for the predicate ‘taller’. It is only illustrative here.

The characteristics soundness and completeness of the program were proved by establishing a series of lemmas. These lemmas use reasoning on a graph as proof mechanism. I refer to my master thesis for more detail of these lemmas. Besides giving a proof of the soundness and completeness of the program, I was able to deduce from this theory:

• an elegant proof of the found solutions (shown in the demo)

• under certain conditions it is possible to deduce a general rule (= procedure) from a solution and its proof (shown in the demo).

3) Which kind of logic is best suited for the Semantic Web?

Using rules; making queries; finding solutions; this is the subject of logic. So the relevance of logic for an inferencing standard based on top of RDF has to be investigated

The goal here is to compare ‘constructive logic’ with ‘first order logic’. Instead however of discussing logic systems the two logic systems are compared by means of an analogy in the three following slides. The constructive aspect here is that each triple in a proof (or a database) must be an existing triple i.e. the namespaces of the labels should exist, the notions expressed by the labels should be defined by the namespaces; the sites should be trusted; the origin of the triple should be a trusted site.

Thus constructive logic is used for reasons of verifiability.

This famous drawing of Escher represents a building but a building that is impossible to build. In the analogy it represents first order logic. With first order logic it is possible to make definitions that are impossible to construct.

An example: if the result of a query is:

T1 or T2

where T1 and T2 are triples. T1 might be true or T2 might be true but which one is true??

Two points however:

* it is an analogy and an analogy never fits completely

* this is not an attack on first order logic: it is possible to make fine buildings too with first order logic; besides first order logic is a monument of science with huge merits, certainly in the field of mathematics.

This plan represents a constructive database. It is less impressive than the ‘Escher’ building, but it can be build.

And this is the result of building the plan from the previous slide. It represents the answer to a query consisting of ‘real’ triples.

This slide show a RDF graph consisting of a single triple. However this is not a valid RDF graph. Why not?

Because all labels in a valid RDF graph need to have a ‘ namespace ’.

A namespace is a URI (Universal Resource Indicator) like the URI in the upper part of the slide:

http://www.daml.or/2001/01/gedcom/gedcom#

Often we will want to abbreviate those namespaces e.g. gd is the abbreviation for the URI above.

gd:zoon(def:Frank, def:Christine)

is then a valid triple. ‘ def ’ might be the abbreviation of a namespace of my own (everybody is entitled to create (unique) namespaces). When no namespace is indicated, the default namespace applies. In the exemple of the slide show, namespaces were omitted for reasons of clarity.

What is the importance of namespaces?

A namespace indicates the origin of a label. Eventually the trust system might reject a triple because a label from a non-trusted site is present. Also the notions occurring in the triple like ‘ sun ’, ‘ Frank ’, ‘ Christine ’ must be defined in the corresponding namespace. Note that the origin of a triple and the origins of its labels are not necessarily the same.

This system thus plays a role in the verifiability of a triple.

4) What can be done for augmenting the efficiency of the inference process?

This slide show how a general rule can be deduced from a solution and its proof. The precise technique is explained in the thesis text.

Suppose a query:

grandmother(Z,X)

As was shown in the slide with the search tree, a solution is found after following certain paths in a search tree.

The solution is:

grandmother(elza, wim)

From the solution and its proof (as will be shown in the demo) it is possible to deduce a general rule with the form:

t1, t2, …, tn --> grandmother(X,Y)

where t1, t2, …, tn are triples.

For our simple example the general rule is:

child(X, Y), child(Y, Z), gender(Z, female) :> grandmother(Z, X)

A final remark: many other optimization techniques exist. Studying and testing some in relation with an RDF inference engine can be the subject of a next thesis.

5) How can inconsistencies be handled by the inference process?

In a closed world (reasoning on one computer system) inconsistencies can be mastered; on the internet, when inferencing becomes a distributed process, inconsistencies will be commonplace.

For the treatment of inconsistencies the used logic system is important; remember that constructive logic leads to verifiable triples and the possibility of using the trust system.

In first order logic the ECQ principle holds:

Ex Contradictione Quodlibet

or: from a contradiction everything can be deduced.

Though this priciple is valid for some constructive logic systems, for the WWW it is not wanted. For the treatment of inconsistencies (and errors) the trust system is important. In case of conflict, triples fom non-trusted sites can be thrown away (this can of course better be done preventively). At the present moment only theoretical conclusions can be made. The advent of an operational semantic web is necessary for seeing which inconsistencies will present themselves.

The accent of the research has been on questions 1, 2 and 3. A lot of puzzling was necessary before finding a convenient structure for the inferencing engine. The search for the proof of the soundness and completeness of the results of the engine required a deal of work too. The study and comparison of different logics in relation to RDF involved also a fair amount of theoretical work.

I will again review the questions and mention what was accomplished during the research:

1) define an inference process on top of RDF and give a specification of this process in a functional language.

A program with different versions was written and tested.

2) is the specified inference process sound and complete?

A graph theory is established that proofs the soundness and completeness of the engine and leads to two applications (deduction of a proof of a solution and deduction of a general rule)

3) which kind of logic is best suited for the Semantic Web?

It is shown that RDF with an implication as defined in the thesis uses constructive logic. The use of constructive logic is defended for the Semantic Web.

4) what can be done for augmenting the efficiency of the inference process?

Experiments were done with two optimization techniques and some others were discussed.

5) how can inconsistencies be handled by the inference process?

The importance of logic for the handling and avoidance of inconsistencies is pointed out. Some theoretic aspects of inconsistency are highlighted.

I return now to the case study of the beginning to make an evaluation of the program that was implemented for this thesis. The second version of the program can send a subquery and recieve an answer. It can also recieve a subquery and send an answer. So in order to have the program working on two sites that exchange queries, all that must be added is a (standard) IP client-server interface = 2 modules.

In the case study the program works with large databases. I will suppose that those are relational databases. Then a module must be added that transforms a RDF query to a SQL query, makes the SQL query and transforms the result back to RDF. The transformation of a relational DB to RDF is straightforward:

column-name -> predicate

row-name -> subject

value -> object

This is then an extra module.

In order to make the inferencing that is necessary for this case study, a minimal number of builtins will have to be added to the program e.g. builtins for math, string, input and output.

After the slide ‘Questions’ the demo slides will start.

The program was written using the Hugs interpreter that is compatible with Haskell 98. After starting winhugs and loading the program, the ‘start’ command is given.

This slide shows the ‘help’ function.

‘m’ or the ‘menu’ command show a small menu with some applications.

The demo application is started. When all files are read the ‘s’ (= step) command is given and the first step in the inferencing process is executed. After each step a certain amount of information is shown: After ‘goal:’ the current goal is shown i.e. the current RDF triple that the engine tries to resolve. After ‘goallist:’ the current goallist is shown i.e. the list of the outstanding goals (informally, the triples that have to be proved). After ‘substitution:’ an empty list is shown in this slide because no substitutions have been done up till now. After ‘history:’ nothing appears, because there is not yet an history. The history will, piece by piece, build the proof of a solution. The variables are prefixed with a prefix e.g. _1?X. This is a renaming of the variables with a unique name because there are variables with a local scope that can have the same name in different scopes. Though having the same name, they are really different variables. For understanding the events in this trace it is best to make abstraction of these prefixes.

In order to solve the goal grandmother(Z,X) this triple is matched with the consequent of the ‘grandmother’ rule. This rule says that somebody is a grandmother when he is a grandparent and his gender is female. You can see in the history that this rule has been applied. Two substitutions (of variables by variables) are added in the list of substitutions. The goals grandparent(Z,X) and gender(Z, female) have been added in the goallist. The current goal is always the first goal in the goallist. This is so in this program but could be done differently. Some variables have also recieved a longer prefix e.g. 1$_$2?Z. The first number is the so called level numbering. This number is raised with one with each step. This numbering is necessary for making a difference between substitutions of the same variables during different steps. (This I found a difficult feature to understand when I was making the program).

In the history you can see that the ‘grandparent’ rule has been applied for resolving the goal grandparent(Z, X). This goal is not any longer in the goallist and instead the two goals child(X,Y) and child(Y, Z) are added to the goallist.

The goal /0 is an empty goal. This is a purely technical aspect. You can ignore this. In the history you see that child(X,Y) is transformed into child(christine, elza) with the substitution: X à christine and Y à elza as you can see in the list of substitutions.

The current goal is now:

child(Y,Z) where Y has to be susbtituted (or better 2$_$1?Y) by elza as you can see in the list of substitutions. So really, what is looked for is the mother of elza. But such a fact is not in the database as will appear from the next slide.

The engine was not able to match the triple child(elza, Z) with a fact in the database so this path in the search tree has to be abandoned. The engine will now perform a backtrack in order to find an alternative path in the search tree. Such an alternative will be retrieved from the stack (if there is one, which is the case now).

You can see now in the history that the engine has now used a different substitution for the triple child(X,Y) namely (see list of substitutions):

X à wim, Y à christine.

The history has been changed too compared with the slide ‘demo 7’.

Now the current goal is:

child(Y, Z) where Y must be substituted by christine.

The system has no found that christine is the child of elza.

Now the system still has to proof that:

gender(elza, female).

The last goal has been ‘proved’.

A solution has been found.

The list of substitutions gives all substitutions that were used in order to arrive at a solution.

A proof of the solution is given in a ‘forward reasoning style’.

The facts that were found are shown and the used rules are instantiated (the variables are replaced with labels). Because the substitutions contain the level numbering the search tree can be reconstructed. A trace of the program is of course also a proof of a solution, but in this proof all failure paths of the search tree are missing and only the relevant (successful) paths are given.

Also a general rule is deduced that permits to find a grandmother using only one rule.

In the used prolog-like notation for RDF the variables are universal variables with local scope. This means the scope is limited to one rule. If there are two rule e.g.

son(X,Y) :> parent(Y,X).

brand(X,Y) :> makes(Y,X).

It is clear that the ‘X’ and ‘Y’ of the first rule will not denote the same thing as the ‘X’ and ‘Y’ of the second rule. In order not to confuse variables all variables with local scope are transformed into variables with global scope. This is done by giving them a unique name. The unique name is made by prefixing the variables with ‘_n’ where n is a unique number for each rule.

In the slide above you can see that variables are prefixed with numbers followed by ‘$_$’. This is the ‘level numbering’. Suppose there is a rule:

son(X, Y) :> father(Y, X).

At one level in the inferencing process X is substituted by ‘John’ and at another level by ‘Peter’. Now there are two substitutions:

X -> John

and

Y -> Peter.

If then there is a goal: grandfather(X, Y); by what should X be replaced?

Therefore, at each step in the inferencing proces variables are renamed by prefixing them with ‘n$_$’ where n is the number of the current level.