Achieving interoperability in e-government services with two modes of semantic bridging: SRS and SWRL

Data heterogeneity in the public sector is a serious problem and remains to be a key issue as different naming conventions are used to represent similar data labels. The e-government effort in many countries has provided a platform for government entities and their business partners to exchange data through Information Communication Technologies (ICT) and standards such as RosettaNet (B2B data exchange standard), EDIFACT (Electronic Data Interchange for Administration, Commerce, and Transport), XML (Extensible Markup Language) and EDI (Electronic Data Interchange). However, e-government efforts have not really resolved data heterogeneity problems significantly due to limitation of these standards. One such limitation is the inability of data inheritance. In order to solve this problem with emphasis on Service Oriented Architectures (SOA) and Web Services, a semantically enriched web service for the public sector is needed. Thus we propose an ontology-based solution which allows data inheritance and polymorphism. This goal of this paper is to show how heterogeneous e-government documents can be semantically matched. We propose a shared hierarchical knowledge repository approach and a detailed process methodology for semantic mediation. A two-part semantic mediation approach using SRS (Semantic Relatedness Scores) and SWRL (Semantic Web Rule Language) is highlighted. Both measures are complimentary and provide the semantics necessary for resolving schema heterogeneity. Our approach incorporates a rule-based engine that reads and executes SWRL rules (i.e. RacerPro). We also adopted several tools for proof-of-concept such as Protege (i.e. ontology editor) and JESS (Java Expert Shell System).


Introduction
The stovepipe phenomenon creates a barrier for government bodies to freely exchange data. In most e-government initiatives today the main goal would be for government entities to seamlessly exchange data via parsing mechanisms such as EDI and XML. Data from different domains such as health records and national registration would require manual data transformations in order to achieve data interoperability. SOAP (Simple Object Access Protocol), UDDI (Universal Discovery Description Interchange) and WSDL (Web Services Description Language) standards only provide syntactical interoperability but semantic mediation provides semantic interoperability which is more reliable. Without semantics, a shared conceptualisation would not be possible. Semantics i.e. meanings, provide machine-understandable data that enable machines to reason data, draw inferences and perform semantic mediation on-the-fly. Currently only syntactic matching and data parsing is carried out for public agencies to share and exchange data. Without semantics in place, a human domain expert has to mix and match services between agencies manually [13], [16].
With a multitude of e-service requests between government agencies, a human being, processing these service requests has to understand the implicit semantics and choreograph them to establish syntactical interoperability, which evidently causes delay, unavoidable human errors and most importantly inaccuracy due to the lack of semantic processing. Semantic mediation is based on meanings unlike syntactic matching that is just based on string, prefix and suffix matching of schemas [15]. Thus, a less labour-intensive approach for schema integration is needed. The Semantic Web provides several components that are necessary for the creation of this vision, mainly ontologies, Web Ontology Language (OWL), resource description framework (RDF), description logic and reasoning capabilities. It includes the provision of metadata and data interchange formats such as N3 (Notation 3), Turtle (Terse RDF Triple Language) and N-Triples. Ontologies support formal logics which allow data inference and are more powerful than XML schemas. They are defined as a shared conceptualization of a domain of knowledge [5].
Currently, e-government public services utilize very specialized applications that are only available to certain agencies and not all agencies participating in the consortium. To ensure interoperability some countries have implemented XML schemas with Web Service interfaces e.g. Danish e-government [21]. The effort is similar to that of maintaining a shared repository to ensure interoperability for all government systems by using the same schema language to avoid reusability problems of syntax specific definitions [3], [4], [7]. Schema interoperability guidelines are issued for this purpose and in most cases made mandatory to ensure all government agencies adhere to the same naming conventions. For example in England the schema had to be registered with UK GovTalk (Site 1). Examples of these schemas for addresses are CorrespondenceAddress, HomeAddress, BusinessAddress and ElectoralAddress. In some cases interoperability is only limited to a boundary within central control such as the UN/CEFACT initiative (Site 2).
This creates a barrier for inter-organizational services between public agencies of different domains outside that boundary. The lack of semantics causes data exchange to be impossible. For example figure 1 illustrates a data heterogeneity problem for inter-organizational services between public agencies. The scenario here is that a customer wants to renew his driver's license online. He first logs into the DMV (Department of Motor Vehicle) portal and selects the type of service. Then he provides essential information such as full_name, DOB (10-10-1965), DMV customer ID (A33-05-7156) and address (1234 Oakton Circle Rd Arlington VA 22202). These details are passed on to a license renewal inspector where details provided by customer are verified. At the same time a mode of payment is selected which is verified by the records inspector before licence renewal is performed. Data is then passed on to the DMV License Renewal server which validates the data and passes on details to the license renewal clerk who updates and verifies renewal data. When payment is validated with updates from the DMV License Renewal server, the records inspector receives this information with updates exchanged from DMV Records server. The bank is then notified for the charges and the customer's account is debited. An option for printing a receipt is also provided. The customer then waits for his renewed licence to arrive in the mail.
Based on the process above, let's analyze the reasons for data heterogeneity problems in this environment. Data heterogeneity is caused by the separate data definitions or naming conventions maintained by both the DMV agencies depicted above i.e. DMV Licence Renewal and DMV Records for their customer records. For example, DMV Records maintains first_name, middle_name and last_name however DMV Licence Renewal maintains a complex string called full_name that literally combines first_name, middle_name and last_name. Also address in the DMV Licence Renewal is treated as a complex string and in DMV Records it is divided into street name, city, state and zipcode. For example ("Oakton Circle Rd Arlington VA 22202") compared to ("Oakton Circle Rd", "Arlington", "VA", "22202"). Since public agencies develop their own systems independently from each other, the granularity of how information is expressed can differ greatly. As we mentioned earlier having all agencies to adhere to one type of naming convention and making it mandatory is not practical. This would be analogous to developing a global ontology with a global schema [1], [3]. A more practical approach would be to focus on creating a semantic bridge between domain specific local ontologies [14], [15]. [16], [17]. This provides the semantic interoperability for interagency data exchanges.

Shared Hierarchical Ontology Structures
We propose that domain specific local ontologies in public agencies e.g. DMV Licence Renewal and DMV Records maintain their own naming conventions for their customer records and at the same time borrow general concepts from an upper ontology. This helps facilitate knowledge reuse and allows domain experts to express their knowledge, even when they don't completely agree with each other. This type of structure is referred to as a shared hierarchical ontology [2]. In this structure, knowledge is organized in different levels, each inheriting knowledge from upper level or parent ontologies. For multiple inheritance relationships it is important for knowledge inherited in local ontologies to be consistent with upper ontologies with no naming clashes. Ontologies are hand crafted by domain experts in their respective public agencies and it will be impossible to find a perfect ontology that covers all aspects of a shared domain of knowledge i.e. public services for the DMV. In order to maintain rich definitions, a shared hierarchical structure is crucial [2], [20]. Although the hierarchical structure helps knowledge reuse, it will not be realistic to assume that all developed ontologies will be under one central control and available at all times [19]. As such, a distributed model of a hierarchical repository is more appropriate (see figure 3). There are three servers (i.e. server 1, server 2 and server 3) which are distributed. The servers could be maintained by other agencies within the DMV consortium or even other agencies like the National Registration Department (NRD) or the Internal Revenue Service (IRS). To solve the availability problem, when a new ontology inherits knowledge from another upper ontology, a copy of the inherited knowledge is made available locally. The reason for this is that the parent ontology does not have to be available online at all times in order to have all their children ontologies functioning properly.

Semantic Bridging Process Methodology
In this section we present the process methodology for semantic bridging (see figure 4) [13], [15]. The first step is called ontology development. It involves creating or selecting the source ontology (SO) and target ontology (TO). As mentioned earlier DMV Licence Renewal could maintain its own set of definitions that is different from DMV Records. This is really a very important step for the ontologist as the domain specific local ontologies of public agencies e.g. DMV Licence Renewal and DMV Records have to be determined before mediation can be done. If DMV Licence Renewal is selected to be SO then TO would be DMV Records. In the second step, ontologies are checked for equality (E), inclusiveness (IC) and all disjoint (D) concepts are negated. We use the following symbols: 1) C for concept or class, 2) c for attributes or slots and 3) O for ontology, for simplicity [15]. The tests for E, IC, CN and D were based on definitions in [11]. The detailed definition is as follows: • Equality (E) Two classes (C) of DMV Licence Renewal and DMV Records are equal if, they: 1) have semantically equivalent data labels, 2) are synonyms or 3) have the same slots or attribute names. For example: 1) C 1 =Customer & C 2 =Customer, 2) C 1 =Customer & C 2 =Client and 3) C 1 and C 2 have same slots (c) names, e.g. c 1 = <CustID, name, address, DOB> and c 2 = < CustID, name, address, DOB >.

• Inclusiveness (IC)
Two classes (C) of DMV Licence Renewal and DMV Records are inclusive if, the attribute (c) of one is inclusive in the other. For example if c i = StreetAddress and c j = Address, then c i is a type of c j . In other words StreetAddress is inclusive in Address c i (c i ≥ c j ). This is applicable to hyponyms. Semantic Relatedness Score (SRS) is determined based on a hybrid matching technique that combines syntactic and semantic matching. The scores are used to populate a similarity matrix which is used as a basis to match the different schemas that the public agencies have defined in their ontologies. In the third step is called the respective ontologies are tested for consistency. This is to ensure that the concepts that have been mapped are in fact consistent and that there are no conflicting concepts. We use a reasoning engine (i.e. RacerPro) to check for these inconsistencies. If inconsistencies are discovered then they are resolved immediately. Consistency is defined as follows: • Consistency (CN) In step four, SO and TO are merged and integrated. At this point to ensure that schemas are perfectly matched the ontologist performing the merge selects the data labels which have higher scores than the threshold score. SRS produces scores between 0 and 1. A score of 0 means that data labels have no match and 1 indicates a perfect match. For instance if DMV Licence Renewal had the following schema, </Address> and at the same time if DMV Records, had the following schema </Address> it would result in a score of 1. For data labels that are in the range of 0 to 1, the ontologist is presented with the scores above the threshold of 0.5 (e.g. t>0.5). All scores below the threshold are maintained in a log for the ontologist to refer to at a later point. A detailed matching algorithm has been introduced in [15] to illustrate and explain the process. In step five, we use the same reasoning engine (i.e. RacerPro) to check for post matching inconsistencies. This ensures consistency is maintained even after data labels are matched. It also ensures integrity of matched data labels. Lastly in step six, a log report is produced and results are published. This data is also annotated to document all changes that have been updated. We do this to provide other ontologists to trace the lineage of data that had been used to make any changes during the mapping process. In the next section we describe how SRS is determined. SRS is a hybrid measure that comprises syntactic matching (SYN) and semantic matching (SEM) [15]. SYN uses approximate string matching to integrate data labels based on a number of deletions, insertions and substitutions to match a source string with a target string. Suppose we had two concepts i.e.</client> and </customer>. If "client" is chosen to be the source string, "customer" would be the target string. SYN would give on a scale of 0 to 1 the syntactic distance (d) of these two strings [20]. We convert distance (d) to similarity (s) by using the inverse of distance (i.e.1-d). SYN does not measure similarity based on meanings and does not consider the ontological structure or taxonomy of concepts being matched. However it gives a gross match. This is why we combine it with SEM. SEM uses representation of meaning to measure similarity by considering word senses that use linguistic and cognitive measures. Although there are many linguistic algorithms, we use Lin, Gloss Vector, WordNet Vector and LSA (Latent Semantic Analysis) to determine SRS. Scores from these measures are aggregated and normalized to produce SRS. Our experiments have proven that the combination of these four measures provides higher reliability and precision [15]. Figure 5 below show that SRS had higher precision and relevance scores when matched against actual feedback received from human experts, i.e. human cognitive responses (HCR). The idea was to see how accurate SRS was with a human expert's judgement. This was how we validated our findings that SRS had a better match compared to just using SYN matching for the ontology mediation process. An experiment was conducted with 30 word-pairs (see figure 6 and Appendix I) based on a study done at Princeton [6]  terms of relevance, SRS had a 96.67% match with scores obtained from human subjects and a 40% match in terms of precision. These were higher compared to how pure SYN scores fared with human responses [15]. As we can see in figure 5, SYN scores have lower relevance (73.33%) and precision (16.67%). Figure 6 shows the correlation of SRS scores with HCR which was positive r = 0.919 (91.9%). The SRS scores also had a smaller variance when matched against HCR scores compared to pure SYN scores. A matching agent will run matches based on SRS and provide results to the ontologist this makes ontology matching less laborious if otherwise would have to be done manually by the ontologist. In this section we present how semantic bridging with SRS can be further extended by using rules. SRS provides the ontologist all schemas that are likely to be matched with high reliability and precision [15]. Rules on the other hand are cardinality constraints that can be used for matching data labels, schemas and concepts. Rules can be predefined ahead of time so that frequently appearing schemas can be matched automatically [10]. To match schemas on names for instance we can write a rule that would match </first_name>, </middle_name> and </last_name> with </full_name>.If we had schemas for example </street_name>, </city>, </state> and </zipcode> to define an address in one domain ontology and defined as just </address> in another, a simple rule can be executed to match them on-the-fly. As mentioned earlier in section one and two, DMV Licence Renewal and DMV Records may have domain specific schema definitions that are different. Since establishing services between them will be an ongoing task rules could provide an automatic solution for creating homogeneity amongst heterogeneous schemas used by them [10] [18].
In view of the reasoning aspects that are possible in the Semantic Web, that supports web services, we propose an approach where rules can be used to match concepts. Reasoning is an approach when agents in a knowledge system perform tasks by inference [8], [9]. Given the following statement "If X has a son Y, and X has a brother Z, given that X is a male" the agent is able to then infer that Y has an uncle, Z. The agent does not need to be explicitly told about the relationship between Y and Q. As long as uncle is defined earlier an agent will be able to infer this quickly. SWRL is based on OWL and RuleML (Rule Markup Language). It enables OWL axioms to include Hornlogic that can be used to execute rules in a knowledgebase like the ones that public agencies will need to share (see section 2). SWRL rules show the implication between an antecedent (body) and consequent (head). In other words if the antecedent holds true, then the consequent must hold true also. In our example earlier if antecedents "X has a son Y, and X has a brother Z", X is a male" is all true then the consequent must also hold true which is "Y has an uncle, Z". With rules in place we can easily automate matching of schemas on-the-fly, which otherwise would be very labour intensive. As such SWRL rules and SRS would be complementary efforts towards semantic bridging.
Based on the example given in section 1, DMV Records maintains first_name, middle_name and last_name however DMV Licence Renewal maintains a complex string called full_name. The address in the DMV Licence Renewal is treated as a complex string and in DMV Records it is divided into street_name, city, state and zipcode. Figure 7, depicts customer ontology for DMV Records where customer name and address is expressed with greater granularity. It also shows the client ontology for DMV Licence Renewal where customer name and address is expressed with less granularity. Dotted lines indicate semantic bridging that is done via SWRL rules. We will demonstrate how the rules are written in the next section.

Writing rules in SWRL
Based on the process of semantically bridging the ontologies in figure 7 earlier, in this section we show how SWRL rules are written to accomplish that task. The first rule would be to associate </Street_Name>, </City>, </State> and </Zipcode> from DMV Records to </Address> in DMV Licence Renewal. The following is how the rule 1 is expressed: This implies that the (Antecedent (hasStreet_Name (I-variable(x1) I-variable(x2)), hasZipcode (I-variable(x2) Ivariable(x3))) and thus the consequent would be (hasStreetZipAddress (I-variable(x1) I-variable(x3)))).
Rule 3 combines rule 1 and 2 to determine address and is expressed as:
• Once the definition are agreed among public agencies like those done in rules 1 through 5, then future data exchange is automatically triggered as it would be predefined allowing data exchanges and knowledge inheritance to happen on-the-fly.