Query Processing in Partly Inaccessible Databases

Survey

One important characteristic of distributed database management systems (DBMS) is that due to network or machine failure the environment may become partitioned into sub-environments that cannot communicate with each other. However, there are various application areas where the individual sub-environments should remain operable in such situations . In particular queries to the database should be processed in an appropriate way.
To this end, the final and all intermediate results of queries in distributed DBMS must be regarded as potentially vague or incomplete. As a consequence, we have to deal with vague values and vague collections during query processing. Whereas vague values have been addressed in various papers dealing with null values, vague collections stemming from inaccessibility represent a novel research topic. To define a closed approach for vague collections, adequate representations are needed. Furthermore the semantics of the operators usually provided by query languages have to be redefined for these representations.

To become more precise, let us consider an example. Assume a distributed DBMS supporting links to represent relationships between objects. Further assume the example database sketched in the figure below. The solid arrows between the rivers and the countries depict flows_through relationships, while the dashed arrows depict has_capital relationships.

[example database]

The objects (rivers, countries, and capitals) are distributed over three segments, segment 2 being currently inaccessible. If we have to evaluate a query in this situation the answer to this query can be based only on the information stored on the accessible segments 1 and 3. However, the potential error in the result of the query should be as small as possible.
The correct answer to the query Q: Select those capitals of countries crossed by the river Rhine or the river Rhone which have more than 3 million inhabitants! would be the set {Berlin, Paris}.
Due to inaccessibility, the usual inside-out evaluation of the query, starting from the two specified rivers and then navigating via the countries to the capitals, reaches only Berlin as a result object. Although Paris itself is accessible, it is - due to unreachability - missing in the result! My approach uses a logical predicate to describe those missing elements. The usefulness of this logical predicate can be demonstrated as follows.
Assume a second query R had collected the capitals Amsterdam, Paris, and Bern. If we compute the intersection S of Q and R, we check the three capitals of R against the logical predicate of Q. Because both Amsterdam and Bern have too few inhabitants they cannot be in Q and hence not in S. For Paris we cannot decide, if it was in Q and so it belongs to the so-called explicit part of the upper bound of S.

To sum up, my representation of a vague set consists of three parts:

  1. the explicit lower bound,
  2. the explicit upper bound, and
  3. the logical predicate.
The logical predicate is useful, for in general there are missing elements which are nevertheless accessible. If we had not used a logical predicate in the above example, Amsterdam and Bern would have had to be inserted into S because of the incompleteness of Q.

Vague multisets are handled quite analogously. In addition, in a vague multiset the number of occurrences of an element can be vague.
Vague lists are more complicated, because when sorting a vague set or a vague multiset, even the arising order can be vague. My approach comprises an adequate representation for vague lists, too.

Although my concepts for the dealing with partial inaccessibility are independent of a particular query language and even of a certain datamodel, they are validated by being implemented into the query language NTT, which is based upon PCTE, the ISO standard for an open repository.


Literature relevant to the topic:


For further information, comments, etc. please contact me by email or have a look at my homepage.