Friday, November 23, 2007
Distributed Databases Transparency features
Distributed databases offer transparent access to data. Beyond the authorization mechanism, there is no control over what the program does to the data. It can delete all the data, zero it, or insert random new data. In addition, no comprehensible audit trail is kept telling who did what to the data. This interface is convenient for programmers, but it is a real problem for application designers and administrators.
The simplest way to explain the negative aspects of a distributed database is to compare refrigerators to grocery stores. My refrigerator operates like a distributed database. Anyone with a key to my house is welcome to take things from the refrigerator or put them in. There is a rule that whoever takes the last beer should get more at the grocery store. I only give keys to people who follow this rule.
A grocery store could operate like a distributed database. It could hand out keys to trusted customers who agree to pay for any groceries they take. This would be much cheaper than having a lot of clerks standing around collecting money from customers. Why don't any grocery stores operate in this way? Why are they different from refrigerators? Well, its because refrigerators are convenient for the users but are unmanageable. The clerks manage the access to the store inventory.
Requester-server designs provide an administrative mechanism much like the store clerks. They provide defined, enforceable, auditable interfaces which control access to an organization's data. Rather than publishing its database design and providing transparent access to it, an organization publishes the CALL and RETURN messages of its server procedures. These servers perform requests according to the procedures specified by the site owner. They are the site's standard operating procedures. Requesters send messages to servers which in turn execute these procedures much as the clerks perform and enforce the store's operating procedures.
Requester-server designs are more modular than distributed databases. A site can change its database design and operating procedures without impacting any requesters. This gives each site considerable local autonomy. The only things a site cannot easily change are the request and reply message formats. In the parlance of programming languages, distributed databases offer transparent types, servers offer opaque types, sometimes called abstract types or encapsulated types.
Requester-server designs are more efficient. They send fewer and shorter messages. Consider the example of adding an invoice to a remote node's database. A distributed database implementation would send an update to the account file, insert a record in the invoice file, and then insert several records in the invoice-detail file. This would add up to a dozen or more messages. A requester-server design would send a single message to a server. The server would then perform the updates as local operations (see Figure 1). If the communication net is SUE then sending only a single message is a big savings over multi-message designs.
To summarize the negatives, applications coded with transparent access to geographically distributed databases have:
• Poor manageability,
• Poor modularity and,
• Poor message performance.
when compared to a requester-server design.
The lunatic fringe of distributed databases promise transparent access to heterogeneous databases (say an ASCII system accessing an EBCIDIC system). These folks promise to hide all the nasties of networking, security, performance, and semantics under the veil of transparency.
This is a wonderful promise. But the prospect of getting people who cannot agree on how to represent the letter "A" to agree to share their raw data is far fetched. Heterogeneous systems are a very good argument for requesters and servers. The systems need only agree on a network protocol and a requester-server interface. Still a little far fetched unless a standard network and requester-server model emerges.
Manageability of a Distributed Database
Manageability is the key problem in distributed systems. Lets suppose for the moment that some genius solved all the technical problems. Lets suppose that there is cheap, fast, reliable communications among all points of the globe. Suppose that everyone agrees to run the same hardware and same software. Suppose that everyone trusts everyone else completely and that there are no auditors insisting that we explain how our system works.
Now suppose that we have to design and manage a distributed application in this ideal world. Will we use transparent access to geographically remote data? Probably not.
Why not? Well, a distributed system is a big and complex thing. We will want to change and grow it over time. We may want to add nodes, move data about, redesign the database, change the format or meaning of certain data items, and do other things which are likely to invalidate some programs using the data.
If everyone in the world knows what our database looks like, and we change the design, then their programs will stop working. Some changes may not break programs, but others certainly will. To install a change, we would have to change all the programs that use our data. This might be possible, but at some point, change control will consume all the system's resources.
Modularity is the solution to this. If we only tell people about the interface to our servers, we can change a lot about our database without letting anyone else know. We can support "old" server interfaces when we go to a new design and gradually inform our users about the new interface. They can convert at their leisure.
So, even in the programmer's ideal world, manageable distributed systems must be structured as modules communicating via messages rather than as programs transparently accessing an integrated distributed database.
The Case For Transparent Access to Cluster Data
What is the proper place for transparent access to distributed databases? Transparent access is very convenient for programmers -- it makes it easy to bring up distributed applications. Coding requesters-servers and making an application modular is extra work. Unfortunately, system administrators control the security of their data and generally do everything in their power to prevent ad hoc queries from running on it. If I want to run an ad hoc query on someone else's data, I have to get permission. This turns out to be not very ad hoc after all.
Clustering is the real application of transparent access to distributed data. To appreciate clusters, you have to appreciate the quandary of computer vendors. Almost all vendors have standardized on a single architecture. IBM wishes it had only System 360, DEC wishes it had only VAX and so on. The vendors then build 1, 2, 3, 4, 5,... MIP engines for that architecture. Most vendors are limited to 15MIPs per processor right now. To go beyond that they must combine several processors and convince the customer that the resulting price/performance adds to more than 15MIPS.
Clusters offer an approach to this problem. The vendor builds a slow-cheap cpu (say 1MIP), and a fast-expensive cpu (say 10MIPS). The vendor does the same for disks and communication controllers -- making a cheap box and a high-performance box. He then offers software that lets the customer use between 1 and 100 processors clustered as a single system. This gives the customer a 1MIP to 100MIP range with the cheap engines and a 10MIP to 1000MIP range for the expensive boxes (see figure 3).
Clustering offers both the customer and the vendor significant advantages. The customer can buy just what he needs and grow in small increments as he needs more. The vendor has two advantages. First it need design and support only a very few module types (discs, cpus, communications,....). In addition, it can build systems which far exceed the power of the non-clustered vendors. Apollo, DEC, Teradata, Tandem, and Sun have each taken this approach. Of course if the vendor or customer programs in a bottleneck, then the clusters cannot grow beyond the bottleneck. Successful vendors and customers have avoided such bottlenecks -- it is possible but the many failures indicate that it is not easy.
Figure 3. A graph showing the growth in throughput as processors are added to a cluster. The graph shows two families of processors, one capable of unit throughput per cpu and the other capable of 10 units per cpu. Distributed database software provides this kind of linear growth for clustered systems.
The arguments against geographically distributed databases do not apply to clustered systems. A cluster and its operators are typically in a single room. SUE is not a problem. The people have face-to-face contact and the computers have duplexed, high-speed buses among them.
A cluster is like a centralized system, so it can be managed as one. For small clusters the local autonomy derived from modularity may be moot. Reviewers of this paper took strong exception to that statement. They argue that any cluster which supports several applications will operate as a requester-server system just to enforce the modularity. Large applications must be decomposed into independent subsystems each of which is managed independently. The centralized cluster example in [3] is actually managed as five cooperating applications each with its own server interfaces to the others.
A distributed database serves cluster applications nicely, allowing data to be partitioned among any disks in the cluster and allowing servers to run on any cpus in the cluster. Because the intra-cluster communication is fast and cheap, the cost of distributing data in the cluster is negligible -- well perhaps not negligible but at least acceptable. Based Tandem's experience, the message-based design required for clustered systems uses about twice as many data moves and instructions as a "conventional" design. So clustered systems "waste" about half the MIPS in order to get a software design that supports modular growth within a cluster without bottlenecks. Tandem did this because they offer mirrored disks, duplexed data paths and so on for fault tolerance -- other vendors have been reluctant to sacrifice a factor of two. And yet, Tandem and Teradata systems are competitive with those of other vendors and they are the only vendors building functional 100MIP clusters
The simplest way to explain the negative aspects of a distributed database is to compare refrigerators to grocery stores. My refrigerator operates like a distributed database. Anyone with a key to my house is welcome to take things from the refrigerator or put them in. There is a rule that whoever takes the last beer should get more at the grocery store. I only give keys to people who follow this rule.
A grocery store could operate like a distributed database. It could hand out keys to trusted customers who agree to pay for any groceries they take. This would be much cheaper than having a lot of clerks standing around collecting money from customers. Why don't any grocery stores operate in this way? Why are they different from refrigerators? Well, its because refrigerators are convenient for the users but are unmanageable. The clerks manage the access to the store inventory.
Requester-server designs provide an administrative mechanism much like the store clerks. They provide defined, enforceable, auditable interfaces which control access to an organization's data. Rather than publishing its database design and providing transparent access to it, an organization publishes the CALL and RETURN messages of its server procedures. These servers perform requests according to the procedures specified by the site owner. They are the site's standard operating procedures. Requesters send messages to servers which in turn execute these procedures much as the clerks perform and enforce the store's operating procedures.
Requester-server designs are more modular than distributed databases. A site can change its database design and operating procedures without impacting any requesters. This gives each site considerable local autonomy. The only things a site cannot easily change are the request and reply message formats. In the parlance of programming languages, distributed databases offer transparent types, servers offer opaque types, sometimes called abstract types or encapsulated types.
Requester-server designs are more efficient. They send fewer and shorter messages. Consider the example of adding an invoice to a remote node's database. A distributed database implementation would send an update to the account file, insert a record in the invoice file, and then insert several records in the invoice-detail file. This would add up to a dozen or more messages. A requester-server design would send a single message to a server. The server would then perform the updates as local operations (see Figure 1). If the communication net is SUE then sending only a single message is a big savings over multi-message designs.
To summarize the negatives, applications coded with transparent access to geographically distributed databases have:
• Poor manageability,
• Poor modularity and,
• Poor message performance.
when compared to a requester-server design.
The lunatic fringe of distributed databases promise transparent access to heterogeneous databases (say an ASCII system accessing an EBCIDIC system). These folks promise to hide all the nasties of networking, security, performance, and semantics under the veil of transparency.
This is a wonderful promise. But the prospect of getting people who cannot agree on how to represent the letter "A" to agree to share their raw data is far fetched. Heterogeneous systems are a very good argument for requesters and servers. The systems need only agree on a network protocol and a requester-server interface. Still a little far fetched unless a standard network and requester-server model emerges.
Manageability of a Distributed Database
Manageability is the key problem in distributed systems. Lets suppose for the moment that some genius solved all the technical problems. Lets suppose that there is cheap, fast, reliable communications among all points of the globe. Suppose that everyone agrees to run the same hardware and same software. Suppose that everyone trusts everyone else completely and that there are no auditors insisting that we explain how our system works.
Now suppose that we have to design and manage a distributed application in this ideal world. Will we use transparent access to geographically remote data? Probably not.
Why not? Well, a distributed system is a big and complex thing. We will want to change and grow it over time. We may want to add nodes, move data about, redesign the database, change the format or meaning of certain data items, and do other things which are likely to invalidate some programs using the data.
If everyone in the world knows what our database looks like, and we change the design, then their programs will stop working. Some changes may not break programs, but others certainly will. To install a change, we would have to change all the programs that use our data. This might be possible, but at some point, change control will consume all the system's resources.
Modularity is the solution to this. If we only tell people about the interface to our servers, we can change a lot about our database without letting anyone else know. We can support "old" server interfaces when we go to a new design and gradually inform our users about the new interface. They can convert at their leisure.
So, even in the programmer's ideal world, manageable distributed systems must be structured as modules communicating via messages rather than as programs transparently accessing an integrated distributed database.
The Case For Transparent Access to Cluster Data
What is the proper place for transparent access to distributed databases? Transparent access is very convenient for programmers -- it makes it easy to bring up distributed applications. Coding requesters-servers and making an application modular is extra work. Unfortunately, system administrators control the security of their data and generally do everything in their power to prevent ad hoc queries from running on it. If I want to run an ad hoc query on someone else's data, I have to get permission. This turns out to be not very ad hoc after all.
Clustering is the real application of transparent access to distributed data. To appreciate clusters, you have to appreciate the quandary of computer vendors. Almost all vendors have standardized on a single architecture. IBM wishes it had only System 360, DEC wishes it had only VAX and so on. The vendors then build 1, 2, 3, 4, 5,... MIP engines for that architecture. Most vendors are limited to 15MIPs per processor right now. To go beyond that they must combine several processors and convince the customer that the resulting price/performance adds to more than 15MIPS.
Clusters offer an approach to this problem. The vendor builds a slow-cheap cpu (say 1MIP), and a fast-expensive cpu (say 10MIPS). The vendor does the same for disks and communication controllers -- making a cheap box and a high-performance box. He then offers software that lets the customer use between 1 and 100 processors clustered as a single system. This gives the customer a 1MIP to 100MIP range with the cheap engines and a 10MIP to 1000MIP range for the expensive boxes (see figure 3).
Clustering offers both the customer and the vendor significant advantages. The customer can buy just what he needs and grow in small increments as he needs more. The vendor has two advantages. First it need design and support only a very few module types (discs, cpus, communications,....). In addition, it can build systems which far exceed the power of the non-clustered vendors. Apollo, DEC, Teradata, Tandem, and Sun have each taken this approach. Of course if the vendor or customer programs in a bottleneck, then the clusters cannot grow beyond the bottleneck. Successful vendors and customers have avoided such bottlenecks -- it is possible but the many failures indicate that it is not easy.
Figure 3. A graph showing the growth in throughput as processors are added to a cluster. The graph shows two families of processors, one capable of unit throughput per cpu and the other capable of 10 units per cpu. Distributed database software provides this kind of linear growth for clustered systems.
The arguments against geographically distributed databases do not apply to clustered systems. A cluster and its operators are typically in a single room. SUE is not a problem. The people have face-to-face contact and the computers have duplexed, high-speed buses among them.
A cluster is like a centralized system, so it can be managed as one. For small clusters the local autonomy derived from modularity may be moot. Reviewers of this paper took strong exception to that statement. They argue that any cluster which supports several applications will operate as a requester-server system just to enforce the modularity. Large applications must be decomposed into independent subsystems each of which is managed independently. The centralized cluster example in [3] is actually managed as five cooperating applications each with its own server interfaces to the others.
A distributed database serves cluster applications nicely, allowing data to be partitioned among any disks in the cluster and allowing servers to run on any cpus in the cluster. Because the intra-cluster communication is fast and cheap, the cost of distributing data in the cluster is negligible -- well perhaps not negligible but at least acceptable. Based Tandem's experience, the message-based design required for clustered systems uses about twice as many data moves and instructions as a "conventional" design. So clustered systems "waste" about half the MIPS in order to get a software design that supports modular growth within a cluster without bottlenecks. Tandem did this because they offer mirrored disks, duplexed data paths and so on for fault tolerance -- other vendors have been reluctant to sacrifice a factor of two. And yet, Tandem and Teradata systems are competitive with those of other vendors and they are the only vendors building functional 100MIP clusters
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment