What is the Optimizer, and What Does it Do?
Database systems operate in a client-server model. This model is best described in terms of the function of each client and the server. A client is used to interact with and present data from the server. The server performs all of the database processing and transmits results to the client. In this model, there are usually many clients per a single server. In the context of a database system operating in this model, the database server is responsible for processing the queries presented by the client and returning the results accordingly. This has been termed query shipping where the query is shipped to the server and a payload (data) is returned. The benefits of query shipping are a reduction of communication time for queries and the ability to exploit server resources rather than using the more limited resources of the client to conduct the query. This model also permits a separation of how the data is stored and retrieved on the server from the way the data is used on the client. In other words, the client-server model supports data independence.
So, What is a Query Optimizer?
One of the principal advantages of the relational model introduced by Codd in 1970 is data independence . i.e. the separation of the physical implementation from the logical model. Codd said, “Users of large data banks must be protected from having to know how the data is organized in the machine … Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed.” This separation allowed a powerful set of logical semantics to be developed, independent of a particular physical implementation. The goal of data independence, is that each of the logical elements is independent of all of the physical elements. For example, the logical layout of the data into relations (tables) with attributes (columns) arranged by tuples (rows) is completely independent of how the data is stored on the storage medium.
How an Optimizer Works
One of the challenges of data independence is that database programming becomes a two-part process. First, there is the writing of the logical query -- describing what the query is supposed to do. Second, there is the writing of the physical plan -- which shows how to implement the logical query. The logical query can be written, in general, in many different forms such as a high-level language like structured query language (SQL) [Cham81] or as an algebraic query tree [Tuck04]. For example, in the traditional relational model, a logical query can be described in relational calculus or relational algebra. The relational calculus is better in terms of focusing on what needs to be computed [Date04]. The relational algebra is more concrete, but still leaves out many details involved in the evaluation of a query [Date04]. The physical plan is a query tree [Wern01] in a physical algebra that can be understood by the database system's query execution engine. A query tree is a tree structure in which each node contains a query operator and has a number of children that corresponds to the arity of the operation. The query tree can be transformed via the optimizer into a plan for execution. This plan can be thought of as a program that the query execution engine can execute. There are several phases that a query statement goes through before it is executed; parsing, validation, optimization, plan generation/compilation, and execution [Grae93a]. Figure 6-1 depicts the query processing steps that a typical database system would employ . Each query statement is parsed for validity and checked for correct syntax and for identification of the query operations. The parser then outputs the query in an intermediate form to allow the optimizer to form an efficient query execution plan. The execution engine then executes the query and the results are returned to the client. This progression is depicted in figure 6-1 where once parsing is completed, the query is validated for errors, then optimized, a plan chosen and compiled, and finally the query is executed.
Query Processing Steps # Parsing # Query Validation # Optimization # Query Plan Compilation # Execution
The first step in this process is to translate the logical query from SQL into a query tree in logical algebra. This step is done by the parser. The next step is to translate the query tree in logical algebra into a physical plan. There are generally a large number of plans that could implement the query tree. The process of finding the best execution plan is called query optimization. That is, for some query execution performance measure (e.g. execution time), we want to find the plan with the best execution performance. The goal is that the plan be optimal or near optimal within the search space of the optimizer. The optimizer starts by copying the relational algebra query tree into its search space. The optimizer then expands the search space and finds the best plan. At this level of generality, the optimizer can be viewed as the code generation part of a query compiler for the SQL language. It produces code to be interpreted by the query execution engine, except that the optimizer's emphasis is on producing "very efficient" code. For example, the optimizer uses the database system's catalog to get information (e.g. number of tuples) about the stored relations referenced by the query, something traditional programming language compilers normally do not do [Much97]. Finally, the optimizer copies the optimal physical plan out of its memory structure and sends it to the query execution engine. The query execution engine executes the plan using the relations in the stored database as input, and produces the table of rows that match the query criteria as output. All of this activity requires additional processing time and places a greater burden on the process by forcing database implementers to consider the performance of the query optimizer and execution engine as a factor in their overall efficiency. Ioannidis states this best, “Relational query optimization is an expensive process, primarily because the number of alternative access plans for a query grows at least exponentially with the number of relations participating in the query.”
One of the primary reasons for the large number of query plans is that optimization will be required for many different values of important run-time parameters whose actual values are unknown at optimization time. Database systems make certain assumptions about the database contents (e.g., value distribution in relation attributes), the physical schema (e.g., index types), the values of the system parameters (e.g., number of available buffers), and the values of the query constants.