(cn)2

Chapter 2

Advanced Data Selection in Oracle

Displaying Data from Multiple Tables *

Select Statements That Join Data from More than One Table *

Exercises *

Creating Outer Joins *

Exercises *

Joining a Table to Itself *

Exercises *

Group Functions and Their Uses *

Identifying Available Group Functions *

Exercises *

Using Group Functions *

Exercises *

Using the GROUP BY Clause *

Exercises *

Excluding GROUP Data with HAVING *

Exercises *

Using Subqueries *

Nested Subqueries *

Exercises *

Subqueries in Other Situations *

Exercises *

Putting Data in Order with Subqueries *

Exercises *

Using Runtime Variables *

Entering Variables at Run Time *

Exercises *

Automatic Definition of Runtime Variables *

Exercises *

ACCEPT: Another Way to Define Variables *

Exercises *

Chapter Summary *

Two-Minute Drill *

In this chapter, you will understand and demonstrate knowledge in the following areas:

Displaying data from multiple tables
Group functions and their uses
Using subqueries
Using runtime variables

This chapter covers the advanced topics of Oracle data selection. The first area of understanding discussed in this chapter is the table join. The chapter will cover how you can write select statements to access data from more than one table. The discussion will also cover how you can create joins that display data from different tables even when the information in the two tables does not correspond completely. Finally, the use of table self joins will be discussed. The chapter also introduces the group by clause used in select statements and group functions. Use of the subquery is another area covered in this chapter as well. Finally, specification and use of variables is presented. The material in this chapter will complete the user’s knowledge of data selection and comprises 22 percent of OCP Exam 1.

Displaying Data from Multiple Tables

In this section, you will cover the following areas related to displaying data from multiple tables:

Select statements to join data from more than one table
Creating outer joins
Joining a table to itself

The typical database contains many tables. Some smaller databases may have only a dozen or so tables, while other databases may have hundreds. The common factor, however, is that no database has just one table that contains all the data you need. Oracle recognizes you may want data that resides in multiple tables drawn together in some meaningful way. In order to show data from multiple tables in one query, Oracle allows you to perform table joins. A table join is when data from one table is associated with data from another table according to a common column in both tables.

TIP: There must be at least one column shared between two tables in order to join the two tables in a select statement.

If a column value appears in two tables, a relationship can be defined between them if one of the columns appears as part of a primary key in one of the tables. A primary key is used in a table to identify the uniqueness of each row in a table. The table in which the column appears as a primary key is referred to as the parent table, while the column that references the other table in the relationship is often called the child table. The column in the child table relates to the parent table as something called a foreign key. Figure 2-1 demonstrates how the relationship may work in a database.

Figure 1: Primary- and foreign-key relationship between tables

Select Statements That Join Data from More than One Table

When a primary- or foreign-key relationship exists between several tables, then it is possible to join their data. As described in the last chapter, a select statement can have three parts: the select clause, the from clause, and the where clause. The select clause is where you list the column names you want to view data from, along with any single-row functions and/or column aliases. The from clause gives the names of the tables where the data is stored. In a table join, two or more tables are named as sources for data. The final clause is the where clause, which contains comparison operations that will filter out the unwanted data from what you want to see. The comparison operations in a table join statement also have another purpose—to describe how the data between two tables should be joined together, as shown in the following code:

SELECT a.antique_name, a.antique_cost,
a.storage_box_number, b.box_name, b.box_location
FROM antique a, storage_box b
WHERE a.antique_name in ('VICTROLA','CAMERA','RADIO')
AND a.storage_box_number = b.storage_box_number;

A.ANTIQUE_N A.ANTIQ A.STOR B.BOX_NAME B.BOX_LOCATION
----------- ------- ------ ---------- --------------
VICTROLA 150.00 3 ALPHA-3 ALPHA BLDG
CAMERA 75.00 4 ALPHA-4 ALPHA BLDG
RADIO 200.00 4 ALPHA-4 ALPHA BLDG

Many important things are happening in this sample statement, the most fundamental of which is the table join. The from clause in this statement is the clearest indication that a table join statement is taking place. In this statement, the from clause contains two table names, each of which is followed by a letter. Table ANTIQUE in this example is followed by the letter a, while table STORAGE_BOX is followed by the letter b. This display demonstrates an interesting concept in Oracle—not only can the columns in a select statement have aliases, but the tables named in the from clause can have aliases as well.

In most cases, tables with columns in common should have the same name for those columns, because then it becomes easier to identify that they contain the same data. However, this common name can lead to ambiguity when the Oracle SQL processing mechanism attempts to parse the statement and resolve all database object names. If each column isn’t linked to the particular tables identified in the from clause, Oracle will return an error. By specifying an alias for each table in the from clause, and then prefixing each column name in the select statement with the alias, you avoid ambiguity in the SQL statements while also avoiding the need to type out a table name each time a column is specified. The following code block illustrates the extra coding necessary when referencing columns if table aliases aren’t used:

SELECT antique_name, antique_cost,
antique.storage_box_number, box_name, box_location
FROM antique, storage_box
WHERE antique_name in ('VICTROLA','CAMERA','RADIO')
AND antique.storage_box_number = storage_box.storage_box_number;

ANTIQUE_NAM ANTIQUE ANTIQU BOX_NAME BOX_LOCATION
----------- ------- ------ -------- ------------
VICTROLA 150.00 3 ALPHA-3 ALPHA BLDG
CAMERA 75.00 4 ALPHA-4 ALPHA BLDG
RADIO 200.00 4 ALPHA-4 ALPHA BLDG

Notices something else. Neither the alias nor the full table name need be specified before a column that appears in only one table specified by the from clause. Ambiguity is only produced when the column appears in two or more of the tables specified in the from clause.

The next area to cover in creating queries that join data from one table to data from another table is the use of comparison operations in the where clause of the statement. The where clause must include one comparison that links the data of one table to the data in the other table. Without this link, all data from both tables is generated as output from the query in something referred to as a Cartesian product. A Cartesian product is the product of all data coming from one table multiplied by the product of the other.

There are two possibilities available in order to link the data from one table to another: equality comparisons or inequality comparisons. Tables joined on equality statements in the where clause are referred to as an "inner" join, or equijoin. An equijoin will return data where the value in one column in one table equals the value in the column of the other table. In the situation where the tables are being joined based on an inequality statement in the where clause, typically the data returned may have less meaning unless a range of data is specified and the actual link between the two tables is an equality statement, as shown below:

SELECT antique_name, antique_cost,
antique.storage_box_number, box_name, box_location
FROM antique, storage_box
WHERE antique_name IN ('VICTROLA','CAMERA','RADIO')
AND antique.storage_box_number < storage_box.storage_box_number;

ANTIQUE_NAM ANTIQUE ANTIQU BOX_NAME BOX_LOCATION
----------- ------- ------ -------- ------------
VICTROLA 150.00 3 ALPHA-1 ALPHA BLDG
VICTROLA 150.00 3 ALPHA-2 ALPHA BLDG
VICTROLA 150.00 3 ALPHA-3 ALPHA BLDG
VICTROLA 150.00 3 ALPHA-4 ALPHA BLDG
CAMERA 75.00 4 ALPHA-1 ALPHA BLDG
CAMERA 75.00 4 ALPHA-2 ALPHA BLDG
CAMERA 75.00 4 ALPHA-3 ALPHA BLDG
CAMERA 75.00 4 ALPHA-4 ALPHA BLDG
RADIO 200.00 4 ALPHA-1 ALPHA BLDG
RADIO 200.00 4 ALPHA-2 ALPHA BLDG
RADIO 200.00 4 ALPHA-3 ALPHA BLDG
RADIO 200.00 4 ALPHA-4 ALPHA BLDG

This is junk data. It illustrates that when an inequality operation is specified as part of the where clause joining data from one table to another, there is no way to guarantee that the inequality operation will be satisfied for all values in the column for both tables. There is also a high possibility that the data returned by an inequality join will look suspiciously like a Cartesian product. A better alternative for drawing data from a table that satisfies an inequality operation but does not produce a Cartesian product is to specify the inequality operation outside the comparison that produces the join, as shown here:

SELECT antique_name, antique_cost,
antique.storage_box_number, box_name, box_location
FROM antique, storage_box
WHERE box_location in ('VICTROLA','CAMERA','RADIO')
AND antique.storage_box_number = storage_box.storage_box_number
AND antique.storage_box_number > 3;

ANTIQUE_NAM ANTIQUE ANTIQU BOX_NAME BOX_LOCATION
----------- ------- ------ -------- ------------
CAMERA 75.00 4 ALPHA-4 ALPHA BLDG
RADIO 200.00 4 ALPHA-4 ALPHA BLDG

This select statement will produce all results joined properly using the equality operation to link the rows of two tables in an inner join, while also satisfying the comparison needed to obtain data for only those storage boxes greater than box #3. In general, it is best to specify an equality operation for the two columns linking the tables for the join, followed by an inequality operation on the same column in one of the tables to filter the number of rows that will be linked in the join.

Generally speaking, the query used to produce a table join must contain the right number of equality operations to avoid a Cartesian product. If the number of tables to be joined equals N, the user should remember to include at least N-1 equality conditions in the select statement so that each column in each table that exists in another table is referenced at least once.

TIP: For N joined tables, you need at least N-1 join conditions in the select statement in order to avoid a Cartesian product.

Exercises

What is a table join? How is a table join produced?

Why is it important to use equality operations when creating a table join?

How many equality conditions are required to join three tables? Six tables? Twenty tables?

Creating Outer Joins

In some cases, however, you need some measure of inequality on the joined columns of a table join operation in order to produce the data required in the return set. Say, for example, that you want to see all Victrolas not in storage boxes as well as those that are boxed. One limitation of "inner" join or equijoin statements is that they will not return data from either table unless there is a common value in both columns for both tables on which to make the join.

SELECT antique_name, antique_cost,
antique.storage_box_number, box_name, box_location
FROM antique, storage_box
WHERE box_location = 'VICTROLA'
AND antique.storage_box_number = storage_box.storage_box_number;

ANTIQUE_NAM ANTIQUE ANTIQU BOX_NAME BOX_LOCATION
----------- ------- ------ -------- ------------
VICTROLA 150.00 3 ALPHA-3 ALPHA BLDG

Notice only Victrolas that have corresponding storage box entries in the STORAGE_BOX table are included in the return set. In an attempt to obtain a list of Victrolas that are not boxed, the user then issues the following nonjoin query:

SELECT antique_name, antique_cost
FROM antique;

ANTIQUE_NAM ANTIQUE
----------- -------
VICTROLA 150.00
VICTROLA 90.00
VICTROLA 45.00

This query is a little closer to the mark, returning data on antique Victrolas regardless of whether or not they are boxed, but the user still needs to see storage box information for those Victrolas that are boxed. In order to force the join to return data from one table even if there is no corresponding record in the other table, the user can specify an outer join operation. The previous inner join statement can be modified in the following way to show records in the ANTIQUE table that have no corresponding record in the STORAGE_BOX table:

SELECT antique_name, antique_cost,
antique.storage_box_number, box_name, box_location
FROM antique, storage_box
WHERE box_location = 'VICTROLA'
AND antique.storage_box_number = storage_box.storage_box_number (+);

ANTIQUE_NAM ANTIQUE ANTIQU BOX_NAME BOX_LOCATION
----------- ------- ------ -------- ------------
VICTROLA 150.00 3 ALPHA-3 ALPHA BLDG
VICTROLA 90.00
VICTROLA 75.00

Outer join statements such as these produce result sets that are "outside" the join criteria as well as inside it. Notice the special (+) character string called the outer join operator at the end of the comparison that forms the join. This marker denotes which column can have NULL data corresponding to the non-NULL values in the other table. In the previous example, the outer join marker is on the side of the STORAGE_BOX table, meaning that data in the ANTIQUE table can correspond either to values in STORAGE_BOX or to NULL if there is no corresponding value in STORAGE_BOX.

TIP: For "inner" joins, there must be shared values in the common column in order for the row in either table to be returned by the select statement.

Exercises

How does an outer join remedy the situation where a lack of corresponding values in the shared column of two tables causes rows from neither table to be selected?

What is the special character used to denote outer joins?

Joining a Table to Itself

In special situations, it may be necessary for you to perform a join using only one table. Well, you really are using two copies of the tables – you join the table to yourself. This task can be useful in certain cases where there is a possibility that some slight difference exists between two rows that would otherwise be duplicate records. If you want to perform a self join on a table, you should utilize the table alias method described earlier in the chapter to specify the same table for Oracle to understand that a self join is being performed.

The following example of a self join shows how to use this technique properly. For the example, assume that there is a table called TEST_RESULTS on which users at various locations administer a test for employees of a large corporation. The test is designed to determine if a given employee is ready for promotion. If an employee fails the test, he or she must wait a full year before taking the test again. It is discovered that there is a bug in the system that allowed some employees to circumvent the rule by taking the test at a different location. Now, management wants to find out which employees have taken the test more than once in the past year. The columns in the TEST_RESULTS table are listed as follows: EMPID, LOCATION, DATE, and SCORE. In order to determine if an employee has taken the test twice in the last year, you could issue the following SQL select that uses self join techniques:

SELECT a.empid, a.location, a.date, b.location, b.date
FROM test_results a, test_results b
WHERE a.empid = b.empid
AND a.location <> b.location
AND a.date > sysdate-365
AND b.date > sysdate-365;

A.EMPID A.LOCATION A.DATE B.LOCATION B.DATE
------- ---------- --------- ---------- ---------
94839 St. John 04-NOV-97 Wendt 03-JAN-98
04030 Stridberg 27-JUN-97 Wendt 03-AUG-97
59393 St. John 20-SEP-97 Wendt 04-OCT-97

The output from this self join shows that three employees took the test in different locations within the last 12 months. The clause used to determine DATE highlights the flexibility inherent in Oracle’s internal method for storing DATE datatypes and sysdate as numbers representing the number of days since the beginning of the Julian calendar. The storage method Oracle uses allows you to perform simple mathematical operations on dates to obtain other dates without worrying about taking into account factors like the number of days in months between the old date and new, whether the year in question is a leap year, etc.

Those users who must perform self joins on tables should be extremely cautious about doing so in order to avoid performance issues or Cartesian products. Usually, the required number of equality operations is usually at least two in the situation of self joins, simply because only using one equality condition usually does not limit the output of a self join to the degree that the output must be limited in order to obtain meaningful information.

TIP: The number of equality operations usually needed in the where clause of a self join should be greater than or equal to 2.

It should be stated that a self join typically requires a long time to execute, because Oracle must necessarily read all data for each table twice sequentially. Ordinarily, Oracle will read data from two different tables to perform the join, but since the operation in this case is a self join, all data comes from one table. Without a proper comparison operation set up in the where clause, you may wind up with many copies of every row in the table returned, which will certainly run for a long time and produce a lot of unnecessary output.

Exercises

What is a self join? How might a self join be used?

How many equality operations should be used to create a self join?

What performance issues do self joins present?

Group Functions and Their Uses

In this section, you will cover the following areas related to group functions and their uses:

Identifying available group functions
Using group functions
Using the group by clause
Excluding group data with the having clause

A group function allows you to perform a data operation on several values in a column of data as though the column was one collective group of data. These functions are called group functions also, because they are often used in a special clause of select statements called a group by clause. A more complete discussion of the group by clause appears in the second discussion of this section.

Identifying Available Group Functions

An important difference group functions have with single-row functions is that group functions can operate on several rows at a time. This advantage allows functions to be used that calculate figures like averages and standard deviation. The list of available group functions appears here:

avg(x)	Average for all x column values returned by the select statement
count(x)	A total number of non-NULL values returned by the select statement for column x
max(x)	The maximum value in column x for all rows returned by the select statement
min(x)	The minimum value in column x for all rows returned by the select statement
stddev(x)	The standard deviation for all values in column x in all rows returned by select statements
sum(x)	The sum of all values in column x in all rows returned by the select statement
variance(x)	The variance for all values in column x in all rows returned by select statements

Exercises

What is a group function? How do they differ from single-row functions?

Name several group functions.

Using Group Functions

Examples of output from each of the following group functions appear over the next several pages. Since these functions require the use of several rows of data, the EMP table from the previous chapter will be used frequently. The EMP table appears in Table 2-1.

Empid	Lastname	Firstname	Salary
39334	Smith	Gina	75,000
49539	Qian	Lee	90,000
60403	Harper	Rod	45,000
02039	Walla	Rajendra	60,000
49392	Spanky	Stacy	100,000

Table 1: The EMP Table

The avg( ) function takes the values for a single column on all rows returned by the query and calculates the average value for that column. Based on the data from the previous table, the avg( ) function on the SALARY column produces the following result:

SELECT AVG(salary)FROM EMP;

AVG(salary)
-----------
74000

The second grouping function illustrated is count( ). This function is bound to become the cornerstone of any Oracle professional’s repertoire. The count( ) function returns a row count for the table given certain column names and/or select criteria. Note that the fastest way to execute count( ) is to pass a value that resolves quickly in the SQL processing mechanism. Some values that resolve quickly are integers and the ROWID pseudocolumn.

SELECT COUNT(*), -- Slow
COUNT(1), -- Fast
COUNT(rowid) -- Fast
FROM EMP;

COUNT(*) COUNT(1) COUNT(rowid)
-------- -------- ------------
5 5 5

The asterisk (*) in the previous query is a wildcard variable that indicates all columns in the table. For better performance, this wildcard should not generally be used because the Oracle SQL processing mechanism must first resolve all column names in the table, a step that is unnecessary if one is simply trying to count rows. Notice that one of these examples uses the special pseudocolumn called ROWID. A ROWID is a special value that uniquely identifies each row. Each row in a table has one unique ROWID. The ROWID is not actually part of the table; rather, ROWID is a piece of information stored internally within Oracle. Thus, it is considered a "pseudocolumn."

TIP: Do not use count(*) to determine the number of rows in a table. Use count(1) or count(ROWID) instead. These options are faster because they bypass some unnecessary operations in Oracle’s SQL processing mechanism.

The next pair of grouping functions to be covered are the max( ) and min( ) functions. The max( ) function determines the largest value for the column passed, while min( ) determines the smallest value for the column passed, as shown here:

SELECT MAX(salary), MIN(salary) FROM EMP;

MAX(salary) MIN(salary)
----------- -----------
100000 45000

Another group function details the variance( ) of all values in a column. The variance of a set of numbers represents a measure of variability�the mean squared deviation from the expected value for the set, as shown below:

SELECT VARIANCE(salary) FROM EMP;

VARIANCE(salary)
----------------
492500000

Related to variance( ) is the next example in this section�the stddev( ) function. It produces the standard deviation for values in the column specified. Standard deviation is the square root of the variance for the set�a measure of variability. The number produced represents the margin of variance or error for a set of numbers.

SELECT STDDEV(salary) FROM EMP;

STDDEV(salary)
--------------
22192.341

The final group function is used commonly in simple accounting reports. The sum( ) function gives the total of all values in a column.

SELECT SUM(salary) FROM EMP;

SUM(salary)
-----------
370000

In general, the group functions will operate on columns of datatype NUMBER because many of the functions they represent in mathematics are numeric operations. For example, it makes little sense to take the standard deviation for a set of 12 words, unless the user wants to take the standard deviation of the length of those words by combining the use of the length( ) function with the stddev( ) function. There is one notable exception to this general rule, though—that exception is the count( ) function. The count( ) function will operate on a column of any datatype.

TIP: Group functions ignore NULL values by default. This is an essential piece of information you should know for OCP.

Exercises

How are group functions incorporated into select statements? How many rows of output can usually be expected from a query using a group function?

What is ROWID? Is ROWID stored in a table?

Using the GROUP BY Clause

Sometimes it gives more meaning to the output of a select statement to collect data into logical groupings. For example, to perform calculations on the populations of several cities in America, you may issue a query against all records in the CITIES table. The select statements, such as ones containing order by, may work well for specific queries against particular cities in this table, such as listing data in order based on an alphabetized list of cities and states, such as the SQL statement below:

SELECT state, city, population
FROM cities
ORDER BY state, city;

STATE CITY POPULATION
--------------- -------------- ----------
ALABAMA AARDVARK 12,560
ALABAMA BARNARD 176,000
...

Consider the following example, however. There arises a situation where you want to perform specific calculations on the cities in each state separately. For example, you want to find out the average city population for each of the states listed on the table. This select statement works fine for producing the raw data you need to calculate the average city population for each state; however, there is an easier way for you to determine this information based on usage of the group by clause in SQL statements.

SELECT state, AVG(population)
FROM CITIES
GROUP BY state;

STATE AVG(POPULA
----------------- ----------
ALABAMA 49494
ALASKA 14349
NEW YORK 85030
ARIZONA 35003
CALIFORNIA 65040
...

The group by clause in this example saves you from performing a great deal of work by hand. Instead, Oracle shoulders most of the work and shows only the results you need. The group by clause works well in many situations where you want to report calculations on data according to groups or categories. There are two possible error messages with group by operations. They are shown here with examples:

SELECT job,avg(sal),deptno
FROM emp;

ORA-00937: not a single-group group set function

SELECT job,avg(sal),deptno
FROM emp
GROUP BY deptno;

ORA-00979: not a GROUP BY expression

To illustrate the usage of group by in another example, assume that you now want to calculate the average salary for all employees in a corporation by department. The EMP table contains the following columns: EMPID, LASTNAME, FIRSTNAME, SALARY, and DEPT. However, in addition to obtaining the average employee salary by department, you want the information in order from highest average salary to lowest.

SELECT dept, AVG(salary)
FROM emp
GROUP BY dept
ORDER BY avg(salary) DESC;

DEPT AVG(SALARY)
---- -----------
201B 103020
594C 94030
493W 71039
201C 50403

In this example, the order by clause was combined with the group by clause to create a special order for the output. This order gives the data some additional meaning. You not limited to grouping data by only one selected column, either. If you want, more than one column can be used in the group by statement�provided that the same nonaggregate columns specified in the select clause of the query match the columns specified in the group by clause. The following example illustrates proper usage of group by with more than one column specified. It assumes the addition of a column, COUNTRY, which names the country containing the city.

SELECT country, state, AVG(population)
FROM cities
GROUP BY country, state;

Exercises

How is the group by clause of a select statement used?

Identify some situations where statements containing the group by clause return errors.

Excluding GROUP Data with HAVING

One initial problem encountered when using the group by statement is that once the data is grouped, you must then analyze the data returned by the group by statement in order to determine which groups are relevant and which are not. It is sometimes useful to weed out unwanted data. For example, in the final query from the previous section, suppose you only wanted to see which departments paid an average salary of $80,000 or more per year. In effect, you are attempting to put a where clause on the group by clause.

This effect can be gained with the use of a special clause in Oracle called having. This clause acts as a modified where clause that only applies to the resultant rows generated by the group by expression. Consider the application of the previous modifications to the query of employee salary by department. If you want to view only those departments whose employees make an average of $80,000 or more, you may want to issue the following query. The having clause in this case is used to eliminate the departments whose average salary is under $80,000. Notice that this selectivity cannot easily be accomplished with an ordinary where clause, because the where clause is selective to the precision of individual rows while you require selectivity for eliminating groups of rows.

SELECT dept, AVG(salary)
FROM emp
GROUP BY dept
HAVING avg(salary)>80000
ORDER BY avg(salary) DESC;

DEPT AVG(SALARY)
---- -----------
201B 103020
705B 94030

In this query, you successfully limit output on the group by rows by using the having clause. But the having clause need not be limited by some arbitrary number you key in manually. In addition to performing a comparison operation on a constant value, the having clause can perform a special operation to derive the required data with the use of a subquery. A subquery is another SQL query embedded within the overarching query being executed that derives some special value required by a part of the entire query. Subqueries are useful to incorporate into select statements when there is a need for valid value data that you don’t know the value of, but know the manner in which to obtain it.

SELECT dept, AVG(salary)
FROM emp
GROUP BY dept
HAVING AVG(salary) >
(SELECT salary
FROM emp
WHERE empid=49394)
ORDER BY avg(salary) DESC;

DEPT AVG(SALARY)
---- -----------
201B 103020
569A 96120

Exercises

What is the having clause, and what function does it serve?

How can the user specify values to fulfill having criteria without actually knowing what the values themselves are?

Using Subqueries

In this section, you will cover the following topics related to using subqueries:

Nested subqueries
Subqueries in other situations
Putting data in order with subqueries

There are several different ways to include subqueries in where statements. The most common method used is through the equality comparison operation, or with the in comparison, which is in itself similar to a case statement offered in many programming languages because the equality can be established with one element in the group. Another useful item for inclusion of a subquery in the where clause of a select statement is the exists clause. When you specify the exists operation in a where clause, you must include a subquery that satisfies the exists operation. If the subquery returns data, then the exists operation returns TRUE. If not, the exists operation returns FALSE. These and other uses for subqueries will be discussed shortly.

TIP: A subquery is a "query within a query," a select statement nested within a select statement designed to limit the selected output of the parent query by producing an intermediate result set of some sort.

Nested Subqueries

Subqueries can be used to obtain search criteria for select statements. The way subqueries work is as follows. The where clause in a select statement has one or more comparison operations. Each comparison operation can contain the name of a column on the left side and a given search method to obtain unknown data on the right side by means of a subquery.

SELECT empid, dept, salary
FROM emp
WHERE dept =
(select dept
from emp
where empid = 78483);

The portion of the SQL statement that is highlighted is the subquery portion of the statement. On one side is the DEPT column, on which a comparison will be based to determine the result dataset. On the other side is the unknown search criteria, defined by the subquery. At the time this select statement is submitted, Oracle will process the subquery first in order to resolve all unknown search criteria, then feed that resolved criteria to the outer query. The outer query then can resolve the dataset it is supposed to return.

The subquery itself can contain subqueries. This process is known as nested subqueries. Consider the following example. You are trying to determine the salary of employees in the same department, as an employee who has submitted an expensive invoice for payment on the company’s relocation expenditure system. The tables involved in this select statement are the EMP table, which has been described, and the INVOICE table, which consists of the following columns: INVOICE_NUMBER, EMPID, INVOICE_AMT, and PAY_DATE. The only information you have about the employee you are looking for is the invoice number the employee submitted for relocation expenses, which is 5640.

SELECT e.empid,
e.salary
FROM emp e
WHERE e.dept =
(SELECT dept
FROM emp
WHERE empid =
(SELECT empid
FROM invoice
WHERE invoice_number = 5640));

In this statement, there are two subqueries: the subquery to the main select statement highlighted in bold, and the nested subquery in italics. Each subquery produces unknown criteria to fulfill the main search occurring in the select statement, the first producing the department information and the second producing the employee ID for the person submitting the invoice. These two details are crucial for completing the select statement, yet the data is unknown at the time the select statement is issued. Oracle must first resolve the innermost nested subquery in italics to resolve the next level. After that, Oracle will resolve the subquery level in bold to resolve the outermost level of the select statement issued.

Subqueries can be nested to a surprisingly deep level. The rule of thumb used to be that you could nest 16 or more subqueries into a select statement. In reality, the number of nested subqueries can be far higher. However, if you need to nest more than five subqueries, you may want to consider writing the query in PL/SQL or in a programming language like PRO*C or PRO*COBOL, or some other programming language that allows embedded SQL statements and cursors. At the very least, you may want to consider rewriting a query that makes heavy use of subqueries into a query that performs extensive join operations as well. Database performance degrades substantially after about that level when processing nested subqueries on all but the most powerful database servers and mainframes.

Exercises

What is a subquery? When might a user want to incorporate a subquery into a database select statement?

What are some situations where a where clause may be sufficient in place of a subquery?

What performance issues might revolve around the use of subqueries?

Subqueries in Other Situations

The previous discussion covered many areas of using subqueries in data manipulation statements. However, that discussion barely scratches the surface on the power of subqueries. A subquery can be used for complicated step-by-step joins of data that use data from one subquery to feed into the processing of its immediate parent. However, subqueries also allow you to "jump" subquery levels to perform incredibly complex, almost counterintuitive processing that necessarily must involve some discussion of a programming concept known as variable scope. If you are not familiar with the term, variable scope refers to the availability or "viewability" of data in certain variables at certain times.

Sometimes a variable has a local scope. That is to say that the variable can only be seen when the current block of code is being executed. Consider the columns in comparison operations named in subqueries to be variables whose "scope" is local to the query. Additionally, there is another type of scope, called global scope. In addition to a variable having local scope within the subquery where it appears, the variable also has global scope in all subqueries to that query. In the previous select statement example, all variables or columns named in comparison operations in the outermost select operation are local to that operation and global to all nested subqueries, given in bold and italics. Additionally, all columns in the subquery detailed in bold are local to that query and global to the subquery listed in italics. Columns named in the query in italics are local to that query only; since there are no subqueries to it, the columns in that query cannot be global. The nested query example from the previous discussion is featured in Figure 2-2.

Figure 2: Nested query example

TIP: The scope of a variable defines which code blocks will have the variable and its defined value available to it. There are two different types of variable scope�local and global. If a variable has global scope, then it and its value are available everywhere in the code block. If a variable has local scope, then it and its value are available only in the current code block running in the memory stack.

In certain cases, it may be useful for a subquery to refer to a global column value rather than a local one to obtain result data. The subquery architecture of Oracle allows you to refer to global variables in subqueries as well as local ones to produce more powerful queries. An example of this type of global scope usage, sometimes also referred to as correlated subqueries, is as follows. Assume that there is a recruiter for a national consulting firm who wants to find people in Minneapolis who are proficient in Oracle SQL skills. Furthermore, the recruiter only wants to see the names and home cities for people who are certified Oracle professionals. The recruiter has at her disposal a nationwide resume search system with several tables. These tables include one called CANDIDATE, which contains the candidate ID, candidate name, salary requirement, and current employer. Another table in this example is called SKILLS, where the candidate ID is matched with the skill the candidate possesses. A third table, called COMPANIES, contains the names and home cities for companies that the consulting firm tries to draw their talent from. In order to find the names and locations of people who possess the abilities the recruiter requires, the recruiter may issue the following select statement against the national recruiting database:

SELECT candidate_id,
name,
employer
FROM candidate
WHERE candidate_id IN
(SELECT candidate_id
FROM skills
WHERE skill_type = 'ORACLE SQL'
AND certified = 'YES')
AND employer IN
(SELECT employer
FROM companies
WHERE city = 'MINNEAPOLIS');

CANDIDATE_ID NAME EMPLOYER
------------ -------- --------------
60549 DURNAM TransCom

This query produces the result set the recruiter is looking for. However, there is one anomaly about this query, which is highlighted in bold in the preceding code block. The first subquery, which selects candidate IDs based on a search for specific skills, performs its operation on the SKILLS table. Notice that there is a reference to the EMPLOYER column from the CANDIDATE table in this subquery. This column is not present in the SKILLS table, but nonetheless it can be referred to by the subquery because the columns from the CANDIDATE table are global to all subqueries, since the CANDIDATE table is in the outermost query of the select statement. Notice in the last subquery the use of the in keyword. Recall from Chapter 1 that the in operation allows you to identify a set of values, any of which the column named in the in clause can equal in order to be part of the result set. Thus, if the where clause of the select statement contains and NUMBER in (1,2,3), that means only the rows whose value in the NUMBER column are equal to 1, 2, or 3 will be part of the result set.

Another complicated possibility offered by subqueries is the use of the exists operation. Mentioned earlier, exists allows the user to specify the results of a select statement according to a special subquery operation. This exists operation returns TRUE or FALSE based on whether or not the subquery obtains data when it runs. An example of the usage for the exists subquery is related to a previous example of the relocation expenditure tracking system. The tables involved in this system are the EMP table, which has been described, and the INVOICE table, which consists of the following columns: INVOICE_NUMBER, EMPID, INVOICE_AMT, and PAY_DATE. Let’s assume that you want to identify all the departments that have employees who have incurred relocation expenses in the past year.

SELECT distinct e.dept
FROM emp e
WHERE EXISTS
(SELECT i.empid
FROM invoice i
WHERE i.empid = e.empid
AND i.pay_date > SYSDATE-365);

There are several new things that are worthy of note in this select statement. The first point to be made is that global scope variables are again incorporated in the subquery to produce meaningful results from that code. The second point to make is more about the general nature of exists statements. Oracle will go through every record in the EMP table to see if the EMPID matches that of a row in the INVOICE table. If there is a matching invoice, then the exists criteria is met and the department ID is added to the list of departments that will be returned. If not, then the exists criteria is not met and the record is not added to the list of departments that will be returned.

Notice that there is one other aspect of this query that has not been explained—the distinct keyword highlighted in bold in the select column clause of the outer portion of the query. This special keyword identifies a filter that Oracle will put on the data returned from the exists subquery. When distinct is used, Oracle will return only one row for a particular department, even if there are several employees in that department that have submitted relocation expenses within the past year. This distinct operation is useful for situations when you want a list of unique rows but anticipate that the query may return duplicate rows. The distinct operation removes duplicate rows from the result set before displaying that result to the user.

Exercises

Name a TRUE/FALSE operation that depends on the results of a subquery to determine its value.

What is variable scope? What is a local variable? What is a global variable?

What is the distinct keyword, and how is it used?

Putting Data in Order with Subqueries

As with other types of select statements, those statements that involve subqueries may also require some semblance of order in the data that is returned to the user. In the examples of subquery usage in the previous discussion, there may be a need to return the data in a particular order based on the columns selected by the outermost query. In this case, you may simply want to add in the usage of the order by clause. The previous example of selecting the departments containing relocated employees could be modified as follows to produce the required department data in a particular order required by the user:

SELECT distinct e.dept
FROM emp e
WHERE EXISTS
(SELECT i.empid
FROM invoice i
WHERE i.empid = e.empid
AND i.pay_date > SYSDATE-365)
ORDER BY dept;

By using the order by clause in the outermost statement, the data returned from the outermost statement can be sorted into ascending or descending order. You cannot, however, incorporate order into the data returned by the subquery.

SELECT distinct e.dept
FROM emp e
WHERE EXISTS
(SELECT i.empid FROM invoice i
WHERE i.empid = e.empid
AND i.pay_date > SYSDATE-365)
ORDER BY i.empid;

In another example, the points made before about global and local scope will be reinforced. A recruiter from the national consulting firm mentioned earlier tries to issue the following select statement, similar to the original one discussed with that example:

SELECT candidate_id,
name,
employer
FROM candidate
WHERE candidate_id IN
( SELECT candidate_id
FROM skills
WHERE employer IN
( SELECT employer
FROM companies
WHERE city = 'MINNEAPOLIS'))
ORDER BY skill_type;

When the recruiter attempts to issue the preceding select statement, Oracle will execute the statement without error because the column specified by the order by clause need not be part of the column list in the select column list of the outermost select statement. However, within a subquery it is possible for the user to create order based on the column "variables" from the outer query. Order, then, can only be made on the columns local or global to the query on which the order by clause is placed.

Exercises

Can you use the order by clause within select statements with subqueries? Why or why not?

What about within the subquery itself? Explain.

Using Runtime Variables

In this section, you will cover the following topics related to using runtime variables:

Entering variables at run time
Automatic definition of runtime variables
The accept command

SQL is an interpreted language. That is, there is no "executable code" other than the statement you enter into the command line. At the time that statement is entered, Oracle’s SQL processing mechanism works on obtaining the data and returning it to you. When Oracle is finished returning the data, it is ready for you to enter another statement. This interactive behavior is typical of interpreted programming languages.

SELECT name, salary, dept
FROM emp
WHERE empid = 40539;

NAME SALARY DEPT
-------- ------- ----
DURNAP 70560 450P

In the above statement, the highlighted comparison operation designates that the data returned from this statement must correspond to the EMPID value specified. If you run this statement again, the data returned would be exactly the same, provided that no portion of the record had been changed by the user issuing the query or anyone else on the database. However, Oracle’s interpretive SQL processing mechanism need not have everything defined for it at the time you enter a SQL statement. In fact, there are features within the SQL processing mechanism of Oracle that allow you to identify a specific value to be used for the execution of the query as a runtime variable. This feature permits some flexibility and reusability of SQL statements.

Entering Variables at Run Time

Consider, for example, the situation where you pull up data for several different employees manually for the purpose of reviewing some aspect of their data. Rather than rekeying the entire statement in with the EMPID value hard-coded into each statement, you can substitute a variable specification that forces Oracle to prompt you to enter a data value in order to let Oracle complete the statement. The earlier statement that returned data from the EMP table based on a hard-coded EMPID value can now be rewritten as the following query that allows you to reuse the same code again and again with different values set for EMPID:

SELECT name, salary, dept
FROM emp
WHERE empid = &empid;

Enter value for empid: 40539
Old 3: WHERE empid = &empid;
New 3: WHERE empid = 40539;

NAME SALARY DEPT
------ ------- ----
DURNAP 70560 450P

After completing execution, you now have the flexibility to rerun that same query, except now you can specify a different EMPID without having to reenter the entire statement. Notice that a special ampersand (&) character precedes the name of the variable that will be specified at run time. This combination of ampersand and identifier creates a substitution variable. If you don’t want to use the ampersand, the character can be changed with the set define command at the SQL prompt in SQL*Plus. You can reexecute the statement containing a runtime variable declaration by using the slash (/) command at the prompt in SQL*Plus.

Enter value for empid: 99706
Old 3: WHERE empid = &empid;
New 3: WHERE empid = 99706;

NAME SALARY DEPT
------- ------- ----
MCCALL 103560 795P

This time, you enter another value for the EMPID, and Oracle searches for data in the table based on the new value specified. This activity will go on as listed above until you enter a new SQL statement. Notice that Oracle provides additional information back to you after a value is entered for the runtime variable. The line as it appeared before is listed as the old value, while below the new value is presented as well. This presentation lets you know what data was changed by your input.

Exercises

What special character is used to specify a runtime variable?

How does Oracle prompt for runtime variable change?

What special character is used to reexecute a statement in SQL*Plus if the statement is stored in the current buffer? Can you recall the name of the file in which the SQL*Plus statement buffer is stored?

Automatic Definition of Runtime Variables

In some cases, however, it may not be useful to have you entering new values for a runtime variable every time the statement executes. For example, assume that there is some onerous reporting process that you must perform weekly on every person in a company. A great deal of value is added to the process by having a variable that can be specified at run time because you can then simply reexecute the same statement over and over again, with new EMPID values each time. However, even this improvement is not streamlining the process as much as you would like. Instead of running the statement over and over again with new values specified, you could create a script that contained the SQL statement, preceded by a special statement that defined the input value automatically and triggered the execution of the statement automatically as well. Some basic reporting conventions will be presented in this example, such as spool. This command is used to designate to the Oracle database that all output generated by the following SQL activity should be redirected to an output file named after the parameter following spool:

SPOOL emp_info.out;
DEFINE VAR_EMPID = 34030

SELECT name, salary, dept
FROM emp
WHERE empid = &var_empid;

UNDEFINE VAR_EMPID
DEFINE VAR_EMPID = 94059

SELECT name, salary, dept
FROM emp
WHERE empid = &var_empid;

When you execute the statements in this script, the time spent actually keying in values for the variables named in the SQL select statement is eliminated with the define statement. Notice, however, that in between each execution of the SQL statement there is a special statement using a command called undefine. In Oracle, the data that is defined with the define statement as corresponding to the variable will remain defined for the entire session unless the variable is undefined. By undefining a variable, the user allows another define statement to reuse the variable in another execution of the same or a different statement.

TIP: You can also use the define command if you want to reuse substitution variables over different SQL statements, allowing you to pass a value from one statement to another.

Exercises

How are variables defined within the SQL*Plus session to be used by select statements?

How can the user change a value set for a defined variable?

ACCEPT: Another Way to Define Variables

After executing a few example SQL statements that incorporate runtime variables, you may notice that Oracle’s method for identifying input, though not exactly cryptic, is fairly nonexpressive.

SELECT name, salary, dept
FROM emp
WHERE empid = &empid
AND dept = '&dept';

Enter value for &empid: 30403
Old 3: WHERE empid = &empid
New 3: WHERE empid = 30403

Enter value for &dept: 983X
Old 4: WHERE dept = '&dept';
New 4: WHERE dept = '983X';

NAME SALARY DEPT
-------- -------- ----
TIBBINS 56700 983X

You need not stick with Oracle’s default messaging to identify the need for input. Instead, you can incorporate into scripted SQL statements another method for the purpose of defining runtime variables. This other method allows the creator of the script to define a more expressive message that the user will see when Oracle prompts for input data. The name of the command that provides this functionality is the accept command. In order to use the accept command in a runtime SQL environment, you can create a script in the following way. Assume for the sake of example that you have created a script called emp_sal_dept.sql, into which the following SQL statements are placed:

ACCEPT var_empid PROMPT 'Enter the Employee ID Now:'
ACCEPT var_dept PROMPT 'Enter the Employee Department Now:'

SELECT name, salary, dept
FROM emp
WHERE empid = &var_empid
AND dept = '&var_dept';

At this point, the user can run the script at the prompt using the following command syntax:

SQL> @emp_sal_dept

SQL> start emp_sal_dept.sql

Oracle will then execute the contents of the script. When Oracle needs to obtain the runtime value for the variables that the user identified in the SQL statement and with the accept statement, Oracle will use the prompt the user defined with the prompt clause of the accept statement.

SQL>> @emp_sal_dept
Enter the Employee ID Now: 30403

SELECT name, salary, dept
FROM emp
WHERE empid = &var_empid
AND dept = '&var_dept';

Old 3: WHERE empid = '&var_empid'
New 3: WHERE empid = 30403

Enter the Employee Department Now: 983X

SELECT name, salary, dept
FROM emp
WHERE empid = 30403
AND dept = '&var_dept';

Old 4: WHERE dept = '&var_dept'
New 4: WHERE dept = '983X'

NAME SALARY DEPT
-------- ------- ----
TIBBINS 56700 983X

Using the accept command can be preferable to Oracle’s default output message in situations where you want to define a more accurate or specific prompt, or you want more output to display as the values are defined. In either case, the accept command can work well. Oracle offers a host of options for making powerful and complex SQL statements possible with runtime variables. These options covered can be used for both interactive SQL data selection and for SQL scripts.

TIP: By default, the datatype for a variable defined with the accept command is CHAR. You can also explicitly specify the datatype in the accept command.

Exercises

What is the accept command and how is it used?

What benefits does using the accept command offer?

Chapter Summary

This chapter continues the discussion presented last chapter of using the select statement to obtain data from the Oracle database. The select statements have many powerful features that allow the user to accomplish many tasks. Those features include joining data from multiple tables, grouping data output together and performing data operations on the groups of data, creating select statements that can use subqueries to obtain criteria that is unknown (but the method for obtaining it is known), and using variables that accept values at run time. Together, these areas comprise the advanced usage of SQL select statements. The material in this chapter comprises 22 percent of information questioned on OCP Exam 1.

The first area discussed in this chapter is how data from multiple tables can be joined together to create new meaning. Data in a table can be linked if there is a common or shared column between the two tables. This shared column is often referred to as a foreign key. Foreign keys establish a relationship between two tables that is referred to as a parent/child relationship. The parent table is typically the table in which the common column is defined as a primary key, or the column by which uniqueness is identified for rows in the table. The child table is typically the table in which the column is not the primary key, but refers to the primary key in the parent table.

There are two types of joins. One of those types is the "inner" join, also known as an equijoin. An "equijoin" is a data join based on equality comparisons between common columns of two or more tables. An "outer" join is a nonequality join operation that allows you to obtain output from a table even if there is no corresponding data for that record in the other table.

Joins are generated by using select statements in the following way. First, the columns desired in the result set are defined in the select clause of the statement. Those columns may or may not be preceded with a table definition, depending on whether or not the column appears in more than one table. If the common column is named differently in each table, then there is no need to identify the table name along with the column name, as Oracle will be able to distinguish which table the column belongs to automatically. However, if the column name is duplicated in two or more tables, then you must specify which column you would like to obtain data from, since Oracle must be able to resolve any ambiguities clearly at the time the query is parsed. The tables from which data is selected are named in the from clause, and may optionally be followed by a table alias. A table alias is similar in principle to a column alias, which was discussed in the last chapter. The where clause of a join statement specifies how the join is performed. An inner join is created by specifying the two shared columns in each table in an equality comparison. An outer join is created in the same way, with an additional special marker placed by the column specification of the "outer" table, or the table in which there need not be data corresponding to rows in the other table for that data in the other table to be returned. That special marker is indicated by a (+). Finally, a table may be joined to itself with the use of table aliases. This activity is often done to determine if there are records in a table with slightly different information from rows that are otherwise duplicate rows.

Another advanced technique for data selection in Oracle databases is the use of grouping functions. Data can be grouped together in order to provide additional meaning to the data. Columns in a table can also be treated as a group in order to perform certain operations on them. These grouping functions often perform math operations such as averaging values or obtaining standard deviation on the dataset. Other group functions available on groups of data are max( ), min( ), sum( ), and count( ).

One common grouping operation performed on data for reporting purposes is a special clause in select statements called group by. This clause allows the user to segment output data and perform grouping operations on it. There is another special operation associated with grouping that acts as a where clause for which to limit the output produced by the selection. This limiting operation is designated by the having keyword. The criteria for including or excluding data using the having clause can be identified in one of two ways. Either criterion can be a hard-coded value or it can be based on the results of a select statement embedded into the overarching select statement. This embedded selection is called a subquery.

Another advanced function offered by select statements is the use of subqueries in the where clause of the select statement. There is no theoretical limit to the number of queries that can be nested in select statements, but it is not generally advisable to put more than half a dozen or so based on performance. Subqueries allow the user to specify unknown search criteria for the comparisons in the where clause as opposed to using strictly hard-coded values. Subqueries also illustrate the principle of data scope in SQL statements by virtue of the fact that the user can specify columns that appear in the parent query, even when those columns do not appear in the table used in the subquery.

Another use of subqueries can be found in association with a special operation that can be used in the where clause of a select statement. The name of this special operation is exists. This operation produces a TRUE or FALSE value based on whether or not the related subquery produces data. The exists clause is a popular option for users to incorporate subqueries into their select statements.

Output from the query can be placed into an order specified by the user with the assistance of the order by clause. One final advanced technique covered in this chapter is the specification of variables at run time. This technique is especially valuable in order to provide reusability in a data selection statement. In order to denote a runtime variable in SQL, the user should place a variable name in the comparison operation the user wants to specify a runtime value for. The name of that variable in the select statement should be preceded with a special character to denote it as a variable. By default, this character is an ampersand (&). However, the default variable specification character can be changed with the use of the set define command at the prompt.

Runtime variables can be specified for SQL statements in other ways as well. The define command can be used to identify a runtime variable for a select statement automatically. After being defined and specified in the define command, a variable is specified for the entire session or until it is altered with the undefine command. In this way, the user can avoid the entire process of having to input values for the runtime variables. The final technique covered in this chapter on select statements is the usage of accept to redefine the text displayed for the input prompt. More cosmetic than anything else, accept allows the user to display a more direct message than the Oracle default message for data entry.

Two-Minute Drill

Select statements that obtain data from more than one table and merge the data together are called joins.

In order to join data from two tables, there must be a common column.

A common column between two tables can create a foreign key, or link, from one table to another. This condition is especially true if the data in one of the tables is part of the primary key, or the column that defines uniqueness for rows on a table.

A foreign key can create a parent/child relationship between two tables.

One type of join is the inner join, or equijoin. An equijoin operation is based on an equality operation linking the data in common columns of two tables.

Another type of join is the outer join. An outer join returns data in one table even when there is no data in the other table.

The "other" table in the outer join operation is called the outer table.

The common column that appears in the outer table of the join must have a special marker next to it in the comparison operation of the select statement that creates the table.

The outer join marker is as follows: (+).

Common columns in tables used in join operations must be preceded either with a table alias that denotes the table in which the column appears, or else the entire table name if the column name is the same in both tables.

The data from a table can be joined to itself. This technique is useful in determining if there are rows in the table that have slightly different values but are otherwise duplicate rows.

Table aliases must be used in self join select statements.

Data output from table select statements can be grouped together according to criteria set by the query.

A special clause exists to assist the user in grouping data together. That clause is called group by.

There are several grouping functions that allow you to perform operations on data in a column as though the data were logically one variable.

The grouping functions are max( ), min( ), sum( ), avg( ), stddev( ), variance( ), and count( ).

These grouping functions can be applied to the column values for a table as a whole or for subsets of column data for rows returned in group by statements.

Data in a group by statement can be excluded or included based on a special set of where criteria defined specifically for the group in a having clause.

The data used to determine the having clause can either be specified at run time by the query or by a special embedded query, called a subquery, which obtains unknown search criteria based on known search methods.

Subqueries can be used in other parts of the select statement to determine unknown search criteria as well. Including subqueries in this fashion typically appears in the where clause.

Subqueries can use columns in comparison operations that are either local to the table specified in the subquery or use columns that are specified in tables named in any parent query to the subquery. This usage is based on the principles of variable scope as presented in this chapter.

Data can be returned from statements containing subqueries with the order by clause.

Variables can be set in a select statement at run time with use of runtime variables. A runtime variable is designated with the ampersand character (&) preceding the variable name.

The special character that designates a runtime variable can be changed using the set define command.

A command called define can identify a runtime variable value to be picked up by the select statement automatically.

Once defined, the variable remains defined for the rest of the session or until undefined by the user or process with the undefine command.

A user can also modify the message that prompts the user to input a variable value. This activity is performed with the accept command.