Databases and SQL
Selecting Data
Learning Objectives
- Explain the difference between a table, a record, and a field.
- Explain the difference between a database and a database manager.
- Write a query to select all values for specific fields from a single table.
A relational database is a way to store and manipulate information. Databases are arranged as a collection of tables. Each table has columns (the fields) that describe the data, and rows (the records) which contain the data.
When using a database, we send commands (usually called queries) to a database manager. The database manager is a program that manipulates the database for us. It does whatever lookups and calculations the query specifies and returns the results in a tabular form that we can then use as a starting point for further queries.
Queries are written in a language called SQL, which stands for “Structured Query Language”. SQL provides hundreds of different ways to analyze and recombine data. We will only look at a handful of queries, but those commands account for most of what scientists do.
The dataset
The tables below show the database we will use in our examples:
Person: people who took readings.
ident personal family dyer William Dyer pb Frank Pabodie lake Anderson Lake roe Valentina Roerich danforth Frank Danforth
Site: locations where readings were taken.
name lat long DR-1 -49.85 -128.57 DR-3 -47.15 -126.72 MSK-4 -48.87 -123.4
Visited: when readings were taken at specific sites.
ident site dated 619 DR-1 1927-02-08 622 DR-1 1927-02-10 734 DR-3 1930-01-07 735 DR-3 1930-01-12 751 DR-3 1930-02-26 752 DR-3 -null- 837 MSK-4 1932-01-14 844 DR-1 1932-03-22
Survey: the actual readings.
taken person quant reading 619 dyer rad 9.82 619 dyer sal 0.13 622 dyer rad 7.8 622 dyer sal 0.09 734 pb rad 8.41 734 lake sal 0.05 734 pb temp -21.5 735 pb rad 7.22 735 -null- sal 0.06 735 -null- temp -26.0 751 pb rad 4.35 751 pb temp -18.5 751 lake sal 0.1 752 lake rad 2.19 752 lake sal 0.09 752 lake temp -16.0 752 roe sal 41.6 837 lake rad 1.46 837 lake sal 0.21 837 roe sal 22.5 844 roe rad 11.25
Notice that three entries — one in the Visited
table, and two in the Survey
table — don’t contain any actual data, but instead have a special -null-
entry: we’ll return to these missing values later.
Selecting data
Let’s write an SQL query that displays scientists’ names. We do this using the SQL command SELECT
, giving it the names of the columns we want and the table we want them from. Our query and its output look like this:
SELECT family, personal FROM Person;
family | personal |
---|---|
Dyer | William |
Pabodie | Frank |
Lake | Anderson |
Roerich | Valentina |
Danforth | Frank |
The semicolon at the end of the query tells the database manager that the query is complete and ready to run.
We have written our commands and column names in lower case, and the table name in Title Case, but we don’t have to: as the example below shows, SQL is case insensitive.
SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;
family | personal |
---|---|
Dyer | William |
Pabodie | Frank |
Lake | Anderson |
Roerich | Valentina |
Danforth | Frank |
One convention is to write SQL keywords (such as SELECT
and FROM
) in capital letters and field and table names in lower case. This can make it easier to locate parts of an SQL statement. Whatever casing convention you choose, please be consistent: complex queries are hard enough to read without the extra cognitive load of random capitalization.
It’s important to understand that the rows and columns in a database table aren’t stored in any relevant order. They will always be displayed in some order which we can control that in various ways. For example, we could swap the columns in the output by writing our query as:
SELECT personal, family FROM Person;
personal | family |
---|---|
William | Dyer |
Frank | Pabodie |
Anderson | Lake |
Valentina | Roerich |
Frank | Danforth |
or even repeat columns:
SELECT ident, ident, ident FROM Person;
ident | ident | ident |
---|---|---|
dyer | dyer | dyer |
pb | pb | pb |
lake | lake | lake |
roe | roe | roe |
danforth | danforth | danforth |
As a shortcut, we can select all of the columns in a table using *
:
SELECT * FROM Person;
ident | personal | family |
---|---|---|
dyer | William | Dyer |
pb | Frank | Pabodie |
lake | Anderson | Lake |
roe | Valentina | Roerich |
danforth | Frank | Danforth |
Selecting Site Names
Write a query that displays only the site names from the Site
table.