Single-table data

Dr Peadar Grant

Data Architecture

1 Introduction

This week’s class covers the setup and usage of a single table. As a data analyst you will sometimes encounter situations like last week where you will run queries on databases that already exist. Other times you will be tasked with setting up a database to conduct one-off or ongoing analyses.

We will therefore look at how data needs to be modelled for a single table, focusing on data types and constraints to improve data integrity.

You must closely refer to the Data Definition and Data Manipulation chapters of the PostgreSQL manual through this week’s exercises. As this is a MSc-level course it is expected that you will be able to construct SQL statements by researching documentation for yourself.

2 Prerequisities

Before starting make sure that you:

For today’s lab login to the shared server and run psql. You should be connected to a database with the same name as your username. See Accessing a Database in the PostgreSQL manual.

3 Key demonstration concepts

Key concepts will be easily learned through repeated practice and reference to manual.

CREATE table
to define table name, columns (type, constraints)
ALTER table
for incremental changes.
DROP table
without warning!
CRUD operations
:
INSERT
rows populating specified columns with data given.
UPDATE
tale, optionally select rows with WHERE.
DELETE
rows from a table, optionally select rows with WHERE.

4 Null

Null means logically unknown (empty, incomplete, not applicable, undefined). NOT the same as a numeric zero or the empty string.

Null “pollutes” other expressions. More generally, if any variable in an expression (other than a test for null) is null, then the expression is null:

1 + null = null (1) 1 + null null = null (2)

Must consider 3-value logic (true/false/null) rather than 2-value logic (true/false).

5 Constraints

Constraints allow us to define acceptable values for columns to aid data integrity.

Refer to constraints section in the PostgreSQL manual for: check, not null, unique, primary key. We will meet foreign keys and exclusion constraints later on

5.1 NOT NULL

The NOT NULL constraint prevents a particular column value being null. Inserts will fail unless default column value specified.

-- During table creation:
CREATE TABLE tasks (
   description text not null,
   /* other columns */
);

5.2 UNIQUE

The UNIQUE constraint prohibits two rows from having an equal set of attribute values for the columns specified:

-- During table creation:
CREATE TABLE tasks (
   description text unique,
   /* other columns as necessary */
   project_code text,
   subproject_code text,
   unique(project,subproject)
);

-- Afterwards
ALTER TABLE tasks ADD UNIQUE (project, subproject);

6 Primary key

Primary key is a unique key that is not null. Unambiguously identifies each row.

Max 1 primary key per table. Every table should have a primary key.

-- specifying primary key
CREATE TABLE accounts (
    account_number bigint primary key
    /* other columns as necessary */
);

-- primary key covering two columns
CREATE TABLE accounts (
    branch bigint-- not null automatic
    account_number bigint,  -- not null automatic
    /* other columns */
    primary key (branch, account_number)
    /* other constraints etc. */
)

7 Default values

7.1 Auto-increment column

We can use the column types SERIAL or BIGSERIAL for auto-incrementing columns. Generally prefer BIGSERIAL. Shorthand for default value based on sequence generator (later on).

CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    /* other columns & constraints */
);

8 Transaction control

PostgreSQL transaction control can be used in simplest form to give basic “undo” capability:

-- start a new transaction
BEGIN;

/* execute SQL statements then either: */

COMMIT/* save the changes */
-- or
ROLLBACK/* undo the changes */

8.1 Error handling

If we don’t BEGIN a transaction we implicitly BEGIN before the statement and COMMIT after it. Errors (e.g. syntax) will abort the transaction, requiring a ROLLBACK before any more statements will be accepted.

8.2 Savepoints

BEGIN;

/* sql statement block 1 */

-- define a savepoint
SAVEPOINT sp1;

/* sql statement block 2 */

-- rollback to the savepoint undoes block 2;
ROLLBACK TO sp1;

-- then either:
COMMIT/* or ROLLBACK */

9 UNIX tools

There are a few useful non-SQL tools we will need. Many of these tools work (or have equivalents) on Windows desktops too. The best way to learn how to use these is to use their inbuilt help or man pages.

Inbuilt help is usually accessed by a switch, often -h. If a command is run with no input it will often give the help command in the error message.

Most tools also have a manual page called a manpage:

# general pattern
man commandname

# example man page for curl (see next section)
man curl

Hit q to exit a manpage.

9.1 Using CURL to download data

When we are logged into a remote server we may wish to download data from the internet. There are actually text-mode browsers like lynx, links, links2 but these have become less usable since the web’s descent into excessive visual flourish and JavaScript-centricity.

Instead we will make use of a tool called curl which can download the contents of a particular URL to a file. CURL is a very useful tool, and you should consult its inbuilt help or manpages (man curl) to learn more about its features. Basic usage:

# general pattern
curl -o output_file_name URL

# download the list of London stations at:
# https://www.doogal.co.uk/london_stations.php
#
# https://www.doogal.co.uk/LondonStationsCSV.ashx
curl -o stations.csv  https://www.doogal.co.uk/LondonStationsCSV.ashx

9.2 Printing file contents to screen

# listing the contents of the current directory
ls

# printing out the stations.csv file to the screen
cat stations.csv

# printing out the file with pagination
cat stations.csv | more
# using the bar | operator pipes the output of one command to another

10 Exercise

  1. Setup a simple table to keep track of your tasks / expenses / other data. Your table should at least have a description and due date. Add in other columns (such as priority, e-mail to notify when done) as necessary.