This week’s class covers the setup and usage of a single table. As a data analyst you will sometimes encounter situations like last week where you will run queries on databases that already exist. Other times you will be tasked with setting up a database to conduct one-off or ongoing analyses.
We will therefore look at how data needs to be modelled for a single table, focusing on data types and constraints to improve data integrity.
You must closely refer to the Data Definition and Data Manipulation chapters of the PostgreSQL manual through this week’s exercises. As this is a MSc-level course it is expected that you will be able to construct SQL statements by researching documentation for yourself.
Before starting make sure that you:
For today’s lab login to the shared server and run psql. You should be connected to a database with the same name as your username. See Accessing a Database in the PostgreSQL manual.
Key concepts will be easily learned through repeated practice and reference to manual.
Null means logically unknown (empty, incomplete, not applicable, undefined). NOT the same as a numeric zero or the empty string.
Null “pollutes” other expressions. More generally, if any variable in an expression (other than a test for null) is null, then the expression is null:
Must consider 3-value logic (true/false/null) rather than 2-value logic (true/false).
Constraints allow us to define acceptable values for columns to aid data integrity.
Refer to constraints section in the PostgreSQL manual for: check, not null, unique, primary key. We will meet foreign keys and exclusion constraints later on
The NOT NULL constraint prevents a particular column value being null. Inserts will fail unless default column value specified.
The UNIQUE constraint prohibits two rows from having an equal set of attribute values for the columns specified:
Primary key is a unique key that is not null. Unambiguously identifies each row.
Max 1 primary key per table. Every table should have a primary key.
We can use the column types SERIAL or BIGSERIAL for auto-incrementing columns. Generally prefer BIGSERIAL. Shorthand for default value based on sequence generator (later on).
PostgreSQL transaction control can be used in simplest form to give basic “undo” capability:
If we don’t BEGIN a transaction we implicitly BEGIN before the statement and COMMIT after it. Errors (e.g. syntax) will abort the transaction, requiring a ROLLBACK before any more statements will be accepted.
There are a few useful non-SQL tools we will need. Many of these tools work (or have equivalents) on Windows desktops too. The best way to learn how to use these is to use their inbuilt help or man pages.
Inbuilt help is usually accessed by a switch, often -h. If a command is run with no input it will often give the help command in the error message.
Most tools also have a manual page called a manpage:
Hit q to exit a manpage.
When we are logged into a remote server we may wish to download data from the internet. There are actually text-mode browsers like lynx, links, links2 but these have become less usable since the web’s descent into excessive visual flourish and JavaScript-centricity.
Instead we will make use of a tool called curl which can download the contents of a particular URL to a file. CURL is a very useful tool, and you should consult its inbuilt help or manpages (man curl) to learn more about its features. Basic usage: