Ben Gorman

Ben Gorman

Life's a garden. Dig it.

What's Google BigQuery

BigQuery is a data warehouse designed for storing and querying large datasets. It's optimized for querying and analyzing data, but not for inserting/deleting/modifying data.

SQL

SQL (Structured Query Language) is a standard language for accessing and manipulating databases. SQL has numerous flavors (e.g. MySql, Microsoft SQL Server, PostgreSQL), all of which follow similar patterns and syntax, but each flavor has its own quirks and way of doing things.

BigQuery has its own flavor of SQL. Therefore, if you want to know how to create tables, aggregate data, calculate rolling averages, ... with BigQuery, you need to find the correct BigQuery SQL syntax. A Google search such as "How to select a random row from a table in SQL" will often lead you astray, as the results will not be specific to BigQuery.

As always, a good place to answer your "how do I do this thing?" questions is to refer to the official BigQuery docs.

Hello World

Here's a look at the BigQuery console in its simplest form.

bigquery console

Notice the project id in the header. Our project id is practice-probs. Yours will be different.

Projects > Datasets > Tables

In BigQuery, tables live within datasets and datasets live within projects.

- abc-grocery   <- project
  - primary     <- dataset
    - sales     <- table
    - customers
    - products
  - social
    - twitter_posts
    - instagram_posts
    - facebook_posts

Make a dataset

Before we make our first table, we'll need to make a dataset. The easiest way to make a dataset is to use the web interface like this.

create dataset

  1. Click the three dots next to your project-id in the explorer pane.
  2. Click Create dataset.
  3. Set Dataset ID equal to "mydataset" and then choose a Data location close to you.
  4. Click Create Dataset.

There are lots of other ways to make a dataset. For example, we could make the same dataset using a SQL query like this.

  CREATE SCHEMA project-id.mydataset

Make a table

Now we can make a table within mydataset by executing the following SQL statement in the editor.

CREATE TABLE mydataset.helloworld AS
 
SELECT 1 AS id, 'John'  AS name UNION ALL
SELECT 2 AS id, 'Ryan'  AS name UNION ALL
SELECT 3 AS id, 'Carly' AS name UNION ALL
SELECT 4 AS id, 'Kora'  AS name;

create table

We can inspect the table's schema and details, and preview its data.

...

Query a table

Now let's query the table we just created by executing the following SQL statement.

SELECT * FROM mydataset.helloworld WHERE name IN ('Carly', 'Ryan');

run query

Notice the estimate query "cost" reported as bytes processed in the top right of the details pane.

bigquery estimated cost

In this example, BigQuery expects the query to process 57 bytes. For reference, 1TB of data processed currently costs $5 (although the first TB per month is free.)

BigQuery Jobs

In BigQuery, each executed query translates into a BigQuery job with details like Who ran the query? When was the query executed? and How many bytes of data did the query process?

Jobs are conveniently stored under the Project History and Personal History tabs, so you can fetch previously run queries even if you didn't explicitly save them!

bigquery jobs

Query Notes
  • BigQuery SQL is case-insensitive (but most people use all caps by convention)
  • Tables need to be qualified with their parent dataset name (e.g. foo.bar as opposed to simply bar). Often, you'll see tables qualified by their parent project and dataset (e.g. myproject.foo.bar).

Tips & Tricks

  • You can view a list of keyboard shortcuts by clicking on the keyboard icon near the top right of the details pane.

    bigquery shortcuts

  • If you have multiple query statements in the editor, you can run one of them by highlighting it clicking Run (or use the shortcut cmd + return (mac), ctrl + enter (windows)).

Public Datasets

BigQuery hosts some really cool public datasets that we'll make use of throughout this problem set. To access the public datasets,

find public dataset

  1. Search for "public" in the explorer pane
  2. Click "Broaden search to all projects"
  3. Find the project titled "bigquery-public-data" and pin it
  4. Clear the text inside the search bar
  5. Expand the bigquery-public-data project to see dozens of datasets within it.

You can query any of these datasets, but be cautious as some of them are pretty large (and thus expensive 💰).

Example

-- Identify the top 10 baby names in the US in 2020
SELECT name, sum(number) as number
FROM `bigquery-public-data.usa_names.usa_1910_current`
group by name
ORDER BY number DESC
LIMIT 10

top baby names