All About Database Internals(Part-2): How Database Stores Table On Disk

Dhroov Gupta
5 min readMay 28, 2023

--

In the last part of this series, We talked about ACID properties and why it is necessary to understand them.

In this article, We'll discuss a very interesting topic about Database internals: HOW DATABASE STORES TABLES ON DISK.

Overview

We all know that relational databases stores data in the form of tables, and inside those tables some rows determine an entry of a single entity(Example: User, Employee).

But behind the scenes, the database engine stores these rows, and tables differently from what we see on the client side. Let's take a look.

Pages Representation in Heap

The rows in the table are stored in Pages, these pages are a block of storage sitting in the Heap that can contain multiple rows. Now here things get interesting, when we query the database to fetch the row the database engine gets the whole page that has the row, not a single row.

Let's dive deep into some of the terms we used in this section.

Page

Page Representation

A Page is a fixed-size block of data stored in the disk, they are stored contiguously which means they are next to each other in the storage disk so that the data retrieval process is efficient.

A single page can store multiple rows depending upon the size of the page and the size of a single row. By default, the size of the page in MySQL and Postgres are 16KB and 8KB respectively. We can configure it according to our needs.

IO

An IO means a call to disk to read/write data.

When we execute a query, the database engine will do an IO operation to fetch the pages from the disk. It can fetch more than one page in a single IO depending upon various factors like data size, and disk access patterns.

IO is expensive, less IO means more efficient

Heap

A heap is a storage structure, which is used to store the pages of the table on disk.

When the data is inserted into the table, the database engine typically appends the new rows to the last page of the heap on the disk. If the last page becomes full, a new page is allocated to accommodate the additional rows. This process continues as more data is inserted, resulting in multiple pages forming the heap.

How query executes?

In this section, we'll see how the database engine executes the query behind the scenes and its performance.

Setup Database:

CREATE TABLE employees (id SERIAL, name VARCHAR(255), salary INT);
INSERT INTO employees (name, salary) SELECT substr(md5(random()::text), 0, 10), floor(random() * (1000 - 1 + 1) + 1)::int from generate_series(1, 1000000);

We have created a table of employees with 1 million records of random values.

Now, let's execute a query to see the performance and how it is executed.

As we can see in the image, this select query took 183.5ms which is quite high in terms of database.

Let's understand why it took so much time —

  • EXPLAIN ANALYZE query in Postgres shows the execution plan of the query.
  • As we can see, Postgres did the Parallel seq scan which means instead of doing a full table scan it used multiple worker processes to scan the whole table in parallel.
  • Postgres assigns a range of pages in a heap to each worker process, and then it scans all the pages and returns the pages which have the data required.
  • There were 333337 rows removed after filtering as shown in the result which means all the pages were pulled from the heap and then in the memory, the Postgres engine filtered the rows based on the where clause.

How can we improve the performance?

There is one thing we can use — INDEX

Index

An Index is another data structure separate from the heap that has pointers to the heap data.

Once we find the index, we can directly go to that page in the heap and fetch the data required, we don't have to do a full table scan.

We can index one or more columns based on our needs.

Typically, data structures for the index are B-tree or B+ Tree.

I'll write about B-tree in depth in the later part of the series.

Now, let's see the index in action and how it will improve the performance of the same query.

How query executes with the index?

Let's create the index on the primary key id.

CREATE INDEX on employees (id);
EXPLAIN ANALYZE SELECT name, salary from employees where id=923456;

We ran the above statements and we can see it only took 0.2ms to get the results.

Let's break down this execution plan shown in the image —

  • Postgres did scan the index first to get the pointer of the page in the heap.
  • After getting the page from the heap, it filters the page in the memory and returned the requested data.

That's it, this is a high-level view of how the database stores data and how it retrieves it with and without an index.

--

--

Dhroov Gupta
Dhroov Gupta

Written by Dhroov Gupta

SDE II @ Hubilo | Backend Developer

Responses (1)