Data Alignment in PostgreSQL

一年前開始處理千萬級資料時，考慮到migration速度與資料儲存空間的使用，才知道資料庫也有alignment的機制。64位元作業系統是8 bytes的對齊，取自link的範例，對齊前:

CREATE TABLE t1 (
 , a char
 , b int2    -- 1 byte of padding after a
 , c char
 , d int4    -- 3 bytes of padding after c
 , e char
 , f int8    -- 7 bytes of padding after e
 );

對齊後，可節省11位元組空間，不管是儲存空間還是CPU使用效率都會有所提升。

CREATE TABLE t1 (
 , f int8
 , d int4
 , b int2
 , a char
 , c char
 , e char
 );

我的問題是，各種型態所需的數據對齊方式該怎查呢?

根據這篇實驗結果，欄位規則如下:

消除tuple PADDING, 字段顺序规则：
1、优先使用定长类型(例如numeric, decimal如果业务上不需要无限精度, 那么请使用定长的整型或浮点型代替)
2、定长字段（从大到小）
3、变长字段

因此我們必須知道哪些是屬於可變長的型態與型態所需要的對齊方式。此link提供了一個SQL，讓你可以列出所有type的typalign與typlen。其中typalign用於決定所需要的對齊方式；typlen為-1則代表為可變長的型態:

SELECT typname,typbyval,typlen,typalign FROM pg_type;

假如我要找某一種特定type，加入where限制條件即可:

SELECT typname,typbyval,typlen,typalign FROM pg_type WHERE typname='int2';

結果如下，可以發現typalign的字元有多種:

接下來就是去link中搜尋typalign，

typalign char

typalign is the alignment required when storing a value of this type. It applies to storage on disk as well as most representations of the value inside PostgreSQL. When multiple values are stored consecutively, such as in the representation of a complete row on disk, padding is inserted before a datum of this type so that it begins on the specified boundary. The alignment reference is the beginning of the first datum in the sequence. Possible values are:
c = char alignment, i.e., no alignment needed.
s = short alignment (2 bytes on most machines).
i = int alignment (4 bytes on most machines).
d = double alignment (8 bytes on many machines, but by no means all).

接著就可以按照前述規則去調整欄位了。型態別名可以參考此link。
也可以直接透過以下SQL直接查詢某table的狀態:

# 替換example_table為你的table即可。
SELECT a.attname, t.typname, t.typalign, t.typlen  
  FROM pg_class c  
  JOIN pg_attribute a ON (a.attrelid = c.oid)  
  JOIN pg_type t ON (t.oid = a.atttypid)  
 WHERE c.relname = 'example_table'  
   AND a.attnum >= 0  
 ORDER BY a.attnum;

透過上述SQL，可以快速確認調整結果: (需在建index之前測試，不然會不準喔!)

Data Alignment in PostgreSQL

Problem

How to?

Reference