Upcoming Webinar Friday 1/4: High-Performance PostgreSQL, Tuning and Optimization Guide

January 3, 2019, 8:42 am

≫ Next: Amazon RDS Aurora MySQL – Differences Among Editions

≪ Previous: MongoDB Backup: How and When To Use PSMDB hotbackup and mongodb_consistent_backup

Please join Percona’s Senior Software Engineer, Ibrar Ahmed as he presents his High-Performance PostgreSQL, Tuning and Optimization Guide on Friday, January, 4th, at 8:00 AM PDT (UTC-7) / 11:00 AM EDT (UTC-4).

PostgreSQL is one of the leading open-source databases. Out of the box, the default PostgreSQL configuration is not tuned for any workload. Thus, any system with least resources can run it. PostgreSQL does not give optimum performance on high permanence machines because it is not using the all available resource. PostgreSQL provides a system where you can tune your database according to your workload and machine’s specifications. In addition to PostgreSQL, we can also tune our Linux box so that the database load can work optimally.

In this webinar on High-Performance PostgreSQL, Tuning and Optimization, we will learn how to tune PostgreSQL and we’ll see the results of that tuning. We will also touch on tuning some Linux kernel parameters.

↧

Amazon RDS Aurora MySQL – Differences Among Editions

January 4, 2019, 7:51 am

≫ Next: Upcoming Webinar Wed 1/9: Walkthrough of Percona Server MySQL 8.0

≪ Previous: Upcoming Webinar Friday 1/4: High-Performance PostgreSQL, Tuning and Optimization Guide

differences MySQL aurora versions Amazon Aurora with MySQL Compatibility comes in three editions which, at the time of writing, have quite a few differences around the features that they support. Make sure you don’t assume the newer Aurora 2.x supports everything in Aurora 1.x. On the contrary, right now Aurora 1.x (MySQL 5.6 based) supports most Aurora features. The serverless option was launched for this version, and it’s not based on the latest MySQL 5.7. However, the serverless option, too, has its own set of limitations

I found a concise comparison of what is available in which Amazon Aurora edition hard to come by so I’ve created one. The table was compiled based mostly on documentation research, so if you spot some mistakes please let me know and I’ll make a correction.

Please keep in mind, this is expected to change over time. For example Amazon Aurora 2.x was initially released without Performance_Schema support, which was enabled in later versions.

There seems to be lag porting Aurora features from MySQL 5.6 compatible to MySQL 5.7 compatible – the current 2.x release does not include features introduced in Aurora 1.16 or later as per this document

A comparison table

	MySQL 5.6 Based	MySQL 5.7 Based	Serverless MySQL 5.6 Based
Compatible to MySQL	MySQL 5.6.10a	MySQL 5.7.12	MySQL 5.6.10a
Aurora Engine Version	1.18.0	2.03.01	1.18.0
Parallel Query	Yes	No	No
Backtrack	Yes	No	No
Aurora Global Database	Yes	No	No
Performance Insights	Yes	No	No
SELECT INTO OUTFILE S3	Yes	Yes	Yes
Amazon Lambda – Native Function	Yes	No	No
Amazon Lambda – Stored Procedure	Yes	Yes	Yes
Hash Joins	Yes	No	Yes
Fast DDL	Yes	Yes	Yes
LOAD DATA FROM S3	Yes	Yes	No
Spatial Indexing	Yes	Yes	Yes
Asynchronous Key Prefetch (AKP)	Yes	No	Yes
Scan Batching	Yes	No	Yes
S3 Backed Based Migration	Yes	No	No
Advanced Auditing	Yes	Yes	No
Aurora Replicas	Yes	Yes	No
Database Cloning	Yes	Yes	No
IAM database authentication	Yes	Yes	No
Cross-Region Read Replicas	Yes	Yes	No
Restoring Snapshot from MySQL DB	Yes	Yes	No
Enhanced Monitoring	Yes	Yes	No
Log Export to Cloudwatch	Yes	Yes	No
Minor Version Upgrade Control	Yes	Yes	Always On
Data Encryption Configuration	Yes	Yes	Always On
Maintenance Window Configuration	Yes	Yes	No

Hope this is helps with selecting which Amazon Aurora edition is right for you, when it comes to supported features.

—
Photo by Nathan Dumlao on Unsplash

↧

Upcoming Webinar Wed 1/9: Walkthrough of Percona Server MySQL 8.0

January 8, 2019, 11:13 am

≫ Next: Amazon Aurora Serverless – The Sleeping Beauty

≪ Previous: Amazon RDS Aurora MySQL – Differences Among Editions

Walkthrough of Percona Server for MySQL 8.0

Please join Percona’s MySQL Product Manager, Tyler Duzan as he presents Walkthrough of Percona Server MySQL 8.0 on Wednesday, January 9th at 11:00 AM PDT (UTC-7) / 2:00 PM (UTC-4).

Our Percona Server for MySQL 8.0 software is the company’s free, enhanced, drop-in replacement for MySQL Community Edition. The software includes all of the great features in MySQL Community Edition 8.0. Additionally, it includes enterprise-class features from Percona made available free and open source. Thousands of enterprises trust Percona Server for MySQL to deliver excellent performance and reliability for their databases and mission-critical applications. Furthermore, our open source software meets their need for a mature, proven and cost-effective MySQL solution.

In sum, register for this webinar for a walkthrough of Percona Server for MySQL 8.0.

↧

Amazon Aurora Serverless – The Sleeping Beauty

January 9, 2019, 4:59 am

≫ Next: PostgreSQL Updatable Views: Performing Schema Updates With Minimal Downtime

≪ Previous: Upcoming Webinar Wed 1/9: Walkthrough of Percona Server MySQL 8.0

One of the most exciting features Amazon Aurora Serverless brings to the table is its ability to go to sleep (pause) when idle. This is a fantastic feature for development and test environments. You get access to a powerful database to run tests quickly, but it goes easy on your wallet as you only pay for storage when the instance is paused.

You can configure Amazon RDS Aurora Serverless to go to sleep after a specified period of time. This can be set to anywhere between five minutes and 24 hours

configure Amazon RDS Aurora Serverless sleep time

For this feature to work, however, inactivity has to be complete. If you have so much as a single query or even maintain an idle open connection, Amazon Aurora Serverless will not be able to pause.

This means, for example, that pretty much any monitoring you may have enabled, including our own Percona Monitoring and Management (PMM) will prevent the instance from pausing. It would be great if Amazon RDS Aurora Serverless would allow us to specify user accounts to ignore, or additional service endpoints which should not prevent it from pausing, but currently you need to get by without such monitoring and diagnostic tools, or else enable them only for duration of the test run.

If you’re using Amazon Aurora Serverless to back very low traffic applications, you might consider disabling the automatic pause function, since waking up currently takes quite a while. Otherwise, your users should be prepared for a 30+ seconds wait while Amazon Aurora Serverless activates.

Having such a high time to activate means you need to be mindful of timeout configuration in your test/dev scripts so you do not have to deal with sporadic failures. Or you can also use something like the mysqladmin ping command to activate the instance before your test run.

Some activation experiments

Let’s now take a closer look at Amazon RDS Aurora Serverless activation times. These times are measured for MySQL 5.6 based Aurora Serverless – the only one currently available. I expect numbers could be different in other editions

Amazon RDS Aurora Serverless activation times

I measured the time it takes to run a trivial query (SELECT 1) after the instance goes to sleep. You’ll see I manually scaled the Amazon RDS Aurora Serverless instance to a desired capacity in ACU (Aurora Compute Units), and then had the script wait for six minutes to allow for pause to happen before running the query. The test was performed 12 times and the Min/Max/Avg times of these test runs for different settings of ACU are presented above.

You can see there is some variation between min and max times. I would expect to have even higher outliers, so plan for an activation time of more than a minute as a worst case scenario.

Also note that there is an interesting difference in the activation time between instance sizes. While in my tests the smallest possible size (2 ACU) consistently took longer to activate compared to the medium size (8 ACU), the even bigger size (64 ACU) was the slowest of all.

So make no assumptions about how long it would take for instance of given size to wake up with your workload, but rather test it if it is important consideration for you.

In some (rare) cases I also observed some internal timeouts during the resume process:

[root@ip-172-31-16-160 serverless]# mysqladmin ping -h serverless-test.cluster-XXXX.us-east-2.rds.amazonaws.com -u user -ppassword
mysqladmin: connect to server at 'serverless-test.cluster-XXXX.us-east-2.rds.amazonaws.com' failed
error: 'Database was unable to resume within timeout period.'

What about Autoscaling?

Finally, you may wonder how such Amazon Aurora Serverless pausing plays with Amazon Aurora Serverless Autoscaling ?

In my tests, I observed that resume always restores the instance size to the same ACU as it was before it was paused. However, this is where pausing configuration matters a great deal. According to this document, Amazon Aurora Serverless will not scale down more frequently than once per 900 seconds. While the document does not clarify over what period of time the conditions initiating scale down – cpu usage, connection usage etc – have to be met for scale down to be triggered, I can see that if the instance is idle for five minutes the scale down is not performed – it is just put to sleep.

At the same time, if you change this default five minute period to a longer time, the idle instance will be automatically scaled down a notch every 900 seconds before it finally goes to sleep. Consequently, when it is awakened it will not be at the last stage at which the load was applied, but instead at the stage it was at when it was scaled down. Also, scaling down is considered an event by itself, which resets the idle counter and delays the pause. For example: if the initial instance scale is 8, and the pause timer is set to 1h, it takes 1h 30 minutes for the pause to actually happen – 30 minutes to do scale down twice, plus 1 hour at the minimum size for pause to trigger

Here is a graph to illustrate this:

Amazon Aurora Serverless scale down timings

This also shows that when the load is re-applied at about 13:47, it recovers to the last number of ACU it had before the pause.

This means that a pause time of more than 15 minutes makes the pause behavior substantially different to the default.

Summary

Amazon Aurora Serverless automatic pause is a great for test/dev environments.
Resume time is relatively long, can reach as much as one minute.
Consider disabling automatic pausing for low traffic production applications, or at least let your users know they need to wait when they wake up the application.
Pause and Resume behavior is different in practice for a pause timeout of more than 15 minutes. Sticking to the default 5 minutes is recommended unless you really know what you’re doing.

↧

PostgreSQL Updatable Views: Performing Schema Updates With Minimal Downtime

January 10, 2019, 1:35 am

≫ Next: Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

≪ Previous: Amazon Aurora Serverless – The Sleeping Beauty

postgres updatable views Recently, one of our customers asked us how to minimize downtime when upgrading the database structure with changes that are not backwards-compatible. It’s an interesting question and I would like to visit some alternatives here. I will use PostgreSQL for this series of posts and walk through updatable views, INSTEAD OF Triggers, and the Rule System. Later, we’ll discuss alternatives available for other databases like MySQL.

This first post will give an overview of the problem and also the first implementation of the solution in PostgreSQL using updatable Views.

The Motivation

Software is like a living organism and as such, they evolve. It’s not surprising that the database schemas also evolve, and this brings us a problem: how to minimize downtime when performing upgrades? Or even further, is it possible to upgrade them without activating maintenance mode thereby making the service unavailable for our customers?

Let’s say that we want to push out an update 2.0. It’s a major update, and in this update, there are application code changes and changes to the database such as altered tables, dropped columns, new tables and so on. Checking the changelog, we notice that most of the database changes are backwards-compatible but a few modified tables are not so we can’t just push out the new database changes without breaking some functionality in the existing codebase. To avoid triggering errors while we upgrade the database, we need to shutdown the application servers, update the database, update the codebase, and then get the servers back and running again. That means that we need an unwanted maintenance window!

As per our definition of the problem, we want to get to the point where we don’t have to use this maintenance window, a point where the old and new codebase could coexist for a period of time while we upgrade the system. One solution is to not make changes that the current codebase can’t handle, but, as you may have already assumed, it isn’t really an option when we are constantly trying to optimize and improve our databases. Another option, then, would be to use PostgreSQL updatable views.

Updatable Views

PostgreSQL has introduced automatically updatable views in 9.3. The documentation[1] says that simple views are automatically updatable and the system will allow INSERT, UPDATE or DELETE statements to be used on the view in the same way as on a regular table. A view is automatically updatable if it satisfies all of the following conditions:

The view must have exactly one entry in its FROM list, which must be a table or another updatable view.
The view definition must not contain WITH, DISTINCT, GROUP BY, HAVING, LIMIT, or OFFSET clauses at the top level.
The view definition must not contain set operations (UNION, INTERSECT or EXCEPT) at the top level.
The view’s select list must not contain any aggregates, window functions, or set-returning functions.

Note that the idea is to give a simple mechanism that helps when using views, and if the view is automatically updatable the system will convert any INSERT, UPDATE or DELETE statement on the view into the corresponding statement on the underlying base table. This can also be used to increase the security granularity giving the power to define privilege that operates at the level. If using a WHERE clause in the view we can use the CHECK OPTION to prevent the user from being able to UPDATE or INSERT rows that are not in the scope of the view. For example, let’s say we have a view created to limit the user to view records from a specific country. If the user changes the country of any record, those records would disappear from the view. The CHECK OPTION can help to prevent this from happening. I recommend reading the documentation for more information about how views work in PostgreSQL.

Implementation

Using updatable views makes the implementation as simple as creating views. For our example I will use the below table:

test=# CREATE TABLE t (id INTEGER PRIMARY KEY, name VARCHAR(100) NOT NULL, password VARCHAR(300) NOT NULL, date_created TIMESTAMP NOT NULL DEFAULT now());
CREATE TABLE
test=# INSERT INTO t(id, name, password) VALUES (1, 'user_1', 'pwd_1'), (2, 'user_2','pwd_2'),(3,'user_3','pwd_3'),(4,'user_4','pwd_4'),(5,'user_5','pwd_5');
INSERT 0 5
test=# SELECT * FROM t;
id | name | password | date_created
----+--------+----------+----------------------------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455
3 | user_3 | pwd_3 | 2018-12-27 07:50:39.562455
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455
(5 rows)

We then changed the schema renaming the columns password to pwd, date_created to dt_created and added 2 more columns, pwd_salt and comment. The added columns are not a real problem because they can be either nullable or have a default value but the column name change is a problem. The changes are:

test=# create schema v_10;
CREATE SCHEMA
test=# CREATE VIEW v_10.t AS SELECT id, name, password AS password, date_created AS date_created FROM public.t;
CREATE VIEW
test=# ALTER TABLE public.t RENAME COLUMN password TO pwd;
ALTER TABLE
test=# ALTER TABLE public.t RENAME COLUMN date_created TO dt_created;
ALTER TABLE
test=# ALTER TABLE public.t ADD COLUMN pwd_salt VARCHAR(100);
ALTER TABLE
test=# ALTER TABLE public.t ADD COLUMN comment VARCHAR(500);
ALTER TABLE

To make sure our application will work properly we’ve defined that the tables will be in a specific main schema, in this example is the PUBLIC schema and the views will be in the versioned schemas. In this case, if we have a change in one specific version that needs a view guaranteeing backwards-compatibility, we just create the view inside the versioned schema and apply the changes to the table in the main schema. The application will always define the “search_path” as “versioned_schema,main_schema”, which is “v_10, public” in this example:

test=# SET search_path TO v_10, public;
SET
test=# SELECT * FROM t;
id | name | password | date_created
----+--------+----------+----------------------------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455
3 | user_3 | pwd_3 | 2018-12-27 07:50:39.562455
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455
(5 rows)
test=# select * from public.t;
id | name | pwd | dt_created | pwd_salt | comment
----+--------+-------+----------------------------+----------+---------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455 | |
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455 | |
3 | user_3 | pwd_3 | 2018-12-27 07:50:39.562455 | |
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455 | |
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455 | |
(5 rows)

As we can see, the application still sees the old schema, but does this work? What if someone updates the password of ID #3? Let’s check:

test=# UPDATE t SET password = 'new_pwd_3' WHERE id = 3;
UPDATE 1
test=# SELECT * FROM t;
id | name | password | date_created
----+--------+-----------+----------------------------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455
3 | user_3 | new_pwd_3 | 2018-12-27 07:50:39.562455
(5 rows)
test=# SELECT * FROM public.t;
id | name | pwd | dt_created | pwd_salt | comment
----+--------+-----------+----------------------------+----------+---------
1 | user_1 | pwd_1 | 2018-12-27 07:50:39.562455 | |
2 | user_2 | pwd_2 | 2018-12-27 07:50:39.562455 | |
4 | user_4 | pwd_4 | 2018-12-27 07:50:39.562455 | |
5 | user_5 | pwd_5 | 2018-12-27 07:50:39.562455 | |
3 | user_3 | new_pwd_3 | 2018-12-27 07:50:39.562455 | |
(5 rows)

As we can see, the updatable view worked just like a charm! The new and old application codebase can coexist and work together while we roll up our upgrades. There are some restrictions, as explained in the documentation, like having only one table or view in the WHERE clause but for its simplicity, upgradable views do a great job. For more complex cases where we need to split/join tables? Well, we will discuss these in future articles and show how we can solve them with both TRIGGERS and the PostgreSQL Rule System.

References

[1] https://www.postgresql.org/docs/current/sql-createview.html

—
Photo by Egor Kamelev from Pexels

↧

Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

January 14, 2019, 1:35 pm

≫ Next: Upcoming Webinar Thurs 1/24: Databases Gone Serverless?

≪ Previous: PostgreSQL Updatable Views: Performing Schema Updates With Minimal Downtime

Please join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.

↧

Upcoming Webinar Thurs 1/24: Databases Gone Serverless?

January 22, 2019, 5:31 pm

≫ Next: Upcoming Webinar Wed 1/30: Percona XtraDB Cluster: Failure Scenarios and their Recovery

≪ Previous: Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

Please join Percona’s Senior Technical Manager, Alkin Tezuysal, as he presents Databases Gone Serverless? on Thursday, January 24th, at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

Serverless computing is becoming more popular with developers. For instance, it enables them to build and run applications without needing to operate and manage servers. This talk will provide a high-level overview of serverless applications in the database world, including the use cases, possible solutions, services and benefits provided through the cloud ecosystem. In particular, we will focus on the capabilities of the AWS serverless platform.

In order to learn more, register for this webinar on Databases Gone Serverless.

↧

Upcoming Webinar Wed 1/30: Percona XtraDB Cluster: Failure Scenarios and their Recovery

January 28, 2019, 10:36 am

≫ Next: Upcoming Webinar Thurs 1/31: Percona Server for MongoDB 4.0 Feature Walkthrough

≪ Previous: Upcoming Webinar Thurs 1/24: Databases Gone Serverless?

Percona XtraDB Cluster: Failure Scenarios and their Recovery

Please join Percona’s Senior Technical Manager, Alkin Tezuysal, and Percona’s Percona XtraDB Cluster Lead, Krunal Bauskar as they present their talk, Percona XtraDB Cluster: Failure Scenarios and their Recovery on Wednesday, January 30th, 2019, at 8:00 AM PST (UTC-8) / 11:00 AM EST (UTC-5).

Percona XtraDB Cluster (a.k.a PXC) is an open source, multi-master, high availability MySQL clustering solution. PXC works with your MySQL / Percona Server-created database. Given the multi-master aspect, there are multi-guards to protect a cluster from entering an inconsistent state. Most of these guards are configurable based on their user environment. However, if they are not configured properly they could cause the cluster to stall, fail or error-out.

In this session, we’ll discuss failure scenarios, including a MySQL cluster entering a non-primary state due to network partitioning. We’ll also discuss a cluster stall due to flow control, data inconsistency causing the shutdown of a node and common problems during the initial catch up – a.k.a State Snapshot Transfer (SST). Other issues include delays in the purging of a transaction, a blocking DDL causing the entire cluster to stall and a misconfigured cluster.

We will also go over how to solve some of these problems and how to safely recover from these failures.

To learn more, register for Percona XtraDB Cluster: Failure Scenarios and their Recovery.

↧

Upcoming Webinar Thurs 1/31: Percona Server for MongoDB 4.0 Feature Walkthrough

January 29, 2019, 7:00 am

≫ Next: Using pg_repack to Rebuild PostgreSQL Database Objects Online

≪ Previous: Upcoming Webinar Wed 1/30: Percona XtraDB Cluster: Failure Scenarios and their Recovery

Please join Vinodh Krishnaswamy as he presents his talk, Percona Server for MongoDB 4.0 Feature Walkthrough on January 31st, 2019, at 6:00 AM PST (UTC-8) / 9:00 AM EST (UTC-5).

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database. Moreover, it is a fully-compatible, drop-in replacement for MongoDB 4.0 Community Edition. It also supports MongoDB 4.0 protocols and drivers.

Percona Server for MongoDB extends the functionality of the MongoDB 4.0 Community Edition by including the Percona Memory Engine storage engine, encrypted WiredTiger storage engine, audit logging, SASL authentication, hot backups, and enhanced query profiling. Additionally, Percona Server for MongoDB requires no changes to MongoDB applications or code.

This release includes all features of MongoDB 4.0 Community Edition 4.0. Most notable among these are:

– Multi-Document ACID transactions
– Type conversion through the new aggregation operators
– Enhancements to the Change Streams support

In order to learn more, register for the Percona Server for MongoDB 4.0 Feature Walkthrough.

↧

Using pg_repack to Rebuild PostgreSQL Database Objects Online

February 4, 2019, 5:14 pm

≫ Next: Upcoming Webinar Wed 2/6: Percona Software News and Roadmap Update

≪ Previous: Upcoming Webinar Thurs 1/31: Percona Server for MongoDB 4.0 Feature Walkthrough

In this blog post, we’ll look at how to use

pg_repack

to rebuild PostgreSQL database objects online.

We’ve seen a lot of questions regarding the options available in PostgreSQL for rebuilding a table online. We created this blog post to explain the

pg_repack

extension, available in PostgreSQL for this requirement. pg_repack is a well-known extension that was created and is maintained as an open source project by several authors.

There are three main reasons why you need to use

pg_repack

in a PostgreSQL server:

Reclaim free space from a table to disk, after deleting a huge chunk of records
Rebuild a table to re-order the records and shrink/pack them to lesser number of pages. This may let a query fetch just one page ( or < n pages) instead of n pages from disk. In other words, less IO and more performance.
Reclaim free space from a table that has grown in size with a lot of bloat due to improper autovacuum settings.

You might have already read our previous articles that explained what bloat is, and discussed the internals of autovacuum. After reading these articles, you can see there is an autovacuum background process that removes dead tuples from a table and allows the space to be re-used by future updates/inserts on that table. Over a period of time, tables that take the maximum number of updates or deletes may have a lot of bloated space due to poorly tuned autovacuum settings. This leads to slow performing queries on these tables. Rebuilding the table is the best way to avoid this.

Why is just autovacuum not enough for tables with bloat?

We have discussed several parameters that change the behavior of an autovacuum process in this blog post. There cannot be more than

autovacuum_max_workers

number of autovacuum processes running in a database cluster at a time. At the same time, due to untuned autovacuum settings and no manual vacuuming of the database as a weekly or monthy jobs, many tables can be skipped from autovacuum. We have discussed in this post that the default autovacuum settings run autovacuum on a table with ten records more times than a table with a million records. So, it is very important to tune your autovacuum settings, set table-level customized autovacuum parameters and enable automated jobs to identify tables with huge bloat and run manual vacuum on them as scheduled jobs during low peak times (after thorough testing).

VACUUM FULL

VACUUM FULL

is the default option available with a PostgreSQL installation that allows us to rebuild a table. This is similar to

ALTER TABLE

in MySQL. However, this command acquires an exclusive lock and locks reads and writes on a table.

VACUUM FULL tablename;

pg_repack

pg_repack

is an extension available for PostgreSQL that helps us rebuild a table online. This is similar to

pt-online-schema-change

for online table rebuild/reorg in MySQL. However,

pg_repack

works for tables with a Primary key or a NOT NULL Unique key only.

Installing pg_repack extension

In RedHat/CentOS/OEL from PGDG Repo

Obtain the latest PGDG repo from https://yum.postgresql.org/ and perform the following step:

# yum install pg_repack11 (This works for PostgreSQL 11)
Similarly, for PostgreSQL 10,
# yum install pg_repack10

In Debian/Ubuntu from PGDG repo

Add certificates, repo and install

pg_repack

Following certificate may change. Please validate before you perform these steps.
# sudo apt-get install wget ca-certificates
# wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
# sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
# sudo apt-get update
# apt-get install postgresql-server-dev-11
# apt-get install postgresql-11-repack

Loading and creating pg_repack extension

Step 1 :

You need to add

pg_repack

shared_preload_libraries

. For that, just set this parameter in postgresql.conf or postgresql.auto.conf file.

shared_preload_libraries = 'pg_repack'

Setting this parameter requires a restart.

$ pg_ctl -D $PGDATA restart -mf

Step 2 :

In order to start using

pg_repack

, you must create this extension in each database where you wish to run it:

$ psql
\c percona
CREATE EXTENSION pg_repack;

Using pg_repack to Rebuild Tables Online

Similar to

pt-online-schema-change

, you can use the option

--dry-run

to see if this table can be rebuilt using

pg_repack

. When you rebuild a table using

pg_repack

, all its associated Indexes does get rebuild automatically. You can also use

-t

instead of

--table

as an argument to rebuild a specific table.

Success message you see when a table satisfies the requirements for pg_repack.

$ pg_repack --dry-run -d percona --table scott.employee
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.employee"

Error message when a table does not satisfy the requirements for pg_repack.

$ pg_repack --dry-run -d percona --table scott.sales
INFO: Dry run enabled, not executing repack
WARNING: relation "scott.sales" must have a primary key or not-null unique keys

Now to execute the rebuild of a table: scott.employee ONLINE, you can use the following command. It is just the previous command without

--dry-run

$ pg_repack -d percona --table scott.employee
INFO: repacking table "scott.employee"

Rebuilding Multiple Tables using pg_repack

Use an additional

--table

for each table you wish to rebuild.

Dry Run

$ pg_repack --dry-run -d percona --table scott.employee --table scott.departments
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Execute

$ pg_repack -d percona --table scott.employee --table scott.departments
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Rebuilding an entire Database using pg_repack

You can rebuild an entire database online using

-d

. Any table that is not eligible for

pg_repack

is skipped automatically.

Dry Run

$ pg_repack --dry-run -d percona
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Execute

$ pg_repack -d percona
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Running pg_repack in parallel jobs

To perform a parallel rebuild of a table, you can use the option

-j

. Please ensure that you have sufficient free CPUs that can be allocated to run

pg_repack

in parallel.

$ pg_repack -d percona -t scott.employee -j 4
NOTICE: Setting up workers.conns
INFO: repacking table "scott.employee"

Running pg_repack remotely

You can always run

pg_repack

from a Remote Machine. This helps in scenarios where we have PostgreSQL databases deployed on Amazon RDS. To run

pg_repack

from a remote machine, you must have the same version of

pg_repack

installed in the remote server as well as the database server (say AWS RDS).

↧

Upcoming Webinar Wed 2/6: Percona Software News and Roadmap Update

February 5, 2019, 8:29 am

≫ Next: Upcoming Webinar Thurs 2/7: Top Trends in Modern Data Architecture for 2019

≪ Previous: Using pg_repack to Rebuild PostgreSQL Database Objects Online

Percona Software News and Roadmap Update Webinar

Join Percona CEO Peter Zaitsev as he presents Percona Software News and Roadmap Update on Wednesday, February 6, 2019, at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

Come and listen to Percona CEO Peter Zaitsev discuss what’s new in Percona open source software. Topics include Percona Server for MySQL and MongoDB, Percona XtraBackup, Percona Toolkit, Percona XtraDB Cluster and Percona Monitoring and Management.

During this webinar, Peter will talk about newly released features in Percona software. He will also show a few quick demos and share with you highlights from the Percona open source software roadmap.

Peter will also talk about new developments in Percona commercial services and finish with a Q&A.

Register today to join Peter for his Percona Software News and Roadmap Update.

↧

Upcoming Webinar Thurs 2/7: Top Trends in Modern Data Architecture for 2019

February 6, 2019, 8:36 am

≫ Next: Compression Options in MySQL (Part 2)

≪ Previous: Upcoming Webinar Wed 2/6: Percona Software News and Roadmap Update

Please join Percona’s PMM Product Manager, Michael Coburn for a webinar on The Top Trends in Modern Data Architecture for 2019 hosted by DBTA on Thursday, February 7th at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

A strong data architecture strategy is critical to supporting your organization’s data-driven goals. AI and machine learning, data discovery and real-time analytics reflect that notion. Additionally, greater speed, flexibility, and scalability are common wish-list items. Smarter data governance and security capabilities are not that far behind. What’s more, many new technologies and approaches have come to the forefront of data architecture discussions. Data lakes, in-memory databases and engines like Spark and cloud services of all shapes and sizes are just a few examples.

In order to learn more about the top trends in modern data architecture for 2019, register for this webinar today.

↧

Compression Options in MySQL (Part 2)

February 11, 2019, 8:39 am

≫ Next: plprofiler – Getting a Handy Tool for Profiling Your PL/pgSQL Code

≪ Previous: Upcoming Webinar Thurs 2/7: Top Trends in Modern Data Architecture for 2019

In one of my previous posts, I started a series on data compression options with MySQL. The first post focused on the more traditional compression options like InnoDB Barracuda page compression and MyISAM packing. With this second part, I’ll discuss a newer compression option, InnoDB transparent page compression with punch holes available since 5.7. First, I’ll describe the transparent page compression method and how it works. Then I’ll present similar results as in the first post.

InnoDB transparent page compression

Before we can discuss transparent page compression, we must understand how InnoDB accesses its data pages. To access an InnoDB page, you need to know the tablespace (the file) and the offset of the page within the tablespace file. The offset is the tough part with data compression. If you just compress pages and concatenate them one after the other, the offsets will no longer be at known intervals. InnoDB Barracuda page compression solves the problem by asking the DBA to guess the compression ratio of the pages with the compressed block size setting. For example, you have to tell InnoDB to use a disk block size of 8KB if you think the compression ratio will be around 2. Transparent page compression uses another approach, sparse files.

Sparse files 101

A sparse file is a file with holes in it. Even though a sparse file may be very large, if there are a lot of holes in it, it may end up using a small amount of storage. On almost every Linux system, the /var/log/lastlog file is sparse:

yves@ThinkPad-P51:/var/log$ ls -lah lastlog
-rw-rw-r-- 1 root utmp 18M jan 5 16:09 lastlog
yves@ThinkPad-P51:/var/log$ du -hs lastlog
56K lastlog

While the ls command reports an apparent size of 18MB, the du command tells us the file actually uses only 56KB. Most of the space in the file is actually unallocated. When you access a sparse file, the filesystem has to map the actual physical offsets in the file with the logical offsets seen by the application. A logical offset is no longer directly the number of bytes since the beginning of the file.

Now that we understand a bit what sparse files are, let’s talk about the punch hole aspect. When you write something to disk, you can use the fallocate call to free up, punch, part of it. The freed/punched portion is thus a hole in the file, and the filesystem can later reuse the hole to store something else. Let’s follow a simplified view of the steps required to write a transparently compressed InnoDB page.

InnoDB using sparse files

Figure 1: InnoDB Transparent page compression

In figure 1, an in memory 16KB InnoDB page with 14KB of data is going to be written to disk. As part of the write process, the data is compressed to 6KB and the page is written to the disk. Once written, InnoDB uses the fallocate call to release the 10KB of unused space. Since only full blocks are release, only 8KB is really freed. The remaining space unreleased space (2KB) is just zeroed. The freed space will be reused, either for the same file or by another one. For simplicity, let’s assume the space is reused by the same InnoDB file.

Figure 2: File system layout

If there is no immediate reuse, a portion of the InnoDB file will look like the top file layout of figure 2. The pages (numbers) are still sequentially laid out but there are holes in between. As the file system gets full, it will start to reuse the freed space so eventually, the file layout will look like the bottom one. If you notice, in the bottom layout, the pages are no longer in sequential order. There are consequences to that: the notion of disk sequential access is gone. The most stunning example is a simple file copy on a spinning device. While copying a 1GB regular file may take only 30 seconds, the copy of a 1GB sparse file can take much longer, up to 30 minutes in the worst cases. The impact on physical backup tools, like Percona Xtrabackup, are thus important. Normally physical backups are much faster than logical ones (ex: mysqldump), but with sparse files, it may no longer be true.

MySQL impacts

There are also consequences of the use of sparse files on the design of a MySQL database server. The added random operations increase the importance of using SSD/Flash based storage. Also some settings must be considered with a different perspective:

innodb_flush_neighbors should be 0 since 1 is a cheat geared toward sequential operations
innodb_read_ahead_threshold, normally set to 56, this means when 56 pages of an extent have been scanned, the next extent is read sequentially ahead of time. To be really useful, the next extent should be read before the remaining 8 pages of the current extent are read. Since sequential operations are slower, maybe this value should be lowered a little. The drawback is an increased possibility of a read ahead without use.
innodb_random_read_ahead is a wilder setting, it would be a good idea to experiment with this for your workload

There are likely to be other affected settings.

Review of the test procedure

Just to refresh memories, I am using two datasets for the basic benchmarks. The first, Wikipedia, consists in about 1B rows of Wikipedia access logs. It is moderately compressible. The second dataset, o1543, is from the defunct Percona cloud tool project. It has only 77M rows but they are much wider with 134 columns. The o1543 dataset is highly compressible.

On these two datasets, the following steps were executed:

insert the rows: record time, final size and amount of data written
large range select, record the time
20k updates, record the time to and total bytes written

Results

Final sizes

Figure 3, Innodb transparent page compression final sizes

One of the most critical metrics with compression is the final dataset size, as shown in figure 3. The possibility to use larger InnoDB pages is a big thing with transparent page compression. Larger pages allow for more repetitive patterns to be present within a page, and that improves the compression ratio. Results using page sizes of 16KB, 32KB and 64KB are shown. The uncompressed results are used as references, transparent compression (TC) using Lz4 and Zlib are the actual compressed datasets. First, we see that larger page sizes barely affect the size of the uncompressed dataset (I16, I32 and I64). Since the datasets were inserted in primary key order, the only possible impact is the filling factor of the pages. When InnoDB fills a page in PK order, even when the innodb_fill_factor is set to 100, it always leaves 1KB free per 16KB. With an amount of free space that scales with the page size, the final size doesn’t change much.

The impacts of larger page sizes on the compression ratio are important. The most drastic example is with the o1543 dataset and Zlib compression. While with a 16KB page, the compression ratio was already decent, at 3.65, it grows to an amazing 8.7 (I16/I64TCZlib) with pages of 64KB. Larger page sizes have also a positive impact on the compression ratio of the Wikipedia dataset. The original compression ratio with Zlib and 16KB pages is 2.4 and it grows to 3.4 with 64KB pages. Datasets compressed with Lz4 behave similarly to the Zlib ones but the compression ratio are slightly lower.

Overall, the I64TCZlib results for the Wikipedia dataset is the most compressed form we have so far. For the o1543 dataset, the MyISAMPacked compressed size is still slightly smaller but is read-only.

Insertion time

Figure 4, InnoDB transparent page compression insert times

We normally expect compression to add an overhead but here, the insertion speed improves with larger page sizes (figure 4). The reason is likely to be because we are using spinning disks. Spinning disks have a high latency so doing larger IO operations helps. The time overhead of compression with transparent page compression hovers between 10 and 17%. That’s much less than 60% overhead we observed for the Barracuda table compression in the previous post for the Wikipedia dataset (InnoDBCmp8k/InnoDB). We can conclude the insert rates, when inserts are in PK order, are not much affected by transparent page compression. If you are mostly inserting data, it is a nice win.

Data written by inserts

Figure 5, total amount of data written during the inserts

The amount of data written is not much affected by the transparent compression and the larger page sizes (figure 5) . That’s reasonable as many of the writes are not compressed, only the final write to the tablespace is. Neither the writes to the double write buffer, or to the InnoDB log files, or for the tablespace pre-allocation, are compressed. The differences we see are essentially the same as the ones for the final sizes. Only the uncompressed results do not fit that view but these are rather small deviations.

Range selects

Figure 6, time to complete a long range scan

The range select benchmarks are really a means of testing the decompression overhead. As you notice in figure 6, the time variations are not large. For the Wikipedia dataset, the faster range select is I64TCLz4, and it completed in 788 seconds. That’s almost two minutes slower than the faster results using InnoDB Barracuda compression (block_size=4KB). How can we explain such results? If the freed space is reused, transparent compression causes sequential operations to become random ones. The time should increase. Without space reuse, the storage layer will merge many small reads into a sequential one, and then discard the holes. Effectively, the disk will read the same amount of data, compressed or not. The only difference will come from decompression. Lz4 is extremely fast while Zlib is slower.

Going back to the Wikipedia dataset, it took the exact same time, 830s, for I16, I16TCLz4 and I32TCLz4. That seems to indicate there was no space reuse. With the xfs xfs_bmap tool on a TC compressed file, I listed the blocks used. Here is the command I used and the first lines of the output (with blocks of 512 bytes):

root@LabPS57kvm_1:/tmp# xfs_bmap /var/lib/mysql/test/query_class_metrics.ibd | more
/var/lib/mysql/test/query_class_metrics.ibd:
0: [0..31]: 1013168..1013199
1: [32..39]: 1014544..1014551
2: [40..63]: hole
3: [64..71]: 1016976..1016983
4: [72..95]: hole
5: [96..103]: 1017008..1017015
6: [104..127]: hole
7: [128..135]: 1016880..1016887
8: [136..159]: hole
9: [160..167]: 1016912..1016919
10: [168..191]: hole
...

We have the list:

0..31: 16 KB tablespace header, apparently not compressed
32..39: 4KB TC compressed page, 8 sectors of compressed data
40..63: 12KB hole (24 sectors)
…and so on

So the layout actually looks indeed like the filesystem with no reuse case (top layout) of figure 2. When InnoDB extends the tablespace, it of course proceeds by entire pages. The filesystem will try, as much as possible, to allocate continuous blocks. Initially, the tablespace increases one page at a time but rapidly it grows by extent of 64 pages. The space reuse will start only when there are no more continuous areas large enough to satisfy the allocation requests. Until then, the filesystem still performs mostly sequential operations. The performance characteristics will thus change once the freed blocks start to be reused. On a smaller server, I continued to insert data well after the filesystem would have been full without the holes. The insertion rate fell by about half but the read performance appeared unchanged.

The times of the range selects for the o1543 dataset are more predictable. In all cases, larger pages increase performance. That kind of makes sense – InnoDB needs less IOPS. With Lz4, InnoDB spends less time to decompress the pages than it would need to read the complete uncompressed pages. The opposite is also true for Zlib. The Lz4 results are the fastest, Zlib the slowest, and in between we have the uncompressed results.

20k updates time

Figure 7, time needed to perform 20k updates

Intuitively, I was expecting the larger pages to slow down the updates. Similarly, I was also expecting Lz4 compressed pages to be slower than uncompressed pages, but faster than the ones compressed with Zlib. The above figure shows the times to perform approximately 20k single row updates for both datasets. We performed the updates to the Wikipedia dataset in small separate transactions, while we used a single large update statement for the o1543 dataset.

While the compression algorithm assumption appears to hold true, the one for the page sizes is plainly wrong. Of course, the storage consists of spinning disks so the latency of random IO dominates. The important factor becomes the number of levels in the b-tree of the table. In the root node of the b-tree and all intermediate nodes, bigger pages mean more pointers to the next level. More pointers causes a bigger fan-out –ratio of nodes between levels – and fewer levels. Bigger pages also cause fewer leaf level pages which in turn require less upper level node pages.

Let’s dive a bit more into this topic. The Wikipedia dataset table has an int unsigned primary key. Considering InnoDB always leaves 1KB free in a page and, along with the primary key, each entry in a node (non-leaf) has an extra 9 bytes for the pointer to the next level page. Let’s do some math:

Total number of pages with 16KB pages = 112.6GB / (15KB) = 7871311 pages
Max number of rows in the non-leaf pages for 16KB pages and an int PK = (16 * 1024)/(4 (int PK) + 9 (ptr)) = 1260 rows/pages
Minimum number of pages in the first level above the leaf = 7871311 / 1260 = 6247 pages
Minimum number of pages at the next level = 6247 / 1260 = 5 pages
Root page = 1

Of course, our calculations are an approximation. With a 16KB page size, there are three levels above the leaves for a total of 6253 pages and a size of 98MB. It thus requires 6253 IOPS to warm up the buffer pool with the all nodes. A SATA 7200 rpm disk delivers at best 120 IOPS (one per rotation) so that’s about 51 second. Now, let’s redo the same calculations but with a page size of 32KB:

Total number of pages with 32KB pages = 110.7GB / (31KB) = 3744431 pages
Max number of rows in the non-leaf pages for 32KB pages and an int PK = (32 * 1024)/(4 (int PK) + 9 (ptr)) = 2520 rows/pages
Minimum number of pages in the first level above the leaf = 3744431 / 2520 = 1486 pages
Root page = 1

Using 32KB pages, we have one level less and only 1487 node pages for a combined size of 47MB. To warm up, the buffer pool we have to load at least the node pages, an operation requiring only a quarter of the IOPS compared to when 16KB pages were used. That’s where most of the performance gains come from. The reduced number of IOPS more than compensates for the longer time to read a large page. Again, in this setup, we used spinning disks.

Bytes written per update

Figure 8, average bytes written per update

Now, the last set of results concerns the number of bytes written per update statement (figure 8). There is a big price to pay when you want to use larger InnoDB pages, the write amplification is huge. The number of bytes approximately scales roughly with the page size. The worse case is the I64 result, about 192KB written for a single row update of an integer field (Wikipedia). If your database workload includes a large number of small single row updates, you should avoid expensive flash devices with 64KB InnoDB pages as you’ll burn your devices rapidly.

Operational considerations for larger InnoDB pages and TC

When is it good idea to use transparent compression? When should you use a larger InnoDB page size? One valid use case is a database storing large quantities of operational metrics, like the o1543 dataset. The compression ratio will be fantastic and the performance penalty limited, at least until the filesystem starts reusing the holes.

If you collect data from a large number of devices and you are likely struggling with TBs of highly compressible data, transparent compression might be an interesting option. The only issue I see, but it is a major one, is how to backup large sparse files. InnoDB transparent page compression with punch holes is an interesting solution but, unless I am missing something, it has a somewhat limited scope. There are other compression options with similar compression ratios and less drawbacks.

In this post we explored a feature available since MySQL 5.7, InnoDB transparent compression with punch holes. Performance-wise, we have an interesting solution which offers excellent compression ratio, especially when larger page sizes are used. The transparent compression with punch holes technique suffers from its foundations, sparse files. Backing up very large sparse files is a slow and IO intensive process. Instead of performing large sequential IO operations, the backup process will require millions of small random IO operations.

So far we have discussed the traditional approaches to compression in MySQL (previous post) and Innodb transparent page compression. The next post of the series on data compression with MySQL will introduce the ZFS filesystem. ZFS externalizes the compression to the filesystem in a way that is pretty similar to InnoDB transparent page compression, but the ZFS b-tree file structure removes the inconvenience of sparse files.

Stay tuned, more results are coming.

↧

plprofiler – Getting a Handy Tool for Profiling Your PL/pgSQL Code

February 13, 2019, 10:20 am

≫ Next: ClickHouse Performance Uint32 vs Uint64 vs Float32 vs Float64

≪ Previous: Compression Options in MySQL (Part 2)

PostgreSQL is emerging as the standard destination for database migrations from proprietary databases. As a consequence, there is an increase in demand for database side code migration and associated performance troubleshooting. One might be able to trace the latency to a plsql function, but explaining what happens within a function could be a difficult question. Things get messier when you know the function call is taking time, but within that function there are calls to other functions as part of its body. It is a very challenging question to identify which line inside a function—or block of code—is causing the slowness. In order to answer such questions, we need to know how much time an execution spends on each line or block of code. The plprofiler project provides great tooling and extensions to address such questions.

Demonstration of plprofiler using an example

The plprofiler source contains a sample for testing plprofiler. This sample serves two purposes. It can be used for testing the configuration of plprofiler, and it is great place to see how to do the profiling of a nested function call. Files related to this can be located inside the “examples” directory. Don’t worry—I’ll be running through the installation of plprofiler later in this article.

$ cd examples/

The example expects you to create a database with name “pgbench_plprofiler”

postgres=# CREATE DATABASE pgbench_plprofiler;
CREATE DATABASE

The project provides a shell script along with a source tree to test plprofiler functionality. So testing is just a matter of running the shell script.

$ ./prepdb.sh
dropping old tables...
....

Running session level profiling

This profiling uses session level local-data. By default the plprofiler extension collects runtime data in per-backend hashtables (in-memory). This data is only accessible in the current session, and is lost when the session ends or the hash tables are explicitly reset. plprofiler’s run command will execute the plsql code and capture the profile information.

This is illustrated by below example,

$ plprofiler run --command "SELECT tpcb(1, 2, 3, -42)" -d pgbench_plprofiler --output tpcb-test1.html
SELECT tpcb(1, 2, 3, -42)
-- row1:
tpcb: -42
----
(1 rows)
SELECT 1 (0.073 seconds)

What happens during above plprofiler command run can be summarised in 3 steps:

A function call with four parameters “SELECT tpcb(1, 2, 3, -42)” is presented to the plprofiler tool for execution.
plprofiler establishes a connection to PostgreSQL and executes the function
The tool collects the profile information captured in the local-data hash tables and generates an HTML report “tpcb-test1.html”

Global profiling

As mentioned previously, this method is useful if we want to profile the function executions in other sessions or on the entire database. During global profiling, data is captured into a shared-data hash table which is accessible for all sessions in the database. The plprofiler extension periodically copies the local-data from the individual sessions into shared hash tables, to make the statistics available to other sessions. See the

plprofiler monitor

command, below, for details. This data still relies on the local database system catalog to resolve Oid values into object definitions.

In this example, the plprofiler tool will be running in monitor mode for a duration of 60 seconds. Every 10 seconds, the tool copies data from local-data to shared-data.

$ plprofiler monitor --interval=10 --duration=60 -d pgbench_plprofiler
monitoring for 60 seconds ...
done.

For testing purposes you can start executing a few functions at the same time.

Once the data is captured into shared-data, we can generate a report. For example:

$ plprofiler report --from-shared --title=MultipgMax --output=MultipgMax.html -d pgbench_plprofiler

The data in shared-data will be retained until it’s explicitly cleared using the

plprofiler reset

command

$ plprofiler reset

If there is no profile data present in the shared hash tables, execution of the report will result in error message.

$ plprofiler report --from-shared --title=MultipgMax --output=MultipgMax.html
Traceback (most recent call last):
File "/usr/bin/plprofiler", line 11, in <module>
load_entry_point('plprofiler==4.dev0', 'console_scripts', 'plprofiler')()
File "/usr/lib/python2.7/site-packages/plprofiler-4.dev0-py2.7.egg/plprofiler/plprofiler_tool.py", line 67, in main
return report_command(sys.argv[2:])
File "/usr/lib/python2.7/site-packages/plprofiler-4.dev0-py2.7.egg/plprofiler/plprofiler_tool.py", line 493, in report_command
report_data = plp.get_shared_report_data(opt_name, opt_top, args)
File "/usr/lib/python2.7/site-packages/plprofiler-4.dev0-py2.7.egg/plprofiler/plprofiler.py", line 555, in get_shared_report_data
raise Exception("No profiling data found")
Exception: No profiling data found

Report on profile information

The HTML report generated by plprofiler is a self-contained HTML document and it gives detailed information about the PL/pgSQL function execution. There will be a clickable FlameGraph at the top of the report with details about functions in the profile. The plprofiler FlameGraph is based on the actual Wall-Clock time spent in the PL/pgSQL functions. By default, plprofiler provides details on the top ten functions, based on their self_time (total_time – children_time).

This section of the report is followed by tabular representation of function calls. For example:

This gives a lot of detailed information such as execution counts and time spend against each line of code.

Binary Packages

Binary distributions of plprofiler are not common. However the BigSQL project provides plprofiler packages as an easy to use bundle. Such ready-to-use packages are one of the reasons for BigSQL to remain as one of the most developer friendly PostgreSQL distributions. The first screen of Package manager installation of BigSQL provided me with the information I am looking for:

Appears that there was a recent release of BigSQL packages and plprofiler is an updated package within that.

Installation and configuration is made simple:

$ ./pgc install plprofiler-pg11
['plprofiler-pg11']
File is already downloaded.
Unpacking plprofiler-pg11-3.3-1-linux64.tar.bz2
install-plprofiler-pg11...
Updating postgresql.conf file:
old: #shared_preload_libraries = '' # (change requires restart)
new: shared_preload_libraries = 'plprofiler'

As we can see, even PostgreSQL parameters are updated to have plprofiler as a

shared_preload_library

. If need to use plprofiler for investigating code, these binary packages from the BigSQL project are my first preference because everything is ready to use. Definitely, this is developer-friendly.

Creation of extension and configuring the plprofiler tool

At the database level, we should create the plprofiler extension to profile the function execution. This step needs to be performed in both cases, whether we want global profiling where share_preload_libraries are set, or at session level where that is not required

postgres=# create extension plprofiler;
CREATE EXTENSION

plprofiler is not just an extension, but comes with tooling to invoke profiling or to generate reports. These scripts are primarily coded in Python and uses psycopg2 to connect to PostgreSQL. The python code is located inside the “python-plprofiler” directory of the source tree. There are a few perl dependencies too which will be resolved as part of installation

sudo yum install python-setuptools.noarch
sudo yum install python-psycopg2
cd python-plprofiler/
sudo python ./setup.py install

Building from source

If you already have a PostgreSQL instance running using binaries from PGDG repository OR you want to wet your hands by building everything from source, then installation needs a different approach. I have PostgreSQL 11 already running on the system. The first step is to get the corresponding development packages which have all the header files and libraries to support a build from source. Obviously this is the thorough way of getting plprofiler working.

$ sudo yum install postgresql11-devel

We need to have build tools, and since the core of plprofiler is C code, we have to install a C compiler and make utility.

$ sudo yum install gcc make

Preferably, we should build plprofiler using the same OS user that runs PostgreSQL server, which is “postgres” in most environments. Please make sure that all PostgreSQL binaries are available in the path and that you are able to execute the pg_config, which lists out build related information:

$ pg_config
BINDIR = /usr/pgsql-11/bin
..
INCLUDEDIR = /usr/pgsql-11/include
PKGINCLUDEDIR = /usr/pgsql-11/include
INCLUDEDIR-SERVER = /usr/pgsql-11/include/server
LIBDIR = /usr/pgsql-11/lib
PKGLIBDIR = /usr/pgsql-11/lib
LOCALEDIR = /usr/pgsql-11/share/locale
MANDIR = /usr/pgsql-11/share/man
SHAREDIR = /usr/pgsql-11/share
SYSCONFDIR = /etc/sysconfig/pgsql
PGXS = /usr/pgsql-11/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--enable-rpath' '--prefix=/usr/pgsql-11' '--includedir=/usr/pgsql-11/include' '--mandir=/usr/pgsql-11/share/man' '--datadir=/usr/pgsql-11/share' '--with-icu' 'CLANG=/opt/rh/llvm-toolset-7/root/usr/bin/clang' 'LLVM_CONFIG=/usr/lib64/llvm5.0/bin/llvm-config' '--with-llvm' '--with-perl' '--with-python' '--with-tcl' '--with-tclconfig=/usr/lib64' '--with-openssl' '--with-pam' '--with-gssapi' '--with-includes=/usr/include' '--with-libraries=/usr/lib64' '--enable-nls' '--enable-dtrace' '--with-uuid=e2fs' '--with-libxml' '--with-libxslt' '--with-ldap' '--with-selinux' '--with-systemd' '--with-system-tzdata=/usr/share/zoneinfo' '--sysconfdir=/etc/sysconfig/pgsql' '--docdir=/usr/pgsql-11/doc' '--htmldir=/usr/pgsql-11/doc/html' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'LDFLAGS=-Wl,--as-needed' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
CC = gcc
...
VERSION = PostgreSQL 11.1

Now we’re ready to get the source code and build it. You should be able to checkout the git repository for plprofiler.

$ git clone https://github.com/pgcentral/plprofiler.git
Cloning into 'plprofiler'...
...

Building against PostgreSQL 11 binaries from PGDG can be a bit complicated because of th JIT feature. Configuration flag

--with-llvm

will be enabled. So we may have to have LLVM present in the system as detailed in my previous blog about JIT in PostgreSQL11

Once we’re ready, we can move to the plprofiler directory and build it:

$ cd plprofiler
$ make USE_PGXS=1
--- Output ----
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC -I. -I./ -I/usr/pgsql-11/include/server -I/usr/pgsql-11/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include -c -o plprofiler.o plprofiler.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC -shared -o plprofiler.so plprofiler.o -L/usr/pgsql-11/lib -Wl,--as-needed -L/usr/lib64/llvm5.0/lib -L/usr/lib64 -Wl,--as-needed -Wl,-rpath,'/usr/pgsql-11/lib',--enable-new-dtags
/opt/rh/llvm-toolset-7/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/usr/pgsql-11/include/server -I/usr/pgsql-11/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include -flto=thin -emit-llvm -c -o plprofiler.bc plprofiler.c

Now we should be able to install this extension:

$ sudo make USE_PGXS=1 install
--- Output ----
/usr/bin/mkdir -p '/usr/pgsql-11/lib'
/usr/bin/mkdir -p '/usr/pgsql-11/share/extension'
/usr/bin/mkdir -p '/usr/pgsql-11/share/extension'
/usr/bin/install -c -m 755 plprofiler.so '/usr/pgsql-11/lib/plprofiler.so'
/usr/bin/install -c -m 644 .//plprofiler.control '/usr/pgsql-11/share/extension/'
/usr/bin/install -c -m 644 .//plprofiler--1.0--2.0.sql .//plprofiler--2.0--3.0.sql .//plprofiler--3.0.sql '/usr/pgsql-11/share/extension/'
/usr/bin/mkdir -p '/usr/pgsql-11/lib/bitcode/plprofiler'
/usr/bin/mkdir -p '/usr/pgsql-11/lib/bitcode'/plprofiler/
/usr/bin/install -c -m 644 plprofiler.bc '/usr/pgsql-11/lib/bitcode'/plprofiler/./

The above command expects all build tools to be in the proper path even with sudo.

Profiling external sessions

To profile a function executed by another session, or all other sessions, we should load the libraries at global level. In production environments, that will be the case. This can be done by adding the extension library to the

shared_preload_libraries

specification. You won’t need this if you only want to profile functions executed within your session. Session level profiling is generally possible only in Dev/Test environments.

To enable global profiling, verify the current value of

shared_preload_libraries

and add plprofiler to the list.

postgres=# show shared_preload_libraries ;
shared_preload_libraries
--------------------------
(1 row)
postgres=# alter system set shared_preload_libraries = 'plprofiler';
ALTER SYSTEM
postgres=#

This change requires us to restart the PostgreSQL server

$ sudo systemctl restart postgresql-11

After the restart, it’s a good idea to verify the parameter change

postgres=# show shared_preload_libraries ;
shared_preload_libraries
--------------------------
plprofiler
(1 row)

From this point onwards, the steps are same as those for the binary package setup discussed above.

Summary

plprofiler is a wonderful tool for developers. I keep seeing many users who are in real need of it. Hopefully this blog post will help those who never tried it.

↧

ClickHouse Performance Uint32 vs Uint64 vs Float32 vs Float64

February 15, 2019, 5:23 am

≫ Next: Parallel queries in PostgreSQL

≪ Previous: plprofiler – Getting a Handy Tool for Profiling Your PL/pgSQL Code

While implementing ClickHouse for query executions statistics storage in Percona Monitoring and Management (PMM), we were faced with a question of choosing the data type for metrics we store. It came down to this question: what is the difference in performance and space usage between Uint32, Uint64, Float32, and Float64 column types?

To test this, I created a test table with an abbreviated and simplified version of the main table in our ClickHouse Schema.

The “number of queries” is stored four times in four different columns to be able to benchmark queries referencing different columns. We can do this with ClickHouse because it is a column store and it works only with columns referenced by the query. This method would not be appropriate for testing on MySQL, for example.

CREATE TABLE test
(
    digest String,
    db_server String,
    db_schema String,
    db_username String,
    client_host String,
    period_start DateTime,
    nq_UInt32 UInt32,
    nq_UInt64 UInt64,
    nq_Float32 Float32,
    nq_Float64 Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(period_start)
ORDER BY (digest, db_server, db_username, db_schema, client_host, period_start)
SETTINGS index_granularity = 8192

When testing ClickHouse performance you need to consider compression. Highly compressible data (for example just a bunch of zeroes) will compress very well and may be processed a lot faster than incompressible data. To take this into account we will do a test with three different data sets:

Very Compressible when “number of queries” is mostly 1
Somewhat Compressible when we use a range from 1 to 1000 and
Poorly Compressible when we use range from 1 to 1000000.

Since it’s unlikely that an application will use the full 32 bit range, we haven’t used it for this test.

Another factor which can impact ClickHouse performance is the number of “parts” the table has. After loading the data we ran OPTIMIZE TABLE FINAL to ensure only one part is there on the disk. Note: ClickHouse will gradually delete old files after the optimize command has completed. To avoid these operations interfering with benchmarks, I waited for about 15 minutes to ensure all unused data was removed from the disk.

The amount of memory on the system was enough to cache whole columns in all tests, so this is an in-memory test.

Here is how the table with only one part looks on disk:

root@d01e692c291f:/var/lib/clickhouse/data/pmm/test_lc# ls -la
total 28
drwxr-xr-x 4 clickhouse clickhouse 12288 Feb 10 20:39 .
drwxr-xr-x 8 clickhouse clickhouse 4096 Feb 10 22:38 ..
drwxr-xr-x 2 clickhouse clickhouse 4096 Feb 10 20:30 201902_1_372_4
drwxr-xr-x 2 clickhouse clickhouse 4096 Feb 10 19:38 detached
-rw-r--r-- 1 clickhouse clickhouse 1 Feb 10 19:38 format_version.txt

When you have only one part it makes it very easy to see the space different columns take:

root@d01e692c291f:/var/lib/clickhouse/data/pmm/test_lc/201902_1_372_4# ls -la
total 7950468
drwxr-xr-x 2 clickhouse clickhouse 4096 Feb 10 20:30 .
drwxr-xr-x 4 clickhouse clickhouse 12288 Feb 10 20:39 ..
-rw-r--r-- 1 clickhouse clickhouse 971 Feb 10 20:30 checksums.txt
-rw-r--r-- 1 clickhouse clickhouse 663703499 Feb 10 20:30 client_host.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 client_host.mrk
-rw-r--r-- 1 clickhouse clickhouse 238 Feb 10 20:30 columns.txt
-rw-r--r-- 1 clickhouse clickhouse 9 Feb 10 20:30 count.txt
-rw-r--r-- 1 clickhouse clickhouse 228415690 Feb 10 20:30 db_schema.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 db_schema.mrk
-rw-r--r-- 1 clickhouse clickhouse 6985801 Feb 10 20:30 db_server.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 db_server.mrk
-rw-r--r-- 1 clickhouse clickhouse 19020651 Feb 10 20:30 db_username.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 db_username.mrk
-rw-r--r-- 1 clickhouse clickhouse 28227119 Feb 10 20:30 digest.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 digest.mrk
-rw-r--r-- 1 clickhouse clickhouse 8 Feb 10 20:30 minmax_period_start.idx
-rw-r--r-- 1 clickhouse clickhouse 1552547644 Feb 10 20:30 nq_Float32.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_Float32.mrk
-rw-r--r-- 1 clickhouse clickhouse 1893758221 Feb 10 20:30 nq_Float64.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_Float64.mrk
-rw-r--r-- 1 clickhouse clickhouse 1552524811 Feb 10 20:30 nq_UInt32.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_UInt32.mrk
-rw-r--r-- 1 clickhouse clickhouse 1784991726 Feb 10 20:30 nq_UInt64.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_UInt64.mrk
-rw-r--r-- 1 clickhouse clickhouse 4 Feb 10 20:30 partition.dat
-rw-r--r-- 1 clickhouse clickhouse 400961033 Feb 10 20:30 period_start.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 period_start.mrk
-rw-r--r-- 1 clickhouse clickhouse 2486243 Feb 10 20:30 primary.idx

We can see there are two files for every column (plus some extras), and so, for example, the Float32 based “number of queries” metric store takes around 1.5GB.

You can also use the SQL queries to get this data from the ClickHouse system tables instead:

SELECT *
FROM system.columns
WHERE (database = 'pmm') AND (table = 'test') AND (name = 'nq_UInt32')
Row 1:
──────
database: pmm
table: test
name: nq_UInt32
type: UInt32
default_kind:
default_expression:
data_compressed_bytes: 7250570
data_uncompressed_bytes: 1545913232
marks_bytes: 754848
comment:
is_in_partition_key: 0
is_in_sorting_key: 0
is_in_primary_key: 0
is_in_sampling_key: 0
compression_codec:
1 rows in set. Elapsed: 0.002 sec.
SELECT *
FROM system.parts
WHERE (database = 'pmm') AND (table = 'test')
Row 1:
──────
partition: 201902
name: 201902_1_372_4
active: 1
marks: 47178
rows: 386478308
bytes_on_disk: 1401028031
data_compressed_bytes: 1390993287
data_uncompressed_bytes: 29642900064
marks_bytes: 7548480
modification_time: 2019-02-10 23:26:20
remove_time: 0000-00-00 00:00:00
refcount: 1
min_date: 0000-00-00
max_date: 0000-00-00
min_time: 2019-02-08 14:50:32
max_time: 2019-02-08 15:58:30
partition_id: 201902
min_block_number: 1
max_block_number: 372
level: 4
data_version: 1
primary_key_bytes_in_memory: 4373363
primary_key_bytes_in_memory_allocated: 6291456
database: pmm
table: test
engine: MergeTree
path: /var/lib/clickhouse/data/pmm/test/201902_1_372_4/
1 rows in set. Elapsed: 0.003 sec.

Now let’s look at the queries

We tested with two queries. One of them – we’ll call it Q1 – is a very trivial query, simply taking the sum across all column values. This query needs only to access one column to return results so it is likely to be the most impacted by a change of data type:

SELECT sum(nq_UInt32)
FROM test

The second query – which we’ll call Q2 – is a typical ranking query which computes the number of queries per period and then shows periods with the highest amount of queries in them:

SELECT
    sum(nq_UInt32) AS cnt,
    period_start
FROM test
GROUP BY period_start
ORDER BY cnt DESC
LIMIT 10

This query needs to access two columns and do more complicated processing so we expect it to be less impacted by the change of data type.

Before we get to results I think it is worth drawing attention to the raw performance we’re getting. I did these tests on DigitalOcean Droplet with just six virtual CPU cores, yet still I see numbers like these:

SELECT sum(nq_UInt32)
FROM test
┌─sum(nq_UInt32) ──┐
│     386638984    │
└──────────────────┘
1 rows in set. Elapsed: 0.205 sec. Processed 386.48 million rows, 1.55 GB (1.88 billion rows/s., 7.52 GB/s.)

Processing more than 300M rows/sec per core and more than 1GB/sec per core is very cool!

Query Performance

Results between different compression levels show similar differences between column types, so let’s focus on those with the least compression:

Q1 least compression

Q2 least compression

As you can see, the width of the data type (32 bit vs 64 bit) matters a lot more than the type (float vs integer). In some cases float may even perform faster than integer. This was the most unexpected result for me.

Another metric ClickHouse reports is the processing speed in GB/sec. We see a different picture here:

Q1 GB per second

64 bit data types have a higher processing speed than their 32 bit counter parts, but queries run slower as there is more raw data to process.

Compression

Let’s now take a closer look at compression. For this test we use default LZ4 compression. ClickHouse has powerful support for Per Column Compression Codecs but testing them is outside of scope for this post.

So let’s look at size on disk for UInt32 Column:

On disk data size for UINT32

What you can see from these results is that when data is very compressible ClickHouse can compress it to almost nothing. The compression ratio for our very compressible data set is about 200x (or 99.5% size reduction if you prefer this metric).

Somewhat compressible data compression rate is 1.4x. That’s not bad but considering we are only storing 1-1000 range in this column – which requires 10 bits out of 32 – I would hope for better compression. I guess LZ4 is not compressing such data very well.

Now let’s look at compression for a 64 bit integer column:

On disk data size for UINT64

We can see that while the size almost doubled for very compressible data, increases for our somewhat compressible data and poorly compressible data are not that large. Somewhat compressible data now compresses 2.5x.

Now let’s take a look at Performance depending on data compressibility:

Q1 time for UINT32

Poorly compressible data which takes a larger space on disk is processed faster than somewhat compressible data? This did not make sense. I repeated the run a few times to make sure that the results were correct. When I looked at the compression ratio, though, it suddenly made sense to me.

Poorly compressible data for the UInt32 data type was not compressible by LZ4 so it seems the original data was stored, significantly speeding up “decompression” process. With somewhat compressible data, compression worked and so real decompression needed to take place too. This makes things slower.

This is why we can only observe these results with UInt32 and Float32 data types. UInt64 and Float64 show the more expected results:

Q1 time for UINT64

Summary

Here are my conclusions:

Even with “slower” data types, ClickHouse is very fast
Data type choice matters – but less than I expected
Width (32bit vs 64bit) impacts performance more than integer vs float data types
Storing a small range of values in a wider column type is likely to yield better compression, though with default compression it is not as good as theoretically possible
Compression is interesting. We get the best performance when data can be well compressed. Second best is when we do not have to spend a lot of time decompressing it, as long as it is fits in memory.

↧

Parallel queries in PostgreSQL

February 21, 2019, 6:05 am

≫ Next: Upcoming Webinar Wed 3/6: High Availability and Disaster Recovery in Amazon RDS

≪ Previous: ClickHouse Performance Uint32 vs Uint64 vs Float32 vs Float64

PostgreSQL logo Modern CPU models have a huge number of cores. For many years, applications have been sending queries in parallel to databases. Where there are reporting queries that deal with many table rows, the ability for a query to use multiple CPUs helps us with a faster execution. Parallel queries in PostgreSQL allow us to utilize many CPUs to finish report queries faster. The parallel queries feature was implemented in 9.6 and helps. Starting from PostgreSQL 9.6 a report query is able to use many CPUs and finish faster.

The initial implementation of the parallel queries execution took three years. Parallel support requires code changes in many query execution stages. PostgreSQL 9.6 created an infrastructure for further code improvements. Later versions extended parallel execution support for other query types.

Limitations

Do not enable parallel executions if all CPU cores are already saturated. Parallel execution steals CPU time from other queries, and increases response time.
Most importantly, parallel processing significantly increases memory usage with high WORK_MEM values, as each hash join or sort operation takes a work_mem amount of memory.
Next, low latency OLTP queries can’t be made any faster with parallel execution. In particular, queries that returns a single row can perform badly when parallel execution is enabled.
The Pierian spring for developers is a TPC-H benchmark. Check if you have similar queries for the best parallel execution.
Parallel execution supports only SELECT queries without lock predicates.
Proper indexing might be a better alternative to a parallel sequential table scan.
There is no support for cursors or suspended queries.
Windowed functions and ordered-set aggregate functions are non-parallel.
There is no benefit for an IO-bound workload.
There are no parallel sort algorithms. However, queries with sorts still can be parallel in some aspects.
Replace CTE (WITH …) with a sub-select to support parallel execution.
Foreign data wrappers do not currently support parallel execution (but they could!)
There is no support for FULL OUTER JOIN.
Clients setting max_rows disable parallel execution.
If a query uses a function that is not marked as PARALLEL SAFE, it will be single-threaded.
SERIALIZABLE transaction isolation level disables parallel execution.

Test environment

The PostgreSQL development team have tried to improve TPC-H benchmark queries’ response time. You can download the benchmark and adapt it to PostgreSQL by using these instructions. It’s not an official way to use the TPC-H benchmark, so you shouldn’t use it to compare different databases or hardware.

Download TPC-H_Tools_v2.17.3.zip (or newer version) from official TPC site.
Rename makefile.suite to Makefile and modify it as requested at https://github.com/tvondra/pg_tpch . Compile the code with make command
Generate data: ./dbgen -s 10 generates 23GB database which is enough to see the difference in performance for parallel and non-parallel queries.
Convert tbl files to csv with for + sed
Clone pg_tpch repository and copy csv files to pg_tpch/dss/data
Generate queries with qgen command
Load data to the database with ./tpch.sh command.

Parallel sequential scan

This might be faster not because of parallel reads, but due to scattering of data across many CPU cores. Modern OS provides good caching for PostgreSQL data files. Read-ahead allows getting a block from storage more than just the block requested by PG daemon. As a result, query performance is not limited due to disk IO. It consumes CPU cycles for:

reading rows one by one from table data pages
comparing row values and WHERE conditions

Let’s try to execute simple select query:

tpch=# explain analyze select l_quantity as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Seq Scan on lineitem (cost=0.00..1964772.00 rows=58856235 width=5) (actual time=0.014..16951.669 rows=58839715 loops=1)
Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 1146337
Planning Time: 0.203 ms
Execution Time: 19035.100 ms

A sequential scan produces too many rows without aggregation. So, the query is executed by a single CPU core.

After adding SUM(), it’s clear to see that two workers will help us to make the query faster:

explain analyze select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=1589702.14..1589702.15 rows=1 width=32) (actual time=8553.365..8553.365 rows=1 loops=1)
-> Gather (cost=1589701.91..1589702.12 rows=2 width=32) (actual time=8553.241..8555.067 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=1588701.91..1588701.92 rows=1 width=32) (actual time=8547.546..8547.546 rows=1 loops=3)
-> Parallel Seq Scan on lineitem (cost=0.00..1527393.33 rows=24523431 width=5) (actual time=0.038..5998.417 rows=19613238 loops=3)
Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 382112
Planning Time: 0.241 ms
Execution Time: 8555.131 ms

The more complex query is 2.2X faster compared to the plain, single-threaded select.

Parallel Aggregation

A “Parallel Seq Scan” node produces rows for partial aggregation. A “Partial Aggregate” node reduces these rows with SUM(). At the end, the SUM counter from each worker collected by “Gather” node.

The final result is calculated by the “Finalize Aggregate” node. If you have your own aggregation functions, do not forget to mark them as “parallel safe”.

Number of workers

We can increase the number of workers without server restart:

alter system set max_parallel_workers_per_gather=4;
select * from pg_reload_conf();
Now, there are 4 workers in explain output:
tpch=# explain analyze select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=1440213.58..1440213.59 rows=1 width=32) (actual time=5152.072..5152.072 rows=1 loops=1)
-> Gather (cost=1440213.15..1440213.56 rows=4 width=32) (actual time=5151.807..5153.900 rows=5 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial Aggregate (cost=1439213.15..1439213.16 rows=1 width=32) (actual time=5147.238..5147.239 rows=1 loops=5)
-> Parallel Seq Scan on lineitem (cost=0.00..1402428.00 rows=14714059 width=5) (actual time=0.037..3601.882 rows=11767943 loops=5)
Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 229267
Planning Time: 0.218 ms
Execution Time: 5153.967 ms

What’s happening here? We have changed the number of workers from 2 to 4, but the query became only 1.6599 times faster. Actually, scaling is amazing. We had two workers plus one leader. After a configuration change, it becomes 4+1.

The biggest improvement from parallel execution that we can achieve is: 5/3 = 1.66(6)X faster.

How does it work?

Processes

Query execution always starts in the “leader” process. A leader executes all non-parallel activity and its own contribution to parallel processing. Other processes executing the same queries are called “worker” processes. Parallel execution utilizes the Dynamic Background Workers infrastructure (added in 9.4). As other parts of PostgreSQL uses processes, but not threads, the query creating three worker processes could be 4X faster than the traditional execution.

Communication

Workers communicate with the leader using a message queue (based on shared memory). Each process has two queues: one for errors and the second one for tuples.

How many workers to use?

Firstly, the max_parallel_workers_per_gather parameter is the smallest limit on the number of workers. Secondly, the query executor takes workers from the pool limited by max_parallel_workers size. Finally, the top-level limit is max_worker_processes: the total number of background processes.

Failed worker allocation leads to single-process execution.

The query planner could consider decreasing the number of workers based on a table or index size. min_parallel_table_scan_size and min_parallel_index_scan_size control this behavior.

set min_parallel_table_scan_size='8MB'
8MB table => 1 worker
24MB table => 2 workers
72MB table => 3 workers
x => log(x / min_parallel_table_scan_size) / log(3) + 1 worker

Each time the table is 3X bigger than min_parallel_(index|table)_scan_size, postgres adds a worker. The number of workers is not cost-based! A circular dependency makes a complex implementation hard. Instead, the planner uses simple rules.

In practice, these rules are not always acceptable in production and you can override the number of workers for the specific table with ALTER TABLE … SET (parallel_workers = N).

Why parallel execution is not used?

Besides to the long list of parallel execution limitations, PostgreSQL checks costs:

parallel_setup_cost to avoid parallel execution for short queries. It models the time spent for memory setup, process start, and initial communication

parallel_tuple_cost : The communication between leader and workers could take a long time. The time is proportional to the number of tuples sent by workers. The parameter models the communication cost.

Nested loop joins

PostgreSQL 9.6+ could execute a “Nested loop” in parallel due to the simplicity of the operation.

explain (costs off) select c_custkey, count(o_orderkey)
                from    customer left outer join orders on
                                c_custkey = o_custkey and o_comment not like '%special%deposits%'
                group by c_custkey;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Finalize GroupAggregate
   Group Key: customer.c_custkey
   ->  Gather Merge
         Workers Planned: 4
         ->  Partial GroupAggregate
               Group Key: customer.c_custkey
               ->  Nested Loop Left Join
                     ->  Parallel Index Only Scan using customer_pkey on customer
                     ->  Index Scan using idx_orders_custkey on orders
                           Index Cond: (customer.c_custkey = o_custkey)
                           Filter: ((o_comment)::text !~~ '%special%deposits%'::text)

Gather happens in the last stage, so “Nested Loop Left Join” is a parallel operation. “Parallel Index Only Scan” is available from version 10. It acts in a similar way to a parallel sequential scan. The

c_custkey = o_custkey

condition reads a single order for each customer row. Thus it’s not parallel.

Hash Join

Each worker builds its own hash table until PostgreSQL 11. As a result, 4+ workers weren’t able to improve performance. The new implementation uses a shared hash table. Each worker can utilize WORK_MEM to build the hash table.

select
        l_shipmode,
        sum(case
                when o_orderpriority = '1-URGENT'
                        or o_orderpriority = '2-HIGH'
                        then 1
                else 0
        end) as high_line_count,
        sum(case
                when o_orderpriority <> '1-URGENT'
                        and o_orderpriority <> '2-HIGH'
                        then 1
                else 0
        end) as low_line_count
from
        orders,
        lineitem
where
        o_orderkey = l_orderkey
        and l_shipmode in ('MAIL', 'AIR')
        and l_commitdate < l_receiptdate
        and l_shipdate < l_commitdate
        and l_receiptdate >= date '1996-01-01'
        and l_receiptdate < date '1996-01-01' + interval '1' year
group by
        l_shipmode
order by
        l_shipmode
LIMIT 1;
                                                                                                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1964755.66..1964961.44 rows=1 width=27) (actual time=7579.592..7922.997 rows=1 loops=1)
   ->  Finalize GroupAggregate  (cost=1964755.66..1966196.11 rows=7 width=27) (actual time=7579.590..7579.591 rows=1 loops=1)
         Group Key: lineitem.l_shipmode
         ->  Gather Merge  (cost=1964755.66..1966195.83 rows=28 width=27) (actual time=7559.593..7922.319 rows=6 loops=1)
               Workers Planned: 4
               Workers Launched: 4
               ->  Partial GroupAggregate  (cost=1963755.61..1965192.44 rows=7 width=27) (actual time=7548.103..7564.592 rows=2 loops=5)
                     Group Key: lineitem.l_shipmode
                     ->  Sort  (cost=1963755.61..1963935.20 rows=71838 width=27) (actual time=7530.280..7539.688 rows=62519 loops=5)
                           Sort Key: lineitem.l_shipmode
                           Sort Method: external merge  Disk: 2304kB
                           Worker 0:  Sort Method: external merge  Disk: 2064kB
                           Worker 1:  Sort Method: external merge  Disk: 2384kB
                           Worker 2:  Sort Method: external merge  Disk: 2264kB
                           Worker 3:  Sort Method: external merge  Disk: 2336kB
                           ->  Parallel Hash Join  (cost=382571.01..1957960.99 rows=71838 width=27) (actual time=7036.917..7499.692 rows=62519 loops=5)
                                 Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
                                 ->  Parallel Seq Scan on lineitem  (cost=0.00..1552386.40 rows=71838 width=19) (actual time=0.583..4901.063 rows=62519 loops=5)
                                       Filter: ((l_shipmode = ANY ('{MAIL,AIR}'::bpchar[])) AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND (l_receiptdate >= '1996-01-01'::date) AND (l_receiptdate < '1997-01-01 00:00:00'::timestamp without time zone))
                                       Rows Removed by Filter: 11934691
                                 ->  Parallel Hash  (cost=313722.45..313722.45 rows=3750045 width=20) (actual time=2011.518..2011.518 rows=3000000 loops=5)
                                       Buckets: 65536  Batches: 256  Memory Usage: 3840kB
                                       ->  Parallel Seq Scan on orders  (cost=0.00..313722.45 rows=3750045 width=20) (actual time=0.029..995.948 rows=3000000 loops=5)
 Planning Time: 0.977 ms
 Execution Time: 7923.770 ms

Query 12 from TPC-H is a good illustration for a parallel hash join. Each worker helps to build a shared hash table.

Merge Join

Due to the nature of merge join it’s not possible to make it parallel. Don’t worry if it’s the last stage of the query execution—you can still can see parallel execution for queries with a merge join.

-- Query 2 from TPC-H
explain (costs off) select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment
from    part, supplier, partsupp, nation, region
where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 36
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'AMERICA'
        and ps_supplycost = (
                select
                        min(ps_supplycost)
                from    partsupp, supplier, nation, region
                where
                        p_partkey = ps_partkey
                        and s_suppkey = ps_suppkey
                        and s_nationkey = n_nationkey
                        and n_regionkey = r_regionkey
                        and r_name = 'AMERICA'
        )
order by s_acctbal desc, n_name, s_name, p_partkey
LIMIT 100;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Limit
   ->  Sort
         Sort Key: supplier.s_acctbal DESC, nation.n_name, supplier.s_name, part.p_partkey
         ->  Merge Join
               Merge Cond: (part.p_partkey = partsupp.ps_partkey)
               Join Filter: (partsupp.ps_supplycost = (SubPlan 1))
               ->  Gather Merge
                     Workers Planned: 4
                     ->  Parallel Index Scan using part_pkey on part
                           Filter: (((p_type)::text ~~ '%BRASS'::text) AND (p_size = 36))
               ->  Materialize
                     ->  Sort
                           Sort Key: partsupp.ps_partkey
                           ->  Nested Loop
                                 ->  Nested Loop
                                       Join Filter: (nation.n_regionkey = region.r_regionkey)
                                       ->  Seq Scan on region
                                             Filter: (r_name = 'AMERICA'::bpchar)
                                       ->  Hash Join
                                             Hash Cond: (supplier.s_nationkey = nation.n_nationkey)
                                             ->  Seq Scan on supplier
                                             ->  Hash
                                                   ->  Seq Scan on nation
                                 ->  Index Scan using idx_partsupp_suppkey on partsupp
                                       Index Cond: (ps_suppkey = supplier.s_suppkey)
               SubPlan 1
                 ->  Aggregate
                       ->  Nested Loop
                             Join Filter: (nation_1.n_regionkey = region_1.r_regionkey)
                             ->  Seq Scan on region region_1
                                   Filter: (r_name = 'AMERICA'::bpchar)
                             ->  Nested Loop
                                   ->  Nested Loop
                                         ->  Index Scan using idx_partsupp_partkey on partsupp partsupp_1
                                               Index Cond: (part.p_partkey = ps_partkey)
                                         ->  Index Scan using supplier_pkey on supplier supplier_1
                                               Index Cond: (s_suppkey = partsupp_1.ps_suppkey)
                                   ->  Index Scan using nation_pkey on nation nation_1
                                         Index Cond: (n_nationkey = supplier_1.s_nationkey)

The “Merge Join” node is above “Gather Merge”. Thus merge is not using parallel execution. But the “Parallel Index Scan” node still helps with the part_pkey segment.

Partition-wise join

PostgreSQL 11 disables the partition-wise join feature by default. Partition-wise join has a high planning cost. Joins for similarly partitioned tables could be done partition-by-partition. This allows postgres to use smaller hash tables. Each per-partition join operation could be executed in parallel.

tpch=# set enable_partitionwise_join=t;
tpch=# explain (costs off) select * from prt1 t1, prt2 t2
where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                    QUERY PLAN
---------------------------------------------------
 Append
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on prt2_p1 t2
               Filter: ((b >= 0) AND (b <= 10000))
         ->  Hash
               ->  Seq Scan on prt1_p1 t1
                     Filter: (b = 0)
   ->  Hash Join
         Hash Cond: (t2_1.b = t1_1.a)
         ->  Seq Scan on prt2_p2 t2_1
               Filter: ((b >= 0) AND (b <= 10000))
         ->  Hash
               ->  Seq Scan on prt1_p2 t1_1
                     Filter: (b = 0)
tpch=# set parallel_setup_cost = 1;
tpch=# set parallel_tuple_cost = 0.01;
tpch=# explain (costs off) select * from prt1 t1, prt2 t2
where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                        QUERY PLAN
-----------------------------------------------------------
 Gather
   Workers Planned: 4
   ->  Parallel Append
         ->  Parallel Hash Join
               Hash Cond: (t2_1.b = t1_1.a)
               ->  Parallel Seq Scan on prt2_p2 t2_1
                     Filter: ((b >= 0) AND (b <= 10000))
               ->  Parallel Hash
                     ->  Parallel Seq Scan on prt1_p2 t1_1
                           Filter: (b = 0)
         ->  Parallel Hash Join
               Hash Cond: (t2.b = t1.a)
               ->  Parallel Seq Scan on prt2_p1 t2
                     Filter: ((b >= 0) AND (b <= 10000))
               ->  Parallel Hash
                     ->  Parallel Seq Scan on prt1_p1 t1
                           Filter: (b = 0)

Above all, a partition-wise join can use parallel execution only if partitions are big enough.

Parallel Append

Parallel Append partitions work instead of using different blocks in different workers. Usually, you can see this with UNION ALL queries. The drawback – less parallelism, because every worker could ultimately work for a single query.

There are just two workers launched even with four workers enabled.

tpch=# explain (costs off) select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day union all select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '2000-12-01' - interval '105' day;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Gather
   Workers Planned: 2
   ->  Parallel Append
         ->  Aggregate
               ->  Seq Scan on lineitem
                     Filter: (l_shipdate <= '2000-08-18 00:00:00'::timestamp without time zone)
         ->  Aggregate
               ->  Seq Scan on lineitem lineitem_1
                     Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)

Most important variables

WORK_MEM limits the memory usage of each process! Not just for queries: work_mem * processes * joins => could lead to significant memory usage.
max_parallel_workers_per_gather – how many workers an executor will use for the parallel execution of a planner node
max_worker_processes – adapt the total number of workers to the number of CPU cores installed on a server
max_parallel_workers – same for the number of parallel workers

Summary

Starting from 9.6 parallel queries execution could significantly improve performance for complex queries scanning many rows or index records. In PostgreSQL 10, parallel execution was enabled by default. Do not forget to disable parallel execution on servers with a heavy OLTP workload. Sequential scans or index scans still consume a significant amount of resources. If you are not running a report against the whole dataset, you may improve query performance just by adding missing indexes or by using proper partitioning.

References

—
Image compiled from photos by Nathan Gonthier and Pavel Nekoranec on Unsplash

↧

Upcoming Webinar Wed 3/6: High Availability and Disaster Recovery in Amazon RDS

March 4, 2019, 8:14 am

≫ Next: Upcoming Webinar Thurs 3/7: Enhancing MySQL Security

≪ Previous: Parallel queries in PostgreSQL

High Availability and Disaster Recovery in Amazon RDS Webinar

Join Percona CEO Peter Zaitsev as he presents High Availability and Disaster Recovery in Amazon RDS on Wednesday, March 6th, 2019, at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

In this hour-long webinar, Peter describes the differences between high availability (HA) and disaster recovery (DR). Afterward, Peter will go through scenarios detailing how each is handled manually and in Amazon RDS.

He will review the pros and cons of managing HA and DR in the traditional database environment as well in the cloud. Having full control of these areas is daunting. However, Amazon RDS makes meeting these needs easier and more efficient.

Regardless of which path you choose, monitoring your environment is vital. Peter’s talk will make that message clear. A discussion of metrics you should regularly review to keep your environment working correctly and performing optimally concludes the webinar.

In order to learn more register for Peter’s webinar on High Availability and Disaster Recovery in Amazon RDS.

↧

Upcoming Webinar Thurs 3/7: Enhancing MySQL Security

March 5, 2019, 1:57 pm

≫ Next: Upcoming Webinar Thurs 3/14: Web Application Security – Why You Should Review Yours

≪ Previous: Upcoming Webinar Wed 3/6: High Availability and Disaster Recovery in Amazon RDS

Join Percona Support Engineer, Vinicius Grippa, as he presents his talk Enhancing MySQL Security on Thursday, March 7th, 2019 at 7:00 AM PST (UTC-8) / 10:00 AM EST (UTC-5).

Security is always a challenge when it comes to data. What’s more, regulations like GDPR add a whole new layer on top of it, with rules more and more restrictive to access and manipulate data. Join us in this presentation to check security best practices, as well as traditional and new features available for MySQL including features coming with the new MySQL 8.

In this talk, DBA’s and sysadmins will walk through the security features available on the OS and MySQL. For instance, these features include:

– SO security
– SSL
– ACL
– TDE
– Audit Plugin
– MySQL 8 features (undo, redo and binlog encryption)
– New caching_sha2_password
– Roles
– Password Management
– FIPS mode

In order to learn more register for this webinar on Enhancing MySQL Security.

↧

Upcoming Webinar Thurs 3/14: Web Application Security – Why You Should Review Yours

March 12, 2019, 1:59 pm

≫ Next: Upcoming Webinar Thurs 3/21: MySQL Performance Schema in 1 hour

≪ Previous: Upcoming Webinar Thurs 3/7: Enhancing MySQL Security

Please join Percona’s Information Security Architect, David Bubsy, as he presents his talk Web Application Security – Why You Should Review Yours on March 14th, 2019 at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

In this talk, we take a look at the whole stack and I don’t just mean LAMP.

We’ll cover what an attack surface is and some areas you may look to in order to ensure that you can reduce it.

For instance, what’s an attack surface?

Acronym Hell, what do they mean?

Vulnerability Naming, is this media naming stupidity or driving the message home?

Detection, Prevention and avoiding the boy who cried wolf are some further examples.

Additionally, we’ll cover emerging technologies to keep an eye on or even implement yourself to help improve your security posture.

There will also be a live compromise demo (or backup video if something fails) that covers compromising a PCI compliant network structure to reach the database system. Through this compromise you can ultimately exploit multiple failures to gain bash shell access over the MySQL protocol.

Register for this webinar on web application security to learn more.

↧

Upcoming Webinar Thurs 3/21: MySQL Performance Schema in 1 hour

March 19, 2019, 9:19 am

≫ Next: Upcoming Webinar Wed 3/27: Monitoring PostgreSQL with Percona Monitoring and Management (PMM)

≪ Previous: Upcoming Webinar Thurs 3/14: Web Application Security – Why You Should Review Yours

Please join Percona’s Principal Support Engineer, Sveta Smirnova, as she presents MySQL Performance Schema in 1 hour on Thursday, March 21st, 2019, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

MySQL 8.0 Performance Schema is a mature tool, used by humans and monitoring products. It was born in 2010 as “a feature for monitoring server execution at a low level.” The tool has grown over the years with performance fixes and DBA-faced features. In this webinar, I will give an overview of Performance Schema, focusing on its tuning, performance, and usability.

Performance Schema helps to troubleshoot query performance, complicated locking issues and memory leaks. It can also troubleshoot resource usage, problematic behavior caused by inappropriate settings and much more. Additionally, it comes with hundreds of options which allow for greater precision tuning.

Performance Schema is a potent and very complicated tool. What’s more, it does not affect performance in most cases. However, it collects a lot of data and sometimes this data is hard to read.

In this webinar, I will guide you through the main Performance Schema features, design, and configuration. You will learn how to get the best of it. I will cover its companion sys schema and graphical monitoring tools.

In order to learn more, register for MySQL Performance Schema in 1 hour today.

↧