Quantcast
Channel: Insight for DBAs Archives - Percona Database Performance Blog
Viewing all 1334 articles
Browse latest View live

PostgreSQL locking, Part 1: Row Locks

$
0
0
PostgreSQL row level locks

row signing with postgresqlAn understanding of PostgreSQL locking is important to build scalable applications and avoid downtime. Modern computers and servers have many CPU cores and it’s possible to execute multiple queries in parallel. Databases containing many consistent structures with changes made by queries or background processes running in parallel could crash a database or even corrupt data. Thus we need the ability to prevent access from concurrent processes, while changing shared memory structures or rows. One thread updates the structure while all others wait (exclusive lock), or multiple threads read the structure and all writes wait. The side effect of waits is a locking contention and server resources waste. Thus it’s important to understand why waits happen and what locks are involved. In this article, I review PostgreSQL row level locking.

In follow up posts, I will investigate table-level locks and latches protecting internal database structures.

Row locks – an overview

PostgreSQL has many locks at different abstraction levels. The most important locks for applications are related to MVCC implementation – row level locking. In second place – locks appearing during maintenance tasks (during backups/database migrations schema changes) – table level locking. It’s also possible—but rare—to see waits on low level PostgreSQL locks. More often there is a high CPU usage, with many concurrent queries running, but overall server performance reduced in comparison with normal number of queries running in parallel.

Example environment

To follow along, you need a PostgreSQL server with a single-column table containing several rows:

postgres=# CREATE TABLE locktest (c INT);
CREATE TABLE
postgres=# INSERT INTO locktest VALUES (1), (2);
INSERT 0 2

Row locks

Scenario: two concurrent transactions are trying to select a row for update.

PostgreSQL uses row-level locking in this case. Row level locking is tightly integrated with MVCC implementation, and uses hidden xmin and xmax fields.

xmin
 and
xmax
 store the transaction id. All statements requiring row-level locks modify the xmax field (even SELECT FOR UPDATE). The modification happens after the query returns its results, so in order to see xmax change we need to run SELECT FOR UPDATE twice. Usually, the xmax field is used to mark a row as expired—either removed by some transaction completely or in favor of updated row version—but it also used for row-level locking infrastructure.

If you need more details about the xmin and xmax hidden fields and MVCC implementation, please check our “Basic Understanding of Bloat and VACUUM in PostgreSQL” blog post.

postgres=# BEGIN;
postgres=# SELECT xmin,xmax, txid_current(), c FROM locktest WHERE c=1 FOR UPDATE;
BEGIN
 xmin | xmax | txid_current | c
------+------+--------------+---
  579 |  581 |          583 | 1
(1 row)
postgres=# SELECT xmin,xmax, txid_current(), c FROM locktest WHERE c=1 FOR UPDATE;
 xmin | xmax | txid_current | c
------+------+--------------+---
  579 |  583 |          583 | 1
(1 row)

If a statement is trying to to modify the same row, it checks the list of unfinished transactions. The statement has to wait for modification until the transaction with id=xmax is finished.

There is no infrastructure for waiting on a specific row, but a transaction can wait on transaction id.

-- second connection
SELECT xmin,xmax,txid_current() FROM locktest WHERE c=1 FOR UPDATE;

The SELECT FOR UPDATE query running in the second connection is unfinished, and waiting for the first transaction to complete.

pg_locks

Such waits and locks can be seen by querying pg_locks:

postgres=# SELECT locktype,transactionid,virtualtransaction,pid,mode,granted,fastpath
postgres-#  FROM pg_locks WHERE transactionid=583;
   locktype    | transactionid | virtualtransaction |  pid  |     mode      | granted | fastpath
---------------+---------------+--------------------+-------+---------------+---------+----------
 transactionid |           583 | 4/107              | 31369 | ShareLock     | f       | f
 transactionid |           583 | 3/11               | 21144 | ExclusiveLock | t       | f

You can see the writer transaction id for locktype=transactionid == 583. Let’s get the pid and backend id for the holding lock:

postgres=# SELECT id,pg_backend_pid() FROM pg_stat_get_backend_idset() AS t(id)
postgres-#  WHERE pg_stat_get_backend_pid(id) = pg_backend_pid();
 id | pg_backend_pid
----+----------------
  3 |          21144

This backend has its lock granted (t). Each backend has an OS process identifier (PID) and internal PostgreSQL identifier (backend id). PostgreSQL can process many transactions, but locking can happen only between backends, and each backend executes a single transaction. Internal bookkeeping requires just a virtual transaction identifier: a pair of backend ids and a sequence number inside the backend.

Regardless of the number of rows locked, PostgreSQL will have only a single related lock in the pg_locks table. Queries might modify billions of rows but PostgreSQL does not waste memory for redundant locking structures.

A writer thread sets ExclusiveLock on its transactionid. All row level lock waiters set ShareLock. The lock manager resumes all previously locked backend locks as soon as the writer releases the lock.

Lock release for transactionid occurs on commit or rollback.

pg_stat_activity

Another great method to get locking-related details is to select from the pg_stat_activity table:

postgres=# SELECT pid,backend_xid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid IN (31369,21144);
-[ RECORD 1 ]---+---------------------------------------------------------------------------------------------------------------------------
pid             | 21144
backend_xid     | 583
wait_event_type | Client
wait_event      | ClientRead
state           | idle in transaction
query           | SELECT id,pg_backend_pid() FROM pg_stat_get_backend_idset() AS t(id) WHERE pg_stat_get_backend_pid(id) = pg_backend_pid();
-[ RECORD 2 ]---+---------------------------------------------------------------------------------------------------------------------------
pid             | 31369
backend_xid     | 585
wait_event_type | Lock
wait_event      | transactionid
state           | active
query           | SELECT xmin,xmax,txid_current() FROM locktest WHERE c=1 FOR UPDATE;

Source code-level investigation

Let’s check the stack trace for the waiter with gdb and the pt-pmp tool:

# pt-pmp -p 31369
Sat Jul 28 10:10:25 UTC 2018
30	../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
      1 epoll_wait,WaitEventSetWaitBlock,WaitEventSetWait,WaitLatchOrSocket,WaitLatch,ProcSleep,WaitOnLock,LockAcquireExtended,LockAcquire,XactLockTableWait,heap_lock_tuple,ExecLockRows,ExecProcNode,ExecutePlan,standard_ExecutorRun,PortalRunSelect,PortalRun,exec_simple_query,PostgresMain,BackendRun,BackendStartup,ServerLoop,PostmasterMain,main

The WaitOnLock function is causing the wait. The function is located in lock.c file (POSTGRES primary lock mechanism).

A lock table is a shared memory hash table. The conflicting process sleeps for the lock in storage/lmgr/proc.c. For the most part, this code should be invoked via lmgr.c or another lock-management module, not directly.

Next, locks listed in pg_stat_activity as “Lock” are also called heavyweight locks, and controlled by Lock Manager. HWLocks are also used for many high level actions.

By the way, a full description can be found here: https://www.postgresql.org/docs/current/static/explicit-locking.html

Summary

  • Avoid long running transactions modifying frequently updated rows or too many rows
  • Next, do not use hotspots (single row or multiple rows updated in parallel by many application client connections) with MVCC databases. This kind of workload is more suitable for in-memory databases and can usually be separated from the main business logic.

Upcoming Webinar Thurs 10/18: MongoDB 4.0 Features – Transactions & More

$
0
0
MongoDB 4.0 Features Webinar

MongoDB 4.0 Features WebinarPlease join Percona’s Principal Consultant, Alex Rubin, as he presents MongoDB 4.0 Features – Transactions & More on Thursday, October 18th at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

 

MongoDB 4.0 adds support for multi-document ACID transactions, combining the document model with ACID guarantees. Through snapshot isolation, transactions provide a consistent view of data and enforce all-or-nothing execution to maintain data integrity.

This webinar mainly focuses on MongoDB transactions (the major feature of the latest update) and any future transaction improvements. We will also cover other new MongoDB features, such as Non-Blocking Secondary Reads, Security improvements and more.

After attending the webinar you will learn more about the latest MongoDB features.

Register for this webinar to learn about MongoDB transactions and other features.

PostgreSQL 11! Our First Take On The New Release

$
0
0
slonik_with_black_text

PostgreSQL logoYou may be aware that the new major version of PostgreSQL has been released today. PostgreSQL 11 is going to be one of the most vibrant releases in recent times. It incorporates many features found in proprietary, industry-leading database systems, further qualifying PostgreSQL as a strong open source alternative.

Without further ado, let’s have a look at some killer features in this new release.

Just In Time (JIT) Compilation of SQL Statements

This is a cutting edge feature in PostgreSQL: SQL statements can get compiled into native code for execution. It’s well know how much Google V8 JIT revolutionized the JavaScript language. JIT in PostgreSQL 11 supports accelerating two important factors—expression evaluation and tuple deforming during query execution—and helps CPU bound queries perform faster. Hopefully this is a new era in the SQL world.

Parallel B-tree Index build

This could be the most sought after feature by DBAs, especially those migrating large databases from other database systems to PostgreSQL. Gone are the days when a lot of time was spent on building indexes during data migration. Index maintenance (rebuild) for very large tables can now make an effective use of multiple cores in the server by parallelizing the operation, taking considerably less time to complete.

Lightweight and super fast ALTER TABLE for NOT NULL column with DEFAULT values

In the process of continuous enhancement and adding new features, we see several application developments that involve schema changes to the database. Most such changes include adding new columns to a table. This can be a nightmare if a new column needs to be added to a large table with a default value and a NOT NULL constraint. This is because an ALTER statement can hold a write lock on the table for a long period. It can also involve excessive IO due to table rewrite. PostgreSQL 11 addresses this issue by ensuring that the column addition with a default value and a NOT NULL constraint avoids a table rewrite.  

Stored procedures with transaction control

PostgreSQL 11 includes stored procedures. What really existed in PostgreSQL so far was functions. The lack of native stored procedures in PostgreSQL made the database code for migrations from other databases complex. They often required extensive manual work from experts. Since stored procedures might include transaction blocks with BEGIN, COMMIT, and ROLLBACK, it was necessary to apply workarounds to meet this requirement in past PostgreSQL versions, but not anymore.

Load cache upon crash or restart – pg_prewarm

Memory is becoming cheaper and faster, year over year. The latest generation of servers is commonly available with several hundreds of GBs of RAM, making it easy to employ large caches (shared_buffers) in PostgreSQL. Until today, you might have used pg_prewarm to warm up the cache manually (or automatically at server start). PostgreSQL 11 now includes a background worker thread that will take care of that for you, recording the contents of the shared_buffers—in fact, the “address” of those—to a file autoprewarm.blocks. Upon crash recovery or normal server restart, two such threads work in the background, reloading those blocks into the cache.

Hash Partition

Until PostgreSQL 9.6 we used table inheritance for partitioning a table. PostgreSQL 10 came up with declarative partitioning, using two of the three most common partitioning methods: list and range. And now, PostgreSQL 11 has introduced the missing piece: hash partitioning.

Advanced partitioning features that were always on demand

There were a lot of new features committed to the partitioning space in PostgreSQL 11. It now allows us to attach an index to a given partition even though it won’t behave as a global index.

Also, row updates now automatically move rows to new partitions (if necessary) based on the updated fields. During query processing, the optimizer may now simply skip “unwanted” partitions from the execution plan, which greatly simplifies the work to be done. Previously, it had to convey all the partitions, even if the target data was to be found in just a subset of them.

We will discuss these new features in detail in a future blog post.

Tables can have default partitions

Until PostgreSQL 10, if a table did not have a default partition, PostgreSQL had to reject a row when the row being inserted did not satisfy any of the existing partitions definitions. That changes with the introduction of default partitions in PostgreSQL 11.

Parallel hash join

Most of the SQLs with equi-joins do hash joins in the background. There is a great opportunity to speed up performance if we can leverage the power of hardware by spinning off multiple parallel workers. PostgreSQL 11 now allows hash joins to be performed in parallel.

Write-Ahead Log (WAL) improvements

Historically, PostgreSQL had a default WAL segment of 16MB and we had to recompile PostgreSQL in order to operate with WAL segments of a different size. Now it is possible to change the WAL size during the initialization of the data directory (initdb) or while resetting WALs using pg_resetwal by means of the parameter –wal-segsize = <wal_segment_size>.

Add extensions to convert JSONB data to/from PL/Perl and PL/Python

Python as a programming language continues to gain popularity. It is always among the top 5 in the TIOBE Index. One of the greatest features of PostgreSQL is that you can write stored procedures and functions in most of the popular programming languages, including Python (with PL/Python). Now it is also possible to transform JSONB (Binary JSON) data type to/from PL/Python. This feature was later made available for PL/Perl too. It can be a great add-on for organizations using PostgreSQL as a document store.

Command line improvements in psql: autocomplete and quit/exit

psql has always been friendly to first time PostgreSQL users through the various options like autocomplete and shortcuts. There’s an exception though: users may find it difficult to understand how to effectively quit from psql, and often attempt to use non-existing quit and exit commands. Eventually, they find \q or ctrl + D, but not without frustrating themselves first. Fortunately, that shouldn’t happen anymore: among many recent fixes and improvements to psql is the addition of the intuitive quit and exit commands to safely leave this popular command line client.

Improved statistics

PostgreSQL 10 introduced the new statement CREATE STATISTICS to collect more statistics information about columns from tables. This has been further improved in PostgreSQL 11. Previously, while collecting optimizer statistics, most-common-values (MCV) were chosen based on their significance compared to all columns. But now, MCVs are chosen based on their significance compared to non-MCV values.

The new features of PostgreSQL 11 are not limited to the ones mentioned above. We will be looking further into most of them in future blog posts. We invite you to leave a comment below and let us know if there is any particular feature you would be interested in knowing more about.

Reclaiming space on your Docker PMM server deployment

$
0
0
reclaiming space Docker PMM

reclaiming space Docker PMMRecently we had a customer that had issues with a filled disk on the server hosting their Docker pmm-server environment. They were not able to access the web UI, or even stop the pmm-server container because they had filled the /var/ mount point.

Setting correct expectations

The best way to avoid these kinds of issues in the first place is to plan ahead, and to know exactly with what you are dealing with in terms of disk space requirements. Michael Coburn has written a great blogpost on this matter:

https://www.percona.com/blog/2017/05/04/how-much-disk-space-should-i-allocate-for-percona-monitoring-and-management/

We are now using Prometheus version 2 inside PMM server, so you should take it with a pinch of salt. On the other hand, it will show how you should plan ahead, and think about the “steady state” disk usage, so it’s a good read.

That’s the first step to make sure you won’t get into trouble down the line. But, what happens if you are already in trouble? We’ll see two quick ways that may help reclaiming space.

Before anything else, you should stop any and all PMM clients running, so that you don’t have a race condition after recovering some space, in which metrics coming from the running clients will fill up whatever disk you had freed.

If

pmm-admin stop --all
  won’t work, you can stop the services manually, or even manually kill the running processes as a last resort:
shell> systemctl list-unit-files | grep enabled | grep pmm | awk '{print $1}' | xargs -n 1 systemctl stop
shell> ps ax | egrep "exporter|qan-agent|pmm" | grep -v "ssh" | awk '{print $1}' | xargs kill

Removing unused containers

In order for the next steps to be as effective as possible, make sure there are no unused containers running, or stopped:

shell> docker ps -a

If you see any container that you know you don’t need anymore:

shell> docker stop <container_name>
shell> docker rm -v <container_name>

WARNING! Do not remove the pmm-data container!

Reclaiming space from unused Docker images

After you are done cleaning unused containers, we can move forward with removing unused images. Unless you are manually building your own Docker images, it’s really easy to get them again if needed, so you shouldn’t be afraid of deleting the ones that are not being used. In fact, you don’t need to explicitly download the images. By simply running

docker run … image_name
  Docker will automatically do it for you if it’s not found locally.
shell> docker image prune -a
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
...
Total reclaimed space: 3.97GB

Not too bad, we just reclaimed 4Gb of disk space. This alone should be enough to restart the Docker service and have the pmm-server container back up. But we want more, just because we can 🙂

Reclaiming space from orphaned Docker volumes

By default, when removing a container (with

docker rm
 ) Docker will not delete the associated volumes, unless you use the -v switch as we did above. This will mean that, unless you were aware of this fact, you will probably have some other gigabytes worth of data occupying disk space. We can easily do this with the volume prune command:
shell> docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Volumes:
...
Total reclaimed space: 115GB

Yeah… that’s some significant amount of disk space we just reclaimed back! Again, make sure you don’t care about any of the volumes from your past containers to be able to do this safely, since there is no turning back from this, obviously.

For earlier versions of Docker where this command is not available, you can check this link.

Planning ahead

As mentioned before, you should now revisit Michael’s blogpost, and set the metrics retention and queries retention variables to whatever makes sense for your environment. Even if you plan ahead, you may not be counting on the additional variable overhead of images and orphaned volumes, so you may want to (warning: shameless plug for my own blogpost ahead) use different mount points for your PMM deployment, and avoid using the shared /var/lib/docker/ mount point for it.

PMM also includes a Disk Space usage dashboard, that you can use to monitor this.

Don’t forget to start back up your PMM clients, and continue to monitor them 24×7!

Photo by Andrew Wulf on Unsplash

PostgreSQL locking, part 2: heavyweight locks

$
0
0
Locking in PostgreSQL

Locking in PostgreSQLPostgreSQL locking visibility for application developers and DBAs is in most cases related to heavyweight locks. Complex database locking operations require full instrumentation using views from the system catalog. It should be clear which object is locked by a specific database “backend” process. An alternative name for any lock is “bottleneck”. In order to make database operations parallel we should split a single bottleneck into multiple operation specific tasks.

This is the second part of three blog posts related to table level locks. The previous post was about row-level locks, and a subsequent post reviews the latches protecting internal database structures.

Example environment

A simple table with several rows:

CREATE TABLE locktest (c INT);
INSERT INTO locktest VALUES (1), (2);

Helper view

In order to check different types of these locks, let’s create a helper view as suggested by Bruce Momjian in his presentation:

CREATE VIEW lockview AS SELECT pid, virtualtransaction AS vxid, locktype AS lock_type,
mode AS lock_mode, granted,
CASE
WHEN virtualxid IS NOT NULL AND transactionid IS NOT NULL
THEN virtualxid || ' ' || transactionid
WHEN virtualxid::text IS NOT NULL
THEN virtualxid
ELSE transactionid::text                                                                                                                                                                                           END AS xid_lock, relname,
page, tuple, classid, objid, objsubid
FROM pg_locks LEFT OUTER JOIN pg_class ON (pg_locks.relation = pg_class.oid)
WHERE -- do not show our view’s locks
pid != pg_backend_pid() AND
-- no need to show self-vxid locks                                                                                                                                                                                 virtualtransaction IS DISTINCT FROM virtualxid
-- granted is ordered earlier
ORDER BY 1, 2, 5 DESC, 6, 3, 4, 7;

RowShareLock (ROW SHARE)

Many applications use the read-modify-write paradigm. For instance, the application fetches single object fields from a table, modifies the data, and saves the changes back to the database. In a multi-user environment, different users could modify the same rows in the course of this transaction. We can get inconsistent data with just a plain select. In response to user demands, almost all SQL databases have SELECT … FOR SHARE locking. This feature prevents the application entity from making data modifications until the locker transaction commits or rolls back.

For example:

  1. There is a user with multiple bank accounts stored in an accounts table, with total_amount stored in a bank_clients table.
  2. In order to update the total_amount field, we should prevent modification of all rows related to the specific bank client.
  3. It would be better to use a single update statement to calculate total_amount and select it from the accounts table. If the update requires external data, or some action from the user, then several statements are required

START TRANSACTION;
SELECT * FROM accounts WHERE client_id = 55 FOR SHARE;
SELECT * FROM bank_clients WHERE client_id=55 FOR UPDATE;
UPDATE bank_clients SET total_amount=38984.33, client_status='gold' WHERE client_id=55;
COMMIT;

The SELECT FOR SHARE statement creates a “RowShareLock” lock on the relation locktest.

Here’s exactly the same lock created with an SQL statement:

BEGIN;
LOCK TABLE locktest IN ROW SHARE MODE;

A single heavyweight RowShareLock is required regardless of the number of rows locked by the query.

This is illustrated with an unfinished transaction in the following example. Start the unfinished transaction, and select from lockview in a second connection to the database:

BEGIN;
SELECT * FROM locktest FOR SHARE;
-- In second connection:
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |   lock_mode   | granted | xid_lock | relname
-------+------+---------------+---------------+---------+----------+----------
 21144 | 3/13 | transactionid | ExclusiveLock | t       | 586      |
 21144 | 3/13 | relation      | RowShareLock  | t       |          | locktest

RowExclusiveLock (ROW EXCLUSIVE)

Real queries that modify rows also require heavyweight locks on tables, one per table.

The next example uses a DELETE query, but an UPDATE will have the same effect.

All commands that modify data in a table obtain a ROW EXCLUSIVE lock.

BEGIN;
DELETE FROM locktest;
-- second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |    lock_mode     | granted | xid_lock | relname
-------+------+---------------+------------------+---------+----------+----------
 10997 | 3/6  | transactionid | ExclusiveLock    | t       | 589      |
 10997 | 3/6  | relation      | RowExclusiveLock | t       |          | locktest

This new lock is incompatible with the previous FOR SHARE example.

SELECT * FROM locktest FOR SHARE
  waits for the delete transaction to finish or abort:

postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname,page,tuple FROM lockview;
  pid  | vxid |   lock_type   |    lock_mode     | granted | xid_lock | relname  | page | tuple
-------+------+---------------+------------------+---------+----------+----------+------+-------
 10997 | 3/6  | transactionid | ExclusiveLock    | t       | 589      |          |      |
 10997 | 3/6  | relation      | RowExclusiveLock | t       |          | locktest |      |
 11495 | 5/9  | relation      | RowShareLock     | t       |          | locktest |      |
 11495 | 5/9  | tuple         | RowShareLock     | t       |          | locktest |    0 |     1
 11495 | 5/9  | transactionid | ShareLock        | f       | 589      |          |      |

Queries modifying table content also lock all indexes, even if the index does not contain modified fields.

-- preparation
CREATE INDEX c_idx2 ON locktest (c);
ALTER TABLE locktest ADD COLUMN c2 INT;
CREATE INDEX c2_idx ON locktest(c2);
-- unfinished example transaction
BEGIN;
UPDATE locktest SET c=3 WHERE c=1;
-- second connection
postgres=# SELECT * FROM lockview;
 pid  |  vxid  | lock_type  |    lock_mode     | granted | xid_lock | relname  | page | tuple | classid | objid | objsubid
------+--------+------------+------------------+---------+----------+----------+------+-------+---------+-------+----------
 3998 | 3/7844 | virtualxid | ExclusiveLock    | t       | 3/7844   |          |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | c2_idx   |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | c_idx    |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | c_idx2   |      |       |         |       |
 3998 | 3/7844 | relation   | RowExclusiveLock | t       |          | locktest |      |       |         |       |

ShareLock (SHARE)

The non-concurrent version of CREATE INDEX prevents table updates, e.g. DROP TABLE or INSERT or DELETE, with ShareLock.

BEGIN;
CREATE INDEX c_idx ON locktest (c);
-- second connection
postgres=# SELECT * FROM lockview;
 pid  |  vxid  |   lock_type   |      lock_mode      | granted | xid_lock | relname  | page | tuple | classid | objid | objsubid
------+--------+---------------+---------------------+---------+----------+----------+------+-------+---------+-------+----------
 3998 | 3/7835 | virtualxid    | ExclusiveLock       | t       | 3/7835   |          |      |       |         |       |
 3998 | 3/7835 | transactionid | ExclusiveLock       | t       | 564      |          |      |       |         |       |
 3998 | 3/7835 | relation      | AccessExclusiveLock | t       |          |          |      |       |         |       |
 3998 | 3/7835 | relation      | ShareLock           | t       |          | locktest |      |       |         |       |

You can execute multiple CREATE INDEX queries in parallel unless the index name is exactly the same. The wait happens on the row lock (ShareLock with “transactionid” type) in the pg_class table.

Note that there is also AccessExclusiveLock lock with type “relation”, but it’s not a table level one.

ShareUpdateExclusiveLock (SHARE UPDATE EXCLUSIVE)

These database maintenance operations need to take a ShareUpdateExclusiveLock:

  • ANALYZE table
  • VACUUM (without full) runs.
  • CREATE INDEX CONCURRENTLY

The 

ANALYZE tablename;
  statement updates table statistics. The query planner/optimizer is able to provide the best plans for query execution only if the statistics are up to date.
BEGIN;
ANALYZE locktest;
-- in second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |        lock_mode         | granted | xid_lock | relname
-------+------+---------------+--------------------------+---------+----------+----------
 10997 | 3/7  | transactionid | ExclusiveLock            | t       | 591      |
 10997 | 3/7  | relation      | ShareUpdateExclusiveLock | t       |          | locktest

There is no conflict between RowExclusiveLock and ShareUpdateExclusiveLock. UPDATE/DELETE/INSERT could still modify rows during ANALYZE.

VACUUM and CREATE INDEX CONCURRENTLY can be executed only outside a transaction. To see the effects of these statements in lockview, execute a conflicting transaction first e.g. run ANALYZE in a transaction, or run VACUUM against a huge table.

CREATE INDEX CONCURRENTLY locking can be confusing. SHARE UPDATE EXCLUSIVE lock does not conflict with a ROW EXCLUSIVE lock that’s used for DELETES, INSERT and UPDATES. Unfortunately, CREATE INDEX CONCURRENTLY waits until active transactions are finished twice due to full table scans:

In a concurrent index build, the index is actually entered into the system catalogs in one transaction, then two table scans occur in two more transactions. Before each table scan, the index build must wait for existing transactions that have modified the table to terminate.” PostgreSQL Documentation

AccessExclusiveLock  (ACCESS EXCLUSIVE)

This lock conflicts with any other locks and is used by these statements:

  • CREATE RULE
  • DROP TABLE
  • DROP INDEX
  • TRUNCATE
  • VACUUM FULL
  • LOCK TABLE (default mode)
  • CLUSTER
  • REINDEX
  • REFRESH MATERIALIZED VIEW (without CONCURRENTLY)

BEGIN;
CREATE RULE r_locktest AS ON INSERT TO locktest DO INSTEAD NOTHING;
-- second connection
postgres=# select pid,vxid,lock_type,lock_mode,granted,xid_lock,relname from lockview;
  pid  | vxid |   lock_type   |      lock_mode      | granted | xid_lock | relname
-------+------+---------------+---------------------+---------+----------+----------
 10997 | 3/19 | transactionid | ExclusiveLock       | t       | 596      |
 10997 | 3/19 | relation      | AccessExclusiveLock | t       |          | locktest

More importantly, drop index requires access exclusive locks for both table and index:

BEGIN;
DROP INDEX c_idx;
-- second connection
postgres=# SELECT * FROM lockview;
 pid  |  vxid  |   lock_type   |      lock_mode      | granted | xid_lock | relname  | page | tuple | classid | objid | objsubid
------+--------+---------------+---------------------+---------+----------+----------+------+-------+---------+-------+----------
 3998 | 3/7839 | virtualxid    | ExclusiveLock       | t       | 3/7839   |          |      |       |         |       |
 3998 | 3/7839 | transactionid | ExclusiveLock       | t       | 569      |          |      |       |         |       |
 3998 | 3/7839 | relation      | AccessExclusiveLock | t       |          | c_idx    |      |       |         |       |
 3998 | 3/7839 | relation      | AccessExclusiveLock | t       |          | locktest |      |       |         |       |

Note: This is the most dangerous type of lock. Avoid running queries requiring access exclusive lock in production, or at least put the application in maintenance mode.

ExclusiveLock

Meanwhile, SQL commands don’t use ExclusiveLock, except for with the general LOCK TABLE statement. This lock prevents all requests except for a non-locking select (i.e. without FOR SHARE/UPDATE).

BEGIN;
LOCK TABLE locktest IN EXCLUSIVE MODE;
-- second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid | lock_type |   lock_mode   | granted | xid_lock | relname
-------+------+-----------+---------------+---------+----------+----------
 10997 | 3/21 | relation  | ExclusiveLock | t       |          | locktest

Savepoints

Savepoint produces an additional ExclusiveLock of transactionid type with new xid value.

BEGIN;
SELECT * FROM locktest FOR SHARE;
SAVEPOINT s1;
SELECT * FROM locktest FOR UPDATE;
-- second connection
postgres=# SELECT pid,vxid,lock_type,lock_mode,granted,xid_lock,relname FROM lockview;
  pid  | vxid |   lock_type   |    lock_mode    | granted | xid_lock | relname
-------+------+---------------+-----------------+---------+----------+----------
 10997 | 3/37 | transactionid | ExclusiveLock   | t       | 602      |
 10997 | 3/37 | transactionid | ExclusiveLock   | t       | 603      |
 10997 | 3/37 | relation      | AccessShareLock | t       |          | c_idx
 10997 | 3/37 | relation      | RowShareLock    | t       |          | locktest

pg_advisory_lock

Sometimes application developers require synchronization between processes. In such systems, the application creates and removes locks frequently. Systems with a row-based locks implementation tend to cause table bloat.

There are many functions related to advisory locks:

  • per session or per transaction
  • wait if lock is not available or immediately return false
  • exclusive or shared
  • 64-bit or two 32-bit integers resource identifiers

Imagine that we have several cron jobs and that the application should prevent simultaneous runs of the same script. Next, each script can check if a lock is available in PostgreSQL for specific integer job identifier:

postgres=# SELECT pg_try_advisory_lock(10);
 pg_try_advisory_lock
----------------------
 t
-- second connection
postgres=# SELECT * FROM lockview;
 pid  | vxid | lock_type |   lock_mode   | granted | xid_lock | relname | page | tuple | classid | objid | objsubid
------+------+-----------+---------------+---------+----------+---------+------+-------+---------+-------+----------
 3998 | 3/0  | advisory  | ExclusiveLock | t       |          |         |      |       |       0 |    10 |        1
-- other connections
SELECT pg_try_advisory_lock(10);
 pg_try_advisory_lock
----------------------
 f

The query produces ExclusiveLock with type advisory.

Deadlocks

Any system with multiple locks tends to have a deadlock situation when queries will never finish. The only way to resolve such issues: kill one of blocked statements. More importantly, deadlock detection is an expensive procedure in PostgreSQL. A check for deadlock only happens when a transaction is locked for deadlock_timeout milliseconds—after one second by default.

Here is an illustration of a deadlock situation for two different connections A and B:

Any deadlock starting from lock wait.

A: BEGIN; SELECT c FROM locktest WHERE c=1 FOR UPDATE;
B: BEGIN; SELECT c FROM locktest WHERE c=2 FOR UPDATE; SELECT c FROM locktest WHERE c=1 FOR UPDATE;

You are not alone with the identification of deadlocks, as the pg_stat_activity system view helps you to find statements and transactions causing lock waits:

postgres=# SELECT pg_stat_activity.pid AS pid,
query, wait_event, vxid, lock_type,
lock_mode, granted, xid_lock
FROM lockview JOIN pg_stat_activity ON (lockview.pid = pg_stat_activity.pid);
  pid  |          query             |  wait_event   | vxid |   lock_type   |      lock_mode      | granted | xid_lock
-------+----------------------------+---------------+------+---------------+---------------------+---------+----------
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | transactionid | ExclusiveLock       | t       | 605
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | advisory      | ExclusiveLock       | t       |
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | relation      | AccessShareLock     | t       |
 10997 | SELECT ... c=1 FOR UPDATE; | ClientRead    | 3/43 | relation      | RowShareLock        | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | transactionid | ExclusiveLock       | t       | 606
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | advisory      | ExclusiveLock       | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | relation      | AccessShareLock     | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | relation      | RowShareLock        | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | tuple         | AccessExclusiveLock | t       |
 11495 | SELECT ... c=1 FOR UPDATE; | transactionid | 5/29 | transactionid | ShareLock           | f       | 605

The SELECT FOR UPDATE on c=2 row causes a deadlock:

SELECT c FROM locktest WHERE c=2 FOR UPDATE;

Afterwards, PostgreSQL reports in server log:

2018-08-02 08:46:07.793 UTC [10997] ERROR:  deadlock detected
2018-08-02 08:46:07.793 UTC [10997] DETAIL:  Process 10997 waits for ShareLock on transaction 606; blocked by process 11495.
Process 11495 waits for ShareLock on transaction 605; blocked by process 10997.
Process 10997: select c from locktest where c=2 for update;
Process 11495: select c from locktest where c=1 for update;
2018-08-02 08:46:07.793 UTC [10997] HINT:  See server log for query details.
2018-08-02 08:46:07.793 UTC [10997] CONTEXT:  while locking tuple (0,3) in relation "locktest"
2018-08-02 08:46:07.793 UTC [10997] STATEMENT:  SELECT c FROM locktest WHERE c=2 FOR UPDATE;
ERROR:  deadlock detected
DETAIL:  Process 10997 waits for ShareLock on transaction 606; blocked by process 11495.
Process 11495 waits for ShareLock on transaction 605; blocked by process 10997.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (0,3) in relation "locktest"

As you can see, the database server aborts one blocked transaction automatically.

Multi-way deadlocks

Normally there are just two transactions creating deadlocks. However, in complex cases, an application could caused deadlocks with multiple transactions forming a dependency circle.

A: locks row1, B locks row2, C locks row3

Step2

A: trying to get row3, B: trying to get row1, C: trying to get row2

Summary

  • Do not put DDL statements in long transactions.
  • Please avoid DDL on during high load for frequently updated tables
  • CLUSTER command requires exclusive access to the table and all it’s indexes
  • Monitor postgresql log for deadlock-related messages

Photo by shy sol from Pexels

PostgreSQL locking, part 3: lightweight locks

$
0
0
LWLocks lightweight locks postgres

PostgreSQL logoPostgreSQL lightweight locks, or LWLocks, control memory access. PostgreSQL uses multi-process architecture and should allow only consistent reads and writes to shared memory structures. LWLocks have two levels of locking: shared and exclusive. It’s also possible to release all acquired LWLocks to simplify clean up. Other databases often call primitives similar to LWLocks “latches”. Because LWLocks is an implementation detail, application developers shouldn’t pay much attention to this kind of locking.

This is the third and final part of a series on PostgreSQL locking, related to latches protecting internal database structures. Here are the previous parts: Row-level locks and table-level locks.

Instrumentation

Starting from PostgreSQL 9.6, LWLocks activity can be investigated with the pg_stat_activity system view. It could be useful under high CPU utilization. There are system settings to help with contention on specific lightweight locks.

Before PostgreSQL 9.5, the LWLocks implementation used spin-locks.  It was a bottleneck. This was fixed in 9.5 with atomic state variable.

Potential heavy contention places

  • WALInsertLock: protects WAL buffers. You can increase the number of wal buffers to get a slight improvement. Incidentally, synchronous_commit=off increases pressure on the lock even more, but it’s not a bad thing. full_page_writes=off reduces contention, but it’s generally not recommended.
  • WALWriteLock: accrued by PostgreSQL processes while WAL records are flushed to disk or during a WAL segments switch. synchronous_commit=off removes the wait for disk flush, full_page_writes=off reduces the amount of data to flush.
  • LockMgrLock: appears in top waits during a read-only workload. It latches relations regardless of its size. It’s not a single lock, but at least 16 partitions. Thus it’s important to use multiple tables during benchmarks and avoid single table anti-pattern in production.
  • ProcArrayLock: Protects the ProcArray structure. Before PostgreSQL 9.0, every transaction acquired this lock exclusively before commit.
  • CLogControlLock: protects CLogControl structure, if it shows on the top of pg_stat_activity, you should check the location of $PGDATA/pg_clog—it should be on a buffered file system.
  • SInvalidReadLock: protects sinval array. Readers using shared lock. SICleanupQueue, and other array-wide updates, requires an exclusive lock. It shows at the top of the pg_stat_activity when the shared buffer pool is under stress. Using a higher number of shared_buffers helps to reduce contention.
  • BufMappingLocks: protects regions of buffers. Sets 128 regions (16 before 9.5) of buffers to handle the whole buffer cache.

Spinlocks

The lowest level for locking is spinlocks. Therefore, it’s implemented within CPU-specific instructions. PostgreSQL is trying to change an atomic variable value in a loop. If the value is changed from zero to one – the process obtained a spinlock. If it’s not possible to get a spinlock immediately, the process will increase its wait delay exponentially.  There is no monitoring on spinlocks and it’s not possible to release all accrued spinlocks at once. Due to the single state change, it’s also an exclusive lock. In order to simplify the porting of PostgreSQL to exotic CPU and OS variants, PostgreSQL uses OS semaphores for its spinlocks implementation. Of course, it’s significantly slower compared to native CPU instructions port.

Summary

  • Use pg_stat_activity to find which queries or LWLocks are causing lock waits
  • Use fresh branches of PostgreSQL, as developers have been working on performance improvements and trying to reduce locking contention on hot mutexes.

References

Locks Photo by Warren Sammut on Unsplash

Percona Server for MongoDB 4.0 Is an Ideal Solution for Even More Production Use Cases

$
0
0
Percona Server for MongoDB

Percona Server for MongoDBPercona announces the planned release of Percona Server for MongoDB 4.0, the latest version of the company’s free, enhanced, drop-in replacement for MongoDB Community Edition. Percona Server for MongoDB 4.0 includes all the features of MongoDB Community Edition 4.0, along with enterprise-class features from Percona that make it ideal for enterprise production environments.

Percona Server for MongoDB 4.0 will be generally available later this year.

Organizations are increasingly leveraging multiple open source databases to meet their diverse application requirements and improve the customer experience. Percona supports these organizations with a range of enhanced, production-ready open source databases, enterprise-grade support, and consulting services.

With more than 385,000 downloads, Percona Server for MongoDB provides all the cost and agility benefits of free, proven open source software, along with greater security, reliability and flexibility. With Percona Server for MongoDB, an increasing number of organizations can confidently run the document-based NoSQL MongoDB database to support their product catalogs, online shopping carts, Internet of Things (IoT) applications, mobile/social apps and more. Percona also backs Percona Server for MongoDB 4.0 and MongoDB Community Edition with the support and consulting services enterprises need to achieve optimal performance and maximize the value they obtain from the software – all with lower cost and complexity.

Percona Kubernetes Operator for MongoDB enables easier deployment of Percona Server for MongoDB environments – including standalone, replica set, or sharded cluster – inside Kubernetes and OpenShift platforms, without the need to move to an Enterprise Server. Management and backup are provided through the Operator working with Percona Monitoring and Management and hot backup.

Percona Server for MongoDB 4.0 contains all of the features in MongoDB Community Edition 4.0, including important capabilities that greatly expand its use cases:

  • Support for multi-document ACID transactions – ensuring accurate updates to all of the documents involved in a transaction, and moving the complexity of achieving these updates from the application to the database.
  • Support for SCRAM-SHA-256 authentication – making the production environment more secure and less vulnerable to external attack.
  • New type conversion and string operators – providing greater flexibility when performing aggregations.

Percona Server for MongoDB 4.0 also offers essential enterprise features for free, including:

  • Encrypted WiredTiger storage engine (“data at rest encryption”) with local key management. Integration with key management systems will be available in future releases of Percona Server for MongoDB.
  • SASL Authentication plugin for enabling authentication through OpenLDAP or Active Directory.
  • Open-source auditing for visibility into user and process actions in the database, with the ability to redact sensitive information (such as usernames and IP addresses) from log files.
  • Hot backups to protect against data loss in the event of a crash or disaster – backup activity does not impact performance.
  • Percona Memory Engine, a 100 percent open source in-memory storage engine designed for Percona Server for MongoDB, which is ideal for in-memory computing and other applications demanding very low-latency workloads.
  • Integration with Percona Toolkit and Percona Monitoring and Management for query performance analytics and troubleshooting.
  • Enhanced query profiling.

Percona Server for MySQL 8.0 Delivers Increased Reliability, Performance and Security

$
0
0
Percona Server for MySQL 5.7

Percona Server for MySQLPercona released a Release Candidate (RC) version of Percona Server for MySQL 8.0, the company’s free, enhanced, drop-in replacement for MySQL Community Edition. Percona Server for MySQL 8.0 includes all the features of MySQL Community Edition 8.0, along with enterprise-class features from Percona that make it ideal for enterprise production environments. The latest release offers increased reliability, performance and security.

Percona Server for MySQL 8.0 General Availability (GA) will be available later this year. You learn how to install the release candidate software here. Please note this release candidate is not intended nor recommended for production environments

MySQL databases remain a pillar of enterprise data centers. But as the amount of data collected continues to soar, and as the types of databases deployed for specific applications continue to expand, organizations require a powerful and cost-effective MySQL solution for their production environments. Percona meets this need with its mature, proven open source alternative to MySQL Community Edition. Percona also backs MySQL 8.0 with the support and consulting services enterprises need to achieve optimal performance and maximize the value they obtain from the software – all with lower cost and complexity.

With more than 4,550,000 downloads, Percona Server for MySQL offers self-tuning algorithms and support for extremely high-performance hardware, delivering excellent performance and reliability for production environments. Percona Server for MySQL is trusted by thousands of enterprises to provide better performance. The following capabilities are unique to Percona Server for MySQL 8.0:

  • Greater scalability and availability, enhanced backups, and increased visibility to improve performance, reliability and usability
  • Parallel doublewrite functionality for greatly improved write performance, resulting in increased speed
  • Additional write-optimized storage engine, MyRocks, which takes advantage of modern hardware and database technology to reduce storage requirements, maintenance costs, and increase ROI in both on-premises and cloud-based applications, delivered with a MySQL-compatible interface.
  • Enhanced encryption functionality – including integration with Hashicorp Vault to simplify the management of encryption keys – for increased security
  • Advanced PAM-based authentication, audit logging, and threadpool scalability – enterprise-grade features available in Percona Server for MySQL without a commercial license

Percona Server for MySQL 8.0 also contains all the new features introduced in MySQL Community Edition 8.0, including:

  • Greater Reliability – A new transactional data dictionary makes recovery from failure easier, providing users with a higher level of comfort that data will not be lost. The transactional data dictionary is now crash-safe and centralized. In addition, support for atomic Data Definition Language (DDL) statements ensures that all operations associated with a DDL transaction are committed or rejected as a unit.
  • Enhanced Performance – New functions and expressions (along with Percona’s parallel doublewrite buffer) improve overall performance, allowing users to see their data more quickly. Enhanced JSON functionality improves document storage and query capabilities, and the addition of Window functions provides greater flexibility for aggregation queries. Common Table Expressions (CTE) enable improved query syntax for complex queries.
  • Increased Security – The ability to collect a typical series of permission grant statements for a user into a defined role and then apply that role to a user in MySQL makes the database environment more secure from outside attack by allowing administrators to better manage access permissions for authorized users. SQL Roles also enable administrators to more easily and efficiently determine a set of privileges for multiple associated users, saving time and reducing errors.
  • Expanded Queries – Percona Server for MySQL 8.0 provides support for spatial data types and indexes, as well as for Spatial Reference System (SRS), the industry-standard method for geospatial lookup.

Learn more about the Percona Server for MySQL 8.0 RC release here.


How To Best Use Percona Server Column Compression With Dictionary

$
0
0
Database Compression

column compressionVery often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much data as possible in the buffer pool, and one of them is data compression.

With regular MySQL, to compress InnoDB data you can either use “Barracuda page compression” or “transparent page compression with punch holes”. The use of the ZFS filesystem is another possibility, but it is external to MySQL and doesn’t help with caching. All these solutions are transparent, but often they also have performance and management implications. If you are using Percona Server for MySQL, you have yet another option, “column compression with dictionary“. This feature is certainly not receiving the attention it merits. I think it is really cool—let me show you why.

We all know what compression means, who has not zipped a file before attaching it to an email? Compression removes redundancy from a file. What about the dictionary? A compression dictionary is a way to seed the compressor with expected patterns, in order to improve the compression ratio. Because you can specify a dictionary, the scope of usefulness of column compression with the Percona Server for MySQL feature is greatly increased. In the following sections, we’ll review the impacts of a good dictionary, and devise a way to create a good one without any guessing.

A simple use case

A compression algorithm needs a minimal amount of data in order to achieve a reasonable compression ratio. Typically, if the object is below a few hundred bytes, there is rarely enough data to have repetitive patterns and when the compression header is added, the compressed data can end up larger than the original.

mysql> select length('Hi!'), length(compress('Hi!'));
+---------------+-------------------------+
| length('Hi!') | length(compress('Hi!')) |
+---------------+-------------------------+
|             3 |                      15 |
+---------------+-------------------------+
1 row in set (0.02 sec)

Compressing a string of three bytes results in a binary object of 15 bytes. That’s counter productive.

In order to illustrate the potential of the dictionary, I used this dataset:

http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz

It is a set of 100k Jeopardy questions written in JSON. To load the data in MySQL, I created the following table:

mysql> show create table TestColCompression\G
*************************** 1. row ***************************
Table: TestColCompression
Create Table: CREATE TABLE `TestColCompression` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question` text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=79977 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Then, I did some formatting to create insert statements:

zcat JEOPARDY_QUESTIONS1.json.gz | perl -p -e 's/\[\{/\{/g' | perl -p -e 's/\}, \{/\}\n\{/g' | perl -p -e "s/'/''/g" | \
  (while read line; do echo "insert into testColComp (questionJson) values ('$line');"; done )

And I executed the inserts. About 20% of the rows had some formatting issues but nevertheless, I ended up with close to 80k rows:

mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78110
Avg_row_length: 316
Data_length: 24690688
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:16:41
Update_time: 2018-10-26 15:40:34
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

The average row length is 316 bytes for a total data size of 23.55MB. The question JSON objects are large enough to matter, but barely large enough for compression. Here are the first five rows:

mysql> select question from TestColCompression limit 5\G
*************************** 1. row ***************************
question: {"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
*************************** 2. row ***************************
question: {"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
*************************** 3. row ***************************
question: {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
*************************** 4. row ***************************
question: {"category": "OLD FOLKS IN THEIR 30s", "air_date": "2009-05-08", "question": "'The district of conservative rep. Patrick McHenry in this state includes Mooresville, a home of NASCAR'", "value": "$800", "answer": "North Carolina", "round": "Jeopardy!", "show_number": "5690"}
*************************** 5. row ***************************
question: {"category": "MOVIES & TV", "air_date": "2009-05-08", "question": "'Tim Robbins played a public TV newsman in "Anchorman: The Legend of" him'", "value": "$800", "answer": "Ron Burgundy", "round": "Jeopardy!", "show_number": "5690"}

Let’s begin by a straight column compression without specifying a dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED;
Query OK, 79976 rows affected (4.25 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78995
Avg_row_length: 259
Data_length: 20496384
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:47:56
Update_time: 2018-10-26 15:47:56
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

As expected the data didn’t compress much. The compression ration is 0.82 or if expressed as a percentage, 18%. Since the JSON headers are always the same, and are present in all questions, we should minimally use them for the dictionary. Trying a minimal dictionary made of the headers gives:

mysql> SET @dictionary_data = 'category' 'air_date' 'question' 'value' 'answer' 'round' 'show_number' ;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE COMPRESSION_DICTIONARY simple_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY simple_dictionary;
Query OK, 79976 rows affected (4.72 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78786
Avg_row_length: 246
Data_length: 19447808
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:58:17
Update_time: 2018-10-26 17:58:17
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

There is a little progress, we now have a compression ratio of 0.79. Obviously, we could do more but without a tool, we’ll have to guess. A compressor like zlib builds a dictionary as part of its compression effort, could we use that? Yes, but only if we can generate it correctly and access the result. That’s not readily available with the common compressors I know. Fortunately, someone else had the same issue and wrote a compressor able to save its dictionary. Please let me introduce femtozip.

Femtozip to the rescue

The tool, by itself, has no magic algorithm. It is based on zlib, from what I can understand from the code. Anyway, we won’t compress anything with it, we’ll use it to generate a good dictionary. In order to create a dictionary, the tool looks at a set of files and try to see patterns between them. The use of a single big file defeat the purpose. So, I generated one file per question with:

mkdir questions
cd questions
l=1; mysql -u blog -pblog colcomp -e 'select question from TestColCompression' | (while read line; do echo $line > ${l}; let l=l+1; done)

Then, I used the following command to generate a 1024 bytes dictionary using all the files starting by “1”:

../femtozip/cpp/fzip/src/fzip --model ../questions_1s.mod --build --dictonly --maxdict 1024 1*
Building dictionary...

In about 10s the job was done. I tried with all the 80k files and… I had to kill the process after thirty minutes. Anyway, there are 11111 files starting with “1”, a very decent sample. Our generated dictionary looks like:

cat ../questions_1s.mod
", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "'e", "round": "Jeopardy!", "show_number": "r", "round": "{"cate gory": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "'Jeopardy!", "show_number": "2'", "value": "$1000", "answer": "7", "question": "'The ", "question": "'A'", "value": "$600", "answer": "9", "questi on": "'In ", "question": "'This 3", "question": "'2", "question": "'e'", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"'", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s'", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy !", "show_number": "3", "round": "Jeopardy!", "show_number": "5'", "value": "$200", "answer": "'", "value": "$800", "answer": "'", "value": "$400", "answer": "

With some formatting, I was able to create a dictionary with the above data:

mysql> SET @dictionary_data = '", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "''e", "round": "Jeopardy!", "show_number": "r", "round": "{"category": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "''Jeopardy!", "show_number": "2''", "value": "$1000", "answer": "7", "question": "''The ", "question": "''A''", "value": "$600", "answer": "9", "question": "''In ", "question": "''This 3", "question": "''2", "question": "''e''", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"''", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s''", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "5''", "value": "$200", "answer": "''", "value": "$800", "answer": "''", "value": "$400", "answer": "' ;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE COMPRESSION_DICTIONARY femtozip_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
<\pre>
And then, I altered the table to use the new dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY femtozip_dictionary;
Query OK, 79976 rows affected (4.05 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 79861
Avg_row_length: 190
Data_length: 15220736
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:56:09
Update_time: 2018-10-26 17:56:09
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

That’s interesting, we are now achieving a ratio of 0.61, a significant improvement. I pushed my luck and tried with a 2048 bytes dictionary. It further reduced the ratio to 0.57 but that was about the best I got. Larger dictionaries didn’t lower the ratio below 0.57. Zlib supports up to 32KB for the dictionary.

So, to recap:

  • column compression without dictionary, ratio of 0.82
  • column compression with simple dictionary, ratio of 0.79
  • column compression with a 1k dictionary from femtozip, ratio of 0.61
  • column compression with a 2k dictionary from femtozip, ratio of 0.57

The above example stores a JSON document in a text column. MySQL 5.7 includes a JSON datatype which behaves a bit differently regarding the dictionary. Delimiting characters like ‘{}’ are removed in the on disk representation of a JSON column. If you have TBs of data in similar tables, you should really consider column compression and a systematic way of determining the dictionary with femtozip. In addition to improve the compression, it is likely to be the less performance impacting solution. Would it be interesting to generate a dictionary from existing data with a command like this one?

CREATE COMPRESSION_DICTIONARY_FROM_DATA A_good_dictionary (2048, select questions from TestColCompression limit 10000);

where the dictionary creation process would implicitly includes steps similar to the ones I did with femtozip.

Newly-Released PostgreSQL Minor Versions: Time to Update!

$
0
0
PostgreSQL Minor Versions

In this blog post we’ll look at what the newly-released PostgreSQL minor versions contain. You probably want to update your current versions and use these updates.

You might already have seen that they released the updates for supported PostgreSQL versions on November 8, 2018. PostgreSQL releases minor versions with several bug fixes and feature enhancements each quarter. An important point to note is that PostgreSQL 9.3 got its final minor version release (9.3.24) this quarter, and is no longer supported.

We always recommended that you keep your PostgreSQL databases updated to the latest minor versions. Applying a minor release might need a restart after updating the new binaries. The following is the sequence of steps you should follow to upgrade to the latest minor versions:

  1. Shutdown the PostgreSQL database server
  2. Install the updated binaries
  3. Restart your PostgreSQL database server

Most times, you can choose to update the minor versions in a rolling fashion, in a master-slave (replication) setup. Just perform the update on one server after another, but not all-at-once. Rolling updates avoid downtime for both reads and writes simultaneously. However, we recommended that you shutdown, update and restart them all-at-once while you are performing the updates.

One of the most important fixes is a security fix: CVE-2018-16850. The bug allowed an attacker with CREATE privileges on some non-temporary schema or TRIGGER privileges on some table to create a malicious trigger that, when dumped and restored using pg_dump/pg_restore, would result in additional SQL statements being executed. This applies to PostgreSQL 10 and 11 versions.

Before proceeding further, let’s look at the list of minor versions released this quarter.

  • PostgreSQL 11.1
  • PostgreSQL 10.6
  • PostgreSQL 9.6.11
  • PostgreSQL 9.5.15
  • PostgreSQL 9.4.20
  • PostgreSQL 9.3.25

Now, let us look into the benefits you should see by updating your Postgres versions with the latest minor versions.

PostgreSQL 11.1

PostgreSQL 11.0 was released on October 18, 2018. You might want to look at our blog post on our first take on PostgreSQL 11. With the new minor release PostgreSQL 11.1, we get some interesting functionalities and fixes after 21 days of its previous release, as seen here. The following is a small list of fixes that you might find interesting:

  • Ability to create child indexes of partition tables in another tablespace.
  • NULL handling in parallel hashed multi-batch left joins.
  • Fix to the strictness logic that incorrectly ignored rows of ORDER BY values those were null.
  • Disable recheck_on_update that is forced off for all indexes, as it is not ready yet and requires more time.
  • Prevent creation of a partition in a trigger attached to its parent table
  • Disallow the pg_read_all_stats role from executing pg_stat_statements_reset() to rest the stats as it is just needed for monitoring. You must run ALTER EXTENSION pg_stat_statements UPDATE … to get this into effect.

PostgreSQL 10.6

There are some common fixes that were applied to PostgreSQL 11.1 and PostgreSQL 10.6. You can find PostgreSQL 10.6 release details here. Some of the fixes applied to PostgreSQL 10.6 are in common with other supported PostgreSQL versions, as highlighted below:

  • Disallow the pg_read_all_stats role from executing pg_stat_statements_reset() to rest the stats as it is just needed for monitoring. (PostgreSQL 11.1 and 10.6)
  • Avoid pushing sub-SELECTs containing window functions, LIMIT, or OFFSET to parallel workers to avoid the behavior of different workers getting different answers due to row-ordering variations. (PostgreSQL 9.6 and 10.6)
  • Fixed the WAL file recycling logic that might make a standby fail to remove the WAL files that need to be removed. (PostgreSQL 9.5.15, 9.6.11 and 10.6)
  • Handling of commit-timestamp tracking during recovery has been fixed to avoid recovery failures while trying to fetch the commit timestamp for a transaction that did not record it. (PostgreSQL 9.5.15, 9.6.11 and 10.6)
  • Applied the fix that ensures that the background workers are stopped properly when the postmaster receives a fast-shutdown request before completing the database startup. (PostgreSQL 9.5.15, 9.6.11 and 10.6)
  • Fixed unnecessary failures or slow connections when multiple target host names are used in libpq so that the DNS lookup happens one at a time but not all at once. (PostgreSQL 10.6)

Following is a list of some common fixes applied to PostgreSQL 9.6.11, PostgreSQL 9.5.15, PostgreSQL 9.4.20 and PostgreSQL 9.3.25:

  • Avoid O(N^2) slowdown in regular expression match/split functions on long strings.
  • Fix mis-execution of SubPlans when the outer query is being scanned backward.
  • Fix failure of UPDATE/DELETE … WHERE CURRENT OF … after rewinding the referenced cursor.
  • Fix EvalPlanQual to avoid crashes or wrong answers in concurrent updates, when the code contains an uncorrelated sub-SELECT inside a CASE construct.
  • Memory leak in repeated SP-GiST index scans has been fixed.
  • Ensure that hot standby processes use the correct WAL consistency point to prevent possible misbehavior after reaching a consistent database state during WAL replay.
  • Fix possible inconsistency in pg_dump’s sorting of dissimilar object names.
  • Ensure that pg_restore will schema-qualify the table name when emitting DISABLE/ENABLE TRIGGER commands when a restore is ran using a restrictive search_path.
  • Fix pg_upgrade to handle event triggers in extensions correctly.
  • Fix pg_upgrade’s cluster state check to work correctly on a standby server.
  • Fix build problems on macOS 10.14

Now that you understand the added fixes to existing PostgreSQL versions, we recommend that you test and update your PostgreSQL databases with the new minor versions (if you haven’t already).

If you are currently running your databases on PostgreSQL 9.3.x or earlier, we recommend that you to prepare a plan to upgrade your PostgreSQL databases to the supported versions ASAP. Please subscribe to our blog posts so that you know about the various options on upgrading your PostgreSQL databases to a supported major version.

Installing and Configuring JIT in PostgreSQL 11

$
0
0
JIT with PostgreSQL

JIT in PostgreSQLJust-in-time (JIT in PostgreSQL) compilation of SQL statements is one of the highlighted features in PostgreSQL 11. There is great excitement in the community because of the many claims of up to a 30% jump in performance. Not all queries and workloads get the benefit of JIT compilation. So you may want to test your workload against this new feature.

However, It is important to have a general understanding of what it does and where we can expect the performance gains. Installing PostgreSQL 11 with the new JIT compilation feature requires few extra steps and packages. Taking the time and effort to figure out how to do this shouldn’t be a reason to shy away from trying these cutting-edge features and testing a workload against the JIT feature. This blog post is for those who want to try it.

What is JIT and What it does in PostgreSQL

Normal SQL execution in any DBMS software is similar to what an interpreted language does to the source code. No machine code gets generated out of your SQL statement. But we all know that how dramatic the performance gains can be from a JIT compilation and execution of the machine code it generates. We saw the magic Google V8 engine did to JavaScript language. The quest for doing a similar thing with SQL statement was there for quite some time. But it is a challenging task.

It is challenging because we don’t have the source code (SQL statement) ready within the PostgreSQL server. The source code that needs to undergo JIT need to come from client connections and there could be expressions/functions with a different number of arguments, and it may be dealing with tables of different number and type of columns.

Generally, a computer program won’t get modified at this level while it is running, so branching-predictions are possible. The unpredictability and dynamic nature of SQL statements coming from client connections and hitting the database from time-to-time give no scope for doing advance prediction or compilation in advance. That means the JIT compiler should kick in every time the database gets an SQL statement. For this reason, PostgreSQL needs the help of compiler infrastructure like LLVM  continuously available behind. Even though there were a couple of other options, the main developer of this feature (Andres Freund) had a strong reason why LLVM was the right choice.

. In PostgreSQL 11, the JIT feature currently does:

  1. Accelerating expression evaluation: Expressions in  WHERE clauses, target lists, aggregates and projections
  2. Tuple deforming: Converting on-disk image to corresponding memory representation.
  3. In-lining: bodies of small custom functions, operators and user-defined data types are inline-ed into the expressions using them
  4. You can use compiler optimizations provided by LLVM for preparing optimized machine code.

In this blog, we are going to see how to install PostgreSQL with JIT. Just like regular PostgreSQL installations, we have two options:

  1. Get PostgreSQL from the packages in the PGDG repository
  2. Build PostgreSQL from source

Option 1. Install from PGDG repository.

Compiling from source requires us to install all compilers and tools. We might want to avoid this for various reasons. Installing packages from a PGDG repository is straightforward. On production systems or a container, you might want to install only the bare minimum required packages. Additional packages you don’t really use are always a security concern. Distributions like Ubuntu provide more recent versions of libraries and tool-sets in their default repos. However, distributions like CentOS / RHEL are quite conservative — their priority is stability and proven servers rather than cutting-edge features. So In this section of the post is mostly relevant for CentOS7/RHEL 7.

Here are the steps for the bare minimum installation of PostgreSQL with JIT feature on CentOS7

Step 1. Install PGDG repo and Install PostgreSQL server package.

This is usually the bare minimum installation if we don’t need the JIT feature.

sudo yum install https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm
sudo yum install postgresql11-server

At this stage, we can initialize the data directory and start the service if we don’t need JIT:

sudo /usr/pgsql-11/bin/postgresql*-setup initdb
sudo systemctl start postgresql-11

Step 2. Install EPEL repository

sudo yum install epel-release

Step 3. Install package for PostgreSQL with llvmjit

sudo yum install postgresql11-llvmjit

Since we have already added the EPEL repository, now the dependancy can be resolved by YUM and it can pull and install the necessary package from EPEL. Installation message contains the necessary packages.

...
Installing:
postgresql11-llvmjit      x86_64     11.1-1PGDG.rhel7     pgdg11    9.0 M
Installing for dependencies:
llvm5.0                   x86_64     5.0.1-7.el7          epel      2.6 M
llvm5.0-libs              x86_64     5.0.1-7.el7          epel      13 M
...

As we can see, there are two packages: llvm5.0 and llvm5.0-libs get installed.

Note for Ubuntu users:

As we already mentioned, Repositories of recent versions of Ubuntu contains recent versions of LLVM libraries. For example, Ubuntu 16.04 LTS repo contains libllvm6.0 by default. Moreover, PostgreSQL server package is not divided to have a separate package for jit related files. So default installation of PostgreSQL 11 can get you JIT feature also.

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt install postgresql-11

Option 2. Building from Source

The primary means of distributing PostgreSQL is the source code. Building a minimal PostgreSQL instance requires just a C compiler. But building JIT options requires a few more things. One of the challenges you can run into is different errors during the build process due to older versions of LLVM and Clang present in the system.

Step 1. Download PostgreSQL source tarball and unpack

Tarballs are available in the repository. We can grab and unpack the latest:

curl -LO https://ftp.postgresql.org/pub/source/v11.0/postgresql-11.0.tar.bz2
tar -xvf postgresql-11.0.tar.bz2

Step 2.  Get SCL Repository and Install toolset

Latest versions of LLVM, CLang and GCC are available in SCL. We can get everything in a stretch:

sudo yum install centos-release-scl
sudo yum install llvm-toolset-7 llvm-toolset-7-llvm-devel.x86_64

Now either you can set or edit your PATH to have all new tools in PATH. I would prefer to put that into my profile file:

PATH=/opt/rh/devtoolset-7/root/usr/bin/:/opt/rh/llvm-toolset-7/root/usr/bin/:$PATH

Alternatively, we can open a shell with SCL enabled:

scl enable devtoolset-7 llvm-toolset-7 bash

We should attempt to compile the source from a shell with all these paths set.

Step 3. Install Additional libraries/tools

Based on the configuration options you want, this list may change. Consider this as a sample for demonstration purposes:

sudo yum install  readline-devel zlib-devel libxml2-devel openssl-devel

Step 4. Configure with –with-llvm option and make

Now we should be able to configure and make with our preferred options. The JIT feature will be available if the 

--with-llvm
 option is specified. For this demonstration, I am using an installation directory with my home (/home/postgres/pg11):
./configure --prefix=/home/postgres/pg11 --with-openssl --with-libxml --with-zlib --with-llvm
make
make install

Enabling JIT

You may observe that there is a new directory under the PostgreSQL’s lib folder with name

bit code
Which contains lots of files with .bc extension these are pre-generated bytecodes for LLVM for facilitating features like in-lining.

By default, the JIT feature is disabled in PostgreSQL 11. If you want to test it, you may have to enable the parameter

jit
:
postgres=# ALTER SYSTEM SET jit=on;
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
postgres=# show jit;
 jit
-----
 on
(1 row)

By default, most of the simple queries won’t use JIT because of the cost. The cost is high when JIT kicks in. In case we want to test if JIT is properly configured, we can lie to PostgreSQL that that cost is very low by adjusting the parameter value. However, we should keep in mind that we are accepting negative performance gains. Let me show a quick example:

postgres=# SET jit_above_cost=5;
SET
postgres=# create table t1 (id int);
CREATE TABLE
postgres=# insert into t1 (SELECT (random()*100)::int FROM generate_series(1,800000) as g);
INSERT 0 800000
postgres=# analyze t1;
ANALYZE
postgres=# explain select sum(id) from t1;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=8706.88..8706.89 rows=1 width=8)
   ->  Gather  (cost=8706.67..8706.88 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=7706.67..7706.68 rows=1 width=8)
               ->  Parallel Seq Scan on t1  (cost=0.00..6873.33 rows=333333 width=4)
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(8 rows)

As we can see in the above example, a separate JIT section comes up in the explain plan.

We expect JIT compilation to make a difference in complex analytical queries because the overhead in JIT compilation gets compensated only if the code runs for the duration. Here is a simple aggregate query for demonstration. (I know this is not a complex query, and not the perfect example for demonstrating JIT feature):

postgres=# EXPLAIN ANALYZE SELECT COMPANY_ID,
      SUM(SHARES) TOT_SHARES,
      SUM(SHARES* RATE) TOT_INVEST,
      MIN(SHARES* RATE) MIN_TRADE,
      MAX(SHARES* RATE) MAX_TRADE,
      SUM(SHARES* RATE * 0.002) BROKERAGE
FROM TRADING
GROUP BY COMPANY_ID;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=757298.72..758741.91 rows=5005 width=138) (actual time=16992.290..17011.395 rows=5000 loops=1)
   Group Key: company_id
   ->  Gather Merge  (cost=757298.72..758466.64 rows=10010 width=138) (actual time=16992.270..16996.919 rows=15000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=756298.70..756311.21 rows=5005 width=138) (actual time=16983.900..16984.356 rows=5000 loops=3)
               Sort Key: company_id
               Sort Method: quicksort  Memory: 1521kB
               Worker 0:  Sort Method: quicksort  Memory: 1521kB
               Worker 1:  Sort Method: quicksort  Memory: 1521kB
               ->  Partial HashAggregate  (cost=755916.09..755991.16 rows=5005 width=138) (actual time=16975.997..16981.354 rows=5000 loops=3)
                     Group Key: company_id
                     ->  Parallel Seq Scan on trading  (cost=0.00..287163.65 rows=12500065 width=12) (actual time=0.032..1075.833 rows=10000000 loops=3)
 Planning Time: 0.073 ms
 Execution Time: 17013.116 ms
(15 rows)

We can switch on the JIT parameter at the session level and retry the same query:

postgres=# SET JIT=ON;
SET
postgres=# EXPLAIN ANALYZE SELECT COMPANY_ID,
      SUM(SHARES) TOT_SHARES,
      SUM(SHARES* RATE) TOT_INVEST,
      MIN(SHARES* RATE) MIN_TRADE,
      MAX(SHARES* RATE) MAX_TRADE,
      SUM(SHARES* RATE * 0.002) BROKERAGE
FROM TRADING
GROUP BY COMPANY_ID;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=757298.72..758741.91 rows=5005 width=138) (actual time=15672.809..15690.901 rows=5000 loops=1)
   Group Key: company_id
   ->  Gather Merge  (cost=757298.72..758466.64 rows=10010 width=138) (actual time=15672.781..15678.736 rows=15000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=756298.70..756311.21 rows=5005 width=138) (actual time=15661.144..15661.638 rows=5000 loops=3)
               Sort Key: company_id
               Sort Method: quicksort  Memory: 1521kB
               Worker 0:  Sort Method: quicksort  Memory: 1521kB
               Worker 1:  Sort Method: quicksort  Memory: 1521kB
               ->  Partial HashAggregate  (cost=755916.09..755991.16 rows=5005 width=138) (actual time=15653.390..15658.581 rows=5000 loops=3)
                     Group Key: company_id
                     ->  Parallel Seq Scan on trading  (cost=0.00..287163.65 rows=12500065 width=12) (actual time=0.039..1084.820 rows=10000000 loops=3)
 Planning Time: 0.072 ms
 JIT:
   Functions: 28
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 5.844 ms, Inlining 137.226 ms, Optimization 201.152 ms, Emission 125.022 ms, Total 469.244 ms
 Execution Time: 15696.092 ms
(19 rows)

Here we see a 7.7% improvement in performance. I executed this several times and found that the performance gain is consistently 7-8% for this simple query (which takes 15 seconds to execute). The gains are higher for queries with more calculations/expressions.

Summary

It is fairly simple to install and configure JIT with PostgreSQL as demonstrated above. One point we would like to highlight is that installing JIT packages and enabling the JIT feature can be done online while the database is up and running. This is because all JIT related parameters are dynamic in nature. Parameter changes can be loaded a SIGHUP signal or

SELECT pg_reload_conf()
 by the superuser. If it is not helping our workload, we can turn it off anytime. Nothing stops you from trying it in a non-production environment. We might not see a gain in small and simple queries that take less time for execution because the overhead in doing the JIT compilation can become more than executing the SQL statement. But we should expect a good gain in OLAP workload with complex queries that run for a longer duration.

Identifying Unused Indexes in MongoDB

$
0
0
mongodb index usage stats PMM visualization

Like MySQL, having too many indexes on a MongoDB collection not only affects overall write performance, but disk and memory resources as well. While MongoDB holds predictably well in scaling both reads and writes options, maintaining a heathly schema design should always remain a core character of a good application stack.

Aside from knowing when to add an index to improve query performance, and how to modify indexes to satisfy changing query complexities, we also need to know how to identify unused indexes and cut their unnecessary overhead.

First of all, you can already identify access operation counters from each collection using the

$indexStats
  (
indexStats
  command before 3.0) aggregation command. This command provides two important pieces of information: the
ops
  counter value, and
since
 , which is when the ops counter first iterated to one. It is reset when the
mongod
  instance is restarted.
m34:PRIMARY> db.downloads.aggregate( [ { $indexStats: { } } ] ).pretty()
{
	"name" : "_id_",
	"key" : {
		"_id" : 1
	},
	"host" : "mongodb:27018",
	"accesses" : {
		"ops" : NumberLong(0),
		"since" : ISODate("2018-11-10T15:53:31.429Z")
	}
}
{
	"name" : "h_id_1",
	"key" : {
		"h_id" : 1
	},
	"host" : "mongodb:27018",
	"accesses" : {
		"ops" : NumberLong(0),
		"since" : ISODate("2018-11-10T15:54:57.634Z")
	}
}

From this information, if the ops counter is zero for any index, then we can assume it has not been used either since the index was added or since the server was restarted, with a few exceptions. An index might be unique and not used at all (a uniqueness check on INSERT does not increment the ops counter). The documentation also indicates that index stats counter does not get updated by TTL indexes expiration or chunk split and migration operations.

Be aware of occasional index use

One golden rule, however, is that this type of observation based on type is subjective – before you decide to drop the index, make sure that the counter has collected for a considerable amount of time. Dropping an index that is only used once a month for some heavy reporting can be problematic.

The same information from

$indexStats
  can also be made available to PMM. By default, the
mongo_exporter
  does not include this this information but it can be enabled as an additional collection parameter.
sudo pmm-admin add mongodb:metrics --uri 127.0.0.1:27018 -- -collect.indexusage

Once enabled, we can create a custom graph for this information from any PMM dashboard, as shown below. As mentioned above, any index(es) that has zero values will not have been used for the current time range in the graph. One minor issue with the collector is that each metric does not come with the database and collection information. Consequently, we cannot filter to the collection level yet, we have an improvement request open for that.

MongoDB index usage dashboard report from percona monitoring and management

An alternative view to this information from Grafana PMM is available from the Time Series to Aggregation table panel, shown below. One advantage of having these metrics in PMM is that the data survives an instance restart. Of course, to be useful for identifying unused indexes, the retention period has to match or exceed your complete application “cycle” period.MongoDB index usage stats from PMM

Given that in a MongoDB replicaset, you can delegate data bearing member nodes to different roles, perhaps with tags and priorities. You can also have nodes with different sets of indexes. Being able to identify the sets of indexes needed at the node level allows you to optimize replication, queries, and resource usage.

More Resources

We have an introductory series of posts on MongoDB indexes available on this blog. Read Part 1 here.

You can download Percona Server for MongoDB – all Percona software is open source and free.

Compression options in MySQL (part 1)

$
0
0

Over the last year, I have been pursuing a part time hobby project exploring ways to squeeze as much data as possible in MySQL. As you will see, there are quite a few different ways. Of course things like compression ratio matters a lot but, other items like performance of inserts, selects and updates, along with the total amount of bytes written are also important. When you start combining all the possibilities, you end up with a large set of compression options and, of course, I am surely missing a ton. This project has been a great learning opportunity and I hope you’ll enjoy reading about my results. Given the volume of results, I’ll have to write a series of posts. This post is the first of the series. I also have to mention that some of my work overlaps work done by one of my colleague, Yura Sorokin, in a presentation he did in Dublin.

The compression options

  • InnoDB page size in {16k, 32k, 64k} (as references)
  • InnoDB barracuda page compression, block_size in {8k, 4k}
  • InnoDB Transparent page compression with punch holes, page size in {16k, 32k, 64k} * compression algo in {LZ4, Zlib}
  • MyISAM, MyISAM Packed, MyISAM on ZFS with recordsize in {16k, 32k}
  • InnoDB on ZFS, ZFS compression algo in {LZ4, Zlib}, ZFS record size in {16k, 32k, 64k, 128k}, InnoDB page size in {16k, 32k, 64k}
  • TokuDB, TokuDB compression algo in {ZLIB, LZMA, QUICKLZ, SNAPPY}
  • TokuDB on ZFS, TokuDB compression algo set to None, ZFS compression Zlib, ZFS record size in {16k, 32k, 64k, 128k}
  • MyRocks, compression algo in {None, ZSTD}
  • MyRocks on ZFS, MyRocks compression algo set to None, ZFS compression Zlib, ZFS record size in {16k, 32k, 64k, 128k}

In many interesting cases, the ZFS experiments have been conducted with and without a SLOG.

The test datasets

In order to cover these solutions efficiently, I used a lot of automation and I restricted myself to two datasets. My first dataset consists of a set of nearly 1 billion rows from the Wikipedia access stats. The table schema is:

CREATE TABLE `wiki_pagecounts` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `day` date NOT NULL,
  `hour` tinyint(4) NOT NULL,
  `project` varchar(30) NOT NULL,
  `title` text NOT NULL,
  `request` int(10) unsigned NOT NULL,
  `size` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_time` (`day`,`hour`)
);

and here’s a typical row:

mysql> select * from wiki_pagecounts where id = 16\G
*************************** 1. row ***************************
     id: 16
    day: 2016-01-01
   hour: 0
project: aa
  title: 'File:Wiktionary-logo-en.png'
request: 1
   size: 10752
1 row in set (0.00 sec)

The average length of the title columns is above 70 and it often has HTML escape sequences for UTF-8 characters in it. The actual column content is not really important but it is not random data. Loading this dataset in plain InnoDB results in a data file of about 113GB.

The second dataset is from the defunct Percona cloud tool project and is named “o1543”. Instead of a large number of rows, it is made of only 77M rows but this time, the table has 134 columns, mostly using float or bigint. The table definition is:

CREATE TABLE `query_class_metrics` (
   `day` date NOT NULL,
   `query_class_id` int(10) unsigned NOT NULL,
   `instance_id` int(10) unsigned NOT NULL,
   `start_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
   `end_ts` timestamp NOT NULL DEFAULT '1970-01-01 00:00:01',
   `query_count` bigint(20) unsigned NOT NULL,
   `lrq_count` bigint(20) unsigned NOT NULL DEFAULT '0',
   ...
   `Sort_scan_sum` bigint(20) unsigned DEFAULT NULL,
   `No_index_used_sum` bigint(20) unsigned DEFAULT NULL,
   `No_good_index_used_sum` bigint(20) unsigned DEFAULT NULL,
   PRIMARY KEY (`start_ts`,`instance_id`,`query_class_id`),
   KEY `start_ts` (`instance_id`)
 );

When loaded in plain InnoDB, the resulting data file size is slightly above 87GB.

The test queries

The first test query is, of course, the inserts used to load the datasets. These are multi-inserts statements in primary key order generated by the mysqldump utility.

The second test query is a large range select. For the Wikipedia dataset I used:

select `day`, `hour`, max(request), sum(request), sum(size)
from wikipedia_pagecounts.wiki_pagecounts
where day = '2016-01-05'
group by `day`,`hour`;

while for the o1543 dataset, I used:

select query_class_id, sum(Query_time_sum) as Totat_time,sum(query_count), sum(Rows_examined_sum),
  sum(InnoDB_pages_distinct_sum)
from o1543.query_class_metrics
where start_ts between '2014-10-01 00:00:00' and '2015-06-30 00:00:00'
group by query_class_id
order by Totat_time desc limit 10;

In both cases, a significant amount of data needs to be processed. Finally, I tested random access with the updates. I generated 20k distinct single row updates in random primary key order for the Wikipedia dataset like:

update wikipedia_pagecounts.wiki_pagecounts set request = request + 1 where id = 377748793;

For the o1543 dataset, I used the following update statement:

update o1543.query_class_metrics set Errors_sum = Errors_sum + 1
where query_class_id = 472 and start_ts between '2014-10-01 00:00:00' and '2014-12-31 23:59:59';

which ends up updating very close to 20k rows, well spaced in term of primary key values.

The metrics recorded

In order to compare the compression options, I recorded key metrics.

  • Time: that’s simply the execution time of the queries. It is very difficult to minimally tune a server for all the different engines. Some also rely on the OS file cache. Take the execution time as a rough performance indicator which could be modified substantially through targeted tuning.
  • Amount of data read and written by the MySQL process, as reported by /proc/$(pidof mysqld)/io.
  • Amount of data written to the actual physical device where the data is stored, from /sys/block/$datadevice/stat. That matters a lot for flash devices that have a finite endurance. The amount of data written to the storage is the main motivation of Facebook with MyRocks.
  • The actual size of the final datasets

Even with these simple metrics, you’ll see there is quite a lot to discuss and learn.

The procedure

For the benchmarks, I used a LXC virtual machine. My main goal was to simulate a dataset much larger than the available memory. I tried to limit the MySQL buffers to 128MB but in some cases, like with MyRocks, that was pretty unfair and it impacted the performance results. Basically, the procedure was:

  1. Start mysqld (no buffer pool load)
  2. Sync + drop cache
  3. Capture: du -hs the datadir
  4. Capture: cat /proc/$(pidof mysqld)/io
  5. Capture: cat /sys/block/vdb/stat
  6. Capture: show global variables and show global status;
  7. Run the test query
  8. Wait for 30 minutes for any flushing or maintenance to complete
  9. Capture: du -hs the datadir
  10. Capture: cat /proc/$(pidof mysqld)/io
  11. Capture: cat /sys/block/vdb/stat
  12. Capture: show global variables and show global status;
  13. Stop mysqld

As much as possible, I automated the whole procedure. On many occasions, I ran multiple runs of the same benchmark to validate unexpected behaviors.

First results: Traditional storage options

Inserting the data

In this first post of the series, I’ll report on the traditional “built-in” options which are InnoDB with Barracuda compression (IBC) and MyISAM with packing. I debated a bit the inclusion of MyISAM in this post since tables become read-only once packed but still, I personally implemented solutions using MyISAM packed a few times.

On the first figure (above), we have the final table sizes in GB for the different options we are considering in this first round of results. Over all the post series, we’ll use the plain InnoDB results as a reference. The Wikipedia dataset has a final size in InnoDB of 112.6GB while the o1543 dataset is slightly smaller, at 87.4GB.

The MyISAM sizes are smaller by 10 to 15% which is expected since InnoDB is page based, with page and row headers, and it doesn’t fully pack its pages. The InnoDB dataset size could have been up to twice as large if the data was inserted in random order of the primary key.

Adding Barracuda page compression with an 8KB block size (InnoDBCmp8k), both datasets shrink by close to 50%. Pushing to a block size of 4KB (InnoDBCmp4k), the Wikipedia dataset clearly doesn’t compress that much but, the o1543 dataset is very compressible, by more than 75%. Looking at the MyISAM Packed results, the o1543 dataset compressed to only 8.8GB, a mere 11% of the original MyISAM size. That means the o1543 could have done well with IBC using block size of 2KB. Such a compression ratio is really exceptional. I’ve rarely encountered a ratio that favorable in a production database.

With IBC, we know when the block size is too small for the compressibility of the dataset when the final size is no longer following the ratio of compressed block size over the original block size. For example, the Wikipedia dataset started at 112.6GB and a fourth (4KB/16KB) of this number is much smaller than the 46.9GB size of the InnoDB compression with 4k block size. You’ll also see a large number of compression failure in the innodb_cmp table of the information schema.

When the compression fails with IBC, InnoDB splits the page in two and recompresses each half. That process adds an overhead which can observed in the insertion time figure. While the insertion time of the Wikipedia dataset for the InnoDBCmp4k explodes to 2.6 times the uncompressed insertion time, the o1543 dataset takes only 30% more time. I suggest you do not take the times here too formally, the test environment I used was not fully isolated. View these times as trends.

The amount of data written during the inserts, shown on the above figure, has bugged me for a while. Let’s consider, for example, the amount of writes needed to insert all the rows of the Wikipedia dataset in plain InnoDB. The total is 503GB for a dataset of 113GB. From the MySQL status variables, I have 114GB written to the data files (innodb_pages_written * 16kb), 114GB written to the double write buffer (Inndb_dblwr_pages_written * 16kb) and 160GB written to the InnoDB log files (innodb_os_log_written). If you sum all these values, you have about 388GB, a value short by about… 114GB, too close to the size of the dataset to be an accident. What else is written?

After some research, I finally found it! When a datafile is extended, InnoDB first writes zeros to allocate the space on disk. All the tablespaces have to go through that initial allocation phase so here are the missing 114GB.

Back to the data written during the inserts figure, look at the number of writes required for the Wikipedia dataset when using InnoDBCmp4k. Any idea why it is higher? What if I tell you the double write buffer only writes 16KB pages (actually, it depends on innodb_page_size)? So, when the pages are compressed, they are padded with zeros when written to the double write buffer. We remember that, when compressing to 4KB, we had many compression misses so we ended up with many more pages to write. Actually, Domas Mituzas filed a bug back in 2013 about this. Also, by default (innodb_log_compressed_pages), the compressed and uncompressed versions of the pages are written to the InnoDB log files. Actually, only the writes to the tablespace are reduced by IBC.

MyISAM, since it is not a transactional engine, cheats here. The only overhead are the writes to the b-trees of the index. So, to the expense of durability, MyISAM writes much less data in this simple benchmark.

Range selects

So great, we have now inserted a lot of rows in our tables. How fast can we access these rows? The following figure presents the times to perform large range scans on the datasets. In InnoDB, the times are “fairly” stable. For the Wikipedia dataset, InnoDB compression improves the select performance, the time to decompress a page is apparently shorter than the time to read a full page. MyISAM without compression is the fastest solution but, once again, the benchmark offers the most favorable conditions to MyISAM. MyISAMPacked doesn’t fare as well for the large range select, likely too much data must be decompressed.

20k updates

Going to the updates, the time required to perform 20k updates by a single thread is shown on the above figure. For both datasets, InnoDB and InnoDBCmp8k show similar times. There is a divergence with Wikipedia dataset stored on InnoDBCmp4k, the execution time is 57% larger, essentially caused by a large increase in the number of pages read. MyISAM is extremely efficient dealing with the updates of the o1543 dataset since the record size is fixed and the update process is single-threaded.


Finally, let’s examine the number of bytes written per updates as shown on the figure above. I was naively expecting about two pages written per update statement, one for the double write buffer and one for the tablespace file. The Wikipedia dataset shows more like three pages written per update while the o1543 dataset fits rather well with what I was expecting. I had to look at the file_summary_by_instance table of the Performance schema and the innodb_metrics table to understand. Actually, my updates to the Wikipedia dataset are single row updates executed in autocommit mode, while the updates to the o1543 dataset are from a single statement updating 20k rows. When you do multiple small transactions, you end up writing much more to the undo log and to the system tablespace. The worse case is when the updates are in separate transactions and a long time is allowed for MySQL to flush the dirty pages.

Here are the writes associated with 30 updates, in autocommit mode, 20s apart:

mysql> select NAME, COUNT_RESET from innodb_metrics where name like '%writt%' and count_reset > 0 ;
+---------------------------------+-------------+
| NAME                            | COUNT_RESET |
+---------------------------------+-------------+
| buffer_pages_written            |         120 |
| buffer_data_written             |     2075136 |
| buffer_page_written_index_leaf  |          30 |
| buffer_page_written_undo_log    |          30 |
| buffer_page_written_system_page |          30 |
| buffer_page_written_trx_system  |          30 |
| os_log_bytes_written            |       77312 |
| innodb_dblwr_pages_written      |         120 |
+---------------------------------+-------------+
8 rows in set (0.01 sec)

The index leaf write is where the row is stored in the tablespace, 30 matches the number of rows updated. Each update has dirtied one leaf page as expected. The undo log is used to store the previous version of the row for rollback, in the ibdata1 file. I wrongly assumed these undo entries would not be actually written to disk, and would only live in the buffer pool and purged before they needed to be flushed to disk. I don’t clearly enough understand what is written to the system page and trx system to attempt a clear explanation for these ones. The sum of pages to write is 120, four per update but you need to multiply by two because of the double write buffer. So, in this worse case scenario, a simple single row update may cause up to eight pages to be written to disk.

Grouping the updates in a single transaction basically removes the pages written to the system_page and trx_system as these are per transaction.

Here is the result for the same 30 updates, send at a 20s interval, but this time in a single transaction:

mysql> select NAME, COUNT_RESET from innodb_metrics where name like '%writt%' and count_reset > 0 ;
+---------------------------------+-------------+
| NAME                            | COUNT_RESET |
+---------------------------------+-------------+
| buffer_pages_written            |          63 |
| buffer_data_written             |     1124352 |
| buffer_page_written_index_leaf  |          30 |
| buffer_page_written_undo_log    |          31 |
| buffer_page_written_system_page |           1 |
| buffer_page_written_trx_system  |           1 |
| os_log_bytes_written            |       60928 |
| innodb_dblwr_pages_written      |          63 |
+---------------------------------+-------------+

The write load, in terms of the number of pages written, is cut by half, to four per update. The most favorable case will be a single transaction with no sleep in between.

For 30 updates in a single transaction with no sleep, the results are:

mysql> select NAME, COUNT_RESET from innodb_metrics where name like '%writt%' and count_reset > 0 ;
+---------------------------------+-------------+
| NAME                            | COUNT_RESET |
+---------------------------------+-------------+
| buffer_pages_written            |          33 |
| buffer_data_written             |      546304 |
| buffer_page_written_index_leaf  |          30 |
| buffer_page_written_undo_log    |           1 |
| buffer_page_written_system_page |           1 |
| buffer_page_written_trx_system  |           1 |
| os_log_bytes_written            |        4608 |
| innodb_dblwr_pages_written      |          33 |
+---------------------------------+-------------+
8 rows in set (0.00 sec)

Now, the undo log is flushed only once and we are down to approximately two page writes per update. This is what I was originally expecting. The other results falls well into place if you keep in mind that only the index_leaf writes are compressed. The InnoDBCmp4k results for the Wikipedia dataset are higher, essentially because it took much more time and thus more page flushing occurred.

What we learned?

Everything can be a pretext to explore and learn. Just to summarize, what have we learned in this post?

  • The InnoDB log file logs compressed and uncompressed result by default (see innodb_log_compressed_pages)
  • The double write buffer only writes full pages, compressed pages are zero padded
  • With InnoDB the total amount of data written to disk during the inserts is more than 5 times the final size. Compression worsen the ratio.
  • A single row update causes from two up to eight pages to be written to disk

Not bad in term of collateral learning…

Next?

In this post, we reviewed the traditional data compression solutions available with MySQL. In future posts, we’ll start looking at the alternatives. In the next one, I will evaluate InnoDB Transparent page compression with punch hole, a feature available since MySQL 5.7.

Upcoming Webinar Thurs 11/29: Improve MongoDB Performance with Proper Queries

$
0
0
Improve MongoDB Performance with Proper Queries

Improve MongoDB Performance with Proper QueriesPlease join Percona’s Sr. Technical Operations Architect, Tim Vaillancourt, as he presents Improve MongoDB Performance with Proper Queries on Thursday, November 29th, 2018, at 12:30 PM PST (UTC-8) / 3:30 PM EST (UTC-5).

Register Now

There are many different ways you can use queries in MongoDB to find the data you need. However, knowing which queries are slowing down your performance can be a challenge. In this webinar we’ll discuss the following:

  • Performing ad hoc queries on the database using the find or findOne functions and a query document.
  • How to query for ranges, set inclusion, inequalities, and more using $-conditionals.
  • How to use and sort queries that return a database cursor, which lazily returns batches of documents as you need them.
  • What pitfalls you can encounter when performing the many available meta operations on a cursor, including skipping a certain number of results, and limiting the number of results returned.

By the end of this webinar you will have a better understanding of which queries impact performance. Moreover, you’ll understand how to leverage open source tools to monitor queries.

Register for this webinar to learn how to improve MongoDB performance with the proper queries.

What Happens If You Set innodb_open_files Higher Than open_files_limit?

$
0
0
MySQL innodb_open_files open_file_limit settings

MySQL innodb_open_files open_file_limit settingsThe settings of MySQL configuration variables have a fundamental impact on the performance of your database system. Sometimes it can be a little tricky to predict how changing one variable can affect others, and especially when dealing with cases like the one I’ll describe in this post, where the outcome is not very intuitive. So here, we’ll look at what happens when you set innodb_open_files higher than the open_files_limit.

We can set the maximum number of open files in our MySQL configuration file using:

open_files_limit=10000

If this isn’t set, then the default – which is 5,000 in MySQL 5.7 – should be used.

See Sveta’s excellent blog post for an explanation of how to change the open file limit; if this value is set it will take the SystemD LIMIT_NOFILES unless it’s set to infinity (and on CentOS 7 it will then use 65536,  though much higher values are possible if specified manually):

[root@centos7-pxc57-3 ~]# grep open_files_limit /etc/my.cnf
open_files_limit=10000
[root@centos7-pxc57-3 ~]# grep LimitNOFILE /lib/systemd/system/mysqld.service.d/limit_nofile.conf
LimitNOFILE=infinity
[root@centos7-pxc57-3 ~]# mysql -e “SELECT @@open_files_limit”
+--------------------+
| @@open_files_limit |
+--------------------+
| 65536              |
+--------------------+
[root@centos7-pxc57-3 ~]# perl -pi -e ’s/LimitNOFILE=infinity/LimitNOFILE=20000/‘ /lib/systemd/system/mysqld.service.d/limit_nofile.conf && systemctl daemon-reload && systemctl restart mysqld
[root@centos7-pxc57-3 ~]# mysql -e “SELECT @@open_files_limit”
+--------------------+
| @@open_files_limit |
+--------------------+
| 20000              |
+--------------------+
[root@centos7-pxc57-3 ~]# perl -pi -e ’s/LimitNOFILE=20000/LimitNOFILE=5000/‘ /lib/systemd/system/mysqld.service.d/limit_nofile.conf && systemctl daemon-reload && systemctl restart mysqld
[root@centos7-pxc57-3 ~]# mysql -e “SELECT @@open_files_limit”
+--------------------+
| @@open_files_limit |
+--------------------+
| 5000               |
+--------------------+

As you can see above, MySQL cannot set the value of open_files_limit higher than the system is configured to allow, and open_files_limit will default back to the maximum if it’s set too high.

That seems pretty straightforward, but what isn’t quite as obvious is how that affects innodb_open_files. The innodb_open_files value configures how many .ibd files MySQL can keep open at any one time.

As this obviously requires files to be open, it should be no higher than the open_files_limit (and should be lower). If we try to set it higher as per this example, MySQL will print a warning in the log file:

[root@centos7-pxc57-3 ~]# grep innodb_open_files /var/log/mysqld.log
2018-09-21T08:31:06.002120Z 0 [Warning] InnoDB: innodb_open_files should not be greater than the open_files_limit.

What the warning doesn’t state is that the value is being lowered. Not to the maximum value allowed though:

[root@centos7-pxc57-3 ~]# mysql -e “SELECT @@innodb_open_files”
+---------------------+
| @@innodb_open_files |
+---------------------+
| 2000                |
+---------------------+

2000? Why 2000?

It’s because if we set innodb_open_files too high, it reverts back to the default value, which per the documentation is:

300 if innodb_file_per_table is not enabled, and the higher of 300 and table_open_cache otherwise. Before 5.6.6, the default value is 300.

And table_open_cache? Well that defaults to 400 for versions of MySQL up to 5.6.7, and 2000 for 5.6.8 onwards.

Note that table_open_cache is another setting completely. innodb_open_files controls the number of InnoDB files (.ibd) the server can keep open at once; whilst table_open_cache controls the number of table definition (.frm) files the server can have open at once.

 

Photo by Logan Kirschner from Pexels


MySQL High Availability: Stale Reads and How to Fix Them

$
0
0
solutions for MySQL Stale Reads

solutions for MySQL Stale ReadsContinuing on the series of blog posts about MySQL High Availability, today we will talk about stale reads and how to overcome this issue.

The Problem

Stale reads is a read operation that fetches an incorrect value from a source that has not synchronized an update operation to the value (source Wiktionary).

A practical scenario is when your application applies INSERT or UPDATE data to your master/writer node, and has to read it immediately after. If this particular read is served from another server in the replication/cluster topology, the data is either not there yet (in case of an INSERT) or it still provides the old value (in case of an UPDATE).

If your application or part of your application is sensitive to stale reads, then this is something to consider when implementing HA/load balancing.

How NOT to fix stale reads

While working with customers, we have seen a few incorrect attempts to fix the issue:

SELECT SLEEP(X)

The most common incorrect approach that we see in Percona support is when customers add a sleep between the write and the read. This may work in some cases, but it’s not 100% reliable for all scenarios, and it can add latency when there is no need.

Let’s review an example where by the time you query your slave, the data is already applied and you have configured your transaction to start with a SELECT SLEEP(1). In this case, you just added 1000ms latency when there was no need for it.

Another example could be when the slave is lagging behind for more than whatever you configured as the parameter on the sleep command. In this case, you will have to create a login to keep trying the sleep until the slave has received the data: potentially it could take several seconds.

Reference: SELECT SLEEP.

Semisync replication

By default, MySQL replication is asynchronous, and this is exactly what causes the stale read. However, MySQL distributes a plugin that can make the replication semi-synchronous. We have seen customers enabling it hoping the stale reads problem will go away. In fact, that is not the case. The semi-synchronous plugin only ensures that at least one slave has received it (IO Thread has streamed the binlog event to relay log), but the action of applying the event is done asynchronously. In other words, stale reads are still a problem with semi-sync replication.

Reference: Semisync replication.

How to PROPERLY fix stale reads

There are several ways to fix/overcome this situation, and each one has its pros and cons:

1) MASTER_POS_WAIT

Consists of executing a SHOW MASTER STATUS right after your write, getting the binlog file and position, connecting on a slave, and executing the SELECT MASTER_POS_WAIT function, passing the binlog file and position as parameters. The execution will block until the slave has applied the position via the function. You can optionally pass a timeout to exit the function in case of exceeding this timeout.

Pros:

  • Works on all MySQL versions
  • No prerequisites

Cons:

  • Requires an application code rewrite.
  • It’s a blocking operation, and can add significant latency to queries in cases where a slave/node is too far behind.

Reference: MASTER_POS_WAIT.

2) WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS

Requires GTID: this is similar to the previous approach, but in this case, we need to track the executed GTID from the master (also available on SHOW MASTER STATUS).

Pros:

  • Works on all MySQL versions.

Cons:

  • Requires an application code rewrite.
  • It’s a blocking operation, can add significant latency to queries in cases where a slave/node is too far behind.
  • As it requires GTID, it only works on versions from 5.6 onwards.

Reference: WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS

3) Querying slave_relay_log_info

Consists of enabling relay_log_info_repository=TABLE and sync_relay_log_info=1 on the slave, and using a similar approach to option 1. After the write, execute  SHOW MASTER STATUS, connect to the slave, and query mysql.slave_relay_log_info , passing the binlog name and position to verify if the slave is already applying a position after the one you got from SHOW MASTER STATUS.

Pros:

  • This is not a blocking operation.
  • In cases where the slave is missing the position you require, you can try to connect to another slave and repeat the process. There is even an option to fail over back to the master if none of the slaves have the said position.

Cons:

  • Requires an application code rewrite.
  • In cases of checking multiple slaves, this can add significant latency.

Reference: slave_relay_log_info.

4) wsrep-sync-wait

Requires Galera/Percona XtraDB Cluster: Consists of setting a global/session variable to enforce consistency. This will block execution of subsequent queries until the node has applied all write-sets from it’s applier queue. It can be configured to trigger on multiple commands, such as SELECT, INSERT, and so on.

Pros:

  • Easy to implement. Built-in as a SESSION variable.

Cons:

  • Requires an application code rewrite in the event that you want to implement the solution on per session basis.
  • It’s a blocking operation, and can add significant latency to queries if a slave/node is too far behind.

Reference: wsrep-sync-wait

5) ProxySQL 2.0 GTID consistent reads

Requires MySQL 5.7 and GTID: MySQL 5.7 returns the GTID generated by a commit as part of the OK package. ProxySQL with the help of binlog readers installed on MySQL servers can keep track of which GTID the slave has already applied. With this information + the GTID received from the OK package at the moment of the write, ProxySQL will decide if it will route a subsequent read to one of the slaves/read nodes or if the master/write node will serve the read.

Pros:

  • Transparent to the application – no code changes are required.
  • Adds minimal latency.

Cons:

  • This still a new feature of ProxySQL 2.0, which is not yet GA.

Referece: GTID consistent reads.

Conclusions

Undesirable issues can arise from adding HA and distributing the load across multiple servers. Stale reads can cause an impact on applications sensitive to them. We have demonstrated various approaches you can use to overcome them.


Photo by Tim Evans on Unsplash

PostgreSQL Streaming Physical Replication With Slots

$
0
0
postgres replication using slots

PostgreSQLPostgreSQL streaming physical replication with slots simplifies setup and maintenance procedures. Usually, you should estimate disk usage for the Write Ahead Log (WAL) and provide appropriate limitation to the number of segments and setup of the WAL archive procedure. In this article, you will see how to use replication with slots and understand what problems it could solve.

Introduction

PostgreSQL physical replication is based on WAL. Th Write Ahead Log contains all database changes, saved in 16MB segment files. Normally postgres tries to keep segments between checkpoints. So with default settings, just 1GB of WAL segment files is available.

Replication requires all WAL files created after backup and up until the current time. Previously, it was necessary to keep a huge archive directory (usually mounted by NFS to all slave servers). The slots feature introduced in 9.4 allows Postgres to track the latest segment downloaded by a slave server. Now, PostgreSQL can keep all segments on disk, even without archiving, if a slave is seriously behind its master due to downtime or networking issues. The drawback: the disk space could be consumed infinitely in the case of configuration error. Before continuing, if you need a better understanding of physical replication and streaming replication, I recommend you read “Streaming Replication with PostgreSQL“.

Create a sandbox with two PostgreSQL servers

To setup replication, you need at least two PostgreSQL servers. I’m using pgcli (pgc) to setup both servers on the same host. It’s easy to install on Linux, Windows, and OS X, and provides the ability to download and run any version of PostgreSQL on your staging server or even on your laptop.

python -c "$(curl -fsSL https://s3.amazonaws.com/pgcentral/install.py)"
mv bigsql master
cp -r master slave
$ cd master
master$ ./pgc install pg10
master$ ./pgc start pg10
$ cd ../slave
slave$ ./pgc install pg10
slave$ ./pgc start pg10

First of all you should allow the replication user to connect:

master$ echo "host replication replicator 127.0.0.1/32 md5" >> ./data/pg10/pg_hba.conf

If you are running master and slave on different servers, please replace 127.0.0.1 with the slave’s address.

Next pgc creates a shell environment file with PATH and all the other variables required for PostgreSQL:

master$ source ./pg10/pg10.env

Allow connections from the remote host, and create a replication user and slot on master:

master$ psql
postgres=# CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'replicator';
CREATE ROLE
postgres=# ALTER SYSTEM SET listen_addresses TO '*';
ALTER SYSTEM
postgres=# SELECT pg_create_physical_replication_slot('slot1');
pg_create_physical_replication_slot
-------------------------------------
(slot1,)

To apply system variables changes and hba.conf, restart the Postgres server:

master$ ./pgc stop ; ./pgc start
pg10 stopping
pg10 starting on port 5432

Test table

Create a table with lots of padding on the master:

master$ psql psql (10.6) Type "help" for help.
postgres=# CREATE TABLE t(id INT, pad CHAR(200));
postgres=# CREATE INDEX t_id ON t (id);
postgres=# INSERT INTO t SELECT generate_series(1,1000000) AS id, md5((random()*1000000)::text) AS pad;

Filling WAL with random data

To see the benefits of slots, we should fill the WAL with some data by running transactions. Repeat the update statement below to generate a huge amount of WAL data:

UPDATE t SET pad = md5((random()*1000000)::text);

Checking the current WAL size

You can check total size for all WAL segments from the shell or from psql:

master$ du -sh data/pg10/pg_wal
17M data/pg10/pg_wal
master$ source ./pg10/pg10.env
master$ psql
postgres=# \! du -sh data/pg10/pg_wal
17M data/pg10/pg_wal

Check maximum WAL size without slots activated

Before replication configuration, we can fill the WAL with random data and find that after 1.1G, the data/pg10/pg_wal directory size does not increase regardless of the number of update queries.

postgres=# UPDATE t SET pad = md5((random()*1000000)::text); -- repeat 4 times
postgres=# \! du -sh data/pg10/pg_wal
1.1G data/pg10/pg_wal
postgres=# UPDATE t SET pad = md5((random()*1000000)::text);
postgres=# \! du -sh data/pg10/pg_wal
1.1G data/pg10/pg_wal

Backup master from the slave server

Next, let’s make a backup for our slot1:

slave$ source ./pg10/pg10.env
slave$ ./pgc stop pg10
slave$ rm -rf data/pg10/*
# If you are running master and slave on different servers, replace 127.0.0.1 with master's IP address.
slave$ PGPASSWORD=replicator pg_basebackup -S slot1 -h 127.0.0.1 -U replicator -p 5432 -D $PGDATA -Fp -P -Xs -Rv

Unfortunately pg_basebackup hangs with: initiating base backup, waiting for checkpoint to complete.
We can wait for the next checkpoint, or force the checkpoint on the master. Checkpoint happens every checkpoint_timeout seconds, and is set to five minutes by default.

Forcing checkpoint on master:

master$ psql
postgres=# CHECKPOINT;

The backup continues on the slave side:

pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/92000148 on timeline 1
pg_basebackup: starting background WAL receiver
1073986/1073986 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/927FDDE8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed

The backup copies settings from the master, including its TCP port value. I’m running both master and slave on the same host, so I should change the port in the slave .conf file:

slave$ vim data/pg10/postgresql.conf
# old value port = 5432
port = 5433

Now we can return to the master and run some queries:

slave$ cd ../master
master$ source pg10/pg10.env
master$ psql
postgres=# UPDATE t SET pad = md5((random()*1000000)::text);
UPDATE t SET pad = md5((random()*1000000)::text);

By running these queries, the WAL size is now 1.4G, and it’s bigger than 1.1G! Repeat this update query three times and the WAL grows to 2.8GB:

master$ du -sh data/pg10/pg_wal
2.8G data/pg10/pg_wal

Certainly, the WAL could grow infinitely until whole disk space is consumed.
How do we find out the reason for this?

postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
1/2A400630  | slot1     |  0/92000000 | 2.38

We have one slot behind the master of 2.38GB.

Let’s repeat the update and check again. The gap has increased:

postgres=# postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
1/8D400238  |     slot1 | 0/92000000  | 3.93

Wait, though: we have already used slot1 for backup! Let’s start the slave:

master$ cd ../slave
slave$ ./pgc start pg10

Replication started without any additional change to recovery.conf:

slave$ cat data/pg10/recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replicator password=replicator passfile=''/home/pguser/.pgpass'' host=127.0.0.1 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
primary_slot_name = 'slot1'

pg_basebackup -R option instructs backup to write to the recovery.conf file with all required options, including primary_slot_name.

WAL size, all slots connected

The gap reduced several seconds after the slave started:

postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
 1/8D400238 |     slot1 |  0/9A000000 | 3.80

And a few minutes later:

postgres=# SELECT redo_lsn, slot_name,restart_lsn,
round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind
FROM pg_control_checkpoint(), pg_replication_slots;
redo_lsn    | slot_name | restart_lsn | gb_behind
------------+-----------+-------------+-----------
 1/9E5DECE0 |     slot1 |  1/9EB17250 | -0.01
postgres=# \!du -sh data/pg10/pg_wal
1.3G data/pg10/pg_wal

Slave server maintenance

Let’s simulate slave server maintenance with ./pgc stop pg10 executed on the slave. We’ll push some data onto the master again (execute the UPDATE query 4 times).

Now, “slot1” is again 2.36GB behind.

Removing unused slots

By now, you might realize that a problematic slot is not in use. In such cases, you can drop it to allow retention for segments:

master$ psql
postgres=# SELECT pg_drop_replication_slot('slot1');

Finally the disk space is released:

master$ du -sh data/pg10/pg_wal
1.1G data/pg10/pg_wal

Important system variables

  • archive_mode is not required for streaming replication with slots.
  • wal_level – is replica by default
  • max_wal_senders – set to 10 by default, a minimum of three for one slave, plus two for each additional slave
  • wal_keep_segments – 32 by default, not important because PostgreSQL will keep all segments required by slot
  • archive_command – not important for streaming replication with slots
  • listen_addresses – the only option that it’s necessary to change, to allow remote slaves to connect
  • hot_standby – set to on by default, important to enable reads on slave
  • max_replication_slots – 10 by default https://www.postgresql.org/docs/10/static/runtime-config-replication.html

Summary

  • Physical replication setup is really easy with slots. By default in pg10, all settings are already prepared for replication setup.
  • Be careful with orphaned slots. PostgreSQL will not remove WAL segments for inactive slots with initialized restart_lsn.
  • Check pg_replication_slots restart_lsn value and compare it with current redo_lsn.
  • Avoid long downtime for slave servers with slots configured.
  • Please use meaningful names for slots, as that will simplify debug.

References

MongoDB 4.0: Using ACID Multi-Document Transactions

$
0
0
mongodb 4.0 acid compliant transactions

mongodb 4.0 acid compliant transactionsMongoDB 4.0 is around, and there are a lot of new features and improvements. In this article we’re going to focus on the major feature which is, undoubtedly, the support for multi-document ACID transactions. This novelty for a NoSQL database could be seen as a way to get closer to the relational world. Well, it’s not that—or maybe not just that. It’s a way to add to the document-based model a new, important, and often requested feature to address a wider range of use cases. The document model and its flexibility should remain the best way to start building an application on MongoDB. At this stage, transactions should be used in specific cases, when you absolutely need them: for example, because your application is aware of data consistency and atomicity. Transactions incur a greater performance cost over single document writes, so the denormalized data model will continue to be optimal in many cases and this helps to minimize the need for transactions.

Single writes are atomic by design: as long as you are able to embed documents in your collections you absolutely don’t need to use a transaction. Even so, transaction support is a very good and interesting feature that you can rely on in MongoDB from now on.

MongoDB 4.0 provides fully ACID transactions support but remember:

  • multi-document transactions are available for replica set deployments only
    • you can use transactions even on a standalone server but you need to configure it as a replica set (with just one node)
  • multi-document transactions are not available for sharded cluster
    • hopefully transactions will be available from version 4.2
  • multi-document transactions are available for the WiredTiger storage engine only

ACID transactions in MongoDB 4.0

ACID properties are well known in the world of relational databases, but let’s recap what the acronym means.

  • Atomicity: a group of commands inside the transaction must follow the “all or nothing” paradigm. If only one of the commands fails for any reason, the complete transaction fails as well.
  • Consistency: if a transaction successfully executes, it will take the database from one state that is consistent to another state that is also consistent.
  • Isolation: multiple transactions can run at the same time in the system. Isolation guarantees that each transaction is not able to view partial results of the others. Executing multiple transactions in parallel must have the same results as running them sequentially
  • Durability: it guarantees that a transaction that has committed will remain persistent, even in the case of a system failure

Limitations of transactions

The support for transactions introduced some limitations:

  • a collection MUST exist in order to use transactions
  • a collection cannot be created or dropped inside a transaction
  • an index cannot be created or dropped inside a transaction
  • non-CRUD operations are not permitted inside a transaction (for example, administrative commands like createUser are not permitted )
  • a transaction cannot read or write in config, admin, and local databases
  • a transaction cannot write to system.* collections
  • the size of a transaction is limited to 16MB
    • a single oplog entry is generated during the commit: the writes inside the transaction don’t have single oplog entries as in regular queries
    • the limitation is a consequence of the 16MB maximum size of any BSON document in the oplog
    • in case of larger transactions, you should consider splitting these into smaller transactions
  • by default a transaction that executes for longer then 60 seconds will automatically expire
    • you can change this using the configuration parameter transactionLifetimeLimitSeconds
    • transactions rely on WiredTiger snapshot capability, and having a long running transaction can result in high pressure on WiredTiger’s cache to maintain snapshots, and lead to the retention of a lot of unflushed operations in memory

Sessions

Sessions were deployed in version 3.6 in order to run the retryable writes (for example) but they are very important, too, for transactions. In fact any transaction is associated with an open session. Prior to starting a transaction, a session must be created. A transaction cannot be run outside a session.

At any given time you may have multiple running sessions in the system, but each session may run only a single transaction at a time. You can run transactions in parallel according to how many open sessions you have.

Three new commands were introduce for creating, committing, and aborting transactions:

  • session.startTransaction()
    • starts a new transaction in the current session
  • session.commitTransaction()
    • saves consistently and durably the changes made by the operations in the transaction
  • session.abortTransaction()
    • the transaction ends without saving any of the changes made by the operations in the transaction

Note: in the following examples, we use two different connections to create two sessions. We do this for the sake of simplicity, but remember that you can create multiple sessions even inside a single connection, assigning each session to a different variable.

Our first transaction

To test our first transaction if you don’t have a replica set already configured let’s start a standalone server like this:

#> mongod --dbpath /data/db --logpath /data/mongo.log --fork --replSet foo

Create a new collection, and insert some data.

foo:PRIMARY> use percona
switched to db percona
foo:PRIMARY> db.createCollection('people')
{
   "ok" : 1,
   "operationTime" : Timestamp(1538483120, 1),
   "$clusterTime" : {
      "clusterTime" : Timestamp(1538483120, 1),
      "signature" : {
         "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
         "keyId" : NumberLong(0)
       }
    }
}
foo:PRIMARY> db.people.insert([{_id:1, name:"Corrado"},{_id:2, name:"Peter"},{_id:3,name:"Heidi"}])

Create a session

foo:PRIMARY> session = db.getMongo().startSession()
session { "id" : UUID("dcfa7de5-527d-4b1c-a890-53c9a355920d") }

Start a transaction and insert some new documents

foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").people.insert([{_id: 4 , name : "George"},{_id: 5, name: "Tom"}])
WriteResult({ "nInserted" : 2 })

Now read the collection from inside and outside the session and see what happens

foo:PRIMARY> session.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado" }
{ "_id" : 2, "name" : "Peter" }
{ "_id" : 3, "name" : "Heidi" }
{ "_id" : 4, "name" : "George" }
{ "_id" : 5, "name" : "Tom" }
foo:PRIMARY> db.people.find()
{ "_id" : 1, "name" : "Corrado" }
{ "_id" : 2, "name" : "Peter" }
{ "_id" : 3, "name" : "Heidi" }

As you might notice, since the transaction is not yet committed, you can see the modifications only from inside the session. You cannot see any of the modifications outside of the session, even in the same connection. If you try to open a new connection to the database, then you will not be able to see any of the modifications either.

Now, commit the transaction and see that you can now read the same data both inside and outside the session, as well as from any other connection.

foo:PRIMARY> session.commitTransaction()
foo:PRIMARY> session.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado" }
{ "_id" : 2, "name" : "Peter" }
{ "_id" : 3, "name" : "Heidi" }
{ "_id" : 4, "name" : "George" }
{ "_id" : 5, "name" : "Tom" }
foo:PRIMARY> db.people.find()
{ "_id" : 1, "name" : "Corrado" }
{ "_id" : 2, "name" : "Peter" }
{ "_id" : 3, "name" : "Heidi" }
{ "_id" : 4, "name" : "George" }
{ "_id" : 5, "name" : "Tom" }

When the transaction is committed, all the data are written consistently and durably in the database, just like any typical write. So, writing to the journal file and to the oplog takes place in the same way it as for any single write that’s not inside a transaction. As long as the transaction is open, any modification is stored in memory.

Isolation test

Let’s test now the isolation between two concurrent transactions.

Open the first connection, create a session and start a transaction:

//Connection #1
foo:PRIMARY> var session1 = db.getMongo().startSession()
foo:PRIMARY> session1.startTransaction()

do the same on the second connection:

//Connection #2
foo:PRIMARY> var session2 = db.getMongo().startSession()
foo:PRIMARY> session2.startTransaction()

Update the document on connection #1 to record Heidi’s document. Add the gender field to the document.

//Connection #1
foo:PRIMARY> session1.getDatabase("percona").people.update({_id:3},{$set:{ gender: "F" }})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
foo:PRIMARY> session1.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado" }
{ "_id" : 2, "name" : "Peter" }
{ "_id" : 3, "name" : "Heidi", "gender" : "F" }
{ "_id" : 4, "name" : "George" }
{ "_id" : 5, "name" : "Tom" }

Update the same collection on connection #2 to add the same gender field to all the males:

//Connection #2
foo:PRIMARY> session2.getDatabase("percona").people.update({_id:{$in:[1,2,4,5]}},{$set:{ gender: "M" }},{multi:"true"})
WriteResult({ "nMatched" : 4, "nUpserted" : 0, "nModified" : 4 })
foo:PRIMARY> session2.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado", "gender" : "M" }
{ "_id" : 2, "name" : "Peter", "gender" : "M" }
{ "_id" : 3, "name" : "Heidi" }
{ "_id" : 4, "name" : "George", "gender" : "M" }
{ "_id" : 5, "name" : "Tom", "gender" : "M" }

The two transactions are isolated, each one can see only the ongoing modifications that it has made itself.

Commit the transaction in connection #1:

//Connection #1
foo:PRIMARY> session1.commitTransaction()
foo:PRIMARY> session1.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado" }
{ "_id" : 2, "name" : "Peter" }
{ "_id" : 3, "name" : "Heidi", "gender" : "F" }
{ "_id" : 4, "name" : "George" }
{ "_id" : 5, "name" : "Tom" }

In the connection #2 read the collection:

//Connection #2
foo:PRIMARY> session1.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado", "gender" : "M" }
{ "_id" : 2, "name" : "Peter", "gender" : "M"  }
{ "_id" : 3, "name" : "Heidi" }
{ "_id" : 4, "name" : "George", "gender" : "M"  }
{ "_id" : 5, "name" : "Tom", "gender" : "M"  }

As you can see the second transaction still sees its own modifications, and cannot see the already committed updates of the other transaction. This kind of isolation works the same as the “REPEATABLE READ” level of MySQL and other relational databases.

Now commit the transaction in connection #2 and see the new values of the collection:

//Connection #2
foo:PRIMARY> session2.commitTransaction()
foo:PRIMARY> session2.getDatabase("percona").people.find()
{ "_id" : 1, "name" : "Corrado", "gender" : "M" }
{ "_id" : 2, "name" : "Peter", "gender" : "M" }
{ "_id" : 3, "name" : "Heidi", "gender" : "F" }
{ "_id" : 4, "name" : "George", "gender" : "M" }
{ "_id" : 5, "name" : "Tom", "gender" : "M" }

Conflicts

When two (or more) concurrent transactions modify the same documents, we may have a conflict. MongoDB can detect a conflict immediately, even while transactions are not yet committed. The first transaction to acquire the lock on a document will continue, the second one will receive the conflict error message and fail. The failed transaction can then be retried later.

Let’s see an example.

Create a new transaction in connection #1 to update Heidi’s document. We want to change the name to Luise.

//Connection #1
foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").people.update({name:"Heidi"},{$set:{name:"Luise"}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

Let’s try to modify the same document in a concurrent transaction in connection #2. Modify the name from Heidi to Marie in this case.

//Connection #2
foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").people.update({name:"Heidi"},{$set:{name:"Marie"}})
WriteCommandError({
    "errorLabels" : [
       "TransientTransactionError"
    ],
    "operationTime" : Timestamp(1538495683, 1),
    "ok" : 0,
    "errmsg" : "WriteConflict",
    "code" : 112,
    "codeName" : "WriteConflict",
    "$clusterTime" : {
       "clusterTime" : Timestamp(1538495683, 1),
       "signature" : {
            "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
            "keyId" : NumberLong(0)
       }
     }
})

We received an error and the transaction failed. We can retry it later.

Other details

  • the individual writes inside the transaction are not retry-able even if retryWrites is set to true
  • each commit operation is a retry-able write operation regardless of whether retryWrites is set to true. The drivers retry the commit a single time in case of an error.
  • Read Concern supports snapshot, local and majority values
  • Write Concern can be set at the transaction level. The individual operations inside the transaction ignore the write concern. Write concern is evaluated during the commit
  • Read Preference supports only primary value

Conclusions

Transaction support in MongoDB 4.0 is a very interesting new feature, but it isn’t fully mature yet, there are strong limitations at this stage: a transaction cannot be larger than 16MB, you cannot use it on sharded clusters and others. If you absolutely need a transaction in your application use it. But don’t use transactions only because they are cool, since in some cases a proper data model based on embedding documents in collections and denormalizing your data could be the best solution. MongoDB isn’t by its nature a relational database; as long as you are able to model your data keeping in mind that it’s a NOSQL database you should avoid using transactions. In specific cases, or if you already have a database with strong “informal relations” between the collections that you cannot change, then you could choose to rely on transactions.

Image modified from original photo: by Annie Spratt on Unsplash

Nondeterministic Functions in MySQL (i.e. rand) Can Surprise You

$
0
0
MySQL non deterministic functions rand

Working on a test case with sysbench, I encountered this:

mysql> select * from sbtest1 where id = round(rand()*10000, 0);
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| id   | k      | c                                                                                                                       | pad                                                         |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
|  179 | 499871 | 09833083632-34593445843-98203182724-77632394229-31240034691-22855093589-98577647071-95962909368-34814236148-76937610370 | 62233363025-41327474153-95482195752-11204169522-13131828192 |
| 1606 | 502031 | 81212399253-12831141664-41940957498-63947990218-16408477860-15124776228-42269003436-07293216458-45216889819-75452278174 | 25423822623-32136209218-60113604068-17409951653-00581045257 |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (0.30 sec)

I was really surprised. First, and the most important, id is a primary key and the rand() function should produce just one value. How come it returns two rows? Second, why is the response time 0.30 sec? That seems really high for a primary key access.

Looking further:

CREATE TABLE `sbtest1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=1000001 DEFAULT CHARSET=latin1
mysql> explain select * from sbtest1 where id = round(rand()*10000, 0);
+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | sbtest1 | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 986400 |    10.00 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+

So it is a primary key, but MySQL does not use an index, and it returns two rows. Is this a bug?

Deterministic vs nondeterministic functions

Turned out it is not a bug at all. It is pretty logical behavior from MySQL, but it is not what we would expect. First, why a full table scan? Well, rand() is nondeterministic function. That means we do not know what it will return ahead of time, and actually that is exactly the purpose of rand() – to return a random value. In this case, it is only logical to evaluate the function for each row, each time, and compare the results. i.e. in our case

  1. Read row 1, get the value of id, evaluate the value of RAND(), compare
  2. Proceed using the same algorithm with the remaining rows.

In other words, as the value of rand() is not known (not evaluated) beforehand, so we can’t use an index.

And in this case – rand() function – we have another interesting consequence. For larger tables with an auto_increment primary key, the probability of matching the rand() value and the auto_increment value is higher, so we can get multiple rows back. In fact, if we read the whole table from the beginning and keep comparing the auto_inc sequence with “the roll of the dice”, we can get many rows back.

That behavior is totally counter-intuitive. Nevertheless, to me, it’s also the only correct behavior.

We expect to have the rand() function evaluated before running the query.  This can actually be achieved by assigning rand() to a variable:

mysql> set @id=round(rand()*10000, 0); select @id; select * from sbtest1 where id = @id;
Query OK, 0 rows affected (0.00 sec)
+------+
| @id  |
+------+
| 6068 |
+------+
1 row in set (0.00 sec)
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| id   | k      | c                                                                                                                       | pad                                                         |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| 6068 | 502782 | 84971025350-12845068791-61736600622-38249796467-85706778555-74134284808-24438972515-17848828748-86869270666-01547789681 | 17507194006-70651503059-23792945260-94159543806-65683812344 |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> explain select * from sbtest1 where id = @id;
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table   | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | sbtest1 | NULL       | const | PRIMARY       | PRIMARY | 4       | const |    1 |   100.00 | NULL  |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)

This would meet our expectations.

There are (at least) two bug reports filed, with very interesting discussion:

  1. rand() used in scalar functions returns multiple rows
  2. SELECT on PK with ROUND(RAND()) give wrong errors

Other databases

I wanted to see how it works in other SQL databases. In PostgreSQL, the behavior is exactly the same as MySQL:

postgres=# select * from t2 where id = cast(random()*10000 as int);
  id  |    c
------+---------
 4093 | asdasda
 9378 | asdasda
(2 rows)
postgres=# select * from t2 where id = cast(random()*10000 as int);
  id  |    c
------+---------
 5988 | asdasda
 6674 | asdasda
(2 rows)
postgres=# explain select * from t2 where id = cast(random()*10000 as int);
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..159837.60 rows=1 width=12)
   Filter: (id = ((random() * '10000'::double precision))::integer)
(2 rows)

And SQLite seems different, evaluating the random() function beforehand:

sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
16239|asdsadasdsa
sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
32910|asdsadasdsa
sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
58658|asdsadasdsa
sqlite> explain select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     12    0                    00  Start at 12
1     OpenRead       0     30182  0     2              00  root=30182 iDb=0; t2
2     Function0      0     0     3     random(0)      00  r[3]=func(r[0])
3     Cast           3     69    0                    00  affinity(r[3])
4     Function0      0     3     2     abs(1)         01  r[2]=func(r[3])
5     Divide         4     2     1                    00  r[1]=r[2]/r[4]
6     Cast           1     68    0                    00  affinity(r[1])
7     SeekRowid      0     11    1                    00  intkey=r[1]; pk
8     Copy           1     5     0                    00  r[5]=r[1]
9     Column         0     1     6                    00  r[6]=t2.c
10    ResultRow      5     2     0                    00  output=r[5..6]
11    Halt           0     0     0                    00
12    Transaction    0     0     2     0              01  usesStmtJournal=0
13    Int64          0     4     0     92233720368547  00 r[4]=92233720368547
14    Goto           0     1     0                    00

Conclusion

Be careful when using MySQL nondeterministic functions in  a “where” condition – rand() is the most interesting example – as their behavior may surprise you. Many people believe this to be a bug that should be fixed. Let me know in the comments: do you think it is a bug or not (and why)? I would also be interested to know how it works in other, non-opensource databases (Microsoft SQL Server, Oracle, etc)

PS: Finally, I’ve got a “clever” idea – what if I “trick” MySQL by using the deterministic keyword…

MySQL stored functions: deterministic vs not deterministic

So, I wanted to see how it works with MySQL stored functions if they are assigned “deterministic” and “not deterministic” keywords. First, I wanted to “trick” mysql and pass the deterministic to the stored function but use rand() inside. Ok, this is not what you really want to do!

DELIMITER $$
CREATE FUNCTION myrand() RETURNS INT
    DETERMINISTIC
BEGIN
 RETURN round(rand()*10000, 0);
END$$
DELIMITER ;

From MySQL manual about MySQL stored routines we can read:

Assessment of the nature of a routine is based on the “honesty” of the creator: MySQL does not check that a routine declared DETERMINISTIC is free of statements that produce nondeterministic results. However, misdeclaring a routine might affect results or affect performance. Declaring a nondeterministic routine as DETERMINISTIC might lead to unexpected results by causing the optimizer to make incorrect execution plan choices. Declaring a deterministic routine as NONDETERMINISTIC might diminish performance by causing available optimizations not to be used.

The result is interesting:

mysql> select myrand();
+----------+
| myrand() |
+----------+
|     4202 |
+----------+
1 row in set (0.00 sec)
mysql> select myrand();
+----------+
| myrand() |
+----------+
|     7548 |
+----------+
1 row in set (0.00 sec)
mysql> explain select * from t2 where id = myrand()\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
   partitions: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: NULL
        Extra: Impossible WHERE noticed after reading const tables
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+-------+------+--------------------------------------------------------------------------------+
| Level | Code | Message                                                                        |
+-------+------+--------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select '2745' AS `id`,'asasdas' AS `c` from `test`.`t2` where 0 |
+-------+------+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select * from t2 where id = 4202;
+------+---------+
| id   | c       |
+------+---------+
| 4202 | asasdas |
+------+---------+
1 row in set (0.00 sec)
mysql> select * from t2 where id = 2745;
+------+---------+
| id   | c       |
+------+---------+
| 2745 | asasdas |
+------+---------+
1 row in set (0.00 sec)

So MySQL optimizer detected the problem (somehow).

If I use the NOT DETERMINISTIC keyword, then MySQL works the same as when using the rand() function:

DELIMITER $$
CREATE FUNCTION myrand2() RETURNS INT
   NOT DETERMINISTIC
BEGIN
 RETURN round(rand()*10000, 0);
END$$
DELIMITER ;
mysql> explain select * from t2 where id = myrand2()\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 262208
     filtered: 10.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

 


Photo by dylan nolte on Unsplash

MySQL 8 and The FRM Drop… How To Recover Table DDL

$
0
0
MySQL 8 frm drop recover ddl

… or what I should keep in mind in case of disaster

MySQL 8 frm drop recover ddl

To retrieve and maintain in SQL format the definition of all tables in a database, is a best practice that we all should adopt. To have that under version control is also another best practice to keep in mind.

While doing that may seem redundant, it can become a life saver in several situations. From the need to review what has historically changed in a table, to knowing who changed what and why… to when you need to recover your data and have your beloved MySQL instance not start…

But let’s be honest, only a few do the right thing, and even fewer keep that information up to date. Given that’s the case, what can we do when we have the need to discover/recover the table structure?

From the beginning, MySQL has used some external files to describe its internal structure.

For instance, if I have a schema named windmills and a table named wmillAUTOINC1, on the file system I will see this:

-rw-r-----. 1 mysql mysql     8838 Mar 14 2018 wmillAUTOINC1.frm
-rw-r-----. 1 mysql mysql   131072 Mar 14 2018 wmillAUTOINC1.ibd

The ibd file contains the data, while the frm file contains the structure information.

Putting aside ANY discussion about if this is safe, if it’s transactional and more… when we’ve experienced some major crash and data corruption this approach has been helpful. Being able to read from the frm file was the easiest way to get the information we need.
Simple tools like DBSake made the task quite trivial, and allowed us to script table definition when needed to run long, complex tedious data recovery:

[root@master1 windmills]# /opt/tools/dbsake frmdump wmillAUTOINC1.frm
--
-- Table structure for table `wmillAUTOINC1`
-- Created with MySQL Version 5.7.20
--
CREATE TABLE `wmillAUTOINC1` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `millid` smallint(6) NOT NULL,
  `kwatts_s` int(11) NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) COLLATE utf8_bin NOT NULL,
  `active` tinyint(2) NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC;

Of course, if the frm file was also corrupt, then we could try to get the information from the ibdata dictionary. If that is corrupted too (trust me I’ve seen all of these situations) … well a last resource was hoping the customer has a recent table definition stored somewhere, but as mentioned before, we are not so diligent, are we?

Now, though, in MySQL8 we do not have FRM files, they were dropped. Even more interesting is that we do not have the same dictionary, most of the things that we knew have changed, including the dictionary location. So what can be done?

Well Oracle have moved the FRM information—and more—to what is called Serialized Dictionary Information (SDI), the SDI is written INSIDE the ibd file, and represents the redundant copy of the information contained in the data dictionary.

The SDI is updated/modified by DDL operations on tables that reside in that tablespace. This is it: if you have one file per table normally, then you will have in that file ONLY the SDI for that table, but if you have multiple tables in a tablespace, the SDI information will refer to ALL of the tables.

To extract this information from the IBD files, Oracle provides a utility called ibd2sdi. This application parses the SDI information and reports a JSON file that can be easily manipulated to extract and build the table definition.

One exception is represented by Partitioned tables. The SDI information is contained ONLY in the first partition, and if you drop it, it is moved to the next one. I will show that later.

But let’s see how it works. In the next examples I will look for the table’s name, attributes, and datatype starting from the dictionary tables.

To obtain the info I will do this:

/opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/mysql.ibd |jq  '.[]?|.[]?|.dd_object?|("------------------------------------"?,"TABLE NAME = ",.name?,"****",(.columns?|.[]?|(.name?,.column_type_utf8?)))'

The result will be something like:

"------------------------------------"
"TABLE NAME = "
"tables"
"****"
"id"
"bigint(20) unsigned"
"schema_id"
"bigint(20) unsigned"
"name"
"varchar(64)"
"type"
"enum('BASE TABLE','VIEW','SYSTEM VIEW')"
"engine"
"varchar(64)"
"mysql_version_id"
"int(10) unsigned"
"row_format"
"enum('Fixed','Dynamic','Compressed','Redundant','Compact','Paged')"
"collation_id"
"bigint(20) unsigned"
"comment"
"varchar(2048)"
<snip>
"------------------------------------"
"TABLE NAME = "
"tablespaces"
"****"
"id"
"bigint(20) unsigned"
"name"
"varchar(259)"
"options"
"mediumtext"
"se_private_data"
"mediumtext"
"comment"
"varchar(2048)"
"engine"
"varchar(64)"
"DB_TRX_ID"
""
"DB_ROLL_PTR"
""

I cut the output for brevity, but if you run the above command yourself you’ll be able to see that this retrieves the information for ALL the tables residing in the IBD.

The other thing I hope you noticed is that I am NOT parsing ibdata, but mysql.ibd. Why? Because the dictionary was moved out from ibdata and is now in mysql.ibd.

Look what happens if I try to parse ibdata:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/ibdata1 |jq '.'
[INFO] ibd2sdi: SDI is empty.

Be very careful here to not mess up your mysql.ibd file.

Now what can I do to get information about my wmillAUTOINC1 table in MySQL8?

That is quite simple:

/opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINC.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1068,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINC",
        "mysql_version_id": 80011,
        "created": 20180925095853,
        "last_altered": 20180925095853,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [
          {
            "name": "id",
            "type": 9,
            "is_nullable": false,
            "is_zerofill": false,
            "is_unsigned": false,
            "is_auto_increment": true,
            "is_virtual": false,
            "hidden": 1,
            "ordinal_position": 1,
            "char_length": 11,
            "numeric_precision": 19,
            "numeric_scale": 0,
            "numeric_scale_null": false,
            "datetime_precision": 0,
            "datetime_precision_null": 1,
            "has_no_default": false,
            "default_value_null": false,
            "srs_id_null": true,
            "srs_id": 0,
            "default_value": "AAAAAAAAAAA=",
            "default_value_utf8_null": true,
            "default_value_utf8": "",
            "default_option": "",
            "update_option": "",
            "comment": "",
            "generation_expression": "",
            "generation_expression_utf8": "",
            "options": "interval_count=0;",
            "se_private_data": "table_id=1838;",
            "column_key": 2,
            "column_type_utf8": "bigint(11)",
            "elements": [],
            "collation_id": 83,
            "is_explicit_collation": false
          },
<SNIP>
        "indexes": [
          {
            "name": "PRIMARY",
            "hidden": false,
            "is_generated": false,
            "ordinal_position": 1,
            "comment": "",
            "options": "flags=0;",
            "se_private_data": "id=2261;root=4;space_id=775;table_id=1838;trx_id=6585972;",
            "type": 1,
            "algorithm": 2,
            "is_algorithm_explicit": false,
            "is_visible": true,
            "engine": "InnoDB",
<Snip>
        ],
        "foreign_keys": [],
        "partitions": [],
        "collation_id": 83
      }
    }
  },
  {
    "type": 2,
    "id": 780,
    "object": {
      "mysqld_version_id": 80011,
      "dd_version": 80011,
      "sdi_version": 1,
      "dd_object_type": "Tablespace",
      "dd_object": {
        "name": "windmills/wmillAUTOINC",
        "comment": "",
        "options": "",
        "se_private_data": "flags=16417;id=775;server_version=80011;space_version=1;",
        "engine": "InnoDB",
        "files": [
          {
            "ordinal_position": 1,
            "filename": "./windmills/wmillAUTOINC.ibd",
            "se_private_data": "id=775;"
          }
        ]
      }
    }
  }
]

The JSON will contains:

  • A section describing the DB object at high level
  • Array of columns and related information
  • Array of indexes
  • Partition information (not here but in the next example)
  • Table space information

That is a lot more detail compared to what we had in the FRM, and it is quite relevant and interesting information as well.

Once you have extracted the SDI, any JSON parser tool script can generate the information for the SQL DDL.

I mention partitions, so let’s look at this a bit more, given they can be tricky.

As mentioned, the SDI information is present ONLY in the first partition. All other partitions hold ONLY the tablespace information. Given that, then the first thing to do is to identify which partition is the first… OR simply try to access all partitions, and when you are able to get the details, extract them.

The process is the same:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170301.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1460,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINCPART",
        "mysql_version_id": 80013,
        "created": 20181125110300,
        "last_altered": 20181125110300,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [<snip>
    	  "schema_ref": "windmills",
        "se_private_id": 18446744073709552000,
        "engine": "InnoDB",
        "last_checked_for_upgrade_version_id": 80013,
        "comment": "",
        "se_private_data": "autoinc=31080;version=2;",
        "row_format": 2,
        "partition_type": 7,
        "partition_expression": "to_days(`date`)",
        "partition_expression_utf8": "to_days(`date`)",
        "default_partitioning": 1,
        "subpartition_type": 0,
        "subpartition_expression": "",
        "subpartition_expression_utf8": "",
        "default_subpartitioning": 0,
       ],
<snip>
        "foreign_keys": [],
        "partitions": [
          {
            "name": "PT20170301",
            "parent_partition_id": 18446744073709552000,
            "number": 0,
            "se_private_id": 1847,
            "description_utf8": "736754",
            "engine": "InnoDB",
            "comment": "",
            "options": "",
            "se_private_data": "autoinc=0;version=0;",
            "values": [
              {
                "max_value": false,
                "null_value": false,
                "list_num": 0,
                "column_num": 0,
                "value_utf8": "736754"
              }
            ],

The difference, as you can see, is that the section related to partitions and sub partitions will be filled with all the details you might need to recreate the partitions.

We will have:

  • Partition type
  • Partition expression
  • Partition values
  • …more

Same for sub partitions.

Now again see what happens if I parse the second partition:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 2,
    "id": 790,
    "object": {
      "mysqld_version_id": 80011,
      "dd_version": 80011,
      "sdi_version": 1,
      "dd_object_type": "Tablespace",
      "dd_object": {
        "name": "windmills/wmillAUTOINCPART#P#PT20170401",
        "comment": "",
        "options": "",
        "se_private_data": "flags=16417;id=785;server_version=80011;space_version=1;",
        "engine": "InnoDB",
        "files": [
          {
            "ordinal_position": 1,
            "filename": "./windmills/wmillAUTOINCPART#P#PT20170401.ibd",
            "se_private_data": "id=785;"
          }
        ]
      }
    }
  }
]

I will get only the information about the tablespace, not the table.

As promised let me show you now what happens if I delete the first partition, and the second partition becomes the first:

(root@localhost) [windmills]>alter table wmillAUTOINCPART drop partition PT20170301;
Query OK, 0 rows affected (1.84 sec)
Records: 0  Duplicates: 0  Warnings: 0
[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd |jq '.'|more
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1461,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINCPART",
        "mysql_version_id": 80013,
        "created": 20181129130834,
        "last_altered": 20181129130834,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [
          {
            "name": "id",
            "type": 9,
            "is_nullable": false,
            "is_zerofill": false,
            "is_unsigned": false,
            "is_auto_increment": true,
            "is_virtual": false,
            "hidden": 1,
            "ordinal_position": 1,

As I mentioned before, each DDL updates the SDI, and here we go: I will have all the information on what’s NOW the FIRST partition. Please note the value of the attribute “created” between the first time I queried the other partition, and the one that I have now:

/opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170301.ibd
       "created": 20181125110300,
/opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd
       "created": 20181129130834,

To be clear the second created is NOW (PT20170401) from when I dropped the other partition (PT20170301).

Conclusions

In the end, this solution is definitely more powerful than the FRM files. It will allow us to parse the file and identify the table definition more easily, providing us with much more detail and information.

The problems will arise if and when the IBD file becomes corrupt.

As for the manual:  For InnoDB, an SDI record requires a single index page, which is 16KB in size by default. However, SDI data is compressed to reduce the storage footprint.

By which it means that for each table I have a page, if I associate record=table. Which means that in case of IBD corruption I should (likely) be able to read those pages. Unless I have bad (very bad) luck.

I still wonder how the dimension of an IBD affects the SDI retrieval, but given I have not tried it yet I will have to let you know.

As an aside, I am working on a script to facilitate the generation of the SQL, it’s not yet ready but you can find it here

Last note but keep this in mind! It is stated in the manual but in a hidden place and in small letters:
DDL operations take longer due to writing to storage, undo logs, and redo logs instead of .frm files.

References

https://stedolan.github.io/jq/

https://dev.mysql.com/doc/refman/8.0/en/ibd2sdi.html

https://dev.mysql.com/doc/refman/8.0/en/serialized-dictionary-information.html

https://dev.mysql.com/doc/refman/8.0/en/data-dictionary-limitations.html


Photo by chuttersnap on Unsplash

Viewing all 1334 articles
Browse latest View live