Quantcast
Channel: Insight for DBAs Archives - Percona Database Performance Blog
Viewing all 1336 articles
Browse latest View live

Webinar Wed 7/18: MariaDB 10.3 vs. MySQL 8.0

$
0
0
MariaDB 10.3 vs MySQL 8.0

MariaDB 10.3 vs MySQL 8.0Please join Percona’s Chief Evangelist, Colin Charles as he presents as he presents MariaDB 10.3 vs. MySQL 8.0 on Wednesday, July 18th, 2018, at 9:00 AM PDT (UTC-7) / 12:00 PM EDT (UTC-4).

 

Technical considerations

Are they syntactically similar? Where do these two databases differ? Why would I use one over the other?

MariaDB 10.3 is on the path of gradually diverging from MySQL 8.0. One obvious example is the internal data dictionary currently under development for MySQL 8.0. This is a major change to the way metadata is stored and used within the server, and MariaDB doesn’t have an equivalent feature. Implementing this feature could mark the end of datafile-level compatibility between MySQL and MariaDB.

Non-technical considerations

There are also non-technical differences between MySQL 8.0 and MariaDB 10.3, including:

Licensing: MySQL offers their code as open-source under the GPL, and provides the option of non-GPL commercial distribution in the form of MySQL Enterprise. MariaDB can only use the GPL, because their work is derived from the MySQL source code under the terms of that license.

Support services: Oracle provides technical support, training, certification and consulting for MySQL, while MariaDB has their own support services. Some people will prefer working with smaller companies, as traditionally it affords them more leverage as a customer.

Community contributions: MariaDB touts the fact that they accept more community contributions than Oracle. Part of the reason for this disparity is that developers like to contribute features, bug fixes and other code without a lot of paperwork overhead (and they complain about the Oracle Contributor Agreement). However, MariaDB has its own MariaDB Contributor Agreement — which more or less serves the same purpose.

Colin will take a look at some of the differences between MariaDB 10.3 and MySQL 8.0 and help answer some of the common questions our Database Performance Experts get about the two databases.

Register Now

The post Webinar Wed 7/18: MariaDB 10.3 vs. MySQL 8.0 appeared first on Percona Database Performance Blog.


Webinar Wed 7/19: MongoDB Sharding

$
0
0
MongoDB shard zones

MongoDB shard zonesPlease join Percona’s Senior Support Engineer, Adamo Tonete as he presents MongoDB Sharding 101 on July 19th, 2018, at 12:30 PM PDT (UTC-7) / 3:30 PM EDT (UTC-4).

 

This tutorial is a continuation of advanced topics for the DBA. In it, we will share best practices and tips on how to perform the most common activities.

In this tutorial, we are going to cover MongoDB sharding.

Register Now

The post Webinar Wed 7/19: MongoDB Sharding appeared first on Percona Database Performance Blog.

InnoDB Cluster in a Nutshell Part 3: MySQL Shell

$
0
0
MySQL InnoDB Cluster MySQL Shell

MySQL InnoDB Cluster MySQL ShellWelcome to the third part of this series. I’m glad you’re still reading, as hopefully this means you find this subject interesting at least. Previously we presented the first two components of MySQL InnoDB Cluster: Group Replication and MySQL Router and now we will discuss the last component, MySQL Shell.

MySQL Shell

This is the last component in the cluster and I love it. Oracle have created this tool to centralize cluster management, providing a friendly, command-line based user interface.

The tool can be defined as an advanced MySQL shell, which is much more powerful than the well known MySQL client. With the capacity to work with both relational and document (JSON) data, the tool provides an extended capability to interact with the database from a single place.

MySQL Shell is also able to understand different languages:

  • JavaScript (default) which includes several built-in functions to administer the cluster—create, destroy, restart, etc.—in a very easy way.
  • Python it provides an easy way to write Python code to interact with the database. This is particularly useful for developers who don’t need to have SQL skills or run applications to test code.
  • SQL to work in classic mode to query database as we used to do with the old MySQL client.

A very interesting feature provided with MySQL Shell is the ability to establish different connections to different servers/clusters from within the same shell. There is no need to exit to connect to a different server, just issuing the command \connect will make this happen. As DBA, I find this pretty useful when handling multiple clusters/servers.

Some of the features present in this tool:

  • Capacity to use both Classic and X protocols.
  • Online switch mode to change languages (JavaScript, Python and SQL)
  • Auto-completion of commands using tab, a super expected feature in MySQL client.
  • Colored formatting output that also supports different formats like Table, Tab-separated and Json formats.
  • Batch mode that processes batches of commands allowing also an interactive mode to print output according each line is processed.

Some sample commands

Samples of new tool and execution modes:

#switch modes
\sql
\js
\py
#connect to instance
\connect user@host:[port]
#create a cluster (better to handle through variables)
var cluster=dba.createCluster('percona')
#add instances to cluster
cluster.addInstance(‘root@192.168.70.2:3306’)
#check cluster status
cluster.status()
#using another variable
var cluster2=dba.getCluster(‘percona’)
cluster.status()
#get cluster structure
cluster.describe()
#rejoin instance to cluster - needs to be executed locally to the instance
cluster.rejoinInstance()
#rejoin instance to cluster - needs to be executed locally to the instance
cluster.rejoinInstance()
#recover from lost quorum
cluster.forceQuorumUsingPartitionOf(‘root@localhost:3306’)
#recover from lost quorum
cluster.rebootClusterFromCompleteOutage()
#destroy cluster
cluster.dissolve({force:true});

Personally, I think this tool is a very good replacement for the classic MySQL client. Sadly, mysql-server installations do not include MySQL shell by default, but it is worth getting used to. I recommend you try it.

Conclusion

We finally reached the end of this series. I hope you have enjoyed this short introduction to what seems to be Oracle’s bid to have a built-in High Availability solution based on InnoDB. It may become a good competitor to Galera-based solutions. Still, there is a long way to go, as the tool was only just released as GA (April 2018). There are a bunch of things that need to be addressed before it becomes consistent enough to be production-ready. In my personal opinion, it is not—yet. Nevertheless, I think it is a great tool that will eventually be a serious player in the HA field as it’s an excellent, flexible and easy to deploy solution.

The post InnoDB Cluster in a Nutshell Part 3: MySQL Shell appeared first on Percona Database Performance Blog.

Webinar Weds 7/25: XA Transactions

$
0
0
xa transactions distributed transactions

xa transactions distributed transactionsPlease join Percona Senior MySQL DBA for Managed Services, Dov Endress, as he presents XA Transactions on Wednesday, July 25th, 2018 at 12:00 PM PDT (UTC-7) / 3:00 PM EDT (UTC-4).

Distributed transactions (XA) are becoming more and more vital as applications evolve. In this webinar, we will learn what distributed transactions are and how MySQL implements the XA specification. We will learn the investigatory and debugging techniques necessary to ensure high availability and data consistency across disparate environments.

This webinar is not intended to be an in-depth look at transaction managers, but focuses on resource managers only. It is primarily intended for database administrators and site reliability engineers.

Register Now

Dov Endress

Dov Endress, Senior MySQL DBA in Managed Services

Dov joined Percona in the fall of 2015 as a senior support engineer. He learned BASIC in elementary school on an Apple IIE, and his first computer was a Commodore 64. Dov started working in the LAMP stack in 1999 and has been doing so ever since. He lives in northern Nevada with his wife, step-daughter, grandson, a clowder of cats and Macy – the best dog a person could meet. In his free time, he can be found somewhere outdoors or making things in the garage.

The post Webinar Weds 7/25: XA Transactions appeared first on Percona Database Performance Blog.

Tuning InnoDB Primary Keys

$
0
0

The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload.

As a principal architect at Percona, one of my main duties is to tune customer databases. There are many aspects related to performance tuning which make the job complex and very interesting. In this post, I want to discuss one of the most important one: the choice of good InnoDB primary keys. You would be surprised how many times I had to explain the importance of primary keys and how many debates I had around the topic as often people have preconceived ideas that translate into doing things a certain way without further thinking.

The choice of a good primary key for an InnoDB table is extremely important and can have huge performance impacts. When you start working with a customer using an overloaded x1.16xlarge RDS instance, with close to 1TB of RAM, and after putting a new primary in place they end up doing very well with a r4.4xlarge instance — it’s a huge impact. Of course, it is not a silver bullet –, you need to have a workload like the ones I’ll highlight in the following sections. Keep in mind that tuning comes with trade-offs, especially with the primary key. What you gain somewhere, you have to pay for, performance-wise, elsewhere. You need to calculate what is best for your workload.

What is special about InnoDB primary keys?

InnoDB is called an index-organized storage engine. An index-organized storage engine uses the B-Tree of the primary key to stores the data, the table rows. That means a primary key is mandatory with InnoDB. If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key. There are some issues with the InnoDB hidden primary key. You should always define explicit primary keys on your tables. In summary, you access all InnoDB rows by the primary key values.

An InnoDB secondary index is also a B-Tree. The search key is made of the index columns and the values stored are the primary keys of matching rows. A search by a secondary index very often results in an implicit search by primary key. You can find more information about InnoDB file format in the documentation. Jeremy Cole’s InnoDB Ruby tools are also a great way to learn about InnoDB internals.

What is a B-Tree?

A B-Tree is a data structure optimized for operations on block devices. Block devices, or disks, have a rather important data access latency, especially spinning disks. Retrieving a single byte at a random position doesn’t take much less time than retrieving a bigger piece of data like a 8KB or 16KB object. That’s the fundamental argument for B-Trees. InnoDB uses pieces of data — pages — of 16KB.

A simple three level B-Tree

Let’s attempt a simplified description of a B-Tree. A B-Tree is a data structure organized around a key. The key is used to search the data inside the B-Tree. A B-Tree normally has multiple levels. The data is stored only in the bottom-most level, the leaves. The pages of the other levels, the nodes, only contains keys and pointers to pages in the next lower level.

When you want to access a piece of data for a given value of the key, you start from the top node, the root node, compare the keys it contains with the search value and finds the page to access at the next level. The process is repeated until you reach the last level, the leaves.  In theory, you need one disk read operation per level of the B-Tree. In practice there is always a memory cache and the nodes, since they are less numerous and accessed often, are easy to cache.

An ordered insert example

Let’s consider the following sysbench table:

mysql> show create table sbtest1\G
*************************** 1. row ***************************
       Table: sbtest1
Create Table: CREATE TABLE `sbtest1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=3000001 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql> show table status like 'sbtest1'\G
*************************** 1. row ***************************
           Name: sbtest1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 2882954
 Avg_row_length: 234
    Data_length: 675282944
Max_data_length: 0
   Index_length: 47775744
      Data_free: 3145728
 Auto_increment: 3000001
    Create_time: 2018-07-13 18:27:09
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

The primary key B-Tree size is Data_length. There is one secondary key B-Tree, the k_1 index, and its size is given by Index_length. The sysbench table was inserted in order of the primary key since the id column is auto-incremented. When you insert in order of the primary key, InnoDB fills its pages with up to 15KB of data (out of 16KB), even when innodb_fill_factor is set to 100. That allows for some row expansion by updates after the initial insert before a page needs to be split. There are also some headers and footers in the pages. If a page is too full and cannot accommodate an update adding more data, the page is split into two. Similarly, if two neighbor pages are less than 50% full, InnoDB will merge them. Here is, for example, a sysbench table inserted in id order:

mysql> select count(*), TABLE_NAME,INDEX_NAME, avg(NUMBER_RECORDS), avg(DATA_SIZE) from information_schema.INNODB_BUFFER_PAGE
    -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' group by TABLE_NAME,INDEX_NAME order by count(*) desc;
+----------+--------------------+------------+---------------------+----------------+
| count(*) | TABLE_NAME         | INDEX_NAME | avg(NUMBER_RECORDS) | avg(DATA_SIZE) |
+----------+--------------------+------------+---------------------+----------------+
|    13643 | `sbtest`.`sbtest1` | PRIMARY    |             75.0709 |     15035.8929 |
|       44 | `sbtest`.`sbtest1` | k_1        |           1150.3864 |     15182.0227 |
+----------+--------------------+------------+---------------------+----------------+
2 rows in set (0.09 sec)
mysql> select PAGE_NUMBER,NUMBER_RECORDS,DATA_SIZE,INDEX_NAME,TABLE_NAME from information_schema.INNODB_BUFFER_PAGE
    -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' order by PAGE_NUMBER limit 1;
+-------------+----------------+-----------+------------+--------------------+
| PAGE_NUMBER | NUMBER_RECORDS | DATA_SIZE | INDEX_NAME | TABLE_NAME         |
+-------------+----------------+-----------+------------+--------------------+
|           3 |             35 |       455 | PRIMARY    | `sbtest`.`sbtest1` |
+-------------+----------------+-----------+------------+--------------------+
1 row in set (0.04 sec)
mysql> select PAGE_NUMBER,NUMBER_RECORDS,DATA_SIZE,INDEX_NAME,TABLE_NAME from information_schema.INNODB_BUFFER_PAGE
    -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' order by NUMBER_RECORDS desc limit 3;
+-------------+----------------+-----------+------------+--------------------+
| PAGE_NUMBER | NUMBER_RECORDS | DATA_SIZE | INDEX_NAME | TABLE_NAME         |
+-------------+----------------+-----------+------------+--------------------+
|          39 |           1203 |     15639 | PRIMARY    | `sbtest`.`sbtest1` |
|          61 |           1203 |     15639 | PRIMARY    | `sbtest`.`sbtest1` |
|          37 |           1203 |     15639 | PRIMARY    | `sbtest`.`sbtest1` |
+-------------+----------------+-----------+------------+--------------------+
3 rows in set (0.03 sec)

The table doesn’t fit in the buffer pool, but the queries give us good insights. The pages of the primary key B-Tree have on average 75 records and store a bit less than 15KB of data. The index k_1 is inserted in random order by sysbench. Why is the filling factor so good? It’s simply because sysbench creates the index after the rows have been inserted and InnoDB uses a sort file to create it.

You can easily estimate the number of levels in an InnoDB B-Tree. The above table needs about 40k leaf pages (3M/75). Each node page holds about 1200 pointers when the primary key is a four bytes integer.  The level above the leaves thus has approximately 35 pages and then, on top of the B-Tree is the root node (PAGE_NUMBER = 3). We have a total of three levels.

A randomly inserted example

If you are a keen observer, you realized a direct consequence of inserting in random order of the primary key. The pages are often split, and on average the filling factor is only around 65-75%. You can easily see the filling factor from the information schema. I modified sysbench to insert in random order of id and created a table, also with 3M rows. The resulting table is much larger:

mysql> show table status like 'sbtest1'\G
*************************** 1. row ***************************
           Name: sbtest1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 3137367
 Avg_row_length: 346
    Data_length: 1088405504
Max_data_length: 0
   Index_length: 47775744
      Data_free: 15728640
 Auto_increment: NULL
    Create_time: 2018-07-19 19:10:36
    Update_time: 2018-07-19 19:09:01
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

While the size of the primary key b-tree inserted in order of id is 644MB, the size, inserted in random order, is about 1GB, 60% larger. Obviously, we have a lower page filling factor:

mysql> select count(*), TABLE_NAME,INDEX_NAME, avg(NUMBER_RECORDS), avg(DATA_SIZE) from information_schema.INNODB_BUFFER_PAGE
    -> WHERE TABLE_NAME='`sbtestrandom`.`sbtest1`'group by TABLE_NAME,INDEX_NAME order by count(*) desc;
+----------+--------------------------+------------+---------------------+----------------+
| count(*) | TABLE_NAME               | INDEX_NAME | avg(NUMBER_RECORDS) | avg(DATA_SIZE) |
+----------+--------------------------+------------+---------------------+----------------+
|     4022 | `sbtestrandom`.`sbtest1` | PRIMARY    |             66.4441 |     10901.5962 |
|     2499 | `sbtestrandom`.`sbtest1` | k_1        |           1201.5702 |     15624.4146 |
+----------+--------------------------+------------+---------------------+----------------+
2 rows in set (0.06 sec)

The primary key pages are now filled with only about 10KB of data (~66%). It is a normal and expected consequence of inserting rows in random order. We’ll see that for some workloads, it is bad. For some others, it is a small price to pay.

A practical analogy

InnoDB Primary KeyIt is always good to have a concrete model or analogy in your mind to better understand what is going on. Let’s assume you have been tasked to write the names and arrival time, on paper, of all the attendees arriving at a large event like Percona Live. So, you sit at a table close to the entry with a good pen and a pile of sheets of paper. As people arrive, you write their names and arrival time, one after the other. When a sheet is full, after about 40 names, you move it aside and start writing to a new one. That’s fast and effective. You handle a sheet only once, and when it is full, you don’t touch it anymore. The analogy is easy, a sheet of paper represents an InnoDB page.

The above use case represents an ordered insert. It is very efficient for the writes. Your only issue is with the organizer of the event: she keeps coming to you asking if “Mr. X” or “Mrs. Y” has arrived. You have to scan through your sheets to find the name. That’s the drawback of ordered inserts, reads can be more expensive. Not all reads are expensive, some can be very cheap. For example: “Who were the first ten people to get in?” is super easy. You’ll want an ordered insert strategy when the critical aspects of the application are the rate and the latency of the inserts. That usually means the reads are not user-facing. They are coming from report batch jobs, and as long as these jobs complete in a reasonable time, you don’t really care.

Now, let’s consider a random insertion analogy. For the next day of the event, tired of the organizer questions, you decide on a new strategy: you’ll write the names grouped by the first letter of the last name. Your goal is to ease the searches by name. So you take 26 sheets, and on top of each one, you write a different letter. As the first visitors arrive, you quickly realize you are now spending a lot more time looking for the right sheet in the stack and putting it back at the right place once you added a name to it.

At the end of the morning, you have worked much more. You also have more sheets than the previous day since for some letters there are few names while for others you needed more than a sheet. Finding names is much easier though. The main drawback of random insertion order is the overhead to manage the database pages when adding entries. The database will read and write from/to disk much more and the dataset size is larger.

Determine your workload type

The first step is to determine what kind of workload you have. When you have an insert-intensive workload, very likely, the top queries are inserts on some large tables and the database heavily writes to disk. If you repeatedly execute “show processlist;” in the MySQL client, you see these inserts very often. That’s typical of applications logging a lot of data. There are many data collectors and they all wait to insert data. If they wait for too long, some data may be lost. If you have strict SLA on the insert time and relaxed ones on the read time, you clearly have an insert oriented workload and you should insert rows in order of the primary key.

You may also have a decent insert rate on large tables but these inserts are queued and executed by batch processes. Nobody is really waiting for these inserts to complete and the server can easily keep up with the number of inserts. What matters for your application is the large number of read queries going to the large tables, not the inserts. You already went through query tuning and even though you have good indexes, the database is reading from disk at a very high rate.

When you look at the MySQL processlist, you see many times the same select query forms on the large tables. The only options seem to be adding more memory to lower the disk reads, but the tables are growing fast and you can’t add memory forever. We’ll discuss the read-intensive workload in details in the next section.

If you couldn’t figure if you have an insert-heavy or read-heavy workload, maybe you just don’t have a big workload. In such a case, the default would be to use ordered inserts, and the best way to achieve this with MySQL is through an auto-increment integer primary key. That’s the default behavior of many ORMs.

A read-intensive workload

I have seen quite a few read-intensive workloads over my consulting years, mostly with online games and social networking applications. On top of that, some games have social networking features like watching the scores of your friends as they progress through the game. Before we go further, we first need to confirm the reads are inefficient. When reads are inefficient, the top select query forms will the accessing a number of distinct InnoDB pages close to the number of rows examined. The Percona Server for MySQL slow log, when the verbosity level includes “InnoDB”, exposes both quantities, and the pt-query-digest tool includes stats on them. Here’s an example output (I’ve removed some lines):

# Query 1: 2.62 QPS, 0.00x concurrency, ID 0x019AC6AF303E539E758259537C5258A2 at byte 19976
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.00
# Time range: 2018-07-19T20:28:02 to 2018-07-19T20:28:23
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         48      55
# Exec time     76    93ms   637us     3ms     2ms     2ms   458us     2ms
# Lock time    100    10ms    72us   297us   182us   247us    47us   176us
# Rows sent    100   1.34k      16      36   25.04   31.70    4.22   24.84
# Rows examine 100   1.34k      16      36   25.04   31.70    4.22   24.84
# Rows affecte   0       0       0       0       0       0       0       0
# InnoDB:
# IO r bytes     0       0       0       0       0       0       0       0
# IO r ops       0       0       0       0       0       0       0       0
# IO r wait      0       0       0       0       0       0       0       0
# pages distin 100   1.36k      18      35   25.31   31.70    3.70   24.84
# EXPLAIN /*!50100 PARTITIONS*/
select * from friends where user_id = 1234\G

The friends table definition is:

CREATE TABLE `friends` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user_id` int(10) unsigned NOT NULL,
  `friend_user_id` int(10) unsigned NOT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `active` tinyint(4) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_user_id_friend` (`user_id`,`friend_user_id`),
  KEY `idx_friend` (`friend_user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

I built this simple example on my test server. The table easily fits in memory, so there are no disk reads. What matters here is the relation between “page distin” and “Rows examine”. As you can see, the ratio is close to 1. It means that InnoDB rarely gets more than one row per page it accesses. For a given user_id value, the matching rows are scattered all over the primary key b-tree. We can confirm this by looking at the output of the sample query:

mysql> select * from friends where user_id = 1234 order by id limit 10;
+-------+---------+----------------+---------------------+--------+
| id    | user_id | friend_user_id | created             | active |
+-------+---------+----------------+---------------------+--------+
|   257 |    1234 |             43 | 2018-07-19 20:14:47 |      1 |
|  7400 |    1234 |           1503 | 2018-07-19 20:14:49 |      1 |
| 13361 |    1234 |            814 | 2018-07-19 20:15:46 |      1 |
| 13793 |    1234 |            668 | 2018-07-19 20:15:47 |      1 |
| 14486 |    1234 |           1588 | 2018-07-19 20:15:47 |      1 |
| 30752 |    1234 |           1938 | 2018-07-19 20:16:27 |      1 |
| 31502 |    1234 |            733 | 2018-07-19 20:16:28 |      1 |
| 32987 |    1234 |           1907 | 2018-07-19 20:16:29 |      1 |
| 35867 |    1234 |           1068 | 2018-07-19 20:16:30 |      1 |
| 41471 |    1234 |            751 | 2018-07-19 20:16:32 |      1 |
+-------+---------+----------------+---------------------+--------+
10 rows in set (0.00 sec)

The rows are often apart by thousands of id values. Although the rows are small, about 30 bytes, an InnoDB page doesn’t contain more than 500 rows. As the application becomes popular, there are more and more users and the table size grows like the square of the number of users. As soon as the table outgrows the InnoDB the buffer pool, MySQL starts to read from disk. Worse case, with nothing cached, we need one read IOP per friend. If the rate of these selects is 300/s and on average, every user has 100 friends, MySQL needs to access up to 30000 pages per second. Clearly, this doesn’t scale for long.

We need to determine all the ways the table is accessed. For that, I use pt-query-digest and I raise the limit on the number of query forms returned. Let’s assume I found:

  • 93% of the times by user_id
  • 5% of the times by friend_id
  • 2% of the times by id

The above proportions are quite common. When there is a dominant access pattern, we can do something. The friends table is a typical example of a many-to-many table. With InnoDB, we should define such tables as:

CREATE TABLE `friends` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user_id` int(10) unsigned NOT NULL,
  `friend_user_id` int(10) unsigned NOT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `active` tinyint(4) NOT NULL DEFAULT '1',
  PRIMARY KEY (`user_id`,`friend_user_id`),
  KEY `idx_friend` (`friend_user_id`),
  KEY `idx_id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

Now, the rows are ordered, grouped, by user_id inside the primary key B-Tree but the inserts are in random order. Said otherwise, we slowed down the inserts to the benefit of the select statements on the table. To insert a row, InnoDB potentially needs one disk read to get the page where the new row is going and one disk write to save it back to the disk. Remember in the previous analogy, we needed to take one sheet from the stack, add a name and put it back in place. We also made the table bigger, the InnoDB pages are not as full and the secondary indexes are bigger since the primary key is larger. We also added a secondary index. Now we have less data in the InnoDB buffer pool.

Shall we panic because there is less data in the buffer pool? No, because now when InnoDB reads a page from disk, instead of getting only a single matching row, it gets up to hundreds of matching rows. The amount of read IOPS is no longer correlated to the number of friends times the rate of select statements. It is now only a factor of the incoming rate of select statements. The impacts of not having enough memory to cache all the table are much reduced. As long as the storage can perform more read IOPS than the rate of select statements, all is fine. With the modified table, the relevant lines of the pt-query-digest output are now:

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Rows examine 100   1.23k      16      34   23.72   30.19    4.19   22.53
# pages distin 100     111       2       5    2.09    1.96    0.44    1.96

With the new primary key, instead of 30k read IOPS, MySQL needs to perform only about 588 read IOPS (~300*1.96). It is a workload much easier to handle. The inserts are more expensive but if their rate is 100/s, it just means 100 read IOPS and 100 write IOPS in the worse case.

The above strategy works well when there is a clear access pattern. On top of my mind, here are a few other examples where there are usually dominant access patterns:

  • Game leaderboards (by user)
  • User preferences (by user)
  • Messaging application (by from or to)
  • User object store (by user)
  • Likes on items (by item)
  • Comments on items (by item)

What can you do when you don’t have a dominant access pattern? One option is the use of a covering index. The covering index needs to cover all the required columns. The order of the columns is also important, as the first must be the grouping value. Another option is to use partitions to create an easy to cache hot spot in the dataset. I’ll discuss these strategies in future posts, as this one is long enough!

We have seen in this post a common strategy used to solve read-intensive workload. This strategy doesn’t work all the time — you must access the data through a common pattern. But when it works, and you choose good InnoDB primary keys, you are the hero of the day!

The post Tuning InnoDB Primary Keys appeared first on Percona Database Performance Blog.

Upcoming Webinar Tuesday, 7/31: Using MySQL for Distributed Database Architectures

$
0
0
Distributed Database Architectures

Distributed Database ArchitecturesPlease join Percona’s CEO, Peter Zaitsev as he presents Using MySQL for Distributed Database Architectures on Tuesday, July 31st, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

In modern data architectures, we’re increasingly moving from single-node design systems to distributed architectures using multiple nodes – often spread across multiple databases and multiple continents. Such architectures bring many benefits (such as scalability and resiliency), but can also bring a lot of pain if incorrectly designed and executed.

In this presentation, we will look at how we can use MySQL to engineer distributed multi-node systems.

Register for the webinar.

Peter ZaitsevPeter Zaitsev, CEO and Co-Founder

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Inc. 5000 named Percona to their list in 2013, 2014, 2015 and 2016. Peter was an early employee at MySQL AB, eventually leading the company’s High-Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High-Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his ebook Practical MySQL Performance Optimization is one of Percona’s most popular downloads.

 

The post Upcoming Webinar Tuesday, 7/31: Using MySQL for Distributed Database Architectures appeared first on Percona Database Performance Blog.

Saving With MyRocks in The Cloud

$
0
0

The main focus of a previous blog post was the performance of MyRocks when using fast SSD devices. However, I figured that MyRocks would be beneficial for use in cloud workloads, where storage is either slow or expensive.

In that earlier post, we demonstrated the benefits of MyRocks, especially for heavy IO workloads. Meanwhile, Mark wrote in his blog that the CPU overhead in MyRocks might be significant for CPU-bound workloads, but this should not be the issue for IO-bound workloads.

In the cloud the cost of resources is a major consideration. Let’s review the annual cost for the processing and storage resources.

 Resource cost/year, $   IO cost $/year   Total $/year 
c5.9xlarge  7881    7881
1TB io1 5000 IOPS  1500  3900    5400
1TB io1 10000 IOPS  1500  7800    9300
1TB io1 15000 IOPS  1500  11700  13200
1TB io1 20000 IOPS  1500  15600  17100
1TB io1 30000 IOPS  1500  23400  24900
3.4TB GP2 (10000 IOPS)  4800    4800

 

The scenario

The server version is Percona Server 5.7.22

For instances, I used c5.9xlarge instances. The reason for c5 was that it provides high performance Nitro virtualization: Brendan Gregg describes this in his blog post. The rationale for 9xlarge instances was to be able to utilize io1 volumes with a 30000 IOPS throughput – smaller instances will cap io1 throughput at a lower level.

I also used huge gp2 volumes: 3400GB, as this volume provides guaranteed 10000 IOPS even if we do not use io1 volumes. This is a cheaper alternative to io1 volumes to achieve 10000 IOPS.

For the workload I used sysbench-tpcc 5000W (50 tables * 100W), which for InnoDB gave about 471GB in storage used space.

For the cache I used 27GB and 54G buffer size, so the workload is IO-heavy.

I wanted to compare how InnoDB and RocksDB performed under this scenario.

If you are curious I prepared my terraform+ansible deployment files here: https://github.com/vadimtk/terraform-ansible-percona

Before jumping to the results, I should note that for MyRocks I used LZ4 compression for all levels, which in its final size is 91GB. That is five times less than InnoDB size. This alone provides operational benefits—for example to copy InnoDB files (471GB) from a backup volume takes longer than 1 hour, while it is much faster (five times) for MyRocks.

The benchmark results

So let’s review the results.

InnoDB versus MyRocks throughput in the cloud

Or presenting average throughput in a tabular form:

cachesize IOPS engine avg TPS
27 5000 innodb 132.66
27 5000 rocksdb 481.03
27 10000 innodb 285.93
27 10000 rocksdb 1224.14
27 10000gp2 innodb 227.19
27 10000gp2 rocksdb 1268.89
27 15000 innodb 436.04
27 15000 rocksdb 1839.66
27 20000 innodb 584.69
27 20000 rocksdb 2336.94
27 30000 innodb 753.86
27 30000 rocksdb 2508.97
54 5000 innodb 197.51
54 5000 rocksdb 667.63
54 10000 innodb 433.99
54 10000 rocksdb 1600.01
54 10000gp2 innodb 326.12
54 10000gp2 rocksdb 1559.98
54 15000 innodb 661.34
54 15000 rocksdb 2176.83
54 20000 innodb 888.74
54 20000 rocksdb 2506.22
54 30000 innodb 1097.31
54 30000 rocksdb 2690.91

 

We can see that MyRocks outperformed InnoDB in every single combination, but it is also important to note the following:

MyRocks on io1 5000 IOPS showed the performance that InnoDB showed in io1 15000 IOPS.

That means that InnoDB requires three times more in storage throughput. If we take a look at the storage cost, it corresponds to three times more expensive storage. Given that MyRocks requires less storage, it is possible to save even more on storage capacity.

On the most economical storage (3400GB gp2, which will provide 10000 IOPS) MyRocks showed 4.7 times better throughput.

For the 30000 IOPS storage, MyRocks was still better by 2.45 times.

However it is worth noting that MyRocks showed a greater variance in throughput during the runs. Let’s review the charts with 1 sec resolution for GP2 and io1 30000 IOPS storage:Throughput 1 sec resolution for GP2 and io1 30000 IOPS storage MyROCKS versus InnoDB

Such variance might be problematic for workloads that require stable throughput and where periodical slowdowns are unacceptable.

Conclusion

MyRocks is suitable and beneficial not only for fast SSD, but also for cloud deployments. By requiring less IOPS, MyRocks can provide better performance and save on the storage costs.

However, before evaluating MyRocks, make sure that your workload is IO-bound i.e. the working set is much bigger than available memory. For CPU-intensive workloads (where the working set fits into memory), MyRocks will be less beneficial or even perform worse than InnoDB (as described in the blog post A Look at MyRocks Performance)

 

 

 

The post Saving With MyRocks in The Cloud appeared first on Percona Database Performance Blog.

Basic Understanding of Bloat and VACUUM in PostgreSQL

$
0
0
VACUUM and Bloat PostgreSQL

VACUUM and Bloat PostgreSQLImplementation of MVCC (Multi-Version Concurrency Control) in PostgreSQL is different and special when compared with other RDBMS. MVCC in PostgreSQL controls which tuples can be visible to transactions via versioning.

What is versioning in PostgreSQL?

Let’s consider the case of an Oracle or a MySQL Database. What happens when you perform a DELETE or an UPDATE of a row? You see an UNDO record maintained in a global UNDO Segment. This UNDO segment contains the past image of a row, to help database achieve consistency. (the “C” in A.C.I.D). For example, if there is an old transaction that depends on the row that got deleted, the row may still be visible to it because the past image is still maintained in the UNDO. If you are an Oracle DBA reading this blog post, you may quickly recollect the error

ORA-01555 snapshot too old
 . What this error means is—you may have a smaller undo_retention or not a huge UNDO segment that could retain all the past images (versions) needed by the existing or old transactions.

You may not have to worry about that with PostgreSQL.

Then how does PostgreSQL manage UNDO ?

In simple terms, PostgreSQL maintains both the past image and the latest image of a row in its own Table. It means, UNDO is maintained within each table. And this is done through versioning. Now, we may get a hint that, every row of PostgreSQL table has a version number. And that is absolutely correct. In order to understand how these versions are maintained within each table, you should understand the hidden columns of a table (especially xmin) in PostgreSQL.

Understanding the Hidden Columns of a Table

When you describe a table, you would only see the columns you have added, like you see in the following log.

percona=# \d scott.employee
                                          Table "scott.employee"
  Column  |          Type          | Collation | Nullable |                    Default
----------+------------------------+-----------+----------+------------------------------------------------
 emp_id   | integer                |           | not null | nextval('scott.employee_emp_id_seq'::regclass)
 emp_name | character varying(100) |           |          |
 dept_id  | integer                |           |          |

However, if you look at all the columns of the table in pg_attribute, you should see several hidden columns as you see in the following log.

percona=# SELECT attname, format_type (atttypid, atttypmod)
FROM pg_attribute
WHERE attrelid::regclass::text='scott.employee'
ORDER BY attnum;
 attname  |      format_type
----------+------------------------
 tableoid | oid
 cmax     | cid
 xmax     | xid
 cmin     | cid
 xmin     | xid
 ctid     | tid
 emp_id   | integer
 emp_name | character varying(100)
 dept_id  | integer
(9 rows)

Let’s understand a few of these hidden columns in detail.

tableoid : Contains the OID of the table that contains this row. Used by queries that select from inheritance hierarchies.
More details on table inheritance can be found here : https://www.postgresql.org/docs/10/static/ddl-inherit.html

xmin : The transaction ID(xid) of the inserting transaction for this row version. Upon update, a new row version is inserted. Let’s see the following log to understand the xmin more.

percona=# select txid_current();
 txid_current
--------------
          646
(1 row)
percona=# INSERT into scott.employee VALUES (9,'avi',9);
INSERT 0 1
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 9;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  647 |    0 |    0 |    0 |      9 | avi      |       9
(1 row)

As you see in the above log, the transaction ID was 646 for the command => select txid_current(). Thus, the immediate INSERT statement got a transaction ID 647. Hence, the record was assigned an xmin of 647. This means, no transaction ID that has started before the ID 647, can see this row. In other words, already running transactions with txid less than 647 cannot see the row inserted by txid 647. 

With the above example, you should now understand that every tuple has an xmin that is assigned the txid that inserted it.

Note: the behavior may change depending on the isolation levels you choose, would be discussed later in another blog post.

xmax : This values is 0 if it was not a deleted row version. Before the DELETE is committed, the xmax of the row version changes to the ID of the transaction that has issued the DELETE. Let’s observe the following log to understand that better.

On Terminal A : We open a transaction and delete a row without committing it.

percona=# BEGIN;
BEGIN
percona=# select txid_current();
 txid_current
--------------
          655
(1 row)
percona=# DELETE from scott.employee where emp_id = 10;
DELETE 1

On Terminal B : Observe the xmax values before and after the delete (that has not been committed).

Before the Delete
------------------
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  649 |    0 |    0 |    0 |     10 | avi      |      10
After the Delete
------------------
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  649 |  655 |    0 |    0 |     10 | avi      |      10
(1 row)

As you see in the above logs, the xmax value changed to the transaction ID that has issued the delete. If you have issued a ROLLBACK, or if the transaction got aborted, xmax remains at the transaction ID that tried to DELETE it (which is 655) in this case.

Now that we understand the hidden columns xmin and xmax, let’s observe what happens after a DELETE or an UPDATE in PostgreSQL. As we discussed earlier, through the hidden columns in PostgreSQL for every table, we understand that there are multiple versions of rows maintained within each table. Let’s see the following example to understand this better.

We’ll insert 10 records to the table : scott.employee

percona=# INSERT into scott.employee VALUES (generate_series(1,10),'avi',1);
INSERT 0 10

Now, let’s DELETE 5 records from the table.

percona=# DELETE from scott.employee where emp_id > 5;
DELETE 5
percona=# select count(*) from scott.employee;
 count
-------
     5
(1 row)

Now, when you check the count after DELETE, you would not see the records that have been DELETED. To see any row versions that exist in the table but are not visible, we have an extension called pageinspect. The pageinspect module provides functions that allow you to inspect the contents of database pages at a low level, which is useful for debugging purposes. Let’s create this extension to see the older row versions those have been deleted.

percona=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |              tuple_data_split
--------+--------+---------------------------------------------
    668 |      0 | {"\\x01000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x02000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x03000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x04000000","\\x09617669","\\x01000000"}
    668 |      0 | {"\\x05000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x06000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x07000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x08000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x09000000","\\x09617669","\\x01000000"}
    668 |    669 | {"\\x0a000000","\\x09617669","\\x01000000"}
(10 rows)

Now, we could still see 10 records in the table even after deleting 5 records from it. Also, you can observe here that t_xmax is set to the transaction ID that has deleted them. These deleted records are retained in the same table to serve any of the older transactions that are still accessing them.

We’ll take a look at what an UPDATE would do in the following Log.  

percona=# DROP TABLE scott.employee ;
DROP TABLE
percona=# CREATE TABLE scott.employee (emp_id INT, emp_name VARCHAR(100), dept_id INT);
CREATE TABLE
percona=# INSERT into scott.employee VALUES (generate_series(1,10),'avi',1);
INSERT 0 10
percona=# UPDATE scott.employee SET emp_name = 'avii';
UPDATE 10
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |               tuple_data_split
--------+--------+-----------------------------------------------
    672 |    673 | {"\\x01000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x02000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x03000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x04000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x05000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x06000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x07000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x08000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x09000000","\\x09617669","\\x01000000"}
    672 |    673 | {"\\x0a000000","\\x09617669","\\x01000000"}
    673 |      0 | {"\\x01000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x02000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x03000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x04000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x05000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x06000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x07000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x08000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x09000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x0a000000","\\x0b61766969","\\x01000000"}
(20 rows)

An UPDATE in PostgreSQL would perform an insert and a delete. Hence, all the records being UPDATED have been deleted and inserted back with the new value. Deleted records have non-zero t_xmax value.

Records for which you see a non-zero value for t_xmax may be required by the previous transactions to ensure consistency based on appropriate isolation levels.

We discussed about xmin and xmax. What are these hidden columns cmin and cmax ?

cmax : The command identifier within the deleting transaction or zero. (As per the documentation). However, both cmin and cmax are always the same as per the PostgreSQL source code.

cmin : The command identifier within the inserting transaction. You could see the cmin of the 3 insert statements starting with 0, in the following log.

See the following log to understand how the cmin and cmax values change through inserts and deletes in a transaction.

On Terminal A
---------------
percona=# BEGIN;
BEGIN
percona=# INSERT into scott.employee VALUES (1,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (2,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (3,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (4,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (5,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (6,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (7,'avi',2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (8,'avi',2);
INSERT 0 1
percona=# COMMIT;
COMMIT
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  644 |    0 |    0 |    0 |      1 | avi      |       2
  644 |    0 |    1 |    1 |      2 | avi      |       2
  644 |    0 |    2 |    2 |      3 | avi      |       2
  644 |    0 |    3 |    3 |      4 | avi      |       2
  644 |    0 |    4 |    4 |      5 | avi      |       2
  644 |    0 |    5 |    5 |      6 | avi      |       2
  644 |    0 |    6 |    6 |      7 | avi      |       2
  644 |    0 |    7 |    7 |      8 | avi      |       2
(8 rows)

If you observe the above output log, you see cmin and cmax values incrementing for each insert.

Now let’s delete 3 records from Terminal A and observe how the values appear in Terminal B before COMMIT.

On Terminal A
---------------
percona=# BEGIN;
BEGIN
percona=# DELETE from scott.employee where emp_id = 4;
DELETE 1
percona=# DELETE from scott.employee where emp_id = 5;
DELETE 1
percona=# DELETE from scott.employee where emp_id = 6;
DELETE 1
On Terminal B, before issuing COMMIT on Terminal A
----------------------------------------------------
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
 xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
------+------+------+------+--------+----------+---------
  644 |    0 |    0 |    0 |      1 | avi      |       2
  644 |    0 |    1 |    1 |      2 | avi      |       2
  644 |    0 |    2 |    2 |      3 | avi      |       2
  644 |  645 |    0 |    0 |      4 | avi      |       2
  644 |  645 |    1 |    1 |      5 | avi      |       2
  644 |  645 |    2 |    2 |      6 | avi      |       2
  644 |    0 |    6 |    6 |      7 | avi      |       2
  644 |    0 |    7 |    7 |      8 | avi      |       2
(8 rows)

Now, in the above log, you see that the cmax and cmin values have incrementally started from 0 for the records being deleted. Their values where different before the delete, as we have seen earlier. Even if you ROLLBACK, the values remain the same.

After understanding the hidden columns and how PostgreSQL maintains UNDO as multiple versions of rows, the next question would be—what would clean up this UNDO from a table? Doesn’t this increase the size of a table continuously? In order to understand that better, we need to know about VACUUM in PostgreSQL.

VACUUM in PostgreSQL

As seen in the above examples, every such record that has been deleted but is still taking some space is called a dead tuple. Once there is no dependency on those dead tuples with the already running transactions, the dead tuples are no longer needed. Thus, PostgreSQL runs VACUUM on such Tables. VACUUM reclaims the storage occupied by these dead tuples. The space occupied by these dead tuples may be referred to as Bloat. VACUUM scans the pages for dead tuples and marks them to the freespace map (FSM). Each relation apart from hash indexes has an FSM stored in a separate file called <relation_oid>_fsm.

Here, relation_oid is the oid of the relation that is visible in pg_class.

percona=# select oid from pg_class where relname = 'employee';
  oid
-------
 24613
(1 row)

Upon VACUUM, this space is not reclaimed to disk but can be re-used by future inserts on this table. VACUUM stores the free space available on each heap (or index) page to the FSM file.

Running a VACUUM is a non-blocking operation. It never causes exclusive locks on tables. This means VACUUM can run on a busy transactional table in production while there are several transactions writing to it.

As we discussed earlier, an UPDATE of 10 records has generated 10 dead tuples. Let us see the following log to understand what happens to those dead tuples after a VACUUM.

percona=# VACUUM scott.employee ;
VACUUM
percona=# SELECT t_xmin, t_xmax, tuple_data_split('scott.employee'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('scott.employee', 0));
 t_xmin | t_xmax |               tuple_data_split
--------+--------+-----------------------------------------------
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
        |        |
    673 |      0 | {"\\x01000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x02000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x03000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x04000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x05000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x06000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x07000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x08000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x09000000","\\x0b61766969","\\x01000000"}
    673 |      0 | {"\\x0a000000","\\x0b61766969","\\x01000000"}
(20 rows)

In the above log, you might notice that the dead tuples are removed and the space is available for re-use. However, this space is not reclaimed to filesystem after VACUUM. Only the future inserts can use this space.

VACUUM does an additional task. All the rows that are inserted and successfully committed in the past are marked as frozen, which indicates that they are visible to all the current and future transactions. We will be discussing this in detail in our future blog post “Transaction ID Wraparound in PostgreSQL”.

VACUUM does not usually reclaim the space to filesystem unless the dead tuples are beyond the high water mark.

Let’s consider the following example to see when a VACUUM could release the space to filesystem.

Create a table and insert some sample records. The records are physically ordered on the disk based on the primary key index.

percona=# CREATE TABLE scott.employee (emp_id int PRIMARY KEY, name varchar(20), dept_id int);
CREATE TABLE
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), 'avi', 1);
INSERT 0 1000

Now, run ANALYZE on the table to update its statistics and see how many pages are allocated to the table after the above insert.

percona=# ANALYZE scott.employee ;
ANALYZE
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
relpages | total_bytes | relsize
---------+-------------+---------
6        | 49152       | 49152
(1 row)

Let’s now see how VACUUM behaves when you delete the rows with emp_id > 500

percona=# DELETE from scott.employee where emp_id > 500;
DELETE 500
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
relpages | total_bytes | relsize
---------+-------------+---------
3        | 24576       | 24576
(1 row)

In the above log, you see that the VACUUM has reclaimed half the space to filesystem. Earlier, it occupied 6 pages (8KB each or as set to parameter : block_size). After VACUUM, it has released 3 pages to filesystem.

Now, let’s repeat the same exercise by deleting the rows with emp_id < 500

percona=# DELETE from scott.employee ;
DELETE 500
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), 'avi', 1);
INSERT 0 1000
percona=# DELETE from scott.employee where emp_id < 500;
DELETE 499
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
 relpages | total_bytes | relsize
----------+-------------+---------
        6 |       49152 |   49152
(1 row)

In the above example, you see that the number of pages still remain same after deleting half the records from the table. This means, VACUUM has not released the space to filesystem this time.

As explained earlier, if there are pages with no more live tuples after the high water mark, the subsequent pages can be flushed away to the disk by VACUUM. In the first case, it is understandable that there are no more live tuples after the 3rd page. So, the 4th, 5th and 6th page have been flushed to disk.

However, If you would need to reclaim the space to filesystem in the scenario where we deleted all the records with emp_id < 500, you may run VACUUM FULL. VACUUM FULL rebuilds the entire table and reclaims the space to disk.

percona=# VACUUM FULL scott.employee ;
VACUUM
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size('scott.employee') as relsize
FROM pg_class
WHERE relname = 'employee';
 relpages | total_bytes | relsize
----------+-------------+---------
        3 |       24576 |   24576
(1 row)

Please note that VACUUM FULL is not an ONLINE operation. It is a blocking operation. You cannot read from or write to the table while VACUUM FULL is in progress. We will discuss about the ways to rebuild a table online without blocking in our future blog post.

The post Basic Understanding of Bloat and VACUUM in PostgreSQL appeared first on Percona Database Performance Blog.


Replicating from MySQL 8.0 to MySQL 5.7

$
0
0
replicate from MySQL 8 to MySQL 5.7

In this blog post, we’ll discuss how to set a replication from MySQL 8.0 to MySQL 5.7. There are some situations that having this configuration might help. For example, in the case of a MySQL upgrade, it can be useful to have a master that is using a newer version of MySQL to an older version slave as a rollback plan. Another example is in the case of upgrading a master x master replication topology.

Officially, replication is only supported between consecutive major MySQL versions, and only from a lower version master to a higher version slave. Here is an example of a supported scenario:

5.7 master –> 8.0 slave

while the opposite is not supported:

8.0 master –> 5.7 slave

In this blog post, I’ll walk through how to overcome the initial problems to set a replication working in this scenario. I’ll also show some errors that can halt the replication if a new feature from MySQL 8 is used.

Here is the initial set up that will be used to build the topology:

slave > select @@version;
+---------------+
| @@version     |
+---------------+
| 5.7.17-log |
+---------------+
1 row in set (0.00 sec)
master > select @@version;
+-----------+
| @@version |
+-----------+
| 8.0.12    |
+-----------+
1 row in set (0.00 sec)

First, before executing the CHANGE MASTER command, you need to modify the collation on the master server. Otherwise the replication will run into this error:

slave > show slave status\G
                   Last_Errno: 22
                   Last_Error: Error 'Character set '#255' is not a compiled character set and is not specified in the '/opt/percona_server/5.7.17/share/charsets/Index.xml' file' on query. Default database: 'mysql8_1'. Query: 'create database mysql8_1'

This is because the default character_set and the collation has changed on MySQL 8. According to the documentation:

The default value of the character_set_server and character_set_database system variables has changed from latin1 to utf8mb4.

The default value of the collation_server and collation_database system variables has changed from latin1_swedish_ci to utf8mb4_0900_ai_ci.

Let’s change the collation and the character set to utf8 on MySQL 8 (it is possible to use any option that exists in both versions):

# master my.cnf
[client]
default-character-set=utf8
[mysqld]
character-set-server=utf8
collation-server=utf8_unicode_ci

You need to restart MySQL 8 to apply the changes. Next, after the restart, you have to create a replication user using mysql_native_password.This is because MySQL 8 changed the default Authentication Plugin to caching_sha2_password which is not supported by MySQL 5.7. If you try to execute the CHANGE MASTER command with a user using caching_sha2_password plugin, you will receive the error message below:

Last_IO_Errno: 2059
Last_IO_Error: error connecting to master 'root@127.0.0.1:19025' - retry-time: 60 retries: 1

To create a user using mysql_native_password :

master> CREATE USER 'replica_user'@'%' IDENTIFIED WITH mysql_native_password BY 'repli$cat';
master> GRANT REPLICATION SLAVE ON *.* TO 'replica_user'@'%';

Finally, we can proceed as usual to build the replication:

master > show master status\G
*************************** 1. row ***************************
File: mysql-bin.000007
Position: 155
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set:
1 row in set (0.00 sec)
slave > CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_USER='replica_user', MASTER_PASSWORD='repli$cat',MASTER_PORT=19025, MASTER_LOG_FILE='mysql-bin.000007', MASTER_LOG_POS=155; start slave;
Query OK, 0 rows affected, 2 warnings (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
# This procedure works with GTIDs too
slave > CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_USER='replica_user', MASTER_PASSWORD='repli$cat',MASTER_PORT=19025,MASTER_AUTO_POSITION = 1 ; start slave;

Checking the replication status:

master > show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: replica_user
Master_Port: 19025
Connect_Retry: 60
Master_Log_File: mysql-bin.000007
Read_Master_Log_Pos: 155
Relay_Log_File: mysql-relay.000002
Relay_Log_Pos: 321
Relay_Master_Log_File: mysql-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 155
Relay_Log_Space: 524
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 100
Master_UUID: 00019025-1111-1111-1111-111111111111
Master_Info_File: /home/vinicius.grippa/sandboxes/rsandbox_5_7_17/master/data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.01 sec)

Executing a quick test to check if the replication is working:

master > create database vinnie;
Query OK, 1 row affected (0.06 sec)

slave > show databases like 'vinnie';
+-------------------+
| Database (vinnie) |
+-------------------+
| vinnie |
+-------------------+
1 row in set (0.00 sec)

Caveats

Any tentative attempts to use a new feature from MySQL 8 like roles, invisible indexes or caching_sha2_password will make the replication stop with an error:

master > alter user replica_user identified with caching_sha2_password by 'sekret';
Query OK, 0 rows affected (0.01 sec)

slave > show slave status\G
               Last_SQL_Errno: 1396
               Last_SQL_Error: Error 'Operation ALTER USER failed for 'replica_user'@'%'' on query. Default database: ''. Query: 'ALTER USER 'replica_user'@'%' IDENTIFIED WITH 'caching_sha2_password' AS '$A$005$H	MEDi\"gQ
                        wR{/I/VjlgBIUB08h1jIk4fBzV8kU1J2RTqeqMq8Q2aox0''

Summary

Replicating from MySQL 8 to MySQL 5.7 is possible. In some scenarios (especially upgrades), this might be helpful, but it is not advisable to have a heterogeneous topology because it will be prone to errors and incompatibilities under some cases.

You might also like:

 

The post Replicating from MySQL 8.0 to MySQL 5.7 appeared first on Percona Database Performance Blog.

Resource Usage Improvements in Percona Monitoring and Management 1.13

$
0
0
PMM 1-13 reduction CPU usage by 5x

In Percona Monitoring and Management (PMM) 1.13 we have adopted Prometheus 2, and with this comes a dramatic improvement in resource usage, along with performance improvements!

What does it mean for you? This means you can have a significantly larger number of servers and database instances monitored by the same PMM installation. Or you can reduce the instance size you use to monitor your environment and save some money.

Let’s look at some stats!

CPU Usage

PMM 1.13 reduction in CPU usage by 5x

Percona Monitoring and Management 1.13 reduction in CPU usage after adopting Prometheus 2 by 8x

We can see an approximate 5x and 8x reduction of CPU usage on these two PMM Servers. Depending on the workload, we see CPU usage reductions to range between 3x and 10x.

Disk Writes

There is also less disk write bandwidth required:

PMM 1.13 reduction in disk write bandwidth

On this instance, the bandwidth reduction is “just” 1.5x times. Note this is disk IO for the entire PMM system, which includes more than only the Prometheus component. Prometheus 2 itself promises much more significant IO bandwidth reduction according to official benchmarks

According to the same benchmark, you should expect disk space usage reduction by 33-50% for Prometheus 2 vs Prometheus 1.8. The numbers will be less for Percona Monitoring and Management, as it also stores Query Statistics outside of Prometheus.

Resource usage on the monitored hosts

Also, resource usage on the monitored hosts is significantly reduced:

Percona Monitoring and Management 1.13 reduction of resource usage by Prometheus 2

Why does CPU usage go down on a monitored host with a Prometheus 2 upgrade? This is because PMM uses TLS for the Prometheus to monitored host communication. Before Prometheus 2, a full handshake was performed for every scrape, taking a lot of CPU time. This was optimized with Prometheus 2, resulting in a dramatic CPU usage decrease.

Query performance is also a lot better with Prometheus 2, meaning dashboards visually load a lot faster, though we did not do any specific benchmarks here to share the hard numbers. Note though this improvement only applies when you’re querying the data which is stored in Prometheus 2.

If you’re querying data that was originally stored in Prometheus 1.8, it will be queried through the much slower and less efficient “Remote Read” interface, being quite a bit slower and using a lot more CPU and memory resources.

If you love better efficiency and Performance, consider upgrading to PMM 1.13!

The post Resource Usage Improvements in Percona Monitoring and Management 1.13 appeared first on Percona Database Performance Blog.

Tuning Autovacuum in PostgreSQL and Autovacuum Internals

$
0
0
Tuning Autovacuum in PostgreSQL

The performance of a PostgreSQL database can be compromised by dead tuples, since they continue to occupy space and can lead to bloat. We provided an introduction to VACUUM and bloat in an earlier blog post. Now, though, it’s time to look at autovacuum for postgres, and the internals you to know to maintain a high performance PostgreSQL database needed by demanding applications.

What is autovacuum ?

Autovacuum is one of the background utility processes that starts automatically when you start PostgreSQL. As you see in the following log, the postmaster (parent PostgreSQL process) with pid 2862 has started the autovacuum launcher process with pid 2868. To start autovacuum, you must have the parameter autovacuum set to ON. In fact, you should not set it to OFF in a production system unless you are 100% sure about what you are doing and its implications.

avi@percona:~$ps -eaf | egrep "/post|autovacuum"
postgres  2862     1  0 Jun17 pts/0    00:00:11 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
postgres  2868  2862  0 Jun17 ?        00:00:10 postgres: autovacuum launcher process
postgres 15427  4398  0 18:35 pts/1    00:00:00 grep -E --color=auto /post|autovacuum

Why is autovacuum needed ? 

We need VACUUM to remove dead tuples, so that the space occupied by dead tuples can be re-used by the table for future inserts/updates. To know more about dead tuples and bloat, please read our previous blog post. We also need ANALYZE on the table that updates the table statistics, so that the optimizer can choose optimal execution plans for an SQL statement. It is the autovacuum in postgres that is responsible for performing both vacuum and analyze on tables.

There exists another background process in postgres called Stats Collector that tracks the usage and activity information. The information collected by this process is used by autovacuum launcher to identify the list of candidate tables for autovacuum. PostgreSQL identifies the tables needing vacuum or analyze automatically, but only when autovacuum is enabled. This ensures that postgres heals itself and stops the database from developing more bloat/fragmentation.

Parameters needed to enable autovacuum in PostgreSQL are :

autovacuum = on  # ( ON by default )
track_counts = on # ( ON by default )

track_counts
  is used by the stats collector. Without that in place, autovacuum cannot access the candidate tables.

Logging autovacuum

Eventually, you may want to log the tables on which autovacuum spends more time. In that case, set the parameter log_autovacuum_min_duration to a value (defaults to milliseconds), so that any autovacuum that runs for more than this value is logged to the PostgreSQL log file. This may help tune your table level autovacuum settings appropriately.

# Setting this parameter to 0 logs every autovacuum to the log file.
log_autovacuum_min_duration = '250ms' # Or 1s, 1min, 1h, 1d

Here is an example log of autovacuum vacuum and analyze

< 2018-08-06 07:22:35.040 EDT > LOG: automatic vacuum of table "vactest.scott.employee": index scans: 0
pages: 0 removed, 1190 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 110008 removed, 110008 remain, 0 are dead but not yet removable
buffer usage: 2402 hits, 2 misses, 0 dirtied
avg read rate: 0.057 MB/s, avg write rate: 0.000 MB/s
system usage: CPU 0.00s/0.02u sec elapsed 0.27 sec
< 2018-08-06 07:22:35.199 EDT > LOG: automatic analyze of table "vactest.scott.employee" system usage: CPU 0.00s/0.02u sec elapsed 0.15 sec

When does PostgreSQL run autovacuum on a table ? 

As discussed earlier, autovacuum in postgres refers to both automatic VACUUM and ANALYZE and not just VACUUM. An automatic vacuum or analyze runs on a table depending on the following mathematic equations.

The formula for calculating the effective table level autovacuum threshold is :

Autovacuum VACUUM thresold for a table = autovacuum_vacuum_scale_factor * number of tuples + autovacuum_vacuum_threshold

With the equation above, it is clear that if the actual number of dead tuples in a table exceeds this effective threshold, due to updates and deletes, that table becomes a candidate for autovacuum vacuum.

Autovacuum ANALYZE threshold for a table = autovacuum_analyze_scale_factor * number of tuples + autovacuum_analyze_threshold

The above equation says that any table with a total number of inserts/deletes/updates exceeding this threshold—since last analyze—is eligible for an autovacuum analyze.

Let’s understand these parameters in detail.

  • autovacuum_vacuum_scale_factor Or autovacuum_analyze_scale_factor : Fraction of the table records that will be added to the formula. For example, a value of 0.2 equals to 20% of the table records.
  • autovacuum_vacuum_threshold Or autovacuum_analyze_threshold : Minimum number of obsolete records or dml’s needed to trigger an autovacuum.

Let’s consider a table: percona.employee with 1000 records and the following autovacuum parameters.

autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 50
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 50

Using the above mentioned mathematical formulae as reference,

Table : percona.employee becomes a candidate for autovacuum Vacuum when,
Total number of Obsolete records = (0.2 * 1000) + 50 = 250

Table : percona.employee becomes a candidate for autovacuum ANALYZE when,
Total number of Inserts/Deletes/Updates = (0.1 * 1000) + 50 = 150

Tuning Autovacuum in PostgreSQL

We need to understand that these are global settings. These settings are applicable to all the databases in the instance. This means, regardless of the table size, if the above formula is reached, a table is eligible for autovacuum vacuum or analyze.

Is this a problem ?

Consider a table with ten records versus a table with a million records. Even though the table with a million records may be involved in transactions far more often, the frequency at which a vacuum or an analyze runs automatically could be greater for the table with just ten records.

Consequently, PostgreSQL allows you to configure individual table level autovacuum settings that bypass global settings.

ALTER TABLE scott.employee SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 100);

Output Log
----------
avi@percona:~$psql -d percona
psql (10.4)
Type "help" for help.
percona=# ALTER TABLE scott.employee SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 100);
ALTER TABLE

The above setting runs autovacuum vacuum on the table scott.employee only once there is more than 100 obsolete records.

How do we identify the tables that need their autovacuum settings tuned ? 

In order to tune autovacuum for tables individually, you must know the number of inserts/deletes/updates on a table for an interval. You can also view the postgres catalog view : pg_stat_user_tables to get that information.

percona=# SELECT n_tup_ins as "inserts",n_tup_upd as "updates",n_tup_del as "deletes", n_live_tup as "live_tuples", n_dead_tup as "dead_tuples"
FROM pg_stat_user_tables
WHERE schemaname = 'scott' and relname = 'employee';
 inserts | updates | deletes | live_tuples | dead_tuples
---------+---------+---------+-------------+-------------
      30 |      40 |       9 |          21 |          39
(1 row)

As observed in the above log, taking a snapshot of this data for a certain interval should help you understand the frequency of DMLs on each table. In turn, this should help you with tuning your autovacuum settings for individual tables.

How many autovacuum processes can run at a time ? 

There cannot be more than autovacuum_max_workers number of autovacuum processes running at a time, across the instance/cluster that may contain more than one database. Autovacuum launcher background process starts a worker process for a table that needs a vacuum or an analyze. If there are four databases with autovacuum_max_workers set to 3, then, the 4th database has to wait until one of the existing worker process gets free.

Before starting the next autovacuum, it waits for autovacuum_naptime, the default is 1 min on most of the versions. If you have three databases, the next autovacuum waits for 60/3 seconds. So, the wait time before starting next autovacuum is always (autovacuum_naptime/N) where N is the total number of databases in the instance.

Does increasing autovacuum_max_workers alone increase the number of autovacuum processes that can run in parallel ?
NO. This is explained better in next few lines.

Is VACUUM IO intensive? 

Autovacuum can be considered as a cleanup. As discussed earlier, we have 1 worker process per table. Autovacuum reads 8KB (default block_size) pages of a table from disk and modifies/writes to the pages containing dead tuples. This involves both read and write IO. Thus, this could be an IO intensive operation, when there is an autovacuum running on a huge table with many dead tuples, during a peak transaction time. To avoid this issue, we have a few parameters that are set to minimize the impact on IO due to vacuum.

The following are the parameters used to tune autovacuum IO

  • autovacuum_vacuum_cost_limit : total cost limit autovacuum could reach (combined by all autovacuum jobs).
  • autovacuum_vacuum_cost_delay : autovacuum will sleep for these many milliseconds when a cleanup reaching autovacuum_vacuum_cost_limit cost is done.
  • vacuum_cost_page_hit : Cost of reading a page that is already in shared buffers and doesn’t need a disk read.
  • vacuum_cost_page_miss : Cost of fetching a page that is not in shared buffers.
  • vacuum_cost_page_dirty : Cost of writing to each page when dead tuples are found in it.

Default Values for the parameters discussed above.
------------------------------------------------------
autovacuum_vacuum_cost_limit = -1 (So, it defaults to vacuum_cost_limit) = 200
autovacuum_vacuum_cost_delay = 20ms
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20

Consider autovacuum VACUUM running on the table percona.employee.

Let’s imagine what can happen in 1 second. (1 second = 1000 milliseconds)

In a best case scenario where read latency is 0 milliseconds, autovacuum can wake up and go for sleep 50 times (1000 milliseconds / 20 ms) because the delay between wake-ups needs to be 20 milliseconds.

1 second = 1000 milliseconds = 50 * autovacuum_vacuum_cost_delay

Since the cost associated per reading a page in shared_buffers is 1, in every wake up 200 pages can be read, and in 50 wake-ups 50*200 pages can be read.

If all the pages with dead tuples are found in shared buffers, with an autovacuum_vacuum_cost_delay of 20ms, then it can read: ((200 / vacuum_cost_page_hit) * 8) KB in each round that needs to wait forautovacuum_vacuum_cost_delay amount of time.

Thus, at the most, an autovacuum can read : 50 * 200 * 8 KB = 78.13 MB per second (if blocks are already found in shared_buffers), considering the block_size as 8192 bytes.

If the blocks are not in shared buffers and need to fetched from disk, an autovacuum can read : 50 * ((200 / vacuum_cost_page_miss) * 8) KB = 7.81 MB per second.

All the information we have seen above is for read IO.

Now, in order to delete dead tuples from a page/block, the cost of a write operation is : vacuum_cost_page_dirty, set to 20 by default.

At the most, an autovacuum can write/dirty : 50 * ((200 / vacuum_cost_page_dirty) * 8) KB = 3.9 MB per second.

Generally, this cost is equally divided to all the autovacuum_max_workers number of autovacuum processes running in the Instance. So, increasing the autovacuum_max_workers may delay the autovacuum execution for the currently running autovacuum workers. And increasing the autovacuum_vacuum_cost_limit may cause IO bottlenecks. An important point to note is that this behaviour can be overridden by setting the storage parameters of individual tables, which would subsequently ignore the global settings.

postgres=# alter table percona.employee set (autovacuum_vacuum_cost_limit = 500);
ALTER TABLE
postgres=# alter table percona.employee set (autovacuum_vacuum_cost_delay = 10);
ALTER TABLE
postgres=#
postgres=# \d+ percona.employee
Table "percona.employee"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
id | integer | | | | plain | |
Options: autovacuum_vacuum_threshold=10000, autovacuum_vacuum_cost_limit=500, autovacuum_vacuum_cost_delay=10

Thus, on a busy OLTP database, always have a strategy to implement manual VACUUM on tables that are frequently hit with DMLs, during a low peak window. You may have as many parallel vacuum jobs as possible when you run it manually after setting relevant autovacuum_* settings. For this reason, a scheduled manual Vacuum Job is always recommended alongside finely tuned autovacuum settings.

The post Tuning Autovacuum in PostgreSQL and Autovacuum Internals appeared first on Percona Database Performance Blog.

Open Source Database Community Blog: The Story So Far

$
0
0
open source database community blog

open source database community blogRecently, we initiated a new project, the Open Source Database Community Blog. One way to think of this is as an online, year round version of the Percona Live conferences. If you have a story to tell, an experience to share, or a lesson to be learned send it along. As long as it’s related to open source database software, their management and application. That’s right. Not just Percona software. Any open source database software of all formats.

Unlike Percona Live, though, we are not limited by time or space. All submissions are welcome as long as they follow some simple guidelines.

We have already had some excellent posts, and in case this is news to you, here’s a recap:

You can also read Jean-François’s personal blog (unsolicited, but greatly appreciated) on how the process of getting his post up and running went.

About the posts … and what’s in it for you

All of the writers are giving freely of their time and knowledge. So .. if you would just like to read some alternative independent viewpoints, try the blog. If you want to support the writers with constructive exchanges and comments, that would be great.

If you would like to go a little further and provide some feedback about what you’d like to see via our blog poll, that would be awesome. As a community blog, we want to make sure it hits community interests.

You can also drop me a line if there are things that I’ve missed from the poll.

Also, you should know I have the English covered but I am not a technical expert. We don’t want—not would I get—Percona techs to edit or review these blog posts. That’s not the point of the blog!

So, would you consider being a technical editor maybe? Not for all posts, since many of the writers will want to ‘own’ their content. But there could be some new writers who’d appreciate some back up from a senior tech before going ‘live’ with their posts. Might that tech buddy be you?

There’s some more ideas and I have written more about our vision for the blog here.

If you are tempted to write for this, please get in touch, I would love to hear from you. You do not have to be an expert! Content suitable for all levels of experience is welcome.

The post Open Source Database Community Blog: The Story So Far appeared first on Percona Database Performance Blog.

Webinar Thurs 16/8: Developing an App on MongoDB: Tips and Tricks

$
0
0

Please join Percona’s Sr. Technical Operations Architect Tim Vaillancourt as he presents Developing an App on MongoDB: Tips and Tricks on Thursday, August 16th, 2018, at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

A lot of developers prefer using MongoDB to other open source databases when developing applications. But why? How do you work with MongoDB to create a well-functioning application?

This webinar will help developers understand what MongoDB does and how it processes requests from applications.

In this webinar, we will cover:

  • Data, Queries and Indexes
  • Using indices efficiently
  • Reducing index and storage size with correct data types
  • The aggregation framework
  • Using the Explain and Operation Profiling features
  • MongoDB features to avoid
  • Using Read and Write Concerns for Integrity
  • Performance
  • Scaling read queries using Read Preference
  • What is MongoDB Sharding?
  • Using Percona Monitoring and Management (PMM) to visualize database usage
  • MongoDB users and built-in roles for application security
  • Using SRV DNS record support

By the end of the lesson, you will know how to avoid common problems with MongoDB in the application stage, instead of fixing it in production.

Register Now

The post Webinar Thurs 16/8: Developing an App on MongoDB: Tips and Tricks appeared first on Percona Database Performance Blog.

Using AWS EC2 instance store vs EBS for MySQL: how to increase performance and decrease cost

$
0
0
AWS EC2 MySQL cost savings

AWS EC2 MySQL cost savingsIf you are using large EBS GP2 volumes for MySQL (i.e. 10TB+) on AWS EC2, you can increase performance and save a significant amount of money by moving to local SSD (NVMe) instance storage. Interested? Then read on for a more detailed examination of how to achieve cost-benefits and increase performance from this implementation.

EBS vs Local instance store

We have heard from customers that large EBS GP2 volumes can be affected by short term outages—IO “stalls” where no IO is going in or out for a couple of minutes. This can happen, especially, in the largest AWS region us-east-1. Statistically, with so many disks in disk arrays (which back EBS volumes) we can expect frequent disk failures. If we allocate a very large EBS GP2 volume, i.e. 10Tb+, hitting such failure events can be common.

In the case of MySQL/InnoDB, such an IO “stall” will be obvious, particularly with the highly loaded system where MySQL needs to do physical IO. During the stall, you will see all write queries are waiting, or “hang”.  Some of the writes may error out with “Error 1030” (MySQL error code 1030 (ER_GET_ERRNO): Got error %d from storage engine). There is nothing MySQL can do here – if the IO subsystem is not available, it will need to wait for it.

The good news is: many of the newer EC2 instances (i.e. i3, m5d, etc) have local SSD disks attached (NVMe). Those disks are local to the physical server and should not suffer from the EBS issues described above. Using local disks can be a very good solution:

  1. They are faster, as they are local to the server, and do not suffer from the EBS issues
  2. They are much cheaper compared to large EBS volumes.

Please note, however, that local storage does not guarantee persistence. More about this below.

Another potential option will be to use IO1 volumes with provisional IOPS. However, it will be significantly more expensive for the large volumes and high traffic.

A look at costs

To estimate the costs, I’ve used the AWS simple monthly calculator. Estimated costs are based on 1 year reserved instances. Let’s imagine we will need to use 14TB volume (to store ~10Tb of MySQL data including binary logs). The pricing estimates will look like this:

r4.4xlarge, 122GB RAM, 16 vCPUs + EBS, 14TB volume (this is what we are presumably using now)

Amazon EC2 Service (US East (N. Virginia)) $ 1890.56 / month
Compute: $ 490.56
EBS Volumes: $1400.00

Local storage price estimate:
i3.4xlarge, 122GB RAM, 16 vCPUs, 3800 GiB disk (2 x 1900 NVMe SSD)

Amazon EC2 Service (US East (N. Virginia)) $ 627.21 / month
Compute: $ 625.61

i3.8xlarge, 244GB RAM, 32 vCPUs, 7600 GiB disk (4 x 1900 NVMe SSD)

Amazon EC2 Service (US East (N. Virginia)) $1252.82 / month
Compute: $ 1251.22

As we can see, even if we switch to i3.8xlarge and get 2x more RAM and 2x more virtual CPUs, faster storage, 10 gigabit network we can still pay 1.5x less per box what we are presumably paying now. Include replication, then that’s paying 1.5x less per each of the replication servers.

But wait … there is a catch.

How to migrate to local storage from EBS

Well, we have some challenges here to migrate from EBS to local instance NVMe storage.

  1. Wait, we are storing ~10Tb and i3.8xlarge have 7600 GiB disk. The answer is simple: compression (see below)
  2. Wait, but the local storage is ephemeral, if we loose the box we will loose our data – that is unacceptable.  The answer is also simple: replication (see below)
  3. Wait, but we use EBS snapshots for backups. That answer is simple too: we can still use EBS (and use snapshots) on 1 of the replication slave (see below)

Compression

To fit i3.8xlarge we only need 2x compression. This can be done with InnoDB row compression (row_format=compressed) or InnoDB page compression, which requires sparse file and hole punching support. However, InnoDB compression may be slower and will only compress ibd files—it does not compress binary logs, frm files, etc.

ZFS

Another option: use the ZFS filesystem. ZFS will compress all files, including binary logs and frm. That can be very helpful if we use a “schema per customer” or “table per customer” approach and need to store 100K – 200K tables in a single MySQL instance. If the data is compressible, or new tables were provisioned without much data in those, ZFS can give a significant disk savings.

I’ve used ZFS (followed Yves blog post, Hands-On Look at ZFS with MySQL). Here are the results of data compression with ZFS (this is real data, not a generated data):

# du -sh --apparent-size /mysqldata/mysql/data
8.6T	/mysqldata/mysql/data
# du -sh /mysqldata/mysql/data
3.2T	/mysqldata/mysql/data

Compression ratio:

# zfs get all | grep -i compress
...
mysqldata/mysql/data  compressratio         2.42x                  -
mysqldata/mysql/data  compression           gzip                   inherited from mysqldata/mysql
mysqldata/mysql/data  refcompressratio      2.42x                  -
mysqldata/mysql/log   compressratio         3.75x                  -
mysqldata/mysql/log   compression           gzip                   inherited from mysqldata/mysql
mysqldata/mysql/log   refcompressratio      3.75x                  -

As we can see, the original 8.6Tb of data was compressed to 3.2Tb, the compression ratio for MySQL tables is 2.42x, for binary logs 3.75x. That will definitely fit i3.8xlarge.

(For another test, I’ve generated 40 million tables spread across multiple schemas (databases). I’ve added some data only to one schema, leaving others blank. For that test I achieved ~10x compression ratio.)

Conclusion: ZFS can provide you with very good compression ratio, will allow you to use different EC2 instances on AWS, and save you a substantial amount of money. Although compression is not free performance-wise, and ZFS can be slower for some workloads, using local NVMe storage can compensate.

You can find some performance testing for ZFS on linux in this blog post: About ZFS Performance. Some benchmarks comparing EBS and local NVMe SSD storage (i3 instances) can be found in this blog post: Percona XtraDB Cluster on Amazon GP2 Volumes

MyRocks

Another option for compression would be using the MyRocks storage engine in Percona Server for MySQL, which provides compression.

Replication and using local volumes

As the local instance storage is ephemeral we need redundancy: we can use MySQL replication or Percona XtraDB cluster (PXC). In addition, we can use one replication slave—or we can attach a replication slave to PXC—and have it use EBS storage.

Local storage is not durable. If you stop the instance and then start it again, the local storage will probably disappear. (Though reboot is an exception, you can reboot the instance and the local storage will be fine.) In addition if the local storage disappears we will have to recreate MySQL local storage partition (for ZFS, i.e. zpool create or for EXT4/XFS, i.e. mkfs)

For example, using MySQL replication:

master - local storage (AZ 1, i.e. 1a)
-> slave1 - local storage (AZ 2, i.e. 1b)
-> slave2 - ebs storage (AZ 3, i.e. 1c)
   (other replication slaves if needed with local storage - optional)

MySQL Master AZ 1a, Local storage

Then we can use slave2 for ebs snapshots (if needed). This slave will be more expensive (as it is using EBS) but it can also be used to either serve production traffic (i.e. we can place smaller amount of traffic) or for other purposes (for example analytical queries, etc).

For Percona XtraDB cluster (PXC) we can just use 3 nodes, 1 in each AZ. PXC uses auto-provisioning with SST if the new node comes back blank. For MySQL replication we need some additional things:

  1. Failover from master to a slave if the master will go down. This can be done with MHA or Orchestrator
  2. Ability to clone slave. This can be done with Xtrabackup or ZFS snapshots (if using ZFS)
  3. Ability to setup a new MySQL local storage partition (for ZFS, i.e. zpool create or for EXT4/XFS, i.e. mkfs)

Other options

Here are some totally different options we could consider:

  1. Use IO1 volumes (as discussed). That can be way more expensive.
  2. Use local storage and MyRocks storage engine. However, switching to another storage engine is another bigger project and requires lots of testing
  3. Switch to AWS Aurora. That can be even more expensive for this particular case; and switching to aurora can be another big project by itself.

Conclusions

  1. Using EC2 i3 instances with local NVMe storage can increase performance and save money. There are some limitations: local storage is ephemeral and will disappear if the node has stopped. Reboot is fine.
  2. ZFS filesystem with compression enabled can decrease the storage requirements so that a MySQL instance will fit into local storage. Another option for compression could be to use InnoDB compression (row_format=compressed).

That may not work for everyone as it requires additional changes to the existing server provisioning: failover from master to a slave, ability to clone replication slaves (or use PXC), ability to setup a new MySQL local storage partition, using compression.

The post Using AWS EC2 instance store vs EBS for MySQL: how to increase performance and decrease cost appeared first on Percona Database Performance Blog.

This Week in Data with Colin Charles 50: Percona Live Europe Sessions, PostgreSQL in Google Cloud

$
0
0
Colin Charles

Colin Charles

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Grading is underway for talks at Percona Live Europe 2018. I understand that by next week you will see the tutorial schedule released. As part of the program committee, I have enjoyed reviewing tutorials, and I reckon there is great competition for the schedule. I suggest you register now, and don’t forget to book your accommodation (need a discount?).

A video worth watching: How we Live Migrated Millions of BBM Users & its Infrastructure Across the Pacific (Cloud Next ’18). I was a big Blackberry Messenger (BBM) user in its heyday. They moved from their Canadian data center to Google Cloud. Blackberry used MySQL, migrated Oracle to PostgreSQL, as Oracle isn’t sanctioned to run on Google Cloud Platform. They also moved from MySQL to Google CloudSQL since the databases weren’t critical to running the application. Some cases write to BigQuery and have Tableau query them. They make use of Cassandra native replication. With PostgreSQL, there is master-slave replication: they got users onto the new master and then promoted the master. When they did the master promotion, there was minimal (5-10 minutes) downtime while this process was ongoing. This didn’t impact their key services, messages continued to work, and this was seen as acceptable.

Releases

  • MariaDB 10.3.9 – InnoDB from 5.7.23, a new variable innodb_log_optimize_ddl for avoiding delay due to page flushing and allowing concurrent backup, and many fixes and improvements around ALTER TABLE too.
  • Percona Server for MySQL 5.6.41-84.1 – full-text search index with InnoDB table bug fixed, as well as ensuring queries on a table with CHARSET=euckr COLLATE=euckr_bin always return the same results.
  • Percona Server for MySQL 5.5.61-38.13  bug fixes.

Link List

Upcoming Appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 50: Percona Live Europe Sessions, PostgreSQL in Google Cloud appeared first on Percona Database Performance Blog.


Webinar Tuesday, 8/28: Forking or Branching – Lessons from the MySQL Community

$
0
0
forking or branching

forking or branchingPlease join Percona’s CEO, Peter Zaitsev as he presents Forking or Branching – Lessons from the MySQL Community on Tuesday, August 28th, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

The MySQL Community offers a great example of various forks and branches, with MariaDB being the most well-known fork, and companies like Percona, Facebook and Alibaba maintaining their own branches.

In this presentation we will look at the history of MySQL, the causes of MySQL forking and branching, and discuss the benefits and drawbacks of both approaches, using specific examples from the MySQL ecosystem.

Register for the webinar.

Peter ZaitsevPeter Zaitsev, CEO and Co-Founder

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Inc. 5000 named Percona to their list in 2013, 2014, 2015 and 2016. Peter was an early employee at MySQL AB, eventually leading the company’s High-Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High-Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his ebook Practical MySQL Performance Optimization is one of Percona’s most popular downloads.

 

The post Webinar Tuesday, 8/28: Forking or Branching – Lessons from the MySQL Community appeared first on Percona Database Performance Blog.

Webinar Wed 8/29: Databases in the Hosted Cloud

$
0
0
databases-in-the-cloud

databases-in-the-cloudPlease join Percona’s Chief Evangelist, Colin Charles on Wednesday, August 29th, 2018, as he presents Databases in the Hosted Cloud at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

Nearly everyone today uses some form of database in the hosted cloud. You can use hosted MySQL, MariaDB, Percona Server, and PostgreSQL in several cloud providers as a database as a service (DBaaS).

In this webinar, Colin Charles explores how to efficiently deploy a cloud database configured for optimal performance, with a particular focus on MySQL.

You’ll learn the differences between the various public cloud offerings for Amazon RDS including Aurora, Google Cloud SQL, Rackspace OpenStack DBaaS, Microsoft Azure, and Alibaba Cloud, as well as the access methods and the level of control you have. Hosting in the cloud can be a challenge but after today’s webinar, we’ll make sure you walk away with a better understanding of how you can leverage the cloud for your business needs.

Topics include:

  • Backup strategies
  • Planning multiple data centers for availability
  • Where to host your application
  • How to get the most performance out of the solution
  • Cost
  • Monitoring
  • Moving from one DBaaS to another
  • Moving from a DBaaS to your own hosted platform

Register Now.

The post Webinar Wed 8/29: Databases in the Hosted Cloud appeared first on Percona Database Performance Blog.

Extend Metrics for Percona Monitoring and Management Without Modifying Code

$
0
0
PMM Extended Metrics

Percona Monitoring and Management (PMM) provides an excellent solution for system monitoring. Sometimes, though, you’ll have the need for a metric that’s not present in the list of node_exporter metrics out of the box. In this post, we introduce a simple method and show how to extend the list of available metrics without modifying the node_exporter code. It’s based on the textfile collector.

Enable the textfile collector in pmm-client

This collector is not enabled by default in the latest version of pmm-client. So, first let’s enable the textfile collector.

# pmm-admin rm linux:metrics
OK, removed system pmm-client-hostname from monitoring.
# pmm-admin add linux:metrics -- --collectors.enabled=diskstats,filefd,filesystem,loadavg,meminfo,netdev,netstat,stat,time,uname,vmstat,textfile --collector.textfile.directory="/tmp"
OK, now monitoring this system.
# pmm-admin ls
pmm-admin 1.13.0
PMM Server      | 10.178.1.252  
Client Name     | pmm-client-hostname
Client Address  | 10.178.1.252  
Service Manager | linux-upstart
-------------- -------------------- ----------- -------- ------------ --------
SERVICE TYPE   NAME                 LOCAL PORT  RUNNING  DATA SOURCE  OPTIONS  
-------------- -------------------- ----------- -------- ------------ --------
linux:metrics  pmm-client-hostname  42000       YES      -

Notice that the whole list of default collectors has to be re-enabled. Also, don’t forget to specify the directory for reading files with the metrics (–collector.textfile.directory=”/tmp”). The exporter reads files with the extension .prom

Add a crontab task

The second step is to add a crontab task to collect metrics and place them into a file.

Here are the cron commands for collecting the number of running and stopping docker containers.

*/1 * * * *     root   echo -n "" > /tmp/docker_all.prom; /usr/bin/docker ps -a | sed -n '1!p'| /usr/bin/wc -l | sed -ne 's/^/node_docker_containers_total /p' >> /tmp/doc
ker_all.prom;
*/1 * * * *     root   echo -n "" > /tmp/docker_running.prom; /usr/bin/docker ps | sed -n '1!p'| /usr/bin/wc -l | sed -ne 's/^/node_docker_containers_running_total /p' >>
/tmp/docker_running.prom;

The result of the commands is placed into the files 

/tmp/docker_running.prom
and
/tmp/docker_running.prom
and read by exporter.

Look - we got a new metric!

Adding the crontab tasks by using a script

Also, we have a few bash scripts that make it much easier to add crontab tasks.

The first one allows you to collect the logged-in users and the size of Innodb data files.

Modifying the cron job - a script

You may use the suggested names of files and metrics or set new ones.

The second script is more universal. It allows us to get the size of any directories or files. This script can be placed directly into a crontab task. You should just specify the list of monitored instances (e.g. /var/log /var/cache/apt /var/lib/mysql/ibdata1)

echo  "*/5 * * * * root bash  /root/object_sizes.sh /var/log /var/cache/apt /var/lib/mysql/ibdata1"  > /etc/cron.d/object_size

So, I hope this has provided useful insight into how to set up the collection of new PMM metrics without the need to write code. Please feel free to use the scripts or configure commands similar to the ones provided above.

More resources you might enjoy

If you are new to PMM, there is a great demo site of the latest version, showing you those out of the box metrics. Or how about our free webinar on monitoring Amazon RDS with PMM?

The post Extend Metrics for Percona Monitoring and Management Without Modifying Code appeared first on Percona Database Performance Blog.

Tune Linux Kernel Parameters For PostgreSQL Optimization

$
0
0
Linux parameters for PostgreSQL performance tuning

For optimum performance, a PostgreSQL database depends on the operating system parameters being defined correctly. Poorly configured OS kernel parameters can cause degradation in database server performance. Therefore, it is imperative that these parameters are configured according to the database server and its workload. In this post, we will discuss some important kernel parameters that can affect database server performance and how these should be tuned.

SHMMAX / SHMALL

SHMMAX is a kernel parameter used to define the maximum size of a single shared memory segment a Linux process can allocate. Until version 9.2, PostgreSQL uses System V (SysV) that requires SHMMAX setting. After 9.2, PostgreSQL switched to POSIX shared memory. So now it requires fewer bytes of System V shared memory.

Prior to version 9.3 SHMMAX was the most important kernel parameter. The value of SHMMAX is in bytes.

Similarly SHMALL is another kernel parameter used to define system wide total amount of shared memory pages. To view the current values for SHMMAX, SHMALL or SHMMIN, use the ipcs command.

$ ipcs -lm
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 1073741824
max total shared memory (kbytes) = 17179869184
min seg size (bytes) = 1

$ ipcs -M
IPC status from  as of Thu Aug 16 22:20:35 PKT 2018
shminfo:
	shmmax: 16777216	(max shared memory segment size)
	shmmin:       1	(min shared memory segment size)
	shmmni:      32	(max number of shared memory identifiers)
	shmseg:       8	(max shared memory segments per process)
	shmall:    1024	(max amount of shared memory in pages)

PostgreSQL uses System V IPC to allocate the shared memory. This parameter is one of the most important kernel parameters. Whenever you get following error messages, it means that you have an older version PostgreSQL and you have a very low SHMMAX value. Users are expected to adjust and increase the value according to the shared memory they are going to use.

Possible misconfiguration errors

If SHMMAX is misconfigured, you can get an error when trying to initialize a PostgreSQL cluster using the initdb command.

DETAIL: Failed system call was shmget(key=1, size=2072576, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter.&nbsp;
You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 2072576 bytes),
reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter,
in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration. child process exited with exit code 1

Similarly, you can get an error when starting the PostgreSQL server using the pg_ctl command.

DETAIL: Failed system call was shmget(key=5432001, size=14385152, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter.
You can either reduce the request size or reconfigure the kernel with larger SHMMAX.; To reduce the request size (currently 14385152 bytes),
reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter,
in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.

Be aware of differing definitions

The definition of the SHMMAX/SHMALL parameters is slightly different between Linux and MacOS X. These are the definitions:

  • Linux: kernel.shmmax, kernel.shmall
  • MacOS X: kern.sysv.shmmax, kern.sysv.shmall

The sysctl command can be used to change the value temporarily. To permanently set the value, add an entry into /etc/sysctl.conf. The details are given below.

# Get the value of SHMMAX
sudo sysctl kern.sysv.shmmax
kern.sysv.shmmax: 4096
# Get the value of SHMALL
sudo sysctl kern.sysv.shmall
kern.sysv.shmall: 4096
# Set the value of SHMMAX
sudo sysctl -w kern.sysv.shmmax=16777216
kern.sysv.shmmax: 4096 -> 16777216<br>
# Set the value of SHMALL
sudo sysctl -w kern.sysv.shmall=16777216
kern.sysv.shmall: 4096 -> 16777216

# Get the value of SHMMAX
sudo sysctl kernel.shmmax
kernel.shmmax: 4096
# Get the value of SHMALL
sudo sysctl kernel.shmall
kernel.shmall: 4096
# Set the value of SHMMAX
sudo sysctl -w kernel.shmmax=16777216
kernel.shmmax: 4096 -> 16777216<br>
# Set the value of SHMALL
sudo sysctl -w kernel.shmall=16777216
kernel.shmall: 4096 -> 16777216

Remember: to make the change permanent add these values in

/etc/sysctl.conf
 

Huge Pages

Linux, by default uses 4K memory pages, BSD has Super Pages, whereas Windows has Large Pages. A page is a chunk of RAM that is allocated to a process. A process may own more than one page depending on its memory requirements. The more memory a process needs, the more pages that are allocated to it. The OS maintains a table of page allocation to processes. The smaller the page size, the bigger the table, the more time required to lookup a page in that page table. Therefore, huge pages make it possible to use large amount of memory with reduced overheads; fewer page look ups, fewer page faults, faster read/write operations through larger buffers. This results in improved performance.

PostgreSQL has support for bigger pages on Linux only. By default, Linux uses 4K of memory pages, so in cases where there are too many memory operations, there is a need to set bigger pages. Performance gains have been observed by using huge pages with sizes 2 MB and up to 1 GB. The size of Huge Page can be set boot time. You can easily check the huge page settings and utilization on your Linux box using cat /proc/meminfo | grep -i huge command.

Note: This is only for Linux, for other OS this operation is ignored$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

In this example, although huge page size is set at 2,048 (2 MB), the total number of huge pages has a value of 0. which signifies that huge pages are disabled.

Script to quantify Huge Pages

This is a simple script which returns the number of Huge Pages required. Execute the script on your Linux box while your PostgreSQL is running. Ensure that $PGDATA environment variable is set to PostgreSQL’s data directory.

#!/bin/bash
pid=`head -1 $PGDATA/postmaster.pid`
echo "Pid:            $pid"
peak=`grep ^VmPeak /proc/$pid/status | awk '{ print $2 }'`
echo "VmPeak:            $peak kB"
hps=`grep ^Hugepagesize /proc/meminfo | awk '{ print $2 }'`
echo "Hugepagesize:   $hps kB"
hp=$((peak/hps))
echo Set Huge Pages:     $hp

The output of the script looks like this:

Pid:            12737
VmPeak:         180932 kB
Hugepagesize:   2048 kB
Set Huge Pages: 88

The recommended huge pages are 88, therefore you should set the value to 88.

sysctl -w vm.nr_hugepages= 88

Check the huge pages now, you will see no huge page is in use (HugePages_Free = HugePages_Total).

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       88
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now set the parameter huge_pages “on” in $PGDATA/postgresql.conf and restart the server.

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       81
HugePages_Rsvd:       64
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now you can see that a very few of the huge pages are used. Let’s now try to add some data into the database.

postgres=# CREATE TABLE foo(a INTEGER);
CREATE TABLE
postgres=# INSERT INTO foo VALUES(generate_Series(1,10000000));
INSERT 0 10000000

Let’s see if we are now using more huge pages than before.

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       18
HugePages_Rsvd:        1
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now you can see that most of the huge pages are in use.

Note: The sample value for HugePages used here is very low, which is not a normal value for a big production machine. Please assess the required number of pages for your system and set those accordingly depending on your systems workload and resources.

vm.swappiness

vm.swappiness is another kernel parameter that can affect the performance of the database. This parameter is used to control the swappiness (swapping pages to and from swap memory into RAM) behaviour on a Linux system. The value ranges from 0 to 100. It controls how much memory will be swapped or paged out. Zero means disable swap and 100 means aggressive swapping.

You may get good performance by setting lower values.

Setting a value of 0 in newer kernels may cause the OOM Killer (out of memory killer process in Linux) to kill the process. Therefore, you can be on safe side and set the value to 1 if you want to minimize swapping. The default value on a Linux system is 60. A higher value causes the MMU (memory management unit) to utilize more swap space than RAM, whereas a lower value preserves more data/code in memory.

A smaller value is a good bet to improve performance in PostgreSQL.

vm.overcommit_memory / vm.overcommit_ratio

Applications acquire memory and free that memory when it is no longer needed. But in some cases an application acquires too much memory and does not release it.  This can invoke the OOM killer. Here are the possible values for vm.overcommit_memory parameter with a description for each:

    1. Heuristic overcommit, Do it intelligently (default); based kernel heuristics
    2. Allow overcommit anyway
    3. Don’t over commit beyond the overcommit ratio.

Reference: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

vm.overcommit_ratio is the percentage of RAM that is available for overcommitment. A value of 50% on a system with 2 GB of RAM may commit up to 3 GB of RAM.

A value of 2 for vm.overcommit_memory yields better performance for PostgreSQL. This value maximises RAM utilization by the server process without any significant risk of getting killed by the OOM killer process. An application will be able to overcommit, but only within the overcommit ratio, thus reducing the risk of having OOM killer kill the process. Hence a value to 2 gives better performance than the default 0 value. However, reliability can be improved by ensuring that memory beyond an allowable range is not overcommitted. It avoid the risk of the process being killed by OOM-killer.

On systems without swap, one may experience problem when vm.overcommit_memory is 2.

https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

vm.dirty_background_ratio / vm.dirty_background_bytes

The vm.dirty_background_ratio is the percentage of memory filled with dirty pages that need to be flushed to disk. Flushing is done in the background. The value of this parameter ranges from 0 to 100; however a value lower than 5 may not be effective and some kernels do not internally support it. The default value is 10 on most Linux systems. You can gain performance for write intensive operations with a lower ratio, which means that Linux flushes dirty pages in the background.

You need to set a value of vm.dirty_background_bytes depending on your disk speed.

There are no “good” values for these two parameters since both depend on the hardware. However, setting vm.dirty_background_ratio to 5 and vm.dirty_background_bytes to 25% of your disk speed improves performance by up to ~25% in most cases.

vm.dirty_ratio / dirty_bytes

This is same as vm.dirty_background_ratio / dirty_background_bytes except that the flushing is done in the foreground, blocking the application. So vm.dirty_ratio should be higher than vm.dirty_background_ratio. This will ensure that background processes kick in before the foreground processes to avoid blocking the application, as much as possible. You can tune the difference between the two ratios depending on your disk IO load.

Summing up

You can tune other parameters for performance, but the improvement gains are likely to be minimal. We must keep in mind that not all parameters are relevant for all applications types. Some applications perform better by tuning some parameters and some applications don’t. You need to find a good balance between these parameter configurations for the expected application workload and type, and OS behaviour must also be kept in mind when making adjustments. Tuning kernel parameters is not as easy as tuning database parameters: it’s harder to be prescriptive.

In my next post, I’ll take a look at tuning PostgreSQL’s database parameters. You might also enjoy this post:

Tuning PostgreSQL for sysbench-tpcc

 

The post Tune Linux Kernel Parameters For PostgreSQL Optimization appeared first on Percona Database Performance Blog.

Is It a Read Intensive or a Write Intensive Workload?

$
0
0
innodb row operations featured

One of the common ways to classify database workloads is whether it is  “read intensive” or “write intensive”. In other words, whether the workload is dominated by reads or writes.

Why should you care? Because recognizing if the workload is read intensive or write intensive will impact your hardware choices, database configuration as well as what techniques you can apply for performance optimization and scalability.

This question looks trivial on the surface, but as you go deeper—complexity emerges. There are different “levels” of reads and writes for you to consider. You can also choose to look at event counts or at the time it takes to do operations. These can provide very different responses, especially as the cost difference between a single read and a single write can be an order of magnitude.

Let’s examine the TPC-C Benchmark from this point of view, or more specifically its implementation in Sysbench. The illustrations below are taken from Percona Monitoring and Management (PMM) while running this benchmark.

Analyzing read/write workload by counts

analyzing read write workload by counts
At the highest level, you can think about queries that are sent to the database. In this case we can see about 30K of SELECT queries versus 20K of UPDATE+INSERT queries, making this benchmark slightly more read intensive by this measure.

innodb row operations
Another way to look at the load is through actual operations at the row level – a single query may touch just one row or may touch millions. In this benchmark the difference between looking at workload from a SQL commands standpoint vs a row operation standpoint yields the same results, but it is not going to always be the case.

io activity
Let’s now look at the operating system level. We can see the amount of data written to the disk is 2x more than the amount of data being read from the disk. This workload is write intensive by this measure.

top tables by row read

top tables by rows changed

Yet another way to take a look at your workload is to take a look at it from the aspect of tables. This view shows us that tables are being mostly accessed for reads and writes. This in turn allows us to see whether a given table is getting more reads or writes. This is helpful, for example, if you are considering to move some of the tables to a different server and want to clearly understand how your workload will be impacted.

Analyzing Read/Write Workload by Response Time

As I mentioned already, the counts often do not reflect the time to respond, which is typically more representative of the real work being done. To look at timing information from query point of view, we want to look at query analytics.

query analytics providing time analysis
The “Load” column here is a measure of such a combined response time, versus count which is reflective of query counts. Looking at this list we can see that three out of top five queries are SELECT queries. Looking at the numbers overall, we can see we have a read intensive application from this perspective.

In terms of row level operations, there is currently no easy way to see if reads or writes are dominating overall but  you can get an idea from the table operations dashboard:

table operations dashboard
This shows the load on a per table basis. It labels reads “Fetch” and breaks down writes in more detail—“Update”, “Delete”, “Inserts”—which is helpful. Not all writes are equal either.

disk io load

If we want to look at a response time based view of read vs write on an operating system, we can check out this disk IO Load graph. You can see in this case it happens to match the IO activity graph, with storage taking more time to serve write requests versus read requests

Summary

As you can see, the question about whether a workload is read intensive or write intensive, while simple on the surface, can have many different answers. You might ask me “OK, so what should I use?” Well… it really depends.

Looking at query counts is a great way to understand the application’s demands on the database—you can’t really do anything to change the database size.  However by changing the database configuration and schema you may drastically alter the impact of these queries, both from the standpoint of the number of rows they crunch and in terms of the disk IO they require.

The response time based statistics, gathered from the impact your queries cause on the system or disk IO, provide a better representation of the load these queries currently generate.

Another thing to keep in mind—reads and writes are not created equal. My rule of thumb for InnoDB is that a single row write is about 10x more expensive than a single row read.

More resources that you might enjoy

If you found this post useful, you might also like to see some of Percona’s other resources.

For an introduction to PMM, our free and open source management and monitoring software, you might find value in my recorded webinar, MySQL Troubleshooting and Performance Optimization with PMM

While our white paper Performance at Scale could provide useful insight if you are at the planning or review stage.

The post Is It a Read Intensive or a Write Intensive Workload? appeared first on Percona Database Performance Blog.

Viewing all 1336 articles
Browse latest View live