Upcoming Webinar Tues 9/11: Migrating to AWS Aurora: A Checklist for Success

September 7, 2018, 3:20 pm

≫ Next: How To Deploy PMM on Linode With StackScripts

≪ Previous: MySQL 8 and The FRM Drop… How To Recover Table DDL

Please join Percona’s Senior Consultant, Jervin Real, as he presents Migrating to AWS Aurora: A Checklist for Success. The event will take place on Tuesday, September 11th, 2018, at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

In the last few weeks, we have shown you how to successfully migrate from on-premise MySQL installations to AWS Aurora. What comes next is how to successfully ensure that your Aurora cluster performs and operates as you expect it to.

While Aurora’s hands-off operational approach ensures agile practices remain agile; there are also trade-offs and subsequent growing pains.

This webinar will discuss the points on how to remain flexible and in full control of your data while using AWS Aurora.

Register for this webinar on how to make your Aurora migration a success.

↧

How To Deploy PMM on Linode With StackScripts

September 11, 2018, 7:02 am

≫ Next: PostgreSQL Webinar Wed Oct 10th – Enterprise-Grade PostgreSQL: Built on Open Source Tools

≪ Previous: Upcoming Webinar Tues 9/11: Migrating to AWS Aurora: A Checklist for Success

In my previous blog post, I showed how to deploy Percona Monitoring and Management (PMM) on Linode manually. It is pretty simple, but with a little coding it can be done even more easily using StackScripts

Here’s how:

1. Click on the “Add a Linode” and pick a Linode type you want to deploy.

2. Click on the deployed Linode and then click on the “Rebuild” Link

Rebuild the Linode

3. Click on Deploy Using StackScripts

Deploy using StackScripts

4. On the resulting page search for “PMM” and pick PMMServer from PerconaLab.

Choose PMMServer from PerconaLab

5. Provide the host name for new Linode, pick the root password and click on “Rebuild”

6. Boot the server.

boot the server

7. You’re done. Wait for about 5 minutes for the installation to complete, then you can see PMM interface by going to this Linode IP

view PMM on Linode IP

If you think that a manual deployment with StackScripts is not much less hassle than doing it manually, you’re right. The real benefit comes with using Linode API for deployment.

There are multiple way to access this API, though for basic scripting I prefer the linode-cli tool for using the Linode API from the command line.

With linode-cli you can deploy your PMM Server on Linode using this one liner:

linode-cli linodes create --label pmm-test  --root_pass MyRootPassword123 --stackscript_id 338458  --stackscript_data '{"hostname": "pmm-test"}'

Summary

As you can see, with Linode StackScripts you can get going with Percona Monitoring and Management on Linode in no time, especially if you chose to use the Linode API.

You might also like:

Here’s an overview from the Percona Monitoring and Management manual on deploying PMM. If you are new to PMM and would like to know more, you will find lots of resources on this site including my webinar MySQL Troubleshooting and Performance Optimization with PMM.

↧

PostgreSQL Webinar Wed Oct 10th – Enterprise-Grade PostgreSQL: Built on Open Source Tools

September 14, 2018, 10:23 am

≫ Next: Securing PostgreSQL as an Enterprise-Grade Environment

≪ Previous: How To Deploy PMM on Linode With StackScripts

Enterprise PostgreSQL built with open source tools

PostgreSQL® logo Please join Percona’s PostgreSQL Support Technical Lead, Avinash Vallarapu; Senior Support Engineer, Fernando Laudares; and Senior Support Engineer, Jobin Augustine, on Wednesday, October 10th, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4), as they demonstrate an enterprise-grade PostgreSQL environment built using a combination of open source tools and extensions.

“We built our application on top of PostgreSQL. It works great but only now that we took it to the market and it became a hit we realize how much it relies on the database. How can we “harden” PostgreSQL? How can we make the solution we built around PostgreSQL enterprise-grade?”

“I migrated from a proprietary database software to PostgreSQL. I am curious to know whether I can get the same features I used to have in the proprietary database software.”

You’ll find the answer to these questions and more in a series of blog posts we will be publishing on this topic, which will be followed by a live demo we planned for our webinar on October 10th, 2018.

The market coined the term “enterprise grade” or “enterprise ready” to differentiate products and service offerings for licensed database software. For example: there may be a standard database software or an entry-level package that delivers the core functionality and basic features. Likewise, there may be an enterprise version, a more advanced package which goes beyond the essentials to include features and tools indispensable for running critical solutions in production. With such a differentiation found in commercial software, we may wonder whether a solution built on top of an open source database like PostgreSQL can satisfy all the enterprise requirements.

It starts with building a secured PostgreSQL environment, tuning the database for the production workload, building a high availability strategy that avoids single-point-of-failures, scaling PostgreSQL using connection poolers to avoid excessive usage of server resources, and finally load balancing the reads between master and all the available standby servers aka replicas to effectively use the computing power of all the database servers.

The operational aspect of maintaining an enterprise grade PostgreSQL database also includes the methods to configure a backup strategy that helps us achieve point-in-time-recovery as needed, detailed logging and monitoring PostgreSQL along with a real-time analysis of the database performance and finally maintaining the database health with optimal performance, such as making sure vacuuming is working as it should and at the right times.

“Can we build such an enterprise grade solution that satisfies all the above requirements around PostgreSQL with open source softwares only?”

Yes, we can. During the 20+ years PostgreSQL has been around, the open source community has created all sorts of complementary extensions and tools that can be used to build an enterprise grade solution with postgres.

We’ll be following this post with a series of posts covering each piece of such solution, culminating with a webinar that will take place on October 10th. During the webinar we’ll showcase the full project. Here’s the list of topics we are going to consider while building our enterprise grade PostgreSQL server.

Securing your PostgreSQL database cluster
High Availability
Preparing a Backup strategy and the tools available to achieve it
Scaling PostgreSQL using connection poolers and load balancers
Tools/extensions available for your daily DBA life and detailed logging in PostgreSQL.
Monitoring your PostgreSQL and real-time analysis.

Join us to see it in action!

↧

Securing PostgreSQL as an Enterprise-Grade Environment

September 21, 2018, 6:50 am

≫ Next: Upcoming Webinar Thurs 9/27: Best Practices Using Indexes in MongoDB

≪ Previous: PostgreSQL Webinar Wed Oct 10th – Enterprise-Grade PostgreSQL: Built on Open Source Tools

PostgreSQL® logo In this post, we review how you can build an enhanced and secure PostgreSQL database environment using community software. We look at the features that are available in PostgreSQL that, when implemented, provide improved security.

As discussed in the introductory blog post of this series, in our webinar of October 10, 2018 we highlight important aspects an enterprise should consider for their PostgreSQL environments. This series of blogs addressing particular aspects of the enterprise-grade postgres environment complements the webinar. This post addresses security.

Authentication Layer

Client connections to PostgreSQL Server using host based authentication

PostgreSQL uses a host based authentication file (pg_hba.conf) to authorize incoming connections. This file contains entries with a combination of 5 categories: type, database, user, address, and method. A client is allowed to connect to a database only when the combination of username, database and the hostname of the client matches an entry in the pg_hba.conf file.

Consider the following entry in pg_hba.conf file :

# TYPE DATABASE USER ADDRESS METHOD
host percona pguser 192.168.0.14/32 md5

This entry says that connections from server 192.168.0.14 are only allowed from user pguser and only to the database percona. The method md5 forces password authentication.

The order of the entries in the pg_hba.conf file matters. If you have an entry that rejects connections from a given server followed by another that allows connections from it, the first entry in the order is considered. So, in this case, the connection is rejected.

This is the first layer of protection in authentication. If this criteria is not satisfied in this Access Control List (ACL), PostgreSQL will discard the request without considering even the server authentication.

Server Authentication

Historically, PostgreSQL uses MD5 digest as a password hash by default. The problem with pure MD5 hashing is that this function will always return the same hash for a given password, which renders a MD5 digest more susceptible for password cracking. Newer versions of PostgreSQL implement SCRAM Authentication (Simple Authentication and Secured Layer) that stores passwords in salted and iterated hash formats to strengthen PostgreSQL against offline attacks. SCRAM-SHA-256 support was introduced in PostgreSQL 10. What matters most in terms of “enterprise-grade” security is that PostgreSQL supports industry-standard authentication methods out of the box, like SSL certificates, PAM/LDAP, Kerberos, etc.

Authorization Layer

User management through roles and privileges

It is always recommended to implement segregation of users through roles and privileges. There may be several user accounts in your PostgreSQL server. Only a few of them may be application accounts while the rest are developers or admin accounts. In such cases, PostgreSQL allows you to create multiple roles. Those can be assigned with a set of privileges. Thus, instead of managing user privileges individually, standard roles can be maintained and the appropriate role from the list can be assigned to a user. Through roles, database access can be standardized, which helps in user management and avoids granting too much or too little privilege to a given user.

For example, we might have six roles:

app_read_write
app_read_only
dev_read_write
dev_read_only
admin_read_write
admin_read_only

Now, if you need to create a new dev user who can only have read access, grant one among the appropriate roles, such as dev_read_only:

GRANT dev_read_only to avi_im_developer;

Row level Security

Starting with version 9.5, PostgreSQL implements row level security, which can limit access to only a subset of records/rows in a table. Usually a user is granted a mix of SELECT, INSERT, DELETE and UPDATE privileges on a given table, which allows access to all records in the table. Through row level security, however, such privileges can be restricted to a subset of the records by means of a policy, which in turn can be assigned to a role.

In the next example, we create an employee table and two manager accounts. We then enable row level security on the table and create a policy that allows the managers to only view/modify their own subordinates’ records:

CREATE TABLE scott.employee (id INT, first_name VARCHAR(20), last_name VARCHAR(20), manager VARCHAR(20));
INSERT INTO scott.employee VALUES (1,'avinash','vallarapu','carina');
INSERT INTO scott.employee VALUES (2,'jobin','augustine','stuart');
INSERT INTO scott.employee VALUES (3,'fernando','laudares','carina');
CREATE USER carina WITH ENCRYPTED PASSWORD 'carina';
CREATE USER stuart WITH ENCRYPTED PASSWORD 'stuart';
CREATE ROLE managers;
GRANT managers TO carina, stuart;
GRANT SELECT, INSERT, UPDATE, DELETE ON scott.employee TO managers;
GRANT USAGE ON SCHEMA scott TO managers;
ALTER TABLE scott.employee ENABLE ROW LEVEL SECURITY;
CREATE POLICY employee_managers ON scott.employee TO managers USING (manager = current_user);

In the log we can see that only certain records are visible to each manager:

$ psql -d percona -U carina
psql (10.5)
Type "help" for help.
percona=> select * from scott.employee ;
id | first_name | last_name | manager
----+------------+-----------+---------
 1 | avinash    | vallarapu | carina
 3 | fernando   | laudares | carina
(2 rows)
$ psql -d percona -U stuart
psql (10.5)
Type "help" for help.
percona=> select * from scott.employee ;
id | first_name | last_name | manager
----+------------+-----------+---------
 2 | jobin      | augustine | stuart
(1 row)

You can read more about row level security in the manual page.

Data Security

1. Encryption of data over the wire using SSL

PostgreSQL allows you to use SSL to enable encryption of data in motion. In addition, you may enable certification based authentication to ensure that the communication is happening between trusted parties. SSL is implemented by OpenSSL and thus it requires the OpenSSL package to be installed in your PostgreSQL server and PostgreSQL to be built –with-openssl support.

The following entry in a pg_hba.conf file says that connections to any database and from any user are allowed from server 192.68.0.13 as long as the communication is encrypted over SSL. Also, the connection is only established when a valid client certificate is provided:

# TYPE DATABASE USER ADDRESS METHOD
hostssl all all 192.168.0.13/32 md5

Optionally, you may also use Client Certificate Authentication using the following method:

# TYPE DATABASE USER ADDRESS METHOD
hostssl all all 192.168.0.13/32 cert clientcert=1

2. Encryption at Rest – pgcrypto

The pgcrypto module provides cryptographic functions for PostgreSQL, allowing certain fields to be stored encrypted. pgcrypto implements PGP encryption, which is part of the OpenPGP (RFC 4880) standard. It supports both symmetric-key and public-key encryption. Besides the advanced features offered by PGP for encryption, pgcrypto also offers functions for running simple encryption based on ciphers. These functions only run a cipher over data.

Accounting and Auditing

Logging in PostgreSQL

PostgreSQL allows you to log either all of the statements or a few statements based on parameter settings. You can log all the DDLs or DMLs or any statement running for more than a certain duration to the log file when logging_collector is enabled. To avoid write overload to the data directory, you may also move your log_directory to a different location. Here’s a few important parameters you should review when logging activities in your PostgreSQL server:

log_connections
log_disconnections
log_lock_waits
log_statement
log_min_duration_statement

Please note that detailed logging takes additional disk space and may impose an important overhead in terms of write IO depending on the activity in your PostgreSQL server. You should be careful when enabling logging and should only do so after understanding the overhead and performance degradation it may cause to your workload.

Auditing – pgaudit and set_user

Some essential auditing features in PostgreSQL are implemented as extensions, which can be enabled at will on highly secured environments with regulatory requirements.

pgaudit helps to audit the activities happening in the database. If any unauthorized user has intentionally obfuscated the DDL or DML, the statement the user has passed and the sub-statement that was actually executed in the database will be logged in the PostgreSQL log file.

set_user

provides a method of privilege escalations. If properly implemented, it provides the highest level of auditing, which allows the monitoring of even SUPERUSER actions.

You can read more about pgaudit here.

Security Bug Fixes

PostgreSQL Global Development Group (PGDG) considers security bugs seriously. Any security vulnerabilities can be reported directly to security@postgresql.org. The list of security issues fixed for all the supported PostgreSQL versions can be found here. Security fixes to PostgreSQL are made available through minor version upgrades. This is the main reason why it is advised to always maintain PostgreSQL servers upgraded to the latest minor version.

If you liked this post…

Please join Percona’s PostgreSQL Support Technical Lead, Avinash Vallarapu; Senior Support Engineer, Fernando Laudares; and Senior Support Engineer, Jobin Augustine, on Wednesday, October 10, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4), as they demonstrate an enterprise-grade PostgreSQL® environment built using a combination of open source tools and extensions.

Register Now

↧

Upcoming Webinar Thurs 9/27: Best Practices Using Indexes in MongoDB

September 25, 2018, 6:41 am

≫ Next: Scaling Percona Monitoring and Management (PMM)

≪ Previous: Securing PostgreSQL as an Enterprise-Grade Environment

Please join Percona’s Principal Architect Alex Rubin as he presents Best Practices Using Indexes in MongoDB on Thursday, September 27th at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.

In this webinar we’ll discuss the following:

How MongoDB uses indexes
What types of indexes exist in MongoDB
How to find slow queries
Which index to create to speed up your queries and how to verify that the index will be used
How to monitor and evaluate the performance of your indexes

Register for this webinar on Best Practices Using Indexes in MongoDB.

↧

Scaling Percona Monitoring and Management (PMM)

September 28, 2018, 5:29 am

≫ Next: High Availability for Enterprise-Grade PostgreSQL environments

≪ Previous: Upcoming Webinar Thurs 9/27: Best Practices Using Indexes in MongoDB

Starting with PMM 1.13, PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM. With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

PMM tested with 1000 nodes

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

Number of samples (data points) injected into PMM per second
Number of distinct time series they belong to (cardinality)
Number of distinct query patterns your application uses
Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

Software version – modern database software versions expose more metrics)
Software configuration – some metrics are only exposed in certain configuration
Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
Exporter configuration – disabling collectors can reduce amount of data collectors
Scrape frequency – controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

Performance PMM 1000 nodes load

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

set home dashboard for PMM

Or you can make some other dashboard of your choice and set it as the home dashboard.

Summary

More than 1,000 monitored systems is possible per single PMM server
Your specific workload and configuration may significantly change the resources required
If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM

↧

High Availability for Enterprise-Grade PostgreSQL environments

September 28, 2018, 10:09 am

≫ Next: ClickHouse: Two Years!

≪ Previous: Scaling Percona Monitoring and Management (PMM)

PostgreSQL logo High availability (HA) and database replication is a major topic of discussion for database technologists. There are a number of informed choices to be made to optimize PostgreSQL replication so that you achieve HA. In this post we introduce an overview of the topic, and cover some options available to achieve high availability in PostgreSQL. We’ll then focus in on just one way to implement HA for postgres, using Patroni.

In our previous blog posts, we have discussed the features available to build a secured PostgreSQL environment and the tools available to help you set up a reliable backup strategy. The series of articles is designed to provide a flavor of how you might go about building an enterprise-grade PostgreSQL environment using open source tools. If you’d like to see this implemented, then please do check our our webinar presentation of October 10 – we think you might find it both useful and intriguing!

Replication in PostgreSQL

The first step towards achieving high availability is making sure you don’t rely on a single database server: your data should be replicated to at least one standby replica/slave. Database replication can be done using the two options available with PostgreSQL community software:

Streaming replication
Logical replication & logical decoding

When we setup streaming replication, a standby replica connects to the master (primary) and streams WAL records from it. Streaming replication is considered to be one of the safest and fastest methods for replication in PostgreSQL. A standby server becomes an exact replica of the primary with potentially minimal lag between primary and standby even on very busy transactional servers. PostgreSQL allows you to build synchronous and asynchronous replication while in streaming replication. Synchronous replication ensures that a client is given a success message only when the change is not only committed to the master but also successfully replicated on the standby server as well. As standby servers can accept read requests from clients, we can make more efficient use of our PostgreSQL setup by sparing the master from serving read requests and redirecting these to the replicas instead. You can read more about Streaming Replication in this blog post.

Logical replication in PostgreSQL allows users to perform a selective replication of a subset of the tables found in the master. While streaming replication is implemented in PostgreSQL at block level—where every database in the master gets replicated to the replica, which remains read-only—logical replication suits such unique scenarios where you need to perform replication of a selection of tables in a database and (optionally) allow direct writes to your slave. A slave configured with logical replication can also be configured to replicate from multiple masters. One situation where this is helpful is when you need to replicate data from several PostgreSQL databases to a single PostgreSQL server for reporting and data warehousing tasks.

While it’s technically possible to employ standby servers configured with logical replication in an HA environment, this doesn’t fare well as a best practice. For such usage, a standby server should be able to take the place of another server “transparently” – the more they resemble the master the better. Logical replication opens the door for different data to be replicated to different servers, and this may break things

Here is a list of built-in features available in PostgreSQL that are designed to help achieve high availability:

Streaming replication
Cascaded replication
Asynchronous standby
Synchronous standby
Warm standby
Hot standby
pg_rewind and pg_basebackup

Automatic failover and an always-on strategy

There are many more open source solutions that can help us achieve high availability with PostgreSQL, especially during critical moments, when a master (primary server) becomes unavailable. The following is a list of a few such open source solutions built for PostgreSQL:

However, the HA solutions available for PostgreSQL are not just limited to the list above. We would be interested to hear what you have implemented as an HA solution in the comments section below.

In our upcoming webinar we are going to show you a PostgreSQL replication cluster built using Patroni and how it provides a seamless failover that is transparent to the application.

Patroni

Patroni is a PostgreSQL cluster management template/framework which stores and talks to a distributed consensus key-value store and decides on the state of the cluster. It started as a fork of Governor project. A Patroni PostgreSQL cluster is composed of many individual PostgreSQL instances running on bare metal, containers or virtual machines. In our setup we’ve used etcd for consensus management, which handles leader elections and decides the leader among a cluster of servers that are partitioned by network. This distributed consensus management can also be achieved by using other technologies, such as ZooKeeper and Consul. In the event of a failover, Patroni promotes the slave that has been assigned as a leader by etcd-like consensus key-value stores. Note that when we setup asynchronous replication, you have an option to specify maximum_lag_on_failover to restrict it from promoting a slave that is lagging by more than this value.

Here’s an architecture diagram of Patroni :

Key Benefits of Patroni:

Continuous monitoring and automatic failover
Manual/scheduled switchover with a single command
Built-in automation for bringing back a failed node to cluster again.
REST APIs for entire cluster configuration and further tooling.
Provides infrastructure for transparent application failover
Distributed consensus for every action and configuration.
Integration with Linux watchdog for avoiding split-brain syndrome.

If you found this post useful…

You are sure to enjoy our webinar of October 10, where we demonstrate live how to build an enterprise-grade PostgreSQL environment with open source tools. If you make it to the live presentation, you will also have the chance to ask questions of the team.

In the next blog post of this series we’ll be covering the scalability of our solution and how to accommodate an increase in traffic while maintaining the quality of the service. We’re moving ever closer to an enterprise-grade environment with open source tools!

↧

ClickHouse: Two Years!

October 1, 2018, 7:47 am

≫ Next: Scaling PostgreSQL using Connection Poolers and Load Balancers for an Enterprise Grade environment

≪ Previous: High Availability for Enterprise-Grade PostgreSQL environments

historical trend of ClickHouse popularity

Following my post from a year ago https://www.percona.com/blog/2017/07/06/clickhouse-one-year/, I wanted to review what happened in ClickHouse during this year.
There is indeed some interesting news to share.

1. ClickHouse in DB-Engines Ranking. It did not quite get into the top 100, but the gain from position 174 to 106 is still impressive. Its DB-Engines Ranking score tripled from 0.54 last September to 1.57 this September

And indeed, in my conversation with customers and partners, the narrative has changed from: “ClickHouse, what is it?” to “We are using or considering ClickHouse for our analytics needs”.

2. ClickHouse changed their versioning schema. Unfortunately it changed from the unconventional …; 1.1.54390; 1.1.54394 naming structure to the still unconventional 18.6.0; 18.10.3; 18.12.13 naming structure, where “18.” is a year of the release.

Now to the more interesting technical improvements.

3. Support of the more traditional JOIN syntax. Now if you join two tables you can use SELECT ... FROM tab1 ANY LEFT JOIN tab2 ON tab1_col=tab2_col .

So now, if we take a query from the workload described in https://www.percona.com/blog/2017/06/22/clickhouse-general-analytical-workload-based-star-schema-benchmark/

We can write this:

SELECT     C_REGION,     sum(LO_EXTENDEDPRICE * LO_DISCOUNT)
FROM lineorder ANY INNER JOIN customer
ON LO_CUSTKEY=C_CUSTKEY
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)
GROUP BY C_REGION;

instead of the monstrous:

SELECT
    C_REGION,
    sum(LO_EXTENDEDPRICE * LO_DISCOUNT)
FROM lineorder
ANY INNER JOIN
(
    SELECT
        C_REGION,
        C_CUSTKEY AS LO_CUSTKEY
    FROM customer
) USING (LO_CUSTKEY)
WHERE (toYear(LO_ORDERDATE) = 1993) AND ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)
GROUP BY C_REGION;

4. Support for DELETE and UPDATE operations. This has probably been the most requested feature since the first ClickHouse release.
ClickHouse uses an LSM-tree like structure—MergeTree—and it is not friendly to single row operations. To highlight this specific limitation, ClickHouse uses ALTER TABLE UPDATE / ALTER TABLE DELETE syntax to highlight this will be executed as a bulk operation, so please consider it as such. Updating or deleting rows in ClickHouse should be an exceptional operation, rather than a part of your day-to-day workload.

We can update a column like this: ALTER TABLE lineorder UPDATE LO_DISCOUNT = 5 WHERE LO_CUSTKEY = 199568

5. ClickHouse added a feature which I call Dictionary Compression, but ClickHouse uses the name “LowCardinality”. It is still experimental, but I hope soon it will be production ready. Basically it allows internally to replace long strings with a short list of enumerated values.

For example, consider the table from our example lineorder which contains 600037902 rows, but has only five different values for the column LO_ORDERPRIORITY:

SELECT DISTINCT LO_ORDERPRIORITY
FROM lineorder
┌─LO_ORDERPRIORITY─┐
│ 1-URGENT         │
│ 5-LOW            │
│ 4-NOT SPECIFIED  │
│ 2-HIGH           │
│ 3-MEDIUM         │
└──────────────────┘

So we can define our table as:

CREATE TABLE lineorder_dict (
        LO_ORDERKEY             UInt32,
        LO_LINENUMBER           UInt8,
        LO_CUSTKEY              UInt32,
        LO_PARTKEY              UInt32,
        LO_SUPPKEY              UInt32,
        LO_ORDERDATE            Date,
        LO_ORDERPRIORITY        LowCardinality(String),
        LO_SHIPPRIORITY         UInt8,
        LO_QUANTITY             UInt8,
        LO_EXTENDEDPRICE        UInt32,
        LO_ORDTOTALPRICE        UInt32,
        LO_DISCOUNT             UInt8,
        LO_REVENUE              UInt32,
        LO_SUPPLYCOST           UInt32,
        LO_TAX                  UInt8,
        LO_COMMITDATE           Date,
        LO_SHIPMODE             LowCardinality(String)
)Engine=MergeTree(LO_ORDERDATE,(LO_ORDERKEY,LO_LINENUMBER),8192);

How does this help? Firstly, it offers space savings. The table will take less space in storage, as it will use integer values instead of strings. And secondly, performance. The filtering operation will be executed faster.

For example: here’s a query against the table with LO_ORDERPRIORITY stored as String:

SELECT count(*)
FROM lineorder
WHERE LO_ORDERPRIORITY = '2-HIGH'
┌───count()─┐
│ 119995822 │
└───────────┘
1 rows in set. Elapsed: 0.859 sec. Processed 600.04 million rows, 10.44 GB (698.62 million rows/s., 12.16 GB/s.)

And now the same query against table with LO_ORDERPRIORITY as LowCardinality(String):

SELECT count(*)
FROM lineorder_dict
WHERE LO_ORDERPRIORITY = '2-HIGH'
┌───count()─┐
│ 119995822 │
└───────────┘
1 rows in set. Elapsed: 0.350 sec. Processed 600.04 million rows, 600.95 MB (1.71 billion rows/s., 1.72 GB/s.)

This is 0.859 sec vs 0.350 sec (for the LowCardinality case).

Unfortunately this feature is not optimized for all use cases, and actually in aggregation it performs slower.

An aggregation query against table with LO_ORDERPRIORITY as String:

SELECT DISTINCT LO_ORDERPRIORITY
FROM lineorder
┌─LO_ORDERPRIORITY─┐
│ 4-NOT SPECIFIED  │
│ 1-URGENT         │
│ 2-HIGH           │
│ 3-MEDIUM         │
│ 5-LOW            │
└──────────────────┘
5 rows in set. Elapsed: 1.200 sec. Processed 600.04 million rows, 10.44 GB (500.22 million rows/s., 8.70 GB/s.)

Versus an aggregation query against table with LO_ORDERPRIORITY as LowCardinality(String):

SELECT DISTINCT LO_ORDERPRIORITY
FROM lineorder_dict
┌─LO_ORDERPRIORITY─┐
│ 4-NOT SPECIFIED  │
│ 1-URGENT         │
│ 2-HIGH           │
│ 3-MEDIUM         │
│ 5-LOW            │
└──────────────────┘
5 rows in set. Elapsed: 2.334 sec. Processed 600.04 million rows, 600.95 MB (257.05 million rows/s., 257.45 MB/s.)

This is 1.200 sec vs 2.334 sec (for the LowCardinality case)

6. And the last feature I want to mention is the better support of Tableau Software: this required ODBC drivers. It may not seem significant, but Tableau is the number one software for data analysts, and by supporting this, ClickHouse will reach a much wider audience.

Summing up: ClickHouse definitely became much more user friendly since a year ago!

↧

Scaling PostgreSQL using Connection Poolers and Load Balancers for an Enterprise Grade environment

October 2, 2018, 9:15 am

≫ Next: Finding Table Differences on Nullable Columns Using MySQL Generated Columns

≪ Previous: ClickHouse: Two Years!

PostgreSQL Load balancing connection pooling

In the previous blog posts in this series we’ve covered some of the essential aspects of an Enterprise-grade solution: security, high availability, and backups. Another important aspect is the scalability of the solution: as our application grows how do we accommodate an increase in traffic while maintaining the quality of the service (response time)? The answer to this question depends on the nature of the workload at play but it is often shaped around:

(a) improving its efficiency and
(b) increasing the resources available.

Why connection pooling ?

When it comes to improving the efficiency of a database workload, one of the first places we start looking at is the list of slow queries; if the most popular ones can be optimized to run faster then we can easily gain some overall performance back. Arguably, we may look next at the number and frequency of client connections: is the workload composed of a high number of very frequent but short-lived connections? Or are clients connections of a more moderate number, and tend to stick around for longer ?

If we consider the first scenario further–a high number of short lived connections–and that each connection spawns a new OS process, the server may hit a practical limit as to the number of transactions—or connections—it can manage per second, considering the hardware available and the workload being processed. Remember that PostgreSQL is process-based, as opposed to thread-based, which is itself an expensive operation in terms of resources, both CPU and memory.

A possible remedy for this would be the use of a connection pooler, acting as a mediator between the application and the database. The connection pooler keeps a number of connections permanently opened with the database, and receives and manages all incoming requests from clients itself, allowing them to temporarily use one of the connections it already has established with PostgreSQL. This removes the burden of creating a new process each time a client establishes a connection with PostgreSQL, and allows it to employ the resources that it would otherwise use for this into serving more requests (or completing them faster).

Rule of thumb?

A general rule of thumb that we often hear is that you may need a connection pooler once you reach around 350 concurrent connections. However, the actual threshold is highly dependent on your database traffic and server configuration: as we find out recently, you may need one much sooner.

You may implement connection pooling using your native application connection pooler (if there is one available) or through an external connection pooler such as PgBouncer and pgPool-II. For the solution we have built, which we demonstrate in our webinar of October 10, we have used PgBouncer as our connection pooler.

PgBouncer

PgBouncer is a lightweight (thread-based) connection pooler that has been widely used in PostgreSQL based environments. It “understands” the PostgreSQL connection protocol and has been a stable project for over a decade.

PgBouncer allows you to configure the pool of connections to operate in three distinct modes: session, statement and transaction. Unless you’ve a good reason to reserve a connection in the pool to a single user for the duration of its session, or are operating with single-statements exclusively, transaction mode is the one you should investigate.

A feature that is central to our enterprise-grade solution is that you can add multiple connection strings using unique alias names (referred to as database names). This allows greater flexibility when mediating connections with multiple database servers. We can then have an alias named “master_db” that will route connections to a master/primary server and another alias named “slave_db” that will route connections to a slave/standby server.

Scaling up

Once efficiency is taken care of, we can then start working on increasing the resources, or computing power, available to process database requests. Scaling vertically means, in short, upgrading the server: more and faster cores, memory, storage. It’s a simple approach, but one that reaches a practical limitation rather quickly. It is not inline with other requirements of an enterprise grade solution, such as high availability. The alternative is scaling horizontally. As briefly introduced above, a common way for implementing horizontal scalability is to redirect reads to standby servers (replicas) with the help of a proxy, which can also act as a load balancer, such as HAProxy. We’ll be discussing these ideas further here, and showcase their integration in our webinar.

HAProxy

HAProxy is a popular open source TCP/HTTP load balancer that can distribute the workload across multiple servers. It can be leveraged in a PostgreSQL replication cluster that has been built using streaming replication. When you build replication using streaming replication method standby replicas are open for reads. With the help of HAProxy you can efficiently utilize the computing power of all database servers, distributing read requests among the available replicas using algorithms such as Least Connection and Round Robin.

A combination of connection pooler and load balancer to scale PostgreSQL

The following diagram represents a simplified part of the architecture that composes the enterprise grade solution we’ve designed, where we employ PgBouncer and HAProxy to scale our PostgreSQL cluster:

Our PgBouncer contains two database (alias) names, one for redirecting writes to the master and another for balancing reads across standby replicas, as discussed above. Here is how the database section looks in the pgbouncer.ini, PgBouncer’s main configuration file:

[databases]
master = host=haproxy port=5002 dbname=postgres
slave = host=haproxy port=5003 dbname=postgres

Notice that both database entries redirect their connections to the HAProxy server, but each to a different port. The HAProxy, in turn, is configured to route the connections in functions of the incoming port they reach. Considering the above pgBouncer config file as a reference, writes (master connections) are redirected to port 5002 and reads (slave connections) to port 5003. Here is how the HAProxy config file looks:

# Connections to port 5002
listen Master
    bind *:5002
    option tcp-check
    tcp-check send GET\ / HTTP/1.0\r\n
    tcp-check send HOST\r\n
    tcp-check send \r\n
    tcp-check expect string "role":\ "master"
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg0 pg0:5432 maxconn 100 check port 8008
    server pg1 pg1:5432 maxconn 100 check port 8008
    server pg2 pg2:5432 maxconn 100 check port 8008
# Connections to port 5003
listen Slaves
    bind *:5003
    option tcp-check
    tcp-check send GET\ / HTTP/1.0\r\n
    tcp-check send HOST\r\n
    tcp-check send \r\n
    tcp-check expect string "role":\ "replica"
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg0 pg0:5432 maxconn 100 check port 8008
    server pg1 pg1:5432 maxconn 100 check port 8008
    server pg2 pg2:5432 maxconn 100 check port 8008

As seen above:

connections to port 5002 are sent to server with role: “master”
connections to port 5003 are sent to server with role : “replica”

In a previous post, we discussed using Patroni in our high availability setup. HAProxy relies on Patroni to determine the role of the PostgreSQL server. Patroni is being used here for cluster management and automatic failover. By using Patroni’s REST API (on port 8008 in this scenario) we can obtain the role of a given PostgreSQL server. The example below shows this in practice, the IP addresses denoting the PostgreSQL servers in this setup:

$ curl -s 'http://192.168.50.10:8008' | python -c "import sys, json; print json.load(sys.stdin)['role']"
replica
$ curl -s 'http://192.168.50.20:8008' | python -c "import sys, json; print json.load(sys.stdin)['role']"
master
$ curl -s 'http://192.168.50.30:8008' | python -c "import sys, json; print json.load(sys.stdin)['role']"
replica

HAProxy can thus rely on Patroni’s REST API to redirect connections from the master alias in PgBouncer to a server with role master. Similarly, HAProxy uses server role information to redirect connections from a slave alias to one of the servers with role replica, using the appropriate load balancer algorithm.

This way, we ensure that the application uses the advantage of a connection pooler to leverage connections to the database, and also of the load balancer which distributes the read load to multiple database servers as configured.

There are many other open source connection poolers and load balancers available to build a similar setup. You can choose the one that best suits your environment—just make sure to test your custom solution appropriately before bringing it to production.

↧

Finding Table Differences on Nullable Columns Using MySQL Generated Columns

October 3, 2018, 6:56 am

≫ Next: The Importance of mysqlbinlog –version

≪ Previous: Scaling PostgreSQL using Connection Poolers and Load Balancers for an Enterprise Grade environment

MySQL generated columns Some time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an

IS NULL

comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned DEFAULT NULL,
  `c` char(120) DEFAULT NULL,
  `pad` char(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k` (`k`,`c`,`pad`)
) ENGINE=InnoDB

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

update sbtest1 set k=null where limit 1;
update sbtest2 set k=null where limit 1;

If we execute the comparison query, we get this result:

mysql> select "sbtest1",a.* from
    -> sbtest1 a left join
    -> sbtest2 b using (k,c,pad)
    -> where b.id is null union
    -> select "sbtest2",a.* from
    -> sbtest2 a left join
    -> sbtest1 b using (k,c,pad)
    -> where b.id is null;
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | id   | k    | c                                                                                                                       | pad                                                         |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (3.00 sec)

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.

Solution

The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

ALTER TABLE sbtest1
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);
ALTER TABLE sbtest2
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

mysql> select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null union
    -> select "sbtest2",a.k,a.c,a.pad from
    -> sbtest2 a left join
    -> sbtest1 b using (real_id)
    -> where b.id is null;
Empty set (2.31 sec)

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

mysql> select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null) union
    -> select "sbtest2",a.k,a.c,a.pad
    -> from sbtest2 a
    -> where a.id in (select a.id
    ->   from sbtest2 a left join
    ->   sbtest1 b using (real_id)
    ->   where b.id is null);
Empty set (1.60 sec)

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

mysql> explain select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null;
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref              | rows   | filtered | Extra                                |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE      | a     | NULL       | ALL  | NULL          | NULL    | NULL    | NULL             | 315369 |   100.00 | NULL                                 |
|  1 | SIMPLE      | b     | NULL       | ref  | real_id       | real_id | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

mysql> explain select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null);
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref              | rows   | filtered | Extra                                |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE       | a           | NULL       | index  | PRIMARY       | k          | 187     | NULL             | 315369 |   100.00 | Using where; Using index             |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 4       | sbtest.a.id      |      1 |   100.00 | NULL                                 |
|  2 | MATERIALIZED | a           | NULL       | index  | PRIMARY       | real_id    | 17      | NULL             | 315369 |   100.00 | Using index                          |
|  2 | MATERIALIZED | b           | NULL       | ref    | real_id       | real_id    | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.

Conclusions

This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.

↧

The Importance of mysqlbinlog –version

October 4, 2018, 7:01 am

≫ Next: MongoDB-Disable Chained Replication

≪ Previous: Finding Table Differences on Nullable Columns Using MySQL Generated Columns

Importance of MySQL binlog version When deciding on your backup strategy, one of the key components for Point In Time Recovery (PITR) will be the binary logs. Thankfully, the mysqlbinlog command allows you to easily take binary log backups, including those that would otherwise be encrypted on disk using encrypt_binlog=ON.

When

mysqlbinlog

is used with

--raw --read-from-remote-server --stop-never --verify-binlog-checksum

then it will retrieve binary logs from whichever master it is pointed to, and store them locally on disk in the same format as they were written on the master. Here is an example with the extra arguments that would normally be used:

/usr/bin/mysqlbinlog --raw --read-from-remote-server \
 --stop-never --connection-server-id=1234 \
 --verify-binlog-checksum \
 --host=localhost --port=3306 mysql-bin.000024

This would retrieve the localhost binary logs (starting from mysql-bin.000024) reporting as server_id 1234, verify the checksum and then write each of them to disk.

Changes to the mysqlbinlog source code are relatively infrequent, except for when developing for a new major version, so you may be fooled into thinking that the specific version that you use is not so important—a little like the client version. This is something that is more likely to vary when you are taking remote backups.

Here is the result from the 5.7 branch of mysql-server to show the history of commits by year:

$ git blame --line-porcelain client/mysqlbinlog.cc | egrep "^(committer-time|committer-tz)" | cut -f2 -d' ' | while read ct; do read ctz; date --date "Jan 1, 1970 00:00:00 ${ctz} + ${ct} seconds" --utc +%Y; done | sort -n | uniq -c
   105 2000
    52 2001
    52 2002
   103 2003
   390 2004
   113 2005
    58 2006
   129 2007
   595 2008
    53 2009
   349 2010
   151 2011
   382 2012
   191 2013
   308 2014
   404 2015
    27 2016
    42 2017
    15 2018

Since the first GA release of 5.7 (October 2015), there haven’t been too many bugs and so if you aren’t using new features then you may think that it is OK to keep using the same version as before:

$ git log --regexp-ignore-case --grep bug --since="2015-10-19" --oneline client/mysqlbinlog.cc
1ffd7965a5e Bug#27558169 BACKPORT TO 5.7 BUG #26826272: REMOVE GCC 8 WARNINGS [noclose]
17c92835bb3 Bug #24674276 MYSQLBINLOG -R --HEXDUMP CRASHES FOR INTVAR,                   USER_VAR, OR RAND EVENTS
052dbd7b079 BUG#26878022 MYSQLBINLOG: ASSERTION `(OLD_MH->M_KEY == KEY) ||              (OLD_MH->M_KEY == 0)' FAILED
543129a577c BUG#26878022 MYSQLBINLOG: ASSERTION `(OLD_MH->M_KEY == KEY) || (OLD_MH->M_KEY == 0)' FAILED
ba1a99c5cd7 Bug#26825211 BACKPORT FIX FOR #25643811 TO 5.7
1f0f4476b28 Bug#26117735: MYSQLBINLOG READ-FROM-REMOTE-SERVER NOT HONORING REWRITE_DB FILTERING
12508f21b63 Bug #24323288: MAIN.MYSQLBINLOG_DEBUG FAILS WITH A LEAKSANITIZER ERROR
e8e5ddbb707 Bug#24609402 MYSQLBINLOG --RAW DOES NOT GET ALL LATEST EVENTS
22eec68941f Bug#23540182:MYSQLBINLOG DOES NOT FREE THE EXISTING CONNECTION BEFORE OPENING NEW REMOTE ONE
567bb732bc0 Bug#22932576 MYSQL5.6 DOES NOT BUILD ON SOLARIS12
efc42d99469 Bug#22932576 MYSQL5.6 DOES NOT BUILD ON SOLARIS12
6772eb52d66 Bug#21697461 MEMORY LEAK IN MYSQLBINLOG

However, this is not always the case and some issues are more obvious than others! To help show this, here are a couple of the issues that you might happen to notice.

Warning: option ‘stop-never-slave-server-id’: unsigned value <xxxxxxxx> adjusted to <yyyyy>

The server_id that is used by a server in a replication topology should always be unique within the topology. One of the easy ways to ensure this is to use a conversion of the external IPv4 address to an integer, such as INET_ATON , which provides you with an unsigned integer.

The introduction of

--connection-server-id

(which deprecates

--stop-never-slave-server-id

) changes the behaviour here (for the better). Prior to this you may experience warnings where your server_id was cast to the equivalent of an UNSIGNED SMALLINT. This didn’t seem to be a reported bug, just fixed as a by-product of the change.

ERROR: Could not find server version: Master reported unrecognized MySQL version ‘xxx’

When running mysqlbinlog, the version of MySQL is checked so that the event format is set correctly. Here is the code from MySQL 5.7:

switch (*version) {
 case '3':
   glob_description_event= new Format_description_log_event(1);
   break;
 case '4':
   glob_description_event= new Format_description_log_event(3);
   break;
 case '5':
   /*
     The server is soon going to send us its Format_description log
     event, unless it is a 5.0 server with 3.23 or 4.0 binlogs.
     So we first assume that this is 4.0 (which is enough to read the
     Format_desc event if one comes).
   */
   glob_description_event= new Format_description_log_event(3);
   break;
 default:
   glob_description_event= NULL;
   error("Could not find server version: "
         "Master reported unrecognized MySQL version '%s'.", version);
   goto err;
 }

This section of the code last changed in 2008, but of course there is another vendor that no longer uses a 5-prefixed-version number: MariaDB. With MariaDB, it is impossible to take a backup without using a MariaDB version of the program, as you are told that the version is unrecognised. The MariaDB source code contains a change to this section to resolve the issue when the version was bumped to 10:

83c02f32375b client/mysqlbinlog.cc (Michael Widenius    2012-05-31 22:39:11 +0300 1900) case 5:
83c02f32375b client/mysqlbinlog.cc (Michael Widenius    2012-05-31 22:39:11 +0300 1901) case 10:

Interestingly, MySQL 8.0 gets a little closer to not producing an error (although it still does), but finally sees off those rather old ancestral relatives:

 switch (*version) {
   case '5':
   case '8':
   case '9':
     /*
       The server is soon going to send us its Format_description log
       event.
     */
     glob_description_event = new Format_description_log_event;
     break;
   default:
     glob_description_event = NULL;
     error(
         "Could not find server version: "
         "Master reported unrecognized MySQL version '%s'.",
         version);
     goto err;
 }

These are somewhat trivial examples. In fact, you are more likely to suffer from more serious differences, perhaps ones that do not become immediately apparent, if you are not matching the mysqlbinlog version to the one provided by the version for the server producing the binary logs.

Sadly, it is not so easy to check the versions as the reported version was seemingly left unloved for quite a while (Ver 3.4), so you should check the binary package versions (e.g. using Percona-Server-client-57-5.7.23 with Percona-Server-server-57-5.7.23). Thankfully, the good news is that MySQL 8.0 fixes it!

So reduce the risk and match your package versions!

↧

MongoDB-Disable Chained Replication

October 5, 2018, 3:53 am

≫ Next: PostgreSQL Extensions for an Enterprise-Grade System

≪ Previous: The Importance of mysqlbinlog –version

In this blog post, we will learn what MongoDB chained replication is, why you might choose to disable it, and the steps you need to take to do so.

What is chain replication?

Chain Replication in MongoDB, as the name suggests, means that a secondary member is able to replicate from another secondary member instead of a primary.

Default settings

By default, chained replication is enabled in MongoDB. It helps to reduce the load from the primary but it may lead to a replication lag. When enabled, the secondary node selects its target using the ping time for the closest node.

Reasons to disable chained replication

The main reason to disable chain replication is replication lag. In other words, the length of the delay between MongoDB writing an operation on the primary and replicating the same operation to the secondary.

In either case—chained replication enabled or disabled—replication works in the same way when the primary node fails: the secondary server will promote to the primary. Therefore, writing and reading of data from the application is not affected.

Steps to disable chained replication

1) Check the current status of chained replication in replica set configuration for “settings” like this:

PRIMARY> rs.config().settings
{
"chainingAllowed" : true,
}

2) Disable chained replication, set “chainingAllowed” to false and then reconfig to implement changes.

PRIMARY> cg = rs.config()
PRIMARY> cg.settings.chainingAllowed = false
false
PRIMARY> rs.reconfig(cg)

3) Check again for the current status of chained replication and its done.

PRIMARY> rs.config().settings
{
	"chainingAllowed" : false,
}

Can I override sync source target even after disabling chaining?

Yes, even after you have disabled chained replication, you can still override sync target, though only temporarily. That means it will be overridden until:

mognod instance restarts
established connection between sync source and secondary node.
Additional: if chaining is enabled and sync source falls more than 30 seconds behind another member then the SyncSourceResolver will choose other member having recent oplogs to sync from.

Override sync source

Parameter “replSetSyncFrom” could be used, for example, the secondary node is syncing from host

192.168.103.100:27001

and we would like to sync it from

192.168.103.100:27003

1) Check for the current host it is syncing from:

PRIMARY> rs.status()
{
			"_id" : 1,
			"name" : "192.168.103.100:27002",
			"syncingTo" : "192.168.103.100:27001",
			"syncSourceHost" : "192.168.103.100:27001",
		},

2) Login to that mongod, and execute:

SECONDARY> db.adminCommand( { replSetSyncFrom: "192.168.103.100:27003" })

3) Check replica set status again

SECONDARY> rs.status()
{
			"_id" : 1,
			"name" : "192.168.103.100:27002",
			"syncingTo" : "192.168.103.100:27003",
			"syncSourceHost" : "192.168.103.100:27003",
		},

This is how we can override the sync source in case of testing, maintenance or while the replica is not syncing from the required host.

I hope this blog helps you to understand how to disable chained replication or override the sync source for the specific purpose or reason. The preferred setting of the chainingAllowed parameter is true as it reduces the load from the primary node and also a default setting.

↧

PostgreSQL Extensions for an Enterprise-Grade System

October 5, 2018, 9:43 am

≫ Next: Detailed Logging for Enterprise-Grade PostreSQL

≪ Previous: MongoDB-Disable Chained Replication

PostgreSQL® logo In this current series of blog posts we have been discussing various relevant aspects when building an enterprise-grade PostgreSQL setup, such as security, back up strategy, high availability, and different methods to scale PostgreSQL. In this blog post, we’ll get to review some of the most popular open source extensions for PostgreSQL, used to expand its capabilities and address specific needs. We’ll cover some of them during a demo in our upcoming webinar on October 10.

Expanding with PostgreSQL Extensions

PostgreSQL is one of the world’s most feature-rich and advanced open source RDBMSs. Its features are not just limited to those released by the community through major/minor releases. There are hundreds of additional features developed using the extensions capabilities in PostgreSQL, which can cater to needs of specific users. Some of these extensions are very popular and useful to build an enterprise-grade PostgreSQL environment. We previously blogged about a couple of FDW extensions (mysql_fdw and postgres_fdw ) which will allow PostgreSQL databases to talk to remote homogeneous/heterogeneous databases like PostgreSQL and MySQL, MongoDB, etc. We will now cover a few other additional extensions that can expand your PostgreSQL server capabilities.

pg_stat_statements

The pg_stat_statements module provides a means for tracking execution statistics of all SQL statements executed by a server. The statistics gathered by the module are made available via a view named pg_stat_statements. This extension must be installed in each of the databases you want to track, and like many of the extensions in this list, it is available in the contrib package from the PostgreSQL PGDG repository.

pg_repack

Tables in PostgreSQL may end up with fragmentation and bloat due to the specific MVCC implementation in PostgreSQL, or simply due to a high number of rows being naturally removed. This could lead to not only unused space being held inside the table but also to sub-optimal execution of SQL statements. pg_repack is the most popular way to address this problem by reorganizing and repacking the table. It can reorganize the table’s content without placing an exclusive lock on it during the process. DMLs and queries can continue while repacking is happening. Version 1.2 of pg_repack introduces further new features of parallel index builds, and the ability to rebuild just the indexes. Please refer to the official documentation for more details.

pgaudit

PostgreSQL has a basic statement logging feature. It can be implemented using the standard logging facility with

log_statement = all

. But this is not sufficient for many audit requirements. One of the essential features for enterprise deployments is the capability for fine-grained auditing the user interactions/statements issued to the database. This is a major compliance requirement for many security standards. The pgaudit extension caters to these requirements.

The PostgreSQL Audit Extension (pgaudit) provides detailed session and/or object audit logging via the standard PostgreSQL logging facility. Please refer to the settings section of its official documentation for more details.

pldebugger

This is a must-have extension for developers who work on stored functions written in PL/pgSQL. This extension is well integrated with GUI tools like pgadmin, which allows developers to step through their code and debug it. Packages for pldebugger are also available in the PGDG repository and installation is straightforward.Once it is set up, we can step though and debug the code remotely.

The official git repo is available here

plprofiler

This is a wonderful extension for finding out where the code is slowing down. This is very helpful, particularly during complex migrations from proprietary databases, like from Oracle to PostgreSQL, which affect application performance. This extension can prepare a report on the overall execution time and tables representation, including flamegraphs, with clear information about each line of code. This extension is not, however, available from the PGDG repo: you will need to build it from source. Details on building and installing plprofiler will be covered in a future blog post. Meanwhile, the official repository and documentation is available here

PostGIS

PostGIS is arguably the most versatile implementation of the specifications of the Open Geospatial Consortium. We can see a large list of features in PostGIS that are rarely available in any other RDBMSs.

There are many users who have primarily opted to use PostgreSQL because of the features supported by PostGIS. In fact, all these features are not implemented as a single extension, but are instead delivered by a collection of extensions. This makes PostGIS one of the most complex extensions to build from source. Luckily, everything is available from the PGDG repository:

$ sudo yum install postgis24_10.x86_64

Once the postgis package is installed, we are able to create the extensions on our target database:

postgres=# CREATE EXTENSION postgis;
CREATE EXTENSION
postgres=# CREATE EXTENSION postgis_topology;
CREATE EXTENSION
postgres=# CREATE EXTENSION postgis_sfcgal;
CREATE EXTENSION
postgres=# CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION
postgres=# CREATE EXTENSION postgis_tiger_geocoder;
CREATE EXTENSION
postgres=# CREATE EXTENSION address_standardizer;
CREATE EXTENSION

Language Extensions : PL/Python, PL/Perl, PL/V8,PL/R etc.

Another powerful feature of PostgreSQL is its programming languages support. You can code database functions/procedures in pretty much every popular language.

Thanks to the enormous number of libraries available, which includes machine learning ones, and its vibrant community, Python has claimed the third spot amongst the most popular languages of choice according to the TIOBE Programming index. Your team’s skills and libraries remain valid for PostgreSQL server coding too! Teams that regularly code in JavaScript for Node.js or Angular can easily write PostgreSQL server code in PL/V8. All of the packages required are readily available from the PGDG repository.

cstore_fdw

cstore_fdw is an open source columnar store extension for PostgreSQL. Columnar stores provide notable benefits for analytics use cases where data is loaded in batches. Cstore_fdw’s columnar nature delivers performance by only reading relevant data from disk. It may compress data by 6 to 10 times to reduce space requirements for data archive. The official repository and documentation is available here

HypoPG

HypoPG is an extension for adding support for hypothetical indexes – that is, without actually adding the index. This helps us to answer questions such as “how will the execution plan change if there is an index on column X?”. Installation and setup instructions are part of its official documentation

mongo_fdw

Mongo_fdw presents collections from mongodb as tables in PostgreSQL. This is a case where the NoSQL world meets the SQL world and features combine. We will be covering this extension in a future blog post. The official repository is available here

tds_fdw

Another important FDW (foreign data wrapper) extension in the PostgreSQL world is tds_fdw. Both Microsoft SQL Server and Sybase uses TDS (Tabular Data Stream) format. This fdw allows PostgreSQL to use tables stored in remote SQL Server or Sybase database as local tables. This FDW make use of FreeTDS libraries.

orafce

As previously mentioned, there are lot of migrations underway from Oracle to PostgreSQL. Incompatible functions in PostgreSQL are often painful for those who are migrating server code. The “orafce” project implements some of the functions from the Oracle database. The functionality was verified on Oracle 10g and the module is useful for production work. Please refer to the list in its official documentation about the Oracle functions implemented in PostgreSQL

TimescaleDB

In this new world of IOT and connected devices, there is a growing need of time-series data. Timescale can convert PostgreSQL into a scalable time-series data store. The official site is available here with all relevant links.

pg_bulkload

Is loading a large volume of data into database in a very efficient and faster way a challenge for you? If so pg_bulkload may help you solve that problem. Official documentation is available here

pg_partman

PostgreSQL 10 introduced declarative partitions. But creating new partitions and maintaining existing ones, including purging unwanted partitions, requires a good dose of manual effort. If you are looking to automate part of this maintenance you should have a look at what pg_partman offers. The repository with documentation is available here.

wal2json

PostgreSQL has feature related to logical replication built-in. Extra information is recorded in WALs which will facilitate logical decoding. wal2json is a popular output plugin for logical decoding. This can be utilized for different purposes including change data capture. In addition to wal2json, there are other output plugins: a concise list is available in the PostgreSQL wiki.

There are many more extensions that help us build an enterprise-grade PostgreSQL set up using open source solutions. Please feel free to comment and ask us if we know about one that satisfies your particular needs. Or, if there’s still time, sign up for our October webinar and ask us in person!

↧

Detailed Logging for Enterprise-Grade PostreSQL

October 8, 2018, 6:13 am

≫ Next: Persistence of autoinc fixed in MySQL 8.0

≪ Previous: PostgreSQL Extensions for an Enterprise-Grade System

PostgreSQL® logo In this penultimate post from our series on building an enterprise-grade PostgreSQL environment we cover the parameters we have enabled to configure detailed logging in the demo setup we will showcase in our upcoming webinar.

Detailed logging in PostgreSQL and log analyzer

Like other RDBMS, PostgreSQL allows you to maintain a log of activities and error messages. Until PostgreSQL 9.6, PostgreSQL log files were generated in pg_log directory (inside the data directory) by default. Since PostgreSQL 10, pg_log has been renamed to simply log. However, this directory can be modified to a different location by modifying the parameter log_directory.

Unlike MySQL, PostgreSQL writes the error and activity log to the same log file thus it may grow to several GBs when detailed logging is enabled. In these cases, logging becomes IO-intensive thus it is recommended to store log files in a different storage to the one hosting the data directory.

Parameters to enable detailed logging

Here’s a list of parameters used to customize logging in PostgreSQL. All of them need to be modified in the postgresql.conf or postgresql.auto.conf files.

logging_collector: in order to log any activity in PostgreSQL this parameter must be enabled. The backend process responsible for logging database activity is called logger, it gets started when logging_collector is set to ON. Changing this parameter requires a PostgreSQL restart.

log_min_duration_statement: this parameter is used primarily to set a time threshold: queries running longer than such should be logged (as “slow queries”). Setting it to -1 disables logging of statements. Setting it to 0 enables the logging of every statement running in the database, regardless of its duration. The time unit should follow the actual value, for example: 250ms, 250s, 250min, 1h. Changing this parameter does not require a PostgreSQL restart – a simple reload of the configuration is enough.reload but not a restart. For example:

log_min_duration_statement = 5s

logs every statement running for 5 seconds or longer.

log_line_prefix: helps you customize every log line being printed in the PostgreSQL log file. You can log the process id, application name, database name and other details for every statement as required. The following log_line_prefix may be helpful in most scenarios:

log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h'

The above setting records the following for every statement being logged:

%t : Time stamp without milliseconds
%p : Process id
%l-1 : Number of the log line for each session or process, starting at 1
%u : User name
%d : Database name
%a : Application name
%h : Remote host name or IP address

With the above settings employed for log_line_prefix, the log appears as follows:

2018-06-08 12:36:26 EDT [21315]: [1-1] user=postgres,db=percona,app=psql,client=192.168.0.12 LOG: duration: 2.611 ms statement: select * from departments where id = 26;
2018-06-08 12:36:49 EDT [21323]: [1-1] user=postgres,db=percona,app=psql,client=192.168.0.12 LOG: duration: 2.251 ms statement: select count(*) from departments;

To see more on log_line_prefix, You can refer to the PostgreSQL documentation for further details on this feature.

log_duration: the enabling of this parameter records the duration of every completed statement in PostgreSQL log, and this irrespective of log_min_duration_statement. Have in mind that, as it happens with log_min_duration_statement, the enabling of log_duration may increase log file usage and add affect the server’s general performance. For this reason, if you already have log_min_duration_statement enabled it is often suggested to disable log_duration, unless there’s a specific need to keep track of both.

log_lock_waits: when log_lock_waits is enabled a log message is recorded when a session waits longer than deadlock_timeout to acquire a lock.

log_checkpoints: logs all checkpoints and restart points to the PostgreSQL log file.

log_rotation_size: defines the size limit for each log file; once it reaches this threshold the log file is rotated.
Example:

log_rotation_size = '500MB'

– every log file is limited to a size of 500 MB.

log_rotation_age: determines the maximum life span for a log file, forcing its rotation once this threshold is reached. This parameter is usually set in terms of hours, or maybe days, however the minimum granularity is a minute. However, if log_rotation_size is reached first, the log gets rotated anyway, irrespective of this setting.
Example:

log_rotation_age = 1d

log_statement: controls what type of SQLs are logged. The recommended setting is DDL, which logs all DDLs that are executed. Tracking those allow you to later audit when a given DDL was executed, and by who. By monitoring and understanding the amount of information it may write to the log file you may consider modifying this setting. Other possible values are none, mod (includes DDLs plus DMLs) and all.

log_temp_files: logs information related to a temporary table file whose size is greater than this value (in KBs).

log_directory: defines the directory in which log files are created. Once more, please note that if you have enabled detailed logging it is recommended to have a separate disk—different from the data directory disk—allocated for log_directory.
Example:

log_directory = /this_is_a_new_disk/pg_log

Log Analyzer – pgBadger

You cannot have a separate error log and a slow query log. Everything is written to one log file, which may be periodically rotated based on time and size. Over a period of time, this log file size may increase to several MB’s or even GB’s depending on the amount of logging that has been enabled. It could get difficult for a DBA/developer to parse the log files and get a better view about what is running slowly and how many times a query has run. To help with this taks you may use pgBadger, a log analyzer for PostgreSQL, to parse log files and generate a rich HTML-based report that you can access from a browser. An example report can be seen in the screenshots below:

We’ll be showing detailed logging and pgBadger in action in a couple of days so, if there’s still time, sign up for our October webinar !

↧

Persistence of autoinc fixed in MySQL 8.0

October 8, 2018, 9:00 am

≫ Next: MongoDB Replica set Scenarios and Internals

≪ Previous: Detailed Logging for Enterprise-Grade PostreSQL

MySQL 8.0 autoinc persistence fixed The release of MySQL 8.0 has brought a lot of bold implementations that touched on things that have been avoided before, such as added support for common table expressions and window functions. Another example is the change in how AUTO_INCREMENT (autoinc) sequences are persisted, and thus replicated.

This new implementation carries the fix for bug #73563 (Replace result in auto_increment value less or equal than max value in row-based), which we’ve only found about recently. The surprising part is that the use case we were analyzing is a somewhat common one; this must be affecting a good number of people out there.

Understanding the bug

The business logic of the use case is such the UNIQUE column found in a table whose id is managed by an AUTO_INCREMENT sequence needs to be updated, and this is done with a REPLACE operation:

“REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.”

So, what happens in practice in this particular case is a DELETE followed by an INSERT of the target row.

We will explore this scenario here in the context of an oversimplified currency converter application that uses USD as base reference:

CREATE TABLE exchange_rate (
id INT PRIMARY KEY AUTO_INCREMENT,
currency VARCHAR(3) UNIQUE,
rate FLOAT(5,3)
) ENGINE=InnoDB;

Let’s add a trio of rows to this new table:

INSERT INTO exchange_rate (currency,rate) VALUES ('EUR',0.854), ('GBP',0.767), ('BRL',4.107);

which gives us the following initial set:

master (test) > select * from exchange_rate;
+----+----------+-------+
| id | currency | rate  |
+----+----------+-------+
|  1 | EUR      | 0.854 |
|  2 | GBP      | 0.767 |
|  3 | BRL      | 4.107 |
+----+----------+-------+
3 rows in set (0.00 sec)

Now we update the rate for Brazilian Reais using a REPLACE operation:

REPLACE INTO exchange_rate SET currency='BRL', rate=4.500;

With currency being a UNIQUE field the row is fully replaced:

master (test) > select * from exchange_rate;
+----+----------+-------+
| id | currency | rate  |
+----+----------+-------+
|  1 | EUR      | 0.854 |
|  2 | GBP      | 0.767 |
|  4 | BRL      | 4.500 |
+----+----------+-------+
3 rows in set (0.00 sec)

and thus the autoinc sequence is updated:

master (test) > SHOW CREATE TABLE exchange_rate\G
*************************** 1. row ***************************
     Table: exchange_rate
Create Table: CREATE TABLE `exchange_rate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`currency` varchar(3) DEFAULT NULL,
`rate` float(5,3) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `currency` (`currency`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

The problem is that the autoinc sequence is not updated in the replica as well:

slave1 (test) > select * from exchange_rate;show create table exchange_rate\G
+----+----------+-------+
| id | currency | rate  |
+----+----------+-------+
|  1 | EUR      | 0.854 |
|  2 | GBP      | 0.767 |
|  4 | BRL      | 4.500 |
+----+----------+-------+
3 rows in set (0.00 sec)
*************************** 1. row ***************************
     Table: exchange_rate
Create Table: CREATE TABLE `exchange_rate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`currency` varchar(3) DEFAULT NULL,
`rate` float(5,3) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `currency` (`currency`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Now, the moment we promote that replica as master and start writing to this table we’ll hit a duplicate key error:

slave1 (test) > REPLACE INTO exchange_rate SET currency='BRL', rate=4.600;
ERROR 1062 (23000): Duplicate entry '4' for key 'PRIMARY'

Note that:

a) the transaction fails and the row is not replaced, however the autoinc sequence is incremented:

slave1 (test) > SELECT AUTO_INCREMENT FROM information_schema.TABLES WHERE table_schema='test' AND table_name='exchange_rate';
+----------------+
| AUTO_INCREMENT |
+----------------+
|              5 |
+----------------+
1 row in set (0.00 sec)

b) this problem only happens with row-based replication (binlog_format=ROW), where REPLACE in this case is logged as a row UPDATE:

# at 6129
#180829 18:29:55 server id 100  end_log_pos 5978 CRC32 0x88da50ba Update_rows: table id 117 flags: STMT_END_F
### UPDATE `test`.`exchange_rate`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.107                /* FLOAT meta=4 nullable=1 is_null=0 */
### SET
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.5                  /* FLOAT meta=4 nullable=1 is_null=0 */

With statement-based replication—or even mixed format—the REPLACE statement is replicated as is: it will trigger a DELETE+INSERT in the background on the replica and thus update the autoinc sequence in the same way it did on the master.

This example (tested with Percona Server versions 5.5.61, 5.6.36 and 5.7.22) helps illustrate the issue with autoinc sequences not being persisted as they should be with row-based replication. However, MySQL’s Worklog #6204 includes a couple of scarier scenarios involving the master itself, such as when the server crashes while a transaction is writing to a table similar to the one used in the example above. MySQL 8.0 remedies this bug.

Workarounds

There are a few possible workarounds to consider if this problem is impacting you and if neither upgrading to the 8 series nor resorting to statement-based or mixed replication format are viable options.

We’ll be discussing three of them here: one that resorts around the execution of checks before a failover (to detect and fix autoinc inconsistencies in replicas), another that requires a review of all REPLACE statements like the one from our example and adapt it as to include the id field, thus avoiding the bug, and finally one that requires changing the schema of affected tables in such a way that the target field is made the Primary Key of the table while id (autoinc) is converted into a UNIQUE key.

a) Detect and fix

The less intrusive of the workarounds we conceived for the problem at hand in terms of query and schema changes is to run a check for each of the tables that might be facing this issue in a replica before we promote it as master in a failover scenario:

slave1 (test) > SELECT ((SELECT MAX(id) FROM exchange_rate)>=(SELECT AUTO_INCREMENT FROM information_schema.TABLES WHERE table_schema='test' AND table_name='exchange_rate')) as `check`;
+-------+
| check |
+-------+
|     1 |
+-------+
1 row in set (0.00 sec)

If the table does not pass the test, like ours didn’t at first (just before we attempted a REPLACE after we failed over to the replica), then update autoinc accordingly. The full routine (check + update of autoinc) could be made into a single stored procedure:

DELIMITER //
CREATE PROCEDURE CheckAndFixAutoinc()
BEGIN
 DECLARE done TINYINT UNSIGNED DEFAULT 0;
 DECLARE tableschema VARCHAR(64);
 DECLARE tablename VARCHAR(64);
 DECLARE columnname VARCHAR(64);  
 DECLARE cursor1 CURSOR FOR SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') AND EXTRA LIKE '%auto_increment%';
 DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
 OPEN cursor1;  
 start_loop: LOOP
  IF done THEN
    LEAVE start_loop;
  END IF;
  FETCH cursor1 INTO tableschema, tablename, columnname;
  SET @get_autoinc = CONCAT('SELECT @check1 := ((SELECT MAX(', columnname, ') FROM ', tableschema, '.', tablename, ')>=(SELECT AUTO_INCREMENT FROM information_schema.TABLES WHERE TABLE_SCHEMA=\'', tableschema, '\' AND TABLE_NAME=\'', tablename, '\')) as `check`');
  PREPARE stm FROM @get_autoinc;
  EXECUTE stm;
  DEALLOCATE PREPARE stm;
  IF @check1>0 THEN
    BEGIN
      SET @select_max_id = CONCAT('SELECT @max_id := MAX(', columnname, ')+1 FROM ', tableschema, '.', tablename);
      PREPARE select_max_id FROM @select_max_id;
      EXECUTE select_max_id;
      DEALLOCATE PREPARE select_max_id;
      SET @update_autoinc = CONCAT('ALTER TABLE ', tableschema, '.', tablename, ' AUTO_INCREMENT=', @max_id);
      PREPARE update_autoinc FROM @update_autoinc;
      EXECUTE update_autoinc;
      DEALLOCATE PREPARE update_autoinc;
    END;
  END IF;
 END LOOP start_loop;  
 CLOSE cursor1;
END//
DELIMITER ;

It doesn’t allow for as clean a failover as we would like but it can be helpful if you’re stuck with MySQL<8.0 and binlog_format=ROW and cannot make changes to your queries or schema.

b) Include Primary Key in REPLACE statements

If we had explicitly included the id (Primary Key) in the REPLACE operation from our example it would have also been replicated as a DELETE+INSERT even when binlog_format=ROW:

master (test) > REPLACE INTO exchange_rate SET currency='BRL', rate=4.500, id=3;
# at 16151
#180905 13:32:17 server id 100  end_log_pos 15986 CRC32 0x1d819ae9  Write_rows: table id 117 flags: STMT_END_F
### DELETE FROM `test`.`exchange_rate`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.107                /* FLOAT meta=4 nullable=1 is_null=0 */
### INSERT INTO `test`.`exchange_rate`
### SET
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=1 is_null=0 */
###   @3=4.5                  /* FLOAT meta=4 nullable=1 is_null=0 */
# at 16199
#180905 13:32:17 server id 100  end_log_pos 16017 CRC32 0xf11fed56  Xid = 184
COMMIT/*!*/;

We could point out that we are doing it wrong by not having the id included in the REPLACE statement in the first place; the reason for not doing so would be mostly related to avoiding an extra lookup for each replace (to obtain the id for the currency we want to update). On the other hand, what if your business logic do expects the id to change at each REPLACE ? You should have such requirement in mind when considering this workaround as it is effectively a functional change to what we had initially.

c) Make the target field the Primary Key and keep autoinc as a UNIQUE key

If we make currency the Primary Key of our table and id a UNIQUE key instead:

CREATE TABLE exchange_rate (
id INT UNIQUE AUTO_INCREMENT,
currency VARCHAR(3) PRIMARY KEY,
rate FLOAT(5,3)
) ENGINE=InnoDB;

the same REPLACE operation will be replicated as a DELETE+INSERT too:

# at 19390
#180905 14:03:56 server id 100  end_log_pos 19225 CRC32 0x7042dcd5  Write_rows: table id 131 flags: STMT_END_F
### DELETE FROM `test`.`exchange_rate`
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=0 is_null=0 */
###   @3=4.107                /* FLOAT meta=4 nullable=1 is_null=0 */
### INSERT INTO `test`.`exchange_rate`
### SET
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2='BRL' /* VARSTRING(3) meta=3 nullable=0 is_null=0 */
###   @3=4.5                  /* FLOAT meta=4 nullable=1 is_null=0 */
# at 19438
#180905 14:03:56 server id 100  end_log_pos 19256 CRC32 0x79efc619  Xid = 218
COMMIT/*!*/;

Of course, the same would be true if we had just removed id entirely from the table and kept currency as the Primary Key. This would work in our particular test example but that won’t always be the case. Please note though that if you do keep id on the table you must make it a UNIQUE key: this workaround is based on the fact that this key becomes a second unique constraint, which triggers a different code path to log a replace operation. Had we made it a simple, non-unique key instead that wouldn’t be the case.

If you have any comments or suggestions about the issue addressed in this post, the workarounds we propose, or even a different view of the problem you would like to share please leave a comment in the section below.

Co-Author: Trey Raymond

Trey Raymond is a Sr. Database Engineer for Oath Inc. (née Yahoo!), specializing in MySQL. Since 2010, he has worked to build the company’s database platform and supporting team into industry leaders.

While a performance guru at heart, his experience and responsibilities range from hardware and capacity planning all through the stack to database tool and utility development.

He has a reputation for breaking things to learn something new.

Co-Author: Fernando Laudares

fernando laudares Fernando is a Senior Support Engineer with Percona. Fernando’s work experience includes the architecture, deployment and maintenance of IT infrastructures based on Linux, open source software and a layer of server virtualization. He’s now focusing on the universe of MySQL, MongoDB and PostgreSQL with a particular interest in understanding the intricacies of database systems, and contributes regularly to this blog. You can read his other articles here.

↧

MongoDB Replica set Scenarios and Internals

October 10, 2018, 4:08 am

≫ Next: Instrumenting Read Only Transactions in InnoDB

≪ Previous: Persistence of autoinc fixed in MySQL 8.0

MongoDB replica sets replication internals r The MongoDB® replica set is a group of nodes with one set as the primary node, and all other nodes set as secondary nodes. Only the primary node accepts “write” operations, while other nodes can only serve “read” operations according to the read preferences defined. In this blog post, we’ll focus on some MongoDB replica set scenarios, and take a look at the internals.

Example configuration

We will refer to a three node replica set that includes one primary node and two secondary nodes running as:

"members" : [
{
"_id" : 0,
"name" : "192.168.103.100:25001",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 3533,
"optime" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-09-24T14:49:44Z"),
"electionTime" : Timestamp(1537797392, 2),
"electionDate" : ISODate("2018-09-24T13:56:32Z"),
"configVersion" : 3,
"self" : true
},
{
"_id" : 1,
"name" : "192.168.103.100:25002",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 3063,
"optime" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDurable" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-09-24T14:49:44Z"),
"optimeDurableDate" : ISODate("2018-09-24T14:49:44Z"),
"lastHeartbeat" : ISODate("2018-09-24T14:49:45.539Z"),
"lastHeartbeatRecv" : ISODate("2018-09-24T14:49:44.664Z"),
"pingMs" : NumberLong(0),
"syncingTo" : "192.168.103.100:25001",
"configVersion" : 3
},
{
"_id" : 2,
"name" : "192.168.103.100:25003",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2979,
"optime" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDurable" : {
"ts" : Timestamp(1537800584, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-09-24T14:49:44Z"),
"optimeDurableDate" : ISODate("2018-09-24T14:49:44Z"),
"lastHeartbeat" : ISODate("2018-09-24T14:49:45.539Z"),
"lastHeartbeatRecv" : ISODate("2018-09-24T14:49:44.989Z"),
"pingMs" : NumberLong(0),
"syncingTo" : "192.168.103.100:25002",
"configVersion" : 3
}

Here, the primary is running on port 25001, and the two secondaries are running on ports 25002 and 25003 on the same host.

Secondary nodes can only sync from Primary?

No, it’s not mandatory. Each secondary can replicate data from the primary or any other secondary to the node that is syncing. This term is also known as chaining, and by default, this is enabled.

In the above replica set, you can see that secondary node

"_id":2

is syncing from another secondary node

"_id":1

"syncingTo" : "192.168.103.100:25002"

This can also be found in the logs as here the parameter

chainingAllowed :true

is the default setting.

settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5ba8ed10d4fddccfedeb7492') } }

Chaining?

That means that a secondary member node is able to replicate from another secondary member node instead of from the primary node. This helps to reduce the load from the primary. If the replication lag is not tolerable, then chaining could be disabled.

For more details about chaining and the steps to disable it please refer to my earlier blog post here.

Ok, then how does the secondary node select the source to sync from?

If Chaining is False

When chaining is explicitly set to be false, then the secondary node will sync from the primary node only or could be overridden temporarily.

If Chaining is True

Before choosing any sync node, TopologyCoordinator performs validations like:
- Whether chaining is set to true or false.
- If that particular node is part of the current replica set configurations.
- Identify the node ahead with oplog with the lowest ping time.
- The source code that includes validation is here.
Once the validation is done, SyncSourceSelector relies on SyncSourceResolver which contains the result and details for the new sync source
To get the details and response, SyncSourceResolver coordinates with ReplicationCoordinator
This ReplicationCoordinator is responsible for the replication, and co-ordinates with TopologyCoordinator
The TopologyCoordinator is responsible for topology of the cluster. It finds the primary oplog time and checks for the maxSyncSourceLagSecs
It will reject the source to sync from if the maxSyncSourceLagSecs is greater than the newest oplog entry. The code for this can be found here
If the criteria for the source selection is not fulfilled, then BackgroundSync thread waits and restarts the whole process again to get the sync source.

Example for “unable to find a member to sync from” then, in the next attempt, finding a candidate to sync from

This can be found in the log like this. On receiving the message from rsBackgroundSync thread

could not find member to sync from

, the whole internal process restarts and finds a member to sync from i.e.

sync source candidate: 192.168.103.100:25001

, which means it is now syncing from node 192.168.103.100 running on port 25001.

2018-09-24T13:58:43.197+0000 I REPL     [rsSync] transition to RECOVERING
2018-09-24T13:58:43.198+0000 I REPL     [rsBackgroundSync] could not find member to sync from
2018-09-24T13:58:43.201+0000 I REPL     [rsSync] transition to SECONDARY
2018-09-24T13:58:59.208+0000 I REPL     [rsBackgroundSync] sync source candidate: 192.168.103.100:25001

Once the sync source node is selected, SyncSourceResolver probes the sync source to confirm that it is able to fetch the oplogs.
RollbackID is also fetched i.e. rbid after the first batch is returned by oplogfetcher.
If all eligible sync sources are too fresh, such as during initial sync, then the syncSourceStatus Oplog start is missing and earliestOpTimeSeen will set a new minValid.
This minValid is also set in the case of rollback and abrupt shutdown.
If the node has a minValid entry then this is checked for the eligible sync source node.

Example showing the selection of a new sync source when the existing source is found to be invalid

Here, as the logs show, during sync the node chooses a new sync source. This is because it found the original sync source is not ahead, so not does not contain recent oplogs from which to sync.

2018-09-25T15:20:55.424+0000 I REPL     [replication-1] Choosing new sync source because our current sync source, 192.168.103.100:25001, has an OpTime ({ ts: Timestamp 1537879296000|1, t: 4 }) which is not ahead of ours ({ ts: Timestamp 1537879296000|1, t: 4 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)

2018-09-25T15:20:55.425+0000 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source 192.168.103.100:25001 (config version: 3; last applied optime: { ts: Timestamp 1537879296000|1, t: 4 }; sync source index: -1; primary index: -1) is no longer valid

If the secondary node is too far behind the eligible sync source node, then the node will enter maintenance node and then resync needs to be call manually.
Once the sync source is chosen, BackgroundSync starts oplogFetcher.

Example for oplogFetcher

Here is an example of fetching oplog from the “oplog.rs” collection, and checking for the greater than required timestamp.

2018-09-26T10:35:07.372+0000 I COMMAND  [conn113] command local.oplog.rs command: getMore { getMore: 20830044306, collection: "oplog.rs", maxTimeMS: 5000, term: 7, lastKnownCommittedOpTime: { ts: Timestamp 1537955038000|1, t: 7 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1537903865000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 7, readConcern: { afterOpTime: { ts: Timestamp 1537903865000|1, t: 6 } } } planSummary: COLLSCAN cursorid:20830044306 keysExamined:0 docsExamined:0 numYields:1 nreturned:0 reslen:451 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_command 3063398ms

When and what details replica set nodes communicate with each other?

At a regular interval, all the nodes communicate with each other to check the status of the primary node, check the status of the sync source, to get the oplogs and so on.

ReplicationCoordinator has ReplicaSetConfig that has a list of all the replica set nodes, and each node has a copy of it. This makes nodes aware of other nodes under same replica set.

This is how nodes communicate in more detail:

Heartbeats: This checks the status of other nodes i.e. alive or die

heartbeatInterval: Every node, at an interval of two seconds, sends the other nodes a heartbeat to make them aware that “yes I am alive!”

heartbeatTimeoutSecs: This is a timeout, and means that if the heartbeat is not returned in 10 seconds then that node is marked as inaccessible or simply die.

Every heartbeat is identified by these replica set details:

replica set config version
replica set name
Sender host address
id from the replicasetconfig

The source code could be referred to from here.

When the remote node receives the heartbeat, it processes this data and validates if the details are correct. It then prepares a ReplSetHeartbeatResponse, that includes:

Name of the replica set, config version, and optime details
Details about primary node as per the receiving node.
Sync source details and state of receiving node

This heartbeat data is processed, and if primary details are found then the election gets postponed.

TopologyCoordinator checks for the heartbeat data and confirms if the node is OK or NOT. If the node is OK then no action is taken. Otherwise it needs to be reconfigured or else initiate a priority takeover based on the config.

Response from oplog fetcher

To get the oplogs from the sync source, nodes communicate with each other. This oplog fetcher fetches oplogs through “find” and “getMore”. This will only affect the downstream node that gets metadata from its sync source to update its view from the replica set.

OplogQueryMetadata only comes with OplogFetcher responses

OplogQueryMetadata comes with OplogFetcher response and ReplSetMetadata comes with all the replica set details including configversion and replication commands.

Communicate to update Position commands:

This is to get an update for replication progress. ReplicationCoordinatorExternalState creates SyncSourceFeedback sends replSetUpdatePosition commands.

It includes Oplog details, Replicaset config version, and replica set metadata.

If a new node is added to the existing replica set, how will that node get the data?

If a new node is added to the existing replica set then the “initial sync” process takes place. This initial sync can be done in two ways:

Just add the new node to the replicaset and let initial sync threads restore the data. Then it syncs from the oplogs until it reaches the secondary state.
Copy the data from the recent data directory to the node, and restart this new node. Then it will also sync from the oplogs until it reaches the secondary state.

This is how it works internally

When “initial sync” or “rsync” is called by ReplicationCoordinator then the node goes to “STARTUP2” state, and this initial sync is done in DataReplicator

A sync source is selected to get the data from, then it drops all the databases except the local database, and oplogs are recreated.
DatabasesCloner asks syncsource for a list of the databases, and for each database it creates DatabaseCloner.
For each DatabaseCloner it creates CollectionCloner to clone the collections
This CollectionCloner calls ListIndexes on the syncsource and creates a CollectionBulkLoader for parallel index creation while data cloning
The node also checks for the sync source rollback id. If rollback occurred, then it restarts the initial sync. Otherwise, datareplicator is done with its work and then replicationCoordinator assumes the role for ongoing replication.

Example for the “initial sync” :

Here node enters

"STARTUP2"- "transition to STARTUP2"

Then sync source gets selected and drops all the databases except the local database. Next, replication oplog is created and CollectionCloner is called.

Local database not dropped: because every node has its own “local” database with its own and other nodes’ information, based on itself, this database is not replicated to other nodes.

2018-09-26T17:57:09.571+0000 I REPL     [ReplicationExecutor] transition to STARTUP2
2018-09-26T17:57:14.589+0000 I REPL     [replication-1] sync source candidate: 192.168.103.100:25003
2018-09-26T17:57:14.590+0000 I STORAGE  [replication-1] dropAllDatabasesExceptLocal 1
2018-09-26T17:57:14.592+0000 I REPL     [replication-1] creating replication oplog of size: 990MB... 2018-09-26T17:57:14.633+0000 I REPL     [replication-0] CollectionCloner::start called, on ns:admin.system.version

Finished fetching all the oplogs, and finishing up initial sync.

2018-09-26T17:57:15.685+0000 I REPL     [replication-0] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1537984626000|1, t: 9 }[-1139925876765058240]
2018-09-26T17:57:15.685+0000 I REPL     [replication-0] Initial sync attempt finishing up.

What are oplogs and where do these reside?

oplogs stands for “operation logs”. We have used this term so many times in this blog post as these are the mandatory logs for the replica set. These operations are in the capped collection called “oplog.rs” that resides in “local” database.

Below, this is how oplogs are stored in the collection “oplog.rs” that includes details for timestamp, operations, namespace, output.

rplint:PRIMARY> use local
rplint:PRIMARY> show collections
oplog.rs
rplint:PRIMARY> db.oplog.rs.findOne()
{
 "ts" : Timestamp(1537797392, 1),
 "h" : NumberLong("-169301588285533642"),
 "v" : 2,
 "op" : "n",
 "ns" : "",
 "o" : {
 "msg" : "initiating set"
 }
}

It consists of rolling update operations coming to the database. Then these oplogs replicate to the secondary node(s) to maintain the high availability of the data in case of failover.

When the replica MongoDB instance starts, it creates an oplog ocdefault size. For Wired tiger, the default size is 5% of disk space, with a lower bound size of 990MB. So here in the example it creates 990MB of data. If you’d like to learn more about oplog size then please refer here

2018-09-26T17:57:14.592+0000 I REPL     [replication-1] creating replication oplog of size: 990MB...

What if the same oplog is applied multiple times, will that not lead to inconsistent data?

Fortunately, oplogs are Idempotent that means the value will remain unchanged, or will provide the same output, even when applied multiple times.

Let’s check an example:

For the $inc operator that will increment the value by 1 for the filed “item”, if this oplog is applied multiple times then the result might lead to an inconsistent record if this is not Idempotent. However, rather than increasing the item value multiple times, it is actually applied only once.

rplint:PRIMARY> use db1
//inserting one document
rplint:PRIMARY> db.col1.insert({item:1, name:"abc"})
//updating document by incrementing item value with 1
rplint:PRIMARY> db.col1.update({name:"abc"},{$inc:{item:1}})
//updated value is now item:2
rplint:PRIMARY> db.col1.find()
{ "_id" : ObjectId("5babd57cce2ef78096ac8e16"), "item" : 2, "name" : "abc" }

This is how these operations are stored in oplog, here this $inc value is stored in oplog as $set

rplint:PRIMARY> db.oplog.rs.find({ns:"db1.col1"})
//insert operation
{ "ts" : Timestamp(1537987964, 2), "t" : NumberLong(9), "h" : NumberLong("8083740413874479202"), "v" : 2, "op" : "i", "ns" : "db1.col1", "o" : { "_id" : ObjectId("5babd57cce2ef78096ac8e16"), "item" : 1, "name" : "abc" } }
//$inc operation is changed as ""$set" : { "item" : 2"
{ "ts" : Timestamp(1537988022, 1), "t" : NumberLong(9), "h" : NumberLong("-1432987813358665721"), "v" : 2, "op" : "u", "ns" : "db1.col1", "o2" : { "_id" : ObjectId("5babd57cce2ef78096ac8e16") }, "o" : { "$set" : { "item" : 2 } } }

That means that however many times it is applied, it will generate the same results, so no inconsistent data!

I hope this blog post helps you to understand multiple scenarios for MongoDB replica sets, and how data replicates to the nodes.

↧

Instrumenting Read Only Transactions in InnoDB

October 10, 2018, 8:09 am

≫ Next: How to Fix ProxySQL Configuration When it Won’t Start

≪ Previous: MongoDB Replica set Scenarios and Internals

Instrumenting read only transactions MySQL Probably not well known but quite an important optimization was introduced in MySQL 5.6 – reduced overhead for “read only transactions”. While usually by a “transaction” we mean a query or a group of queries that change data, with transaction engines like InnoDB, every data read or write operation is a transaction.

Now, as a non-locking read operation obviously has less impact on the data, it does not need all the instrumenting overhead a write transaction has. The main thing that can be avoided, as described by documentation, is the transaction ID. So, since MySQL 5.6, a read only transaction does not have a transaction ID. Moreover, such a transaction is not visible in the SHOW ENGINE INNODB STATUS output, though I will not go deeper on what really that means under the hood in this article. The fact is that this optimization allows for better scaling of workloads with many RO threads. An example RO benchmark, where 5.5 vs 5.6/5.7 difference is well seen, may be found here: https://www.percona.com/blog/2016/04/07/mysql-5-7-sysbench-oltp-read-results-really-faster/

To benefit from this optimization in MySQL 5.6, either a transaction has to start with the explicit START TRANSACTION READ ONLY clause or it must be an autocommit, non-locking SELECT statement. In version 5.7 and newer, it goes further, as a new transaction is treated as read-only until a locking read or write is executed, at which point it gets “upgraded” to a read-write one.

Information Schema Instrumentation

Let’s see how it looks like (on MySQL 8.0.12) by looking at information_schema.innodb_trx and information_schema.innodb_metrics tables. The second of these, by default, has transaction counters disabled, so before the test we have to enable it with:

SET GLOBAL innodb_monitor_enable = 'trx%comm%';

or by adding a parameter to the

[mysqld]

section of the configuration file and restarting the instance:

innodb_monitor_enable = "trx_%"

Now, let’s start a transaction which should be read only according to the rules:

mysql [localhost] {msandbox} (db1) > START TRANSACTION; SELECT count(*) FROM db1.t1;
Query OK, 0 rows affected (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 421988493944672
                trx_weight: 0
           trx_rows_locked: 0
         trx_rows_modified: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

Transaction started as above, did not appear in SHOW ENGINE INNODB STATUS, and its trx_id looks strangely high. And first surprise—for some reason, trx_is_read_only is 0. Now, what if we commit such a transaction—how do the counters change? (I reset them before the test):

mysql [localhost] {msandbox} (db1) > commit;
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     1 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     0 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.01 sec)

OK, so clearly it was a read-only transaction overall, just the trx_is_read_only property wasn’t set as expected. I had to report this problem here: https://bugs.mysql.com/bug.php?id=92558

What about an explicit RO transaction:

mysql [localhost] {msandbox} (db1) > START TRANSACTION READ ONLY; SELECT count(*) FROM db1.t1;
Query OK, 0 rows affected (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 421988493944672
                trx_weight: 0
           trx_rows_locked: 0
         trx_rows_modified: 0
          trx_is_read_only: 1
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > commit;
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     2 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     0 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.01 sec)

OK, both transactions are counted as the same type. Moreover, the two transactions shared the same strange trx_id, which appears to be a fake one. For a simple read executed in autocommit mode, the counters increase as expected too:

mysql [localhost] {msandbox} (db1) > select @@autocommit; SELECT count(*) FROM db1.t1;
+--------------+
| @@autocommit |
+--------------+
|            1 |
+--------------+
1 row in set (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     2 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     1 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.00 sec)

Now, let’s test how a transaction looks when we upgrade it to RW later:

mysql [localhost] {msandbox} (db1) > START TRANSACTION; SELECT count(*) FROM db1.t1;
Query OK, 0 rows affected (0.00 sec)
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 421988493944672
                trx_weight: 0
           trx_rows_locked: 0
         trx_rows_modified: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT count(*) FROM db1.t1 FOR UPDATE;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT trx_id,trx_weight,trx_rows_locked,trx_rows_modified,trx_is_read_only,trx_autocommit_non_locking
FROM information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 4106
                trx_weight: 2
           trx_rows_locked: 4
         trx_rows_modified: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > commit;
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     1 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     2 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     1 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.00 sec)

OK, as seen above, after a locking read was done, our transaction has transformed: it got a real, unique trx_id assigned. Then, when committed, the RW counter increased.

Performance Schema Problem

Nowadays it may feel natural to use performance_schema for monitoring everything. And, indeed, we can monitor types of transactions with it as well. Let’s enable the needed consumers and instruments:

mysql [localhost] {msandbox} (db1) > UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%transactions%';
Query OK, 0 rows affected (0.00 sec)
Rows matched: 3  Changed: 0  Warnings: 0
mysql [localhost] {msandbox} (db1) > UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction';
Query OK, 0 rows affected (0.01 sec)
Rows matched: 1  Changed: 0  Warnings: 0
mysql [localhost] {msandbox} (db1) > SELECT * FROM performance_schema.setup_instruments WHERE NAME = 'transaction';
+-------------+---------+-------+------------+------------+---------------+
| NAME        | ENABLED | TIMED | PROPERTIES | VOLATILITY | DOCUMENTATION |
+-------------+---------+-------+------------+------------+---------------+
| transaction | YES     | YES   |            |          0 | NULL          |
+-------------+---------+-------+------------+------------+---------------+
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT * FROM performance_schema.setup_consumers WHERE NAME LIKE '%transactions%';
+----------------------------------+---------+
| NAME                             | ENABLED |
+----------------------------------+---------+
| events_transactions_current      | YES     |
| events_transactions_history      | YES     |
| events_transactions_history_long | YES     |
+----------------------------------+---------+
3 rows in set (0.01 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 0
COUNT_READ_WRITE: 0
 COUNT_READ_ONLY: 0
1 row in set (0.00 sec)

And let’s do some simple tests:

mysql [localhost] {msandbox} (db1) > START TRANSACTION; COMMIT;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 1
COUNT_READ_WRITE: 1
 COUNT_READ_ONLY: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     0 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     0 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.00 sec)

A void transaction caused an increase to this RW counter in Performance Schema view! Moreover, a simple autocommit select increases it too:

mysql [localhost] {msandbox} (db1) > SELECT count(*) FROM db1.t1;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.01 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 2
COUNT_READ_WRITE: 2
 COUNT_READ_ONLY: 0
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > START TRANSACTION READ ONLY; COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT COUNT_STAR,COUNT_READ_WRITE,COUNT_READ_ONLY
FROM performance_schema.events_transactions_summary_global_by_event_name\G
*************************** 1. row ***************************
      COUNT_STAR: 3
COUNT_READ_WRITE: 2
 COUNT_READ_ONLY: 1
1 row in set (0.00 sec)
mysql [localhost] {msandbox} (db1) > SELECT name, comment, status, count
FROM information_schema.innodb_metrics   WHERE name like 'trx%comm%';
+---------------------------+--------------------------------------------------------------------+---------+-------+
| name                      | comment                                                            | status  | count |
+---------------------------+--------------------------------------------------------------------+---------+-------+
| trx_rw_commits            | Number of read-write transactions  committed                       | enabled |     0 |
| trx_ro_commits            | Number of read-only transactions committed                         | enabled |     0 |
| trx_nl_ro_commits         | Number of non-locking auto-commit read-only transactions committed | enabled |     1 |
| trx_commits_insert_update | Number of transactions committed with inserts and updates          | enabled |     0 |
+---------------------------+--------------------------------------------------------------------+---------+-------+
4 rows in set (0.01 sec)

As seen above, with regard to monitoring transactions via Performance Schema, everything seems completely broken, empty transactions increase counters, and the only way to increase RO counter is to call a read-only transaction explicitly, but again, it should not count when no real read was done from a table. For this reason I filed another bug report: https://bugs.mysql.com/bug.php?id=92364

PMM Dashboard

We implemented a transactions information view in PMM, based on Information_schema.innodb_metrics, which—as presented above—is reliable and shows the correct counters. Therefore, I encourage everyone to use the innodb_monitor_enable setting to enable it and have the PMM graph it. It will look something like this:

↧

How to Fix ProxySQL Configuration When it Won’t Start

October 11, 2018, 9:07 am

≫ Next: Identifying High Load Spots in MySQL Using Slow Query Log and pt-query-digest

≪ Previous: Instrumenting Read Only Transactions in InnoDB

restart ProxySQL config With the exception of the three configuration variables described here, ProxySQL will only parse the configuration files the first time it is started, or if the proxysql.db file is missing for some other reason.

If we want to change any of this data we need to do so via ProxySQL’s admin interface and then save them to disk. That’s fine if ProxySQL is running, but what if it won’t start because of these values?

For example, perhaps we accidentally configured ProxySQL to run on port 3306 and restarted it, but there’s already a production MySQL instance running on this port. ProxySQL won’t start, so we can’t edit the value that way:

2018-10-02 09:18:33 network.cpp:53:listen_on_port(): [ERROR] bind(): Address already in use

We could delete proxysql.db and have it reload the configuration files, but that would mean any changes we didn’t mirror into the configuration files will be lost.

Another option is to edit ProxySQL’s database file using sqlite3:

[root@centos7-pxc57-4 ~]# cd /var/lib/proxysql/
[root@centos7-pxc57-4 proxysql]# sqlite3 proxysql.db
sqlite> SELECT * FROM global_variables WHERE variable_name='mysql-interfaces';
mysql-interfaces|127.0.0.1:3306
sqlite> UPDATE global_variables SET variable_value='127.0.0.1:6033' WHERE variable_name='mysql-interfaces';
sqlite> SELECT * FROM global_variables WHERE variable_name='mysql-interfaces';
mysql-interfaces|127.0.0.1:6033

Or if we have a few edits to make we may prefer to do so with a text editor:

[root@centos7-pxc57-4 ~]# cd /var/lib/proxysql/
[root@centos7-pxc57-4 proxysql]# sqlite3 proxysql.db
sqlite> .output /tmp/global_variables
sqlite> .dump global_variables
sqlite> .exit

The above commands will dump the global_variables table into a file in SQL format, which we can then edit:

[root@centos7-pxc57-4 proxysql]# grep mysql-interfaces /tmp/global_variables
INSERT INTO “global_variables” VALUES(‘mysql-interfaces’,’127.0.0.1:3306’);
[root@centos7-pxc57-4 proxysql]# vim /tmp/global_variables
[root@centos7-pxc57-4 proxysql]# grep mysql-interfaces /tmp/global_variables
INSERT INTO “global_variables” VALUES(‘mysql-interfaces’,’127.0.0.1:6033’);

Now we need to restore this data. We’ll use the restore command to empty the table (as we’re restoring from a missing backup):

[root@centos7-pxc57-4 proxysql]# sqlite3 proxysql.db
sqlite> .restore global_variables
sqlite> .read /tmp/global_variables
sqlite> .exit

Once we’ve made the change, we should be able to start ProxySQL again:

[root@centos7-pxc57-4 proxysql]# /etc/init.d/proxysql start
Starting ProxySQL: DONE!
[root@centos7-pxc57-4 proxysql]# lsof -I | grep proxysql
proxysql 15171 proxysql 19u IPv4 265881 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 20u IPv4 265882 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 21u IPv4 265883 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 22u IPv4 265884 0t0 TCP localhost:6033 (LISTEN)
proxysql 15171 proxysql 23u IPv4 266635 0t0 TCP *:6032 (LISTEN)

While you are here

You might enjoy my recent post Using ProxySQL to connect to IPV6-only databases over IPV4

You can download ProxySQL from Percona repositories, and you might also want to check out our recorded webinars that feature ProxySQL too.

↧

Identifying High Load Spots in MySQL Using Slow Query Log and pt-query-digest

October 15, 2018, 6:24 am

≫ Next: MongoDB Backup: How and When To Use PSMDB hotbackup and mongodb_consistent_backup

≪ Previous: How to Fix ProxySQL Configuration When it Won’t Start

pt-query-digest is one of the most commonly used tool when it comes to query auditing in MySQL®. By default, pt-query-digest reports the top ten queries consuming the most amount of time inside MySQL. A query that takes more time than the set threshold for completion is considered slow but it’s not always true that tuning such queries makes them faster. Sometimes, when resources on server are busy, it will impact every other operation on the server, and so will impact queries too. In such cases, you will see the proportion of slow queries goes up. That can also include queries that work fine in general.

This article explains a small trick to identify such spots using pt-query-digest and the slow query log. pt-query-digest is a component of Percona Toolkit, open source software that is free to download and use.

Some sample data

Let’s have a look at sample data in Percona Server 5.7. Slow query log is configured to capture queries longer than ten seconds with no limit on rate of logging, which is generally considered to throttle the IO that comes while writing slow queries to the log file.

mysql> show variables like 'log_slow_rate%' ;
+---------------------+---------+
| Variable_name       | Value    |
+---------------------+---------+
| log_slow_rate_limit | 1       |  --> Log all queries
| log_slow_rate_type  | session |
+---------------------+---------+
2 rows in set (0.00 sec)
mysql> show variables like 'long_query_time' ;
+-----------------+-----------+
| Variable_name   | Value     |
+-----------------+-----------+
| long_query_time | 10.000000 |  --> 10 seconds
+-----------------+-----------+
1 row in set (0.01 sec)

When I run pt-query-digest, I see in the summary report that 80% of the queries have come from just three query patterns.

# Profile
# Rank Query ID                      Response time    Calls R/Call   V/M
# ==== ============================= ================ ===== ======== =====
#    1 0x7B92A64478A4499516F46891... 13446.3083 56.1%   102 131.8266  3.83 SELECT performance_schema.events_statements_history
#    2 0x752E6264A9E73B741D3DC04F...  4185.0857 17.5%    30 139.5029  0.00 SELECT table1
#    3 0xAFB5110D2C576F3700EE3F7B...  1688.7549  7.0%    13 129.9042  8.20 SELECT table2
#    4 0x6CE1C4E763245AF56911E983...  1401.7309  5.8%    12 116.8109 13.45 SELECT table4
#    5 0x85325FDF75CD6F1C91DFBB85...   989.5446  4.1%    15  65.9696 55.42 SELECT tbl1 tbl2 tbl3 tbl4
#    6 0xB30E9CB844F2F14648B182D0...   420.2127  1.8%     4 105.0532 12.91 SELECT tbl5
#    7 0x7F7C6EE1D23493B5D6234382...   382.1407  1.6%    12  31.8451 70.36 INSERT UPDATE tbl6
#    8 0xBC1EE70ABAE1D17CD8F177D7...   320.5010  1.3%     6  53.4168 67.01 REPLACE tbl7
#   10 0xA2A385D3A76D492144DD219B...   183.9891  0.8%    18  10.2216  0.00 UPDATE tbl8
#      MISC 0xMISC                     948.6902  4.0%    14  67.7636   0.0 <10 ITEMS>

Query #1 is generated by the qan-agent from PMM and runs approximately once a minute. These results will be handed over to PMM Server. Similarly queries #2 & #3 are pretty simple. I mean, they scan just one row and will return either zero or one rows. They also use indexing, which makes me think that this is not because of something just with in MySQL. I wanted to know if I could find any common aspect of all these occurrences.

Let’s take a closer look at the queries recorded in slow query log.

# grep -B3 DIGEST mysql-slow_Oct2nd_4th.log
....
....
# User@Host: ztrend[ztrend] @ localhost []  Id: 6431601021
# Query_time: 139.279651  Lock_time: 64.502959 Rows_sent: 0  Rows_examined: 0
SET timestamp=1538524947;
SELECT DIGEST, CURRENT_SCHEMA, SQL_TEXT FROM performance_schema.events_statements_history;
# User@Host: ztrend[ztrend] @ localhost []  Id: 6431601029
# Query_time: 139.282594  Lock_time: 83.140413 Rows_sent: 0  Rows_examined: 0
SET timestamp=1538524947;
SELECT DIGEST, CURRENT_SCHEMA, SQL_TEXT FROM performance_schema.events_statements_history;
# User@Host: ztrend[ztrend] @ localhost []  Id: 6431601031
# Query_time: 139.314228  Lock_time: 96.679563 Rows_sent: 0  Rows_examined: 0
SET timestamp=1538524947;
SELECT DIGEST, CURRENT_SCHEMA, SQL_TEXT FROM performance_schema.events_statements_history;
....
....

Now you can see two things.

All of them have same Unix timestamp
All of them were spending more than 70% of their execution time waiting for some lock.

Analyzing the data from pt-query-digest

Now I want to check if I can group the count of queries based on their time of execution. If there are multiple queries at a given time captured into the slow query log, time will be printed for the first query but not all. Fortunately, in this case I can rely on the Unix timestamp to compute the counts. The timestamp is gets captured for every query. Luckily, without a long struggle, a combination of grep and awk utilities have displayed what I wanted to display.

# grep -A1 Query_time mysql-slow_Oct2nd_4th.log | grep SET | awk -F "=" '{ print $2 }' | uniq -c
2   1538450797;
1   1538524822;
3   1538524846;
7   1538524857;
167 1538524947;   ---> 72% of queries have happened at this timestamp.
1   1538551813;
3   1538551815;
6   1538602215;
1   1538617599;
33  1538631015;
1   1538631016;
1   1538631017;

You can use the command below to check the regular date time format of a given timestamp. So, Oct 3, 05:32 is when there was something wrong on the server:

# date -d @1538524947
Wed Oct 3 05:32:27 IST 2018

Query tuning can be carried out alongside this, but identifying such spots helps avoiding spending time on query tuning where badly written queries are not the problem. Having said that, from this point, further troubleshooting may take different sub paths such as checking log files at that particular time, looking at CPU reports, reviewing past pt-stalk reports if set up to run in the background, and dmesg etc. This approach is useful for identifying at what time (or time range) MySQL was more stressed just using slow query log when no robust monitoring tools, like Percona Monitoring and Management (PMM), are deployed.

Using PMM to monitor queries

If you have PMM, you can review Query Analytics to see the topmost slow queries, along with details like execution counts, load etc. Below is a sample screen copy for your reference:

Slow query log from PMM dashboard

NOTE: If you use Percona Server for MySQL, slow query log can report time in micro seconds. It also supports extended logging of other statistics about query execution. These provide extra power to see the insights of query processing. You can see more information about these options here.

↧

MongoDB Backup: How and When To Use PSMDB hotbackup and mongodb_consistent_backup

December 13, 2018, 3:43 am

≫ Next: Upcoming Webinar Friday 1/4: High-Performance PostgreSQL, Tuning and Optimization Guide

≪ Previous: Identifying High Load Spots in MySQL Using Slow Query Log and pt-query-digest

mongodb backup We have many backup methods to backup a MongoDB database using native mongodump or external tools. However, in this article, we’ll take a look at the backup tools offered by Percona, keeping in mind the restoration scenarios for MongoDB replicaSet and Sharded Cluster environments. We’ll explore how and when to use the tool mongodb-consistent-backup from Percona lab to backup the database consistently in Sharded Cluster/replicaSet environments. We’ll also take a look at hotbackup, a tool that’s available in Percona Server for MongoDB (PSMDB) packages.

Backup is done – What about Restore?

Those who are responsible for data almost always think about the methods needed to backup the database and store the backups securely. But they often fail to foresee the scenario where the backup needs to be used to restore data. For example, unfortunately, I have seen many companies schedule the backup of config files and shard servers separately, but they start and complete the backups at different times based on data volumes. But can we use that backup when we need to restore and start the cluster with it? The answer is no—well, maybe yes if you can tweak the metadata, but data inconsistency may occur. Using this backup schedule, the backup is not consistent for the whole cluster, and we don’t have a point where we can restore the data for all shards/config dbs so that we can start the cluster from that point. Consequently, we face a difficult situation where we really need to use that backup!

Let’s explore the two tools/features available to backup MongoDB from Percona, and look at which method to choose based on your restoration plan.

Hot backup for both replicaset and Sharded cluster

Note: PerconaLabs and Percona-QA are open source GitHub repositories for unofficial scripts and tools created by Percona staff. These handy utilities can help you save time and effort.

Percona software builds located in the PerconaLabs and Percona-QA repositories are not officially released software, and also aren’t covered by Percona support or services agreements.

The main problem with backup is maintaining consistency, as an application still writes to the DB while backup is going on. So to maintain the consistency throughout the backup, and get a reliable full backup of all data needed to restore the database, the backup tool needs to track changes via oplog as well. Using the mongodump utility along with oplog backup would help to achieve this easily in a replicaSet environment since you will need consistency for that replicaSet alone.

But when we need a consistent backup of a Sharded cluster, then it is very difficult to achieve the total cluster consistency as it involvs the backup of all shards and config servers all together up to a particular point, to reuse in failover cases. In this case, even if you use mongodump manually in each shard/config separately, and try to take a consistent backup of the total cluster when there are writes being made, it is a very tedious job. The backup of each shard ends at different points based on different scenarios such as load, data volume etc.

To remedy this, we could take a consistent hot backup of the Sharded cluster by using our utility mongodb-consistent-backup – in other words, point-in-time backup for the sharded cluster environment. This utility internally uses mongodump and gets the oplog changes from each node until the backup from all data nodes and configs are complete. This ensures that there is consistency in the backup of a total Sharded Cluster! You have to make sure you are using replicaSet for your config server too. In fact, this tool also helps you to take a consistent backup in the replicaSet environment.

This utility is available in our Percona lab but please note that it is not yet supported officially. To install this package, please make sure you install all the dependency packages, and follow the steps mentioned in this link to complete the installation process.

If you have enabled authentication in your environment, then create a user like below:

db.createUser({
	user: "backup_usr",
	pwd: "backup_pass",
	roles: [
	{ role: "clusterMonitor", db: "admin" }
	]
})/

The backup could be taken as follows by connecting one of the mongos node in the Sharded Cluster. Here mongos is running on 27051 port and the Cluster has one config replicaSet cfg and two Shards s1 and s2.

[root@app mongodb_consistent_backup-master]# ./bin/mongodb-consistent-backup -H localhost \
> -P 27051 \
> -u backup_usr \
> -p backup_pass \
> -a admin \
> -n clusterFullBackup \
> -l backup/mongodb
[2018-12-05 18:57:38,863] [INFO] [MainProcess] [Main:init:144] Starting mongodb-consistent-backup version 1.4.0 
(git commit: unknown)
[2018-12-05 18:57:38,864] [INFO] [MainProcess] [Main:init:145] Loaded config: {"archive": {"method": "tar", "tar": 
{"binary": "tar", "compression": "gzip"}, "zbackup": {"binary": "/usr/bin/zbackup", "cache_mb": 128, "compression": "lzma"}}, 
"authdb": "admin", "backup": {"location": "backup/mongodb", "method": "mongodump", "mongodump": {"binary": "/usr/bin/mongodump", 
"compression": "auto"}, "name": "clusterFullBackup"}, "environment": "production", "host": "localhost", "lock_file": 
"/tmp/mongodb-consistent-backup.lock", "notify": {"method": "none"}, "oplog": {"compression": "none", "flush": {"max_docs": 100, 
"max_secs": 1}, "tailer": {"enabled": "true", "status_interval": 30}}, "password": "******", "port": 27051, "replication": 
{"max_lag_secs": 10, "max_priority": 1000}, "sharding": {"balancer": {"ping_secs": 3, "wait_secs": 300}}, "upload": {"method": 
"none", "retries": 5, "rsync": {"path": "/", "port": 22}, "s3": {"chunk_size_mb": 50, "region": "us-east-1", "secure": true}, 
"threads": 4}, "username": "backup_usr"}
...
...
[2018-12-05 18:57:40,715] [INFO] [MongodumpThread-5] [MongodumpThread:run:204] Starting mongodump backup of s2/127.0.0.1:27043
[2018-12-05 18:57:40,722] [INFO] [MongodumpThread-7] [MongodumpThread:run:204] Starting mongodump backup of cfg/127.0.0.1:27022
[2018-12-05 18:57:40,724] [INFO] [MongodumpThread-6] [MongodumpThread:run:204] Starting mongodump backup of s1/127.0.0.1:27032
[2018-12-05 18:57:40,800] [INFO] [MongodumpThread-5] [MongodumpThread:wait:130] s2/127.0.0.1:27043:	Enter password:
[2018-12-05 18:57:40,804] [INFO] [MongodumpThread-6] [MongodumpThread:wait:130] s1/127.0.0.1:27032:	Enter password:
[2018-12-05 18:57:40,820] [INFO] [MongodumpThread-7] [MongodumpThread:wait:130] cfg/127.0.0.1:27022:	Enter password:
...
...
[2018-12-05 18:57:54,880] [INFO] [MainProcess] [Mongodump:wait:105] All mongodump backups completed successfully
[2018-12-05 18:57:54,892] [INFO] [MainProcess] [Stage:run:95] Completed running stage mongodb_consistent_backup.Backup with task 
Mongodump in 14.21 seconds
[2018-12-05 18:57:54,913] [INFO] [MainProcess] [Tailer:stop:86] Stopping all oplog tailers
[2018-12-05 18:57:55,955] [INFO] [MainProcess] [Tailer:stop:118] Waiting for tailer s2/127.0.0.1:27043 to stop
[2018-12-05 18:57:56,889] [INFO] [TailThread-2] [TailThread:run:177] Done tailing oplog on s2/127.0.0.1:27043, 2 oplog changes, 
end ts: Timestamp(1544036268, 1)
[2018-12-05 18:57:59,967] [INFO] [MainProcess] [Tailer:stop:118] Waiting for tailer s1/127.0.0.1:27032 to stop
[2018-12-05 18:58:00,801] [INFO] [TailThread-3] [TailThread:run:177] Done tailing oplog on s1/127.0.0.1:27032, 3 oplog changes, 
end ts: Timestamp(1544036271, 1)
[2018-12-05 18:58:03,985] [INFO] [MainProcess] [Tailer:stop:118] Waiting for tailer cfg/127.0.0.1:27022 to stop
[2018-12-05 18:58:04,803] [INFO] [TailThread-4] [TailThread:run:177] Done tailing oplog on cfg/127.0.0.1:27022, 8 oplog changes, 
end ts: Timestamp(1544036279, 1)
[2018-12-05 18:58:06,989] [INFO] [MainProcess] [Tailer:stop:125] Oplog tailing completed in 27.85 seconds
...
...
[2018-12-05 18:58:09,478] [INFO] [MainProcess] [Rotate:symlink:83] Updating clusterFullBackup latest symlink to current backup 
path: backup/mongodb/clusterFullBackup/20181205_1857
[2018-12-05 18:58:09,480] [INFO] [MainProcess] [Main:run:461] Completed mongodb-consistent-backup in 30.49 sec

where,
n – backup directory name to be created
l – backup directory
H – hostname
P – port
p – password
u – user
a – authentication database

The log, above, shows the backup pattern going on, and it captures the state of the oplog, and updates the changes. The same command could be used to connect the replicaSet by having a proper hostname. The tool also has the ability to identify whether it is a replicaSet or Sharded cluster before proceeding with the backup. This can be determined from the log output, as shown below, which is written by the tool when running the backup:

For sharding cluster:

[2018-12-05 19:05:02,453] [INFO] [MainProcess] [Main:run:299] Running backup in sharding mode using seed node(s): localhost:27051

For replicaSet:

[2018-12-05 19:23:05,070] [INFO] [MainProcess] [Main:run:257] Running backup in replset mode using seed node(s): localhost:27041

You can check out a couple of our blogs here and here for more details about the utility.

Hot but Cold backup

You may be wondering about the title Hot but Cold backup. Yes, for Percona Server for MongoDB (PSMDB) packages, there is feature to take the binary hot backup using hotbackup. Those who know the MySQL world will already know about Percona XtraBackup which is our open source and free binary hot backup utility for MySQL. PSMDB hotbackup works in a similar way. When you use hotbackup to backup, then you will have a binary backup ready to start an instance with the backup directory. You don’t need to worry about restoring from scratch and recreating indices. However, this solution works for replicaset/standalone mongodb instances only.

If you can plan well, then you could feasibly use this feature to backup a Sharded cluster by bringing down one of the secondaries from all shards/config servers at the same time (probably when there is low or no transaction writing), then start them on a different port and without the replicaSet variable option, so that those instances won’t rejoin their replicaSet. Now you can start the hotbackup in all instances, once they are finished. You can revert the changes in the config file and allow them to rejoin their replicaSet.

Cautionary notes: Please make sure you are using the low priority or hidden nodes for this purpose, so that the election is not triggered when they split/join back to the replicaSet and don’t use SIGKILL (kill -9) to stop the db as it shuts down the database abruptly. Also, please plan to have at least an equal amount of disk space to that of your shard. A hotbackup takes an approximately equal amount of space as your node.

My colleague Tim Vaillancourt has written a great blogpost on this. See here.

Conclusion

So from the above two methods, now you have the option to choose the similar backup methods based on your RTO, RPO explained here. Hope this helps you! Please share your comments and feedback below, and tell me what you think!

REFERENCES:

https://www.percona.com/doc/percona-server-for-mongodb/LATEST/hot-backup.html
https://www.percona.com/forums/questions-discussions/percona-server-for-mongodb/53006-percona-mongodb-difference-between-hot-backup-and-backup-using-mongo-dump
https://www.percona.com/blog/2016/07/25/mongodb-consistent-backups/
https://www.percona.com/blog/2018/04/06/free-fast-mongodb-hot-backup-with-percona-server-for-mongodb/
https://www.bluelock.com/blog/rpo-rto-pto-and-raas-disaster-recovery-explained/
https://en.wikipedia.org/wiki/Disaster_recovery
https://www.druva.com/blog/understanding-rpo-and-rto/
https://www.percona.com/live/e17/sites/default/files/slides/Running%20MongoDB%20in%20Production%20-%20FileId%20-%20115299.pdf
https://major.io/2010/03/18/sigterm-vs-sigkill/
https://docs.mongodb.com/manual/core/sharded-cluster-config-servers

—
Photo by Designecologist from Pexels

↧