Quantcast
Channel: mydumper Archives - Percona Database Performance Blog
Viewing all 30 articles
Browse latest View live

MyDumper 0.11.3 is Now Available

$
0
0
MyDumper 0.11.3 MySQL

MyDumper 0.11.3 MySQLThe new MyDumper 0.11.3 version, which includes many new features and bug fixes, is now available.  You can download the code from here.

We are very proud to announce that we were able to achieve the two main objectives for the milestone ZSTD and Stream support!  We added four packages with ZSTD support because not all the distributions have support for v1.4 or higher. Package libzstd is required to use ZSTD compression. ZSTD Bullseye package is only available with libraries for Percona Server for MySQL 8.0. There are two main use cases for the Stream functionality:

  • Importing while you are exporting
  • Remote backups

The drawback is that it relies on the network throughput as we are using a single thread to send the files that have been closed. We are going to explain how this functionality works in another blog post!

Enhancement:

Bug/Fixes:

  • Escape double and float values because of -0 #326 #30
  • Fixing const issues after merge zstd #444 #134
  • WITH_ZSTD needs to be set before being used #442
  • Adding better error handling #441 #440
  • Revert #426 #433
  • Database schema creation control added #432 #431
  • Adding anonymized function [Phase I] #427 #428
  • Fixing comment error log in restore_data_in_gstring_from_file #426
  • Adding LC_ALL to allow multi-byte parameters #415 #199
  • Needs to notify main thread to go ahead when “–tables-list” is used without “-B”! #396 #428

Refactoring:

  • Fixing headers #425
  • sync_before_add_index is not needed anymore #416
  • Using generic functions to build filenames #414

Documentation:

  • Modify the log of error #430
  • Fixing readme #420
  • Warning for inconsistencies using multisource #417 #144
  • docs: add brew build dependencies instruction #412
  • [Document] add example #408 #407

Question Addressed:

  • [BUG] Can’t connect to MySQL server using host and port. #434
  • Could not execute query: Unknown error #335

Download MyDumper 0.11.3 Today!


MyDumper Github Repository Is Now an Organization

$
0
0
MyDumper Github Repository organization

For a long time, MyDumper has been in Max Bubenick’s personal GitHub repository. Now, we decided to move to a new MyDumper’s Organization as requested earlier this year by a user from the community.

There were also two other reasons why we decided to move it. The first one is related to how the project is evolving, and the second is that it will allow us to implement integrations to other projects.

We can see the evolution of the project, noting the increase in commits of the last year:

mydumper

We tried to keep the release cycle every two months, focusing on closing as many bugs as possible and implementing new features requested. It was not an easy task, as lots of changes had to be implemented in mydumper and myloader engine to allow the new features to be developed. 

Seeing the progress that has been done, we can encourage ourselves and expect to release version 1.0.1 next year. This will be a huge step for the project, as it will mean that MyDumper is mature enough to endeavor easily any export/import task.

On the other hand, moving MyDumper to an Organization will allow us to create an official Docker image and it will also allow us to create a pipeline with CircleCI. Both are on the Community wish list but it will be also useful for current and future development members, as one of the most important and difficult tasks is related to testing because of the number of use cases that we have. We expect to see higher quality on releases, not just because of the quality of code, but also because the tests cover most of the uses of mydumper and myloader.

Now MyDumper has its own merch, too. By shopping for the merch you contribute to the project. Learn more here.

MyDumper 0.12.3-1 is Now Available

$
0
0
MyDumper 0.12.3-1

MyDumper 0.12.3-1The new MyDumper 0.12.3-1 version, which includes many new features and bug fixes, is now available.  You can download the code from here. MyDumper is Open Source and maintained by the community, it is not a Percona, MariaDB, or MySQL product.

In this new version we focused on:

  • Refactoring tasks: Splitting mydumper.c and myloader.c into multiple files will allow us to find bugs and implement features easily.
  • Adding compression capabilities on stream: One of the more interesting features added to MyDumper was being able to stream to another server and now, compression and transmission will be available and much faster.
  • Support for the Percona Monitoring and Management (PMM) tool: Being able to monitor scheduled processes is always desired, even more, if you know what the process is doing internally and that is why we used the textfile extended metrics to show how many jobs have been done and queue status.

Enhancement:

  • Adding PMM support for MyDumper #660 #587
  • Adding compression capabilities on stream #605
  • Dockerfile: use multi-stage builds #603
  • restore defaults-file override behavior #601

Fix:

  • Fixing when DEFINER is absent and stream/intermediate sync when finish #662 #659
  • Adding better message when schema-create files are not provided #655 #617
  • re adding –csv option #653
  • Fixing trigger absent when –no-schema is used #647 #541
  • Allow compilation on Clang #641 #563
  • Fixing version number #639 #637
  • Adding critical message when DDL fails #636 #634
  • Fix print time #633 #632
  • print sigint confirmation to stdout instead of to a log file #616 #610
  • Dockerfile: add missing libglib2.0 package #608
  • [BUG] mydumper zstd doesn’t run on CentOS due to missing libzstd package dependency #602

Refactoring:

Help Wanted:

  • [BUG] myloader always uses socket to connect to db #650
  • [BUG] real_db_name isn’t found #613

Documentation:

  • Update README.md #614
  • Use 0.11.5-2 in installation examples #606
  • Question: What exactly is inconsistent about inconsistent backups? #644

Download MyDumper 0.12.3 Today!

MyDumper has its own merch now, too. By shopping for the merch you contribute to the project. Check it out! 

MyDumper’s Stream Implementation

$
0
0
MyDumper Stream Implementation

MyDumper Stream ImplementationAs you might know, mysqldump is single-threaded and STDOUT is its default output. As MyDumper is multithreaded, it has to write on different files. Since version 0.11.3 was released in Nov 2021, we have the possibility to stream our backup in MyDumper. We thought for several months until we decided what was the simplest way to implement it and we also had to add support for compression. So, after fixing several bugs, and we now consider it is stable enough, we can explain how it works.

How Can You Stream if MyDumper is Multithreaded?

Receiving a stream is not a problem for myloader, it receives a file at a time and sends it to a thread to process it. However, each worker thread in mydumper is connected to the database, and as soon as it reads data, it should be sent to the stream, which might cause collisions with other worker threads that are reading data from the database. In order to avoid this issue, we ended up with the simplest solution: mydumper is going to take a backup and store it in the local file system that you configured, and the filename will be enqueued to be processed by the Stream Thread which pops one file at a time and pipes to stdout. We study the alternative to send chunks of the file while it is being dumped, but the way that we implemented is simpler and improves the overall performance.

Implementation Details

Here is a high-level diagram of how we implemented it:

MyDumper


When a mydumper Worker Thread processes a job, it connects to the database and stores the output into a file. That didn’t change, but with stream, we are pushing the filename into the mydumper stream_queue.

The mydumper Stream Thread is popping filenames from the mydumper stream_queue, it is going to send the header of the file to stdout and then open the file and send its content.

Then, myloader Stream Thread is going to receive and detect the header, it will create the new file with the filename from the header and store the content in it.

After closing the file, it will enqueue the filename in the myloader stream_queue. A myloader Worker Thread is going to take that file and process it according to the kind of file it is.

By default, the files are deleted, but if you want to keep them, you can use the –no-delete option. 

The header is simply adding — to the filename so you can use myloader or mysql client to import your database. Here is an example:

-- sbtest-schema-create.sql
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `sbtest` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci */ /*!80016 DEFAULT ENCRYPTION='N' */;

-- sbtest.sbtest1-schema.sql
/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;

/*!40103 SET TIME_ZONE='+00:00' */;
CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `pad2` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=100010 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

-- sbtest.sbtest1.00000.sql
/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;
/*!40103 SET TIME_ZONE='+00:00' */;
INSERT INTO `sbtest1` VALUES(1,49929,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291","")
…

Simple Use Cases

A thread writes to a single file to avoid a collision, which improves the performance. However, having thousands of files for a backup of a couple of tables is not manageable. So, the simplest use case is to send everything to a single file:

mydumper -B <SCHEMA_NAME> -h <FROM> > filename.sql

Then you can just simply import it using:

myloader --stream -o -h <TO_SERVER> < filename.sql

Now that you can pipe from a mydumper process to myloader, this execution is possible:

mydumper -B <SCHEMA_NAME> -h <FROM> | myloader --stream -o -h <TO>

pipe from a mydumper process to myloader

Or you can send the stream through the network using nc: 

mydumper -B <SCHEMA_NAME> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
nc -l <MYDUMPER_SERVER> <ANY_PORT> | myloader --stream -o -h <TO_SERVER>

 

stream through the network using nc

This implementation is using the backup directory on mydumper and myloader as Buffers, you must take this into account, as by default it is going to create a directory where you run it.

Another thing that you need to take into account is that mydumper and myloader will be writing on disk, the whole backup will be written on both File Systems while it is being processed, and use a file system with enough disk space.

Finally, you can keep myloader running and send several mydumper backups. First, you need to run:

nc -k -l <MYDUMPER_SERVER> <ANY_PORT> | myloader --stream -o -h <TO_SERVER>

And then execute:

mydumper -B <SCHEMA_NAME_1> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
mydumper -B <SCHEMA_NAME_2> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
mydumper -B <SCHEMA_NAME_3> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
mydumper -B <SCHEMA_NAME_4> -h <FROM_SERVER> | nc -N <MYDUMPER_SERVER> <ANY_PORT>

Some versions of nc have these two options:

      -k      When a connection is completed, listen for another one.  Requires -l.

     -N      shutdown(2) the network socket after EOF on the input.  Some servers require this to finish their work.

This is very useful if you are refreshing some testing environment and you only need a couple of tables on different databases or if you are using a where clause that only applies to some tables.

Considerations

Usually, when you send data to STDOUT, you are not going to have trouble with disk space usage on the dumper server. That is NOT true if you are using MyDumper. Files will be stored on the mydumper server until they are transferred to the receiving server. For instance, if you have a 10TB database, with a very low network bandwidth compared to the disk bandwidth, you might end up filling up the disk where you keep the files temporarily.

Conclusion

We focus the implementation to speed up export and import processes. Opposite to other software or implementations, we use the file system as a buffer causing a higher disk utilization.

FTWRL on MyDumper Removed

$
0
0
FTWRL on MyDumper Removed

FTWRL on MyDumper RemovedThe title is not entirely true, but ‘FTWRL on MyDumper is not needed anymore for consistent backups’ was a long title. One more time, I wanted to share a new feature in MyDumper. This is related to an important piece: the locking mechanism that mydumper uses to sync all the threads.

MyDumper was born because, at that time, we didn’t have a tool that could take a consistent logical backup using multiple threads. Syncing all the threads was one of the problems, which has been solved using FLUSH TABLE WITH READ LOCK (FTWRL), til all the threads execute START TRANSACTION WITH CONSISTENT SNAPSHOT (STWCS), then we release the FTWRL and all the threads are in sync. We all know that FTWRL is very expensive and difficult to acquire on some database workloads.

I started to think about alternatives to avoid using FTWRL, and my first thought was, why don’t we use the SHOW ENGINE INNODB STATUS to check if all the threads are at the same point in time?  That is doable but I didn’t like it, as threads know at what point in time they are!

I asked internally at Percona and I got my answer: https://jira.percona.com/browse/PS-4464. Since Percona Server for MySQL versions 8.0.19-10 and 5.7.30-33, we have a status variable that shows the last GTID executed in our STWCS which was all I needed.

I worked on the solution, which required some testing with high write traffic and forcing infrequent scenarios in real life.

How does it work?

On the next MyDumper release, when you use –no-locks, you can get a consistent backup anyway:

# ./mydumper -B myd_test -o data -v 4 --no-locks
...
** Message: 12:29:42.333: Thread 2: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2180547'.
** Message: 12:29:42.333: All threads in the same position. This will be a consistent backup.
** Message: 12:29:42.333: Thread 3: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2180547'.
** Message: 12:29:42.333: All threads in the same position. This will be a consistent backup.
** Message: 12:29:42.333: Thread 1: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2180547'.
** Message: 12:29:42.333: All threads in the same position. This will be a consistent backup.
** Message: 12:29:42.333: Thread 4: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2180547'.
** Message: 12:29:42.333: All threads in the same position. This will be a consistent backup.
...

In the log, you will see a line per thread letting you know the thread GTID position and confirming that all the threads are at the same point in time. We get the GTID from the execution of SHOW STATUS LIKE 'binlog_snapshot_gtid_executed' per thread, this value is compared with a global shared variable and if they are not the same, it fails, and mydumper is going to try five times to sync, and inform you the result:

...
** Message: 12:32:48.968: Thread 2: All threads in same pos check
** Message: 12:32:48.971: Thread 3: All threads in same pos check
** Message: 12:32:48.972: Thread 4: All threads in same pos check
** Message: 12:32:48.973: Thread 1: binlog_snapshot_gtid_executed_status_local failed with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196157'.
** Message: 12:32:48.973: Thread 2: binlog_snapshot_gtid_executed_status_local failed with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196164'.
** Message: 12:32:48.973: Thread 3: binlog_snapshot_gtid_executed_status_local failed with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196165'.
** Message: 12:32:48.973: Thread 4: binlog_snapshot_gtid_executed_status_local failed with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196165'.
** Message: 12:32:48.975: Thread 1: All threads in same pos check
** Message: 12:32:48.975: Thread 2: All threads in same pos check
** Message: 12:32:48.977: Thread 4: All threads in same pos check
** Message: 12:32:48.980: Thread 3: All threads in same pos check
** Message: 12:32:48.980: Thread 1: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196167'.
** Message: 12:32:48.980: All threads in the same position. This will be a consistent backup.
** Message: 12:32:48.980: Thread 2: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196167'.
** Message: 12:32:48.980: All threads in the same position. This will be a consistent backup.
** Message: 12:32:48.980: Thread 4: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196167'.
** Message: 12:32:48.980: All threads in the same position. This will be a consistent backup.
** Message: 12:32:48.980: Thread 3: binlog_snapshot_gtid_executed_status_local succeeded with gtid: '498aa664-fd29-11ec-a793-0800275ff74d:1-2196167'.
** Message: 12:32:48.980: All threads in the same position. This will be a consistent backup.
...

Remember that you will need to enable GTID and the binlogs to get the value of binlog_snapshot_gtid_executed which is the status variable that we use.

Conclusion

With this new feature, you will be able to reduce the contention caused by mydumper because of the usage of FTWRL. So, starting from the next release, you will be able to use –no-locks if you don’t need any DDL locking mechanism.

Upload Ongoing MyDumper Backups to S3

$
0
0
Upload Ongoing MyDumper Backups to S3

Upload Ongoing MyDumper Backups to S3If you are using MyDumper as your Logical Backup solution and you store your backups on S3, you need to take a local backup and then upload it to S3. But what if there is not enough space to hold the backup on the server where we are taking the backup? Even if we have enough disk space, we will need to wait until the end to start to upload the files, making the whole process longer.

MyDumper implemented stream backup in v0.11.3 and we have been polishing the code since then. We also implemented two ways of executing external commands:

--exec-per-thread: The worker that is getting the data from the database will write and redirect to the STDIN of the external command. It will be similar to execute cat FILE | command per every written and closed file.

--exec: In this case, the worker writes in the local storage and when the file is closed, the filename is enqueued. The exec threads are going to pop from the queue and execute the command based on the filename. FILENAME is a reserved word that is going to be replaced in the command, for instance, --exec=’/usr/bin/ls -l FILENAME’ will execute ls -l of every single file. The command must be an absolute path.

Both implementations have different use cases, pros, and cons. We are going to be using --exec, as current --exec-per-thread implementation doesn’t allow us to dynamically change the command with the filename which is going to be changing on each iteration.

Execution

For this example I created a table name test.mydumper2S3 with millions of rows, you need to configure a valid AWS account, install AWS CLI, and have a bucket. 

As I stated before, there are two ways of uploading the files, the main difference is the amount of execution of the AWS command or threads that you want to use. A stream will be only one process but --exec can control the amount of thread or execution with --exec-threads.

With stream

This might be the simplest way if you are familiar with piping your commands. In the example you will find, the table name, the split by rows value, the path where the temporary files will reside, and finally the --stream option:

mydumper -T myd_test.mydumper2S3 -r 20000 \
  -o data --stream | aws s3 cp - s3://davidducos/mydumper_backup.sql --region us-east-1

On the AWS CLI command, we specify the S3 service and the cp command, the – means that it will read from STDIN and then the location of the single file (s3://davidducos/mydumper_backup.sql) that is going to be uploaded.

In the log, you will entries like this:

…
2022-11-13 21:18:09 [INFO] - Releasing FTWR lock
2022-11-13 21:18:09 [INFO] - Releasing binlog lock
2022-11-13 21:18:09 [INFO] - File data/myd_test-schema-create.sql transferred | Global: 0 MB/s
2022-11-13 21:18:09 [INFO] - File data/myd_test.mydumper2S3-schema.sql transferred | Global: 0 MB/s
2022-11-13 21:18:09 [INFO] - Thread 1 dumping data for `myd_test`.`mydumper2S3`  WHERE `id` IS NULL OR `id` = 1 OR( 1 < `id` AND `id` <= 2001)       into data/myd_test.mydumper2S3.00000.sql| Remaining jobs: -3
………
2022-11-13 21:18:10 [INFO] - Thread 4 dumping data for `myd_test`.`mydumper2S3`  WHERE ( 1740198 < `id` AND `id` <= 1760198)       into data/myd_test.mydumper2S3.00009.sql| Remaining jobs: 0
2022-11-13 21:18:10 [INFO] - File data/myd_test.mydumper2S3.00002.sql transferred | Global: 27 MB/s
2022-11-13 21:18:10 [INFO] - Thread 1 dumping data for `myd_test`.`mydumper2S3`  WHERE ( 2283598 < `id` AND `id` <= 2303598)       into data/myd_test.mydumper2S3.00003.sql| Remaining jobs: 0
………
2022-11-13 21:18:10 [INFO] - Thread 3 dumping data for `myd_test`.`mydumper2S3`  WHERE ( 2424197 < `id` AND `id` <= 2424797)       into data/myd_test.mydumper2S3.00007.sql| Remaining jobs: 1
2022-11-13 21:18:10 [INFO] - Thread 3: Table mydumper2S3 completed
2022-11-13 21:18:10 [INFO] - Thread 3 shutting down
2022-11-13 21:18:10 [INFO] - Releasing DDL lock
2022-11-13 21:18:10 [INFO] - Queue count: 0 0 0 0 0
2022-11-13 21:18:10 [INFO] - Main connection closed
2022-11-13 21:18:10 [INFO] - Finished dump at: 2022-11-13 21:18:10
2022-11-13 21:18:32 [INFO] - File data/myd_test.mydumper2S3.00009.sql transferred in 22 seconds at 0 MB/s | Global: 2 MB/s
2022-11-13 21:18:36 [INFO] - File data/myd_test.mydumper2S3.00003.sql transferred in 4 seconds at 4 MB/s | Global: 2 MB/s
2022-11-13 21:18:39 [INFO] - File data/myd_test.mydumper2S3.00001.sql transferred in 2 seconds at 9 MB/s | Global: 2 MB/s
2022-11-13 21:18:41 [INFO] - File data/myd_test.mydumper2S3.00007.sql transferred in 1 seconds at 4 MB/s | Global: 3 MB/s
2022-11-13 21:18:41 [INFO] - File data/myd_test.mydumper2S3-metadata transferred | Global: 3 MB/s
2022-11-13 21:18:41 [INFO] - File data/metadata transferred | Global: 3 MB/s
2022-11-13 21:18:41 [INFO] - All data transferred was 104055843 at a rate of 3 MB/s

As you see from the log, the files are being streamed as soon as they are closed. However, it took more than 30 seconds after the dump finished for all the files to be streamed. Finally, the command returned a couple of seconds after the “All data transferred…” entry, as the buffer needs to flush the data and upload it to S3.

With--exec

If you need to upload every single file individually, this is the option that you should use. For instance, you can use –load-data or directly the –csv option to allow another process to consume the files.

Let’s see the example:

mydumper -T myd_test.mydumper2S3 -o data -v 3 \
  --exec="/usr/bin/aws s3 cp FILENAME s3://davidducos/mydumper_backup/ --region us-east-1" --exec-threads=8

In this case, AWS CLI will send to STDERR the status of the files that are being uploaded:

upload: data/myd_test-schema-create.sql to s3://davidducos/mydumper_backup/myd_test-schema-create.sql
upload: data/myd_test.mydumper2S3-schema.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3-schema.sql
upload: data/myd_test.mydumper2S3.00042.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00042.sql
upload: data/myd_test.mydumper2S3.00010.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00010.sql
upload: data/myd_test.mydumper2S3.00026.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00026.sql
upload: data/myd_test.mydumper2S3-metadata to s3://davidducos/mydumper_backup/myd_test.mydumper2S3-metadata
upload: data/metadata to s3://davidducos/mydumper_backup/metadata
upload: data/myd_test.mydumper2S3.00006.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00006.sql
upload: data/myd_test.mydumper2S3.00000.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00000.sql
upload: data/myd_test.mydumper2S3.00004.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00004.sql
upload: data/myd_test.mydumper2S3.00005.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00005.sql
upload: data/myd_test.mydumper2S3.00001.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00001.sql
upload: data/myd_test.mydumper2S3.00002.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00002.sql
upload: data/myd_test.mydumper2S3.00003.sql to s3://davidducos/mydumper_backup/myd_test.mydumper2S3.00003.sql

And the log will be the traditional mydumper log.

Conclusion

This is an example with S3 but it is also possible to use it with different vendors or if you need encryption, just pipe to your encryption command and pipe again to AWS or any other command. I didn’t use ZSTD compression which is another option that you should explore. 

Masquerade Your Backups To Build QA/Testing Environments With MyDumper

$
0
0
MyDumper

For a long time, MyDumper has been the fastest tool to take Logical Backups. We have been adding several features to expand the use cases. Masquerade was one of these features, but it was only for integer and UUID values. In this blog post, I’m going to present a new functionality that is available in MyDumper and will be available in the next release: we added the possibility to build random data based on a format that the user defines.

How does it work?

During export, mydumper sends SELECT statements to the database. Each row is written one by one as an INSERT statement. Something important that you might not know, is that each column of a row can be transformed by a function. When you execute a backup, the default function is the identity function, as nothing needs to be changed. The function, which can be configured inside the defaults file, will change the content of the column before writing the row into disk.

How can we select the column to masquerade?

I think that the most valuable element of this feature is the simplicity to define which column will be modified and how you want to mask it.  The format is:

[`schema_name`.`table_name`]
`column1`=random_int
`column2`=random_string

In the section name, you add the schema and table name surrounded by backticks and separated by a dot. Then, each key-value entry will keep in the key the column name surrounded by backticks, and the value will be the masking function definition.

New random format function

Having string, integer, and UUID is nice to have, but what about build dynamic data with a specific format? As we want more realistic data, we want to build dynamically world wide addresses, phone numbers, emails, etc. The new function has this syntax:

random_format { <{file|string n|number n}> | DELIMITER | 'CONSTANT' }*

This are some examples:

`phone`=random_format '+1 ('<number 3>') '<number 3>'-'<number 4>
`emails`=random_format <file names.txt>'.'<file surnames.txt>'@'<file domains.txt>
`addresses`=random_format <number 3>' '<file streets.txt>', '<file cities.txt>', '<file states_and_zip.txt>', USA'

Performance considerations

You should expect performance degradation if you compare masquerade backups and regular backups. It is impossible to measure the impact as it will depend on the amount of data that needs to be masked. However, I tried to give you an idea through an example over a sysbench table of 10M rows.

Baseline backup

We are going to split by rows and compress with ZSTD:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper.cnf -r 100000 -c
real 0m19.964s
user 0m48.396s
sys 0m7.885s

It took near 19.9 seconds to complete, and here is an example of the output:

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")
,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")

One integer column

We are going to use random_int over the k column, which in the configuration will be:

[`test`.`sbtest1`]
`k`=random_int

The backup took 20.7 seconds, an increase of 4%:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-k.cnf -r 100000 -c
real 0m20.709s
user 0m46.056s
sys 0m11.247s

And as you can see, the data in the second column has changed:

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,1527173,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")
,(2,3875126,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")

random_format with <number 11>

Now, we are going to use the last column (pad) and the number tag with 11 digits to simulate the values:

`pad`=random_format <number 11>-<number 11>-<number 11>-<number 11>-<number 11>

We can see that it took 36.6 seconds to complete, and the values in the latest column have changed:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-pad-long.cnf -r 100000 -c
real 0m36.667s
user 1m3.785s
sys 0m32.757s
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","32720009027-12540600353-41008809903-18811191622-46944507919")
,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","14761241271-79422723442-42242331639-12424460062-25625932261")

Take into consideration that 11 digits forced us to execute two times g_random_int, this means that if we have:

`pad`=random_format <number 9>-<number 9>-<number 9>-<number 9>-<number 9>

It will take 29 seconds.

random_format with <file> with 100 lines file

In this case, the configuration will be:

`pad`=random_format <file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>

And it will take 34 seconds:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-simple-pad.cnf -r 100000 -c
real 0m34.224s
user 0m56.702s
sys 0m29.474s
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","aam-abacot-abalienated-abandonedly-ab")
,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","aardwolves-abaised-abandoners-aaronitic-abacterial")

Warning

This is not a fully tested feature in MyDumper; you should consider it as Beta. However, I found it relevant to show the potential that it might have for the community.

Conclusion

Never has it been as easy to build a new masquerade environment as we can do now with MyDumper.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Backup and Restore with MyDumper on Docker

$
0
0
Backup and Restore with MyDumper on Docker

At the end of 2021, I pushed the first Docker image to hub.docker.com. This was the first official image and since then, we have been improving our testing and packaging procedures based on Docker, CircleCI, and GitHub Actions. However, when I’m coding,  I’m not testing in Docker. But a couple of weeks ago, when I was reviewing an issue, I realized some interesting Docker use cases that I want to share.

Common use case

First, we are going to review how to take a simple backup with MyDumper to warm you up:

docker run --name mydumper 
     --rm 
     -v ${backups}:/backups  
     mydumper/mydumper:v0.14.4-7 
     sh -c "rm -rf /backups/data; 
          mydumper -h 172.17.0.5 
               -o /backups/data 
               -B test 
               -v 3 
               -r 1000 
               -L /backups/mydumper.log"

You will find the backup files and the log on ${backups}. Then you can restore it using:

docker run --name mydumper 
     --rm 
     -v ${backups}:/backups 
     mydumper/mydumper:v0.14.4-7 
     sh -c "myloader -h 172.17.0.4 
               -d /backups/data 
               -B test 
               -v 3 
               -o 
               -L /backups/myloader.log"

And if you want to do it faster, you can do it all at once:

docker run --name mydumper 
     --rm 
     -v ${backups}:/backups 
     mydumper/mydumper:v0.14.4-7 
     sh -c "rm -rf /backups/data; 
          mydumper -h 172.17.0.5 
               -o /backups/data 
               -B test 
               -v 3 
               -r 1000 
               -L /backups/mydumper.log ; 
          myloader -h 172.17.0.4 
               -d /backups/data 
               -B test 
               -v 3 
               -o 
               -L /backups/myloader.log"

We can remove the option to mount a volume (-v ${backups}:/backups), as the data will reside inside the container.

Advance use case

Since version 0.14.4-7, I created the Docker image with ZSTD instead of GZIP because it is faster. Other options that are always useful are –rows/-r and –chunk-filesize/-F. On the latest releases, you can run ‘100:1000:0’ for -r, which means:

  • 100 as the minimal chunk size
  • 1000 will be the starting point
  • 0 means that there won’t be a maximum limit

And in this case, where we want small files to be sent to myloader as soon as possible, and because we don’t care about the number of files either, -F will be set to 1.

In the next use case, we are going to stream the backup through the stdout from mydumper to myloader, streaming the content without sharing the backup dir:

docker run --name mydumper 
     --rm 
     -v ${backups}:/backups 
     mydumper/mydumper:v0.14.4-7 
     sh -c "rm -rf /backups/data; 
          mydumper -h 172.17.0.5 
               -o /backups/data 
               -B test 
               -v 3 
               -r 100:1000:0 
               -L /backups/mydumper.log 
               -F 1 
               --stream 
               -c 
        | myloader -h 172.17.0.4 
               -d /backups/data_tmp 
               -B test 
               -v 3 
               -o 
               -L /backups/myloader.log 
               --stream"

In this case, backup files will be created on /backups/data, sent through the pipeline, and stored on /backups/data_tmp until myloader imports that backup file, and then it will remove it.

To optimize this procedure, now, we can share the backup directory setting –stream to NO_STREAM_AND_NO_DELETE, which is not going to stream the content of the file but is going to stream the filename, and it will not delete it as we want the file to be shared to myloader:

docker run --name mydumper 
     --rm 
     -v ${backups}:/backups 
     mydumper/mydumper:v0.14.4-7 
     sh -c "rm -rf /backups/data; 
          mydumper -h 172.17.0.5 
               -o /backups/data 
               -B test 
               -v 3 
               -r 100:1000:0 
               -L /backups/mydumper.log 
               -F 1 
               --stream=NO_STREAM_AND_NO_DELETE 
               -c 
        | myloader -h 172.17.0.4 
               -d /backups/data 
               -B test 
               -v 3 
               -o 
               -L /backups/myloader.log 
               --stream"

As you can see, the directory is the same. Myloader will delete the files after importing them, but if you want to keep the backup files, you should use –stream=NO_DELETE.

The performance gain will vary depending on the database size and number of tables. This can also be combined with another MyDumper feature, masquerade your backups, which allows you to build safer QA/Testing environments.

Conclusion

MyDumper, which already has proven that it is the fastest logical backup solution, now offers a simple and powerful way to migrate data in a dockerized environment.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!


Did MyDumper LIKE Triggers?

$
0
0
MyDumper LIKE TriggerYes, but now it likes them more, and here is why.IntroUsing the LIKE clause to filter triggers or views from a specific table is common. However, it can play a trick on you, especially if you don’t get to see the output (i.e., in a non-interactive session). Let’s take a look at a simple example […]

Understanding trx-consistency-only on MyDumper Before Removal

$
0
0
I have been working on MyDumper for over three years now, and I usually don’t use the tax-consistency-only feature during backups because it wasn’t an option I quite understood. So, when reviewing another issue, I stepped into a curious scenario, and I finally got it and decided to share with you what I learned and when it should […]
Viewing all 30 articles
Browse latest View live