Home News Feeds Planet MySQL

Newsfeeds

Planet MySQL
Planet MySQL - https://planet.mysql.com

  • LSM math - size of search space for LSM tree configuration
    I have written before and will write again about using 3-tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for L1. There is one 3-tuple per level and the components of the 3-tuple are (type, fanout, runs) for Lk (level k) where: type is Tiered or Leveled and explains compaction into that level fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1 runs is the number of sorted runs in that level, an integer and >= 1 Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3-tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy - there are an infinite number of configurations because fanout is a real. The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer. There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11 - 1 But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.

  • Replicating data into Clickhouse
    Clickhouse is a relatively new analytics and datawarehouse engine that provides for very quick insertion and analysing of data. Like most analytics platforms it’s built on a column-oriented storage basis and unlike many alternatives is completely open source. It’s also exceedingly fast, even on relatively modest platforms. Clickhouse does have some differences from some other environments, for example, data inserted cannot easily be updated, and it supports a number of different storage and table engine formats that are used to store and index the information. So how do we get into that from our MySQL transactional store? Well, you can do dumps and loads, or you could use Tungsten Replicator to do that for you. The techniques I’m going to describe here are not in an active release, but use the same principles as other part of our data loading. We’re going to use the CSV-based batch loading system that is employed by our Hadoop, Vertica and Amazon Redshift appliers to get the data in. Ordinarily we would run a materialization step that would merge and update the data from the staging tables, which import the raw change information and turn that into ‘base’ or carbon copy tables. We can’t do that with Clickhouse as the data cannot be modified once imported, but we can still use the information that gets imported. If you are familiar with the way we load data in this method, you will know that we import information using a CSV file and each row of the file is either an INSERT or DELETE, with an UPDATE operation being simulated by a DELETE followed by an INSERT. All rows are also tagged with date, time, and transaction ID information, we can always identify the latest update. Finally, one other thing to note about the Clickhouse environment, and that’s the data types are defined slightly differently. In most databases we are familiar with INT, or LONG or VARCHAR. Within Clickhouse the datatypes you use within the database for table fields more closely match the types in C, so Int32 or Int64. That means creating a simple table uses a definition like this: CREATE TABLE sales.stage_xxx_msg ( tungsten_opcode String, tungsten_seqno Int32, tungsten_row_id Int32, tungsten_commit_timestamp String, id Int32, msg String ) ENGINE = Log; You can also see we dont have a timestamp datatype, or CHAR/VARCHAR, just String. With all that in mind, let’s try loading some data into Clickhouse using Tungsten Replicator! First, a basic MySQL extraction recipe: tools/tpm configure alpha \ --disable-relay-logs=true \ --enable-heterogeneous-service=true \ --install-directory=/opt/continuent \ --master=ubuntuheterosrc \ --mysql-allow-intensive-checks=true \ --replication-password=Tamsin \ --replication-user=root \ --skip-validation-check=MySQLMyISAMCheck We’re going to use a fairly standard replicator install, extracting from a basic MySQL 5.7 server and insert the change data into Clickhouse. For the Clickhouse side, we’ll use the batch applier with a different, custom, template: tools/tpm configure alpha \ --batch-enabled=true \ --batch-load-template=clickhouse \ --datasource-mysql-conf=/dev/null \ --datasource-type=file \ --install-directory=/opt/continuent \ --master=ubuntuheterosrc \ --members=clickhouse2 \ --property=replicator.datasource.global.csvType=vertica \ --replication-password=password \ --replication-port=8123 \ --replication-user=tungsten \ --skip-validation-check=InstallerMasterSlaveCheck \ --start-and-report=true That’s it! We make one other change from other installations, in that because we cannot update information in Clickhouse, rather than using Clickhouse to store the Replicator status information, we’ll use the File datasource type, which stores the information within a file on the local filesystem. To generate this information I’ll generate about 18,000 transactions of data which is a mixture of INSERT, DELETE and UPDATE operations, we’ll load this into MySQL in tandem across 20 threads. Let’s run the load and check clickhouse: clickhouse2 :) select * from stage_xxx_msg limit 10; SELECT * FROM stage_xxx_msg LIMIT 10 ┌─tungsten_opcode─┬─tungsten_seqno─┬─tungsten_row_id─┬─tungsten_commit_timestamp─┬─id─┬─msg──────────────────┐ │ I │ 15 │ 1 │ 2018-12-12 09:48:17.000 │ 9 │ 4qwciTQiKdSrZKCwflf1 │ │ I │ 16 │ 2 │ 2018-12-12 09:48:17.000 │ 10 │ Qorw8T10xLwt7R0h7PsD │ │ I │ 17 │ 3 │ 2018-12-12 09:48:17.000 │ 11 │ hx2QIasJGShory3Xv907 │ │ I │ 19 │ 1 │ 2018-12-12 09:48:17.000 │ 12 │ oMxnT7RhLWpvQSGYtE6V │ │ I │ 20 │ 2 │ 2018-12-12 09:48:17.000 │ 13 │ fEuDvFWyanb1bV9Hq8iM │ │ I │ 23 │ 1 │ 2018-12-12 09:48:17.000 │ 14 │ oLVGsNjMPfWcxnRMkpKI │ │ I │ 25 │ 2 │ 2018-12-12 09:48:17.000 │ 15 │ w3rYUrzxXjb3o9iTHtnS │ │ I │ 27 │ 3 │ 2018-12-12 09:48:17.000 │ 16 │ aDFjRpTOK6ruj3JaX2Na │ │ I │ 30 │ 4 │ 2018-12-12 09:48:17.000 │ 17 │ SXDxPemQ5YI33iT1MVoZ │ │ I │ 32 │ 5 │ 2018-12-12 09:48:17.000 │ 18 │ 8Ta8C0fjIMRYEfVZBZjE │ └─────────────────┴────────────────┴─────────────────┴───────────────────────────┴────┴──────────────────────┘ 10 rows in set. Elapsed: 0.005 sec. Analysing the overall times, I processed 358,980 transactions through MySQL and into Clickhouse using relatively modest virtual machines on my laptop and it took 538 seconds. That’s about 670 transactions a second. Bear in mind we’re comitting every 100 rows here, larger commit intervals would probably be quicker overall. This is using the default settings, and I know from past testing and imports that I can go much faster. I’d count that as a success! Bear in mind we’re also writing to separate databases and tables here, but with the adddbname filter and the modified applier we can insert all of that data into a single table so that if you are concentrating data into a single database/table combination you can do this in one step with Tungsten Replicator. As I said before, Clickhouse is not currently a supported target for the Replicator, but if you are interested please get in touch!

  • Some Notes on MariaDB system-versioned Tables
    As mentioned in a previous post, I gave a talk at Percona Live Europe 2018 about system-versioned tables. This is a new MariaDB 10.3 feature, which consists of preserving old versions of a table rows. Each version has two timestamps that indicate the start (INSERT,UPDATE) of the validity of that version, and its end (DELETE, UPDATE). As a result, the user is able to query these tables as they appear at a point in the past, or how data evolved in a certain time range. An alternative name for this feature is temporal table, and I will use it in the rest of this text. In this post, I want to talk a bit about temporal tables best practices. Some of the information that I will provide is not present in the documentation; while they are based on my experience and tests, there could be errors. My suggestions for good practices are also based on my experience and opinions, and I don’t consider them as universal truths. If you have different opinions, I hope that you will share them in the comments or as a separate blog post. Create temporal columns It is possible – but optional – to create the columns that contain the timestamps of rows. Since there is no special term for them, I call them temporal columns. MariaDB allows us to give them any name we like, so I like to use the names valid_from and valid_to, which seem to be some sort of de facto standard in data warehousing. Whichever names you decide to use, I advise you to use them for all your temporal columns and for nothing else, so that the meaning will be clear. Temporal columns are generated columns, meaning that their values are generated by MariaDB and cannot be modified by the user. They are also invisible columns, which means that they can only be read by mentioning them explicitly. In other words, the following query will not return those columns: SELECT * FROM temporal_table; Also, that query will only show current versions of the rows. In this way, if we make a table temporal, existing applications and queries will continue to work as before. But we can still read old versions and obtain timestamp with a query like this: SELECT *, valid_from, valid_to     FROM temporal_table FOR SYSTEM_TIME ALL     WHERE valid_from < NOW() - INTERVAL 1 MONTH; If we don’t create these columns, we will not be able to read the timestamps of current and old row versions. We will still be able to read data from a point in time or from a time range by using some special syntax. However, I believe that using the consolidated WHERE syntax is easier and more expressive than using some syntax sugar. Primary keys For performance reasons, InnoDB tables should always have a primary key, and normally it shouldn’t be updated. Temporal tables provide another reason to follow this golden rule – even on storage engines that are not organised by primary key, like MyISAM. The reason is easy to demonstrate with an example: SELECT id, valid_from, valid_to FROM t FOR SYSTEM_TIME ALL WHERE id IN (500, 501); +-----+----------------------------+----------------------------+ | id | valid_from | valid_to | +-----+----------------------------+----------------------------+ | 500 | 2018-12-09 12:22:45.000001 | 2018-12-09 12:23:03.000001 | | 501 | 2018-12-09 12:23:03.000001 | 2038-01-19 03:14:07.999999 | +-----+----------------------------+----------------------------+ What do these results mean? Maybe row 500 has been deleted and row 501 has been added. Or maybe row 500 has been modified, and its id became 501. The timestamps suggest that the latter hypothesis is more likely, but there is no way to know that for sure. That is why, in my opinion, we need to be able to assume that UPDATEs never touch primary key values. Indexes Currently, the documentation says nothing about how temporal columns are indexed. However, my conclusion is that the valid_to column is appended to UNIQUE indexes and the primary key. My opinion is based on the results of some EXPLAIN commands, like the following: EXPLAIN SELECT email, valid_to FROM customer ORDER BY email \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: customer type: index possible_keys: NULL key: unq_email key_len: 59 ref: NULL rows: 4 Extra: Using where; Using index This means that the query only reads from a UNIQUE index, and not from table data – therefore, the index contains the email column. It is also able to use the index for sorting, which confirms that email is the first column (as expected). In this way, UNIQUE indexes don’t prevent the same value from appearing multiple times, but it will always be shown at different points in time. It can be a good idea to include valid_to or valid_from in some regular indexes, to optimize queries that use such columns for filtering results. Transaction-safe temporal tables Temporal columns contain timestamps that indicate when a row was INSERTed, UPDATEd, or DELETEd. So, when autocommit is not enabled, temporal columns don’t match the COMMIT time. For most use cases, this behaviour is desirable or at least acceptable. But there are cases when we want to only see committed data, to avoid data inconsistencies that were never seen by applications. To do so, we can create a history-precise temporal table. This only works with InnoDB – not with RocksDB or TokuDB, even if they support transactions. A history-precise temporal table doesn’t contain timestamps; instead, it contains the id’s of transactions that created and deleted each row version. If you know PostgreSQL, you are probably familiar with the xmin and xmax columns – it’s basically the same idea, except that in postgres at some point autovacuum will make old row versions disappear. Because of the similarity, for transaction-precise temporal tables, I like to call the temporal columns xmin and xmax. From this short description, the astute reader may already see a couple of problems with this approach: Temporal tables are based on transaction id’s or on timestamps, not both. There is no way to run a transaction-precise query to extract data that were present one hour ago. But think about it: even if it was possible, it would be at least problematic, because transactions are meant to be concurrent. Transaction id’s are written in the binary log, but such information is typically only accessible by DBAs. An analyst (someone who’s typically interested in temporal tables) has no access to transaction id’s. A partial workaround would be to query tables with columns like created_at and modified_at. We can run queries like this: SELECT created_at, xmin FROM some_table WHERE created_at >= '2018-05-05 16:00:00' ORDER BY created_at LIMIT 1; This will return the timestamp of the first row created since ‘2018-05-05 16:00:00’, as well as the id of the transaction which inserted it. While this approach could give us the information we need with a reasonable extra work, it’s possible that we don’t have such columns, or that rows are not inserted often enough in tables that have them. In this case, we can occasionally write in a table the current timestamp and the current transaction id. This should allow us to associate a transaction to the timestamp we are interested in. We cannot write all transaction id’s for performance reasons, so we can use two different approaches: Write the transaction id and the timestamp periodically, for example each minute. This will not create performance problems. On the other hand, we are arbitrarily deciding the granularity of our “log”. This could be acceptable or not. Write this information when certain events happen. For example when a product is purchased, or when a user changes their password. This will give us a very precise way to see the data as they appeared during critical events, but will not allow us to investigate with the same precision other types of events. Partitioning If we look at older implementations of temporary tables, in the world of proprietary databases (Db2, SQL Server, Oracle), they generally store historical data in a separate physical table or partition, sometimes called a history table. In MariaDB this doesn’t happen automatically or by default, leaving the choice to the user. However, it seems to me a good idea in the general case to create one or more partitions to store historical rows. The main reason is that, rarely, a query has to read both historical and current data, and reading only one partition is an interesting optimization. Excluding columns from versioning MariaDB allows us to exclude some columns from versioning. This means that if we update the values of those columns, we update the current row version in place rather than creating a new one. This is probably useful if a column is frequently updated and we don’t care about these changes. However, if we update more columns with one statement, and only a subset of them is excluded from versioning, a new row version is still created. All in all, the partial exclusion of some rows could be more confusing than useful in several cases. Replication 10.3 is a stable version, but it is still recent. Some of us adopt a new major version after some years, and we can even have reasons to stick with an old version. Furthermore, of course, many of us use MySQL, and MariaDB is not a drop-in replacement. But we can still enjoy temporal tables by adding a MariaDB 10.3 slave. I attached such a slave to older MariaDB versions, and to MySQL 5.6. In all tests, the feature behaved as expected. Initially, I was worried about replication lags. I assumed that, if replication lags, the slave applies the changes with a delay, and the timestamps in the tables are delayed accordingly. I am glad to say that I was wrong: the timestamps in temporal tables seem to match the ones in the binary log, so replication lags don’t affect their correctness. This is true both with row-based replication and with statement-based replication. A small caveat about temporal tables is that the version timestamps are only precise at second level. The fractional part should be ignored. You may have noticed this in the example at the beginning of this post. Backups For backups you will need to use mariabackup instead of xtrabackup. mysqldump can be used, not necessarily from a MariaDB distribution. However, it treats temporal tables as regular tables. It does not backup historical data. This is necessary because of a design choice: we cannot insert rows with timestamps in the past. This makes temporal tables much more reliable. Also, temporal tables are likely to be (or become) quite big, so a dump is probably not the best way to backup them. —Photo by Ashim D’Silva on Unsplash The post Some Notes on MariaDB system-versioned Tables appeared first on Percona Community Blog.

  • Shinguz: To NULL, or not to NULL, that is the question!
    As we already stated in earlier articles in this blog [1 and 2] it is a good idea to use NULL values properly in MariaDB and MySQL. One of my Mantras in MariaDB performance tuning is: Smaller tables lead to faster queries! One consequence out of this is to store NULL values instead of some dummy values into the columns if the value is not known (NULL: undefined/unknown). To show how this helps related to space used by a table we created a little example: CREATE TABLE big_null1 ( id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY , c01 VARCHAR(32) NOT NULL , c02 VARCHAR(32) NOT NULL , c03 VARCHAR(32) NOT NULL , c04 VARCHAR(32) NOT NULL , c05 VARCHAR(32) NOT NULL , c06 VARCHAR(32) NOT NULL , c07 VARCHAR(32) NOT NULL , c08 VARCHAR(32) NOT NULL , c09 VARCHAR(32) NOT NULL , c10 VARCHAR(32) NOT NULL , c11 VARCHAR(32) NOT NULL , c12 VARCHAR(32) NOT NULL , INDEX (c03) , INDEX (c06) , INDEX (c09) ); CREATE TABLE big_null2 ( id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY , c01 VARCHAR(32) NOT NULL , c02 VARCHAR(32) NOT NULL , c03 VARCHAR(32) NOT NULL , c04 VARCHAR(32) NOT NULL , c05 VARCHAR(32) NOT NULL , c06 VARCHAR(32) NOT NULL , c07 VARCHAR(32) NOT NULL , c08 VARCHAR(32) NOT NULL , c09 VARCHAR(32) NOT NULL , c10 VARCHAR(32) NOT NULL , c11 VARCHAR(32) NOT NULL , c12 VARCHAR(32) NOT NULL , INDEX (c03) , INDEX (c06) , INDEX (c09) ); Now we fill the table with default values (empty string or dummy values) because we do not know yet the contents: INSERT INTO big_null1 VALUES (NULL, '', '', '', '', '', '', '', '', '', '', '', ''); INSERT INTO big_null1 SELECT NULL, '', '', '', '', '', '', '', '', '', '', '', '' FROM big_null1; ... up to 1 Mio rows INSERT INTO big_null2 VALUES (NULL, 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.' , 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.'); INSERT INTO big_null2 SELECT NULL, 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.' , 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.', 'Some dummy value.' FROM big_null2; ... up to 1 Mio rows ANALYZE TABLE big_null1; ANALYZE TABLE big_null2; SELECT table_name, table_rows, avg_row_length, data_length, index_length, data_free FROM information_schema.tables WHERE table_name IN ('big_null1', 'big_null2') ORDER BY table_name; +------------+------------+----------------+-------------+--------------+-----------+ | table_name | table_rows | avg_row_length | data_length | index_length | data_free | +------------+------------+----------------+-------------+--------------+-----------+ | big_null1 | 1046760 | 37 | 39387136 | 36225024 | 4194304 | | big_null2 | 1031990 | 264 | 273416192 | 89899008 | 6291456 | +------------+------------+----------------+-------------+--------------+-----------+ The opposite example is a table which allows NULL values for unknown fields: CREATE TABLE big_null3 ( id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY , c01 VARCHAR(32) NULL , c02 VARCHAR(32) NULL , c03 VARCHAR(32) NULL , c04 VARCHAR(32) NULL , c05 VARCHAR(32) NULL , c06 VARCHAR(32) NULL , c07 VARCHAR(32) NULL , c08 VARCHAR(32) NULL , c09 VARCHAR(32) NULL , c10 VARCHAR(32) NULL , c11 VARCHAR(32) NULL , c12 VARCHAR(32) NULL , INDEX (c03) , INDEX (c06) , INDEX (c09) ); Also this table is filled with unknown values but this time with value NULL instead of an empty string: INSERT INTO big_null3 (id) VALUES (NULL); INSERT INTO big_null3 (id) SELECT NULL FROM big_null3; ... up to 1 Mio rows ANALYZE TABLE big_null3; SELECT table_name, table_rows, avg_row_length, data_length, index_length, data_free FROM information_schema.tables WHERE table_name IN ('big_null1', 'big_null2', 'big_null3') ORDER BY table_name ; +------------+------------+----------------+-------------+--------------+-----------+ | table_name | table_rows | avg_row_length | data_length | index_length | data_free | +------------+------------+----------------+-------------+--------------+-----------+ | big_null1 | 1046760 | 37 | 39387136 | 36225024 | 4194304 | | big_null2 | 1031990 | 264 | 273416192 | 89899008 | 6291456 | | big_null3 | 1047800 | 26 | 27852800 | 36225024 | 7340032 | +------------+------------+----------------+-------------+--------------+-----------+ We see, that this table already uses much less space when we make correct use of NULL values... So let us do some simple query run time tests: big_null1 big_null2 big_null3 SELECT * FROM big_nullx 1.1 s 1.3 s 0.9 s SELECT * FROM big_nullx AS t1  JOIN big_nullx AS t2 ON t2.id = t1.id  JOIN big_nullx AS t3 ON t1.id = t3.id 5.0 s 5.7 s 4.2 s One of my advices is, to fill the columns with NULL values if possible. So let us try this advice as well: CREATE TABLE big_null4 ( id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY , c01 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c02 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c03 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c04 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c05 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c06 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c07 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c08 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c09 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c10 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c11 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , c12 VARCHAR(32) NULL DEFAULT 'Some dummy value here...!' , INDEX (c03) , INDEX (c06) , INDEX (c09) ); INSERT INTO big_null4 (id) VALUES (NULL); INSERT INTO big_null4 (id) SELECT NULL FROM big_null4; ... up to 1 Mio rows ANALYZE TABLE big_null4; SELECT table_name, table_rows, avg_row_length, data_length, index_length, data_free FROM information_schema.tables WHERE table_name IN ('big_null1', 'big_null2', 'big_null3', 'big_null4') ORDER BY table_name ; +------------+------------+----------------+-------------+--------------+-----------+ | table_name | table_rows | avg_row_length | data_length | index_length | data_free | +------------+------------+----------------+-------------+--------------+-----------+ | big_null1 | 1046760 | 37 | 39387136 | 36225024 | 4194304 | | big_null2 | 1031990 | 264 | 273416192 | 89899008 | 6291456 | | big_null3 | 1047800 | 26 | 27852800 | 36225024 | 7340032 | | big_null4 | 998533 | 383 | 382599168 | 118358016 | 6291456 | +------------+------------+----------------+-------------+--------------+-----------+ So following my advice we fill with NULL values: UPDATE big_null4 SET c01 = NULL, c02 = NULL, c03 = NULL, c04 = NULL, c05 = NULL, c06 = NULL , c07 = NULL, c08 = NULL, c09 = NULL, c10 = NULL, c11 = NULL, c12 = NULL; ANALYZE TABLE big_null4; SELECT table_name, table_rows, avg_row_length, data_length, index_length, data_free FROM information_schema.tables WHERE table_name IN ('big_null1', 'big_null2', 'big_null3', 'big_null4') ORDER BY table_name; +------------+------------+----------------+-------------+--------------+-----------+ | table_name | table_rows | avg_row_length | data_length | index_length | data_free | +------------+------------+----------------+-------------+--------------+-----------+ | big_null1 | 1046760 | 37 | 39387136 | 36225024 | 4194304 | | big_null2 | 1031990 | 264 | 273416192 | 89899008 | 6291456 | | big_null3 | 1047800 | 26 | 27852800 | 36225024 | 7340032 | | big_null4 | 1047285 | 364 | 381779968 | 126222336 | 33554432 | +------------+------------+----------------+-------------+--------------+-----------+ It seems like we do not see the effect yet. So lets optimize the table to reclaim the space: OPTIMIZE TABLE big_null4; SELECT table_name, table_rows, avg_row_length, data_length, index_length, data_free FROM information_schema.tables WHERE table_name IN ('big_null1', 'big_null2', 'big_null3', 'big_null4') ORDER BY table_name ; +------------+------------+----------------+-------------+--------------+-----------+ | table_name | table_rows | avg_row_length | data_length | index_length | data_free | +------------+------------+----------------+-------------+--------------+-----------+ | big_null1 | 1046760 | 37 | 39387136 | 36225024 | 4194304 | | big_null2 | 1031990 | 264 | 273416192 | 89899008 | 6291456 | | big_null3 | 1047800 | 26 | 27852800 | 36225024 | 7340032 | | big_null4 | 1047180 | 30 | 32030720 | 39370752 | 4194304 | +------------+------------+----------------+-------------+--------------+-----------+ And you see there we get much of the space back... NULL is a good thing! Taxonomy upgrade extras:  null performance optimize Backup table default

  • LSM math - how many levels minimizes write amplification?
    How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels write-amp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question. The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable) When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer. Introduction I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the original LSM paper didn't address this problem because that system used a fixed number of levels. One result from the original LSM paper and updated by me is that write-amp is minimized when the per-level growth factor is constant. Sometimes I use fanout or per-level fanout rather than per-level growth factor. In RocksDB the option name is max_bytes_for_level_multiplier. Yes, this can be confusing. The default fanout in RocksDB is 10.MathI solve this for pure-leveled compaction which differs from what RocksDB calls leveled. In pure-leveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to explain this here where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.Assumptions: LSM tree uses pure-leveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree workload is update-only so the number of keys in the database is fixed workload has no write skew and all keys are equally likely to be updated per-level write-amp == per-level growth factor. In practice and in theory the per-level write-amp tends to be less than the per-level growth factor. total write-amp is the sum of per-level write-amp. I ignore write-amp from the WAL.  Specify function for write-amp and determine critical points# wa is the total write-amp# n is the number of levels# per-level fanout is the nth root of the total fanout# per-level fanout = per-level write-amp# therefore wa = number of levels * per-level fanoutwa = n * t^(1/n)# given the function for write-amp as wa = a * b# ... then below is a' * b + a * b'a = n, b = t^(1/n)wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2)# which simplifies towa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)# critical point for this occurs when wa' = 0t^(1/n) - (1/n) * ln(t) * t^(1/n) = 0t^(1/n) = (1/n) * ln(t) * t^(1/n)1 = (1/n) * ln(t)n = ln(t)When t = 1024 then n = ln(1024) ~= 6.93. In this case write-amp is minimized when 7 levels are used although 6 isn't a bad choice.Assuming the cost function is convex (see below) the critical point is the minimum for write-amp. However, n must be an integer so the number of levels that minimizes write-amp is one of: ceil(ln(t)) or floor(ln(t)).The graph for wa when t=1024 can be viewed thanks to Desmos. The function looks convex and I show below that it is.Determine whether critical point is a min or maxThe critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.wa = n * t ^ (1/n)wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)wa' = t^(1/n) * (1 - (1/n) * ln(t))# assuming wa' is a * b then wa'' is a' * b + a * b' a  = t^(1/n)a' = ln(t) * t^(1/n) * -1 * (1/n^2)a' = - ln(t) * t^(1/n) * (1/n^2)b  = 1 - (1/n) * ln(t)b' = (1/n^2) * ln(t)# a' * b - ln(t) * t^(1/n) * (1/n^2)         --> called x below+ ln(t) * ln(t) * (1/n^3) * t^(1/n) --> called y below# b' * at^(1/n) * (1/n^2) * ln(t)           --> called z below# therefore wa'' = x + y + z# note that x, y and z all contain: t^(1/n), 1/n and ln(t)wa'' = t^(1/n) * (1/n) * ln(t) * (-(1/n) + (ln(t) * 1/n^2) + (1/n))wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'wa'' = t^(1/n) * 1/n^3 * ln(t)^2Therefore wa'' is positive, wa is convex and the critical point is a minimum value for waSolve for per-level fanoutThe next step is to determine the value of the per-level fanout when write-amp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the per-level fanout is e in that case. When the number of levels is limited to an integer then the per-level fanout that minimizes write-amp is a value that is close to e.# total write-amp is number of levels * per-level fanoutwa = n * t^(1/n)# The per-level fanout is t^(1/n) and wa is minimized when n = ln(t)# Therefore we show that t^(1/n) = e when n = ln(t)Assume t^(1 / ln(t)) = eln (t^(1 / ln(t))) = ln e(1 / ln(t)) * ln(t) = 11=1When the t=1024 then ln(t) ~= 6.93. With 7 levels the per-level fanout is t^(1/7) ~= 2.69 while e ~= 2.72.

Who's Online
We have 36 guests online