We regularly have clients who ask us how they can better handle large amounts of audience data. Pretty much all modern databases can handle millions or even tens of millions of records without issue, but you can make everything faster and more dependable my tweaking your process a little bit. Here are some of the best ways to optimize your database performance when handling large datasets.
If you've noticed your queries getting sluggish, indexing should be the first thing on your list to consider. When you create an index, a copy of the column and its primary key are written to disk. This means that, if the index contains all of the data needed to respond to a query, then the actual table itself will never get hit.
Good index design can be tough though. You don't just want to index all of your columns and call it a day. That is essentially the same as having no indexes at all. Instead, choose which columns would be used in the
WHERE clause of your queries. By focusing solely on these columns, you can significantly impact the performance of your database in a positive way.
Good index design can be tough though. You don't just want to index all of your columns and call it a day.
Indexes are not a magical solution for your database woes, however. Each index you add comes at a cost. The more indexes on your table, the slower writes will be. If you have a write-heavy database, you will want to be very selective with the amount of indexes you create on your tables.
If you've added too many indexes and your writes have taken a performance hit, identify and remove any duplicate indexes. Duplicate indexes can slow down performance unnecessarily and don't provide any performance boost when performing
Using the right datatype for your data is key to getting your database humming along. Using irrelevant datatypes can cause your queries to execute more slowly, can lead to errors and use more resources than necessary.
Generally speaking, when creating a database schema, smaller is better. Try to choose the smallest datatype for a column that can be used. Smaller datatypes use less space on the disk, leading to less resources consumed and faster query execution, requiring fewer CPU cycles to process.
Keeping your datatypes simple can also help. For instance, if you are setting a flag on a column, you could use a
varchar column, but comparing strings is much more expensive than comparing integers, so, if you are only toggling between 0 and 1, a
TINYINT column would be more appropriate.
If your data contains whole numbers, you should use one of
BIGINT. For primary keys, these should be either
BIGINT with the
UNSIGNED attribute which will disallow negative numbers, increasing the maximum value by about double.
Generally speaking, when creating a database schema, smaller is better.
For real numbers, or numbers which contain a decimal or fractional part, using
DOUBLE can lead to inaccuracies as these are approximated datatypes, meaning they support approximate calculations. Instead, use
DECIMAL as this type will support exact math.
DECIMAL columns are also good for storing whole numbers that are too large for
When storing strings, you will likely use
VARCHAR for most columns. However, if you know the length of your strings, the
CHAR type may be a good option. Unlike
CHAR will cause MySQL to allocate the memory needed for the exact number of characters, while
VARCHAR will fluctuate based on the number of characters stored.
VARCHAR is suitable for most strings and can save on resources for strings less than the allocated size,
CHAR can offer performance enhancements for strings of the same length. For instance, a hashed value such as an md5 salted & hashed password with a fixed-length would be a great candidate for a
CHAR column since all of the values would be exactly the same length.
If you have a list of distinct string values that doesn't change, you can utilize the
ENUM datatype. MySQL will store these values as very small representations of themselves, making them much more efficient than a
VARCHAR column serving the same purpose. The downside of
ENUM is that the values are limited to what is defined in the column. You cannot add different values without first altering the table definition.
Additionally, you'll generally want to avoid
LONGTEXT datatypes as these can't be effectively indexed. If you have metadata associated with a record that needs to be accessible, creating a join to an auxiliary table storing the value would be most appropriate.
Normalizing your data entails scaling the queryable data down to just what's necessary to supply results. This could mean combining data from multiple tables or translating longer string values into defined integer counterparts.
The goal of normalization is to reduce the amount of effort required to execute a query, touching as little data as possible to achieve a result.
The goal of normalization is to reduce the amount of effort required to execute a query, touching as little data as possible to achieve a result. The can greatly reduce errors and inconsistencies encountered with table joins, temporary tables and increases performance drastically. In fact, this is why almost all machine learning algorithms rely on data normalization - it makes the whole process easier on developers, less resource intensive and more accurate overall.
No matter how solid your database schema is, ultimately the way you construct your queries will have one of the greatest impacts on performance and reliability. Here are some things to watch out for when crafting queries to access your data:
INNER JOINinstead of
ORDER BYunless you absolutely need a sorted result
Writing efficient queries not only helps you retrieve the most accurate results in the fastest time, it can significantly lower infrastructure costs, improve application performance and ensure you are formulating your schema in the most optimal fashion.
If you're doing all of the above, you'd be missing a huge opportunity to improve performance if you didn't also take a look at your database server. There are a few simple things you can do to supercharge MySQL and make it work better for you.
Ensure System Resources are Available
The first thing to check is your RAM, disk space, CPU usage and network. If any of these are lacking, you risk bottlenecking the entire system. Using a tool like
iotop will help you identify if you are using too much disk space and need to add more. For your CPU, you can use the linux
top command to display all processes currently running on your system and the amount of CPU they are consuming. If
mysqld is consuming too much, you may need to scale up the processing power on your server.
The first thing to check is your RAM, disk space, CPU usage and network. If any of these are lacking, you risk bottlenecking the entire system.
For your RAM, use the command line tool
free -m to display memory currently being used on the system. Ensure the swap is zero, otherwise more RAM may need be in your future. Finally check your network. While networking isn't normally too big of an issue for databases, if you don't have the proper bandwidth and network connectivity to handle the amount of data being sent back and forth, it could lead to dropped packets and network outages.
Use InnoDB, Not MyISAM
MySQL allows you to choose the database engine type you'd like to use. The main thing to remember is that MyISAM is a much older database engine without the optimizations present in InnoDB, so making sure you're using InnoDB can boost performance over the legacy MyISAM engine.
Use the Latest Version
While it should go without saying, using the latest version of your database will offer improved performance and security as well as future-proof your data against upcoming breaking changes in the database engine. Staying up to date is an important part of your database optimization strategy as it ensures your data is always stored in the best version of the database currently available.
Managing large datasets for your audiences can be challenging but it doesn't have to be painful. Ensuring you take the proper precautions when constructing your database schema can save you headaches down the road and provide for a solid foundation when creating segments for your marketing campaigns.