What do you do when your application needs to store something? Are you the type of developer or company that habitually jumps to create a database everytime you see that an application needs to store something permanently? Worse, are you the type of programmer who stores binary things like images, music, videos and ‘what not’ inside databases too?
On the other hand, are you so petrified of and averse to databases that you store everything in Excel, comma seperated or JSON files?
In this post, I put forth some rules I use to decide what goes into databases and for what type of data you should consider a different data storage.
DO use a database when…
Your data can be represented on paper in a table with rows and columns: If your data is represented by a table with rows and columns even on a piece of paper, then chances are that it can be digitised perfectly into a database. Records of daily sales of electronics is a good example of tabular data. You have a column for date, one for time, one showing the SKU of the product sold (e.g. a certain brand of TV or washing machine, etc), the price for which it was sold and a column containing the phone number of the customer who bought it. In another table, a table of phone numbers and names of customers can be recorded. Anything that can be seamlessly put into a tabular form in rows and columns without too much thought or effort is a good candidate for database storage.
You need to find ONE record from millions or billions based on a single identifier: Among millions of insurance policies purchased from your insurance company, what if you need to find exactly ONE? Based on the policy number? Along with being a table of records, a list of million policies also has a column called policy number which identifies each policy uniquely. A database is the best way to store such data, since it can find one record from millions or even billions in a matter of milliseconds. This is made possible by features such as indexing, i.e. similar to the index in a book, a database automatically makes an index of the records stored in a table, so that searching for them becomes significantly faster.
You need to find a set of records based on conditions: Imagine you are a supermarket like Walmart and you maintain a list of customers that you want to reach out to when certain products become available. This means that you want to shortlist a set of customers based on conditions. Conditions are either single values which represent a category or a range of values which fit into a category. E.g. you may want to only find ‘male’ customers between the ages 21 – 27. The first condition is a single value representing a category, i.e. gender, whereas the second condition is a range that broadly represents unmarried men at the start of their careers.
Such search conditions are made faster by using a database solution, since they offer features such as indexing that makes the search operator faster. Finding the required records even among billions of records takes only milliseconds to at most a couple of seconds.
The data will be constantly read and written to, even simultaneously: Imagine you are building a system that measures the weather around you. A set of sensors is continuously pumping records every second, measuring the temperature and humidity. Meanwhile the interface of the application is plotting a continuous graph on the screen. Data is being written to and read from at a continuous rate. By providing features such as locking which makes the writer or the reader wait until the other is done, databases provide a safe environment where we do not have to worry about clashes and outdated data.
DO NOT use a database when…
The data is not seamlessly a linear series of records: Which type of calendar would you like to own? One in which the dates are written linearly one after the other, each day of the year on one line, occupying 365 rows in a table? Or one of those cool table calendars where each page represents a month and where on each page the dates are neatly and compactly laid out in a grid, each row representing a week and columns representing the day of the week?
Throughout the 1990s and 2000s, software developers forcibly tried make all sorts of data linear, so that they fit inside the precious database solution purchased by the company. Even if a linear series of records is not the best way to store that data. Habits die hard and this practice is continuing into the 2010s, where companies forcibly store non-linear data in linear fashion in their Oracle, MySQL and Salesforce databases. Despite other types of storage solutions now available.
When data is not naturally linear, e.g. a tree, grid or a circle, it is a real fight to make it so. And it may not even be the best solution. You should look for ways of storage that fit that data more naturally and seamlessly. The next time a shiny, high-performance database solution tempts you to fit your data in a linear manner, just look at your table calendar and remind yourself that there are other ways to store your application’s data more efficiently.
The data does not run into millions of records: When there are millions of records, a database can give you a significant edge in comparison to storing the data in regular files. But that edge is lean when you have records running into just thousands. In that case, there is an overhead to using a database, because your application would need to establish a connection and authenticate before accessing the database.
Data will be read from, but not written to: A database is excellent when records are being read and written at the same time in large volumes. But a database is an overkill if your data if read-only. Using a database solution for just reads is like always keeping a pencil in hand when reading a book. The pencil is using up your hand’s grip. And it doesn’t even need to be there. Similarly, a database solution uses up resources since it needs to be ready to read and write, even if you aren’t intending to write. Read-only data can be best stored in other types of solutions such as a regular file with a good index at the start of the file describing the data’s whereabouts. This will be like a book with a detailed table of contents. Ideally the file system should revoke write access for all users from that file, so that no changes can be done even by mistake.
Within a database, do not store…
Images, sounds and videos: When these data types are stored inside a database, usually you store them in a format called a ‘blob’ or a binary large object. A blob is a random sequence of 1s and 0s. So it cannot be indexed. Indexing is one of the biggest advantages of a database, but it doesn’t work on a blob. So there is no advantage to storing images or videos in a database. Secondly, since images and videos are stored in database in pure binary, the database does not know what they are. So no preview or thumbnails can be offered. Finally, a lot of applications depend on mechanisms like compression, creating thumbnails and caching to make images load faster. All such advantages are lost when images are stored in databases. Images are better off stored in the file system where they can take advantage of fast loading methods. The records in the database can simply store paths to the corresponding images, sounds or videos.
Complex data structures with more than one value: Inefficient applications often use databases as a dumping ground for more complex data. Columns inside databases are home to comma seperated values, XML data, JSON data or entire blog posts. Ideally, the values stored in database columns should be short values such as numbers or short text pieces taking up one line. Large and complex data should reside outside the database, while the record in the database should contain a reference to that data, such as a path or a link. It is not possible to efficiently index large / complex data, thus losing the advantage of storing them in databases. So, they should be kept outside.
With so many famous, tested and proven database solutions around us, it is extremely easy to be tempted by the lure of converting and storing everything in a linear series of records in rows and columns. However, not all data in the universe is made equal or even conducive to the system of linear records. As a software engineer / company, it is our obligation to test the validity of a database as a solution and even boldly take the call to not use one when the solution is not best addressed by it.