Say NO to SQL: Part 3

In parts 1 and 2, we saw how relational databases need us to break up data and store it in little pieces. We specifically talked about splitting up data into different tables in part 1. In this part, we will see how we can get the different pieces of data from different tables together again and what kind of problems this merging causes.

Recall

In part 1 of this series, we ended up with two tables, which look like the following.

Contact serial numberOwn nameSurname
1SanjayaChowdhury
2PurushottamaKulkarni
Contact serial numberPhone typePhone number
1Personal9422515373
1Work9758085846
2Personal9169362975
2Backup9707818584

Rules mandating that only data which cannot be broken up further can be stored in relational databases forced us to make the above two tables. The first table contains a list of contact names and assigns each one an ID. The second table contains the list of phone numbers corresponding to each contact. This ensures that each table serves exactly one purpose, as per relational database rules.

Bringing the data together

It is possible to create a relational table which brings the data from the two tables together. To find out corresponding records, the procedure matches data from columns which are common to both tables. In our case, the ‘contact serial number’ acts as the glue that holds the records in the two tables together. The serial numbers are matched and a entire row of information is built. This procedure is called ‘JOIN’ing tables. Here is how the joined table looks.

Contact serial numberOwn nameSurnamePhone typePhone number
1SanjayaChowdhuryPersonal9422515373
1SanjayaChowdhuryWork9758085846
2PurushottamaKulkarniPersonal9169362975
2PurushottamaKulkarniBackup9707818584

What just happened!

You can see the problems of ‘join’ing straightaway. Contact serial number was used as the common ground between the two tables. And it appears more than once in the second table. This means that the data repeats. The number of repetitions is dictated by the table where the same data appears multiple times. But there’s worse!

The problem of combinatorics

Let’s say that we are joining 2 tables, with table 1 having 3 rows with the same data in common field and table 2 having 2 rows with the same data in the common field. The number of repetitions will be a multiplication: 3 x 2. We will see 6 rows with nearly similar data, all being the possible combinations of the rows with the common field. Let’s see an example.

Table 1: Contains a list of customer orders

Customer IDOrder dateItemQuantity
C000011st December 2016Farm Fresh Pizza2
C000017th December 2016Garlic bread sticks5
C0000115th December 2016Choco Lava cake2

Table 2: Contains a list of all possible customer delivery addresses

Customer IDDelivery Address
C00001Somewhere in Ghatkopar, Mumbai
C00001Somewhere in Vashi, Mumbai

Joined:

Customer IDOrder dateItemQuantityDelivery address
C000011st December 2016Farm Fresh Pizza2Somewhere in Ghatkopar, Mumbai
C000011st December 2016Farm Fresh Pizza2Some in Vashi, Mumbai
C000017th December 2016Garlic bread sticks5Somewhere in Ghatkopar, Mumbai
C000017th December 2016Garlic bread sticks5Some in Vashi, Mumbai
C0000115th December 2016Choco Lava cake2Somewhere in Ghatkopar, Mumbai
C0000115th December 2016Choco Lava cake2Some in Vashi, Mumbai

Well, all that an innocent data analyst wanted was to be able to generate a report page which contains a list of all the orders by a customer and all the possible delivery addresses where he wants orders delivered. However, after joining the resulting table, which acts like a lesson in mathematical combination, makes no sense!

And also note that this confused set of records was for ONLY ONE customer C00001. Imagine what would happen with the data spanning 1000 customers!

Problem of computation and memory

Joining is an intensive process. It matches rows from multiple tables and creates new bigger rows. This consumes a lot of processor and memory. Many reports, which depend on joining tables, are set up to generated overnight or during weekends. Real-time reporting is nearly impossible, unless you have a super computer.

Problem of fragmented solutions

We saw how the joining process got confused if both the tables have common data appearing multiple times. Due to this, programmers avoid joining tables all together and querying the database multiple times in succession, each time extracting different pieces of data from different tables.Whenever we query a database, each query is held in a queue until the database is ready to get to it. Querying the database multiple times for fragments of data is tedious and can lead to programs parked in the queue for too long, leading to delays. We are better off extracting larger chunks of data with fewer queries, whenever possible, to avoid standing in the queue all day.

The elegance of NoSQL

In part 1, we already saw how tight and neat the table is for non-relational databases. Here is the non-relational example for the contacts database.

Serial NumberNamePhones
1Sanjaya ChowdhuryPersonal: 9422515373
Work: 9758085846
2Purushottama KulkarniPersonal: 9169362975
Backup: 9707818584

Now let us look at how we solve the problem of customer orders and delivery addresses.

Customer IDOrdersAddresses
C00001(
Date: 1st December 2016:
(
Item: Farm Fresh Pizza
Quantity: 2
),
Date: 7th December 2016:
(
Item: Garlic Bread sticks,
Quantity: 5
),
Date: 15th December 2016:
(
Item: Choco Lava cake
Quantity: 2
)
)
(
Somewhere in Ghatkopar, Mumbai
Somewhere in Vashi, Mumbai
)

Neither is data repeated, nor will we need to query seperately for seperate fragments of info.

Conclusion

We saw how joins are useful in bringing fragments of data together in relational tables, but how they cause data repetition and confusing results. We also saw how non-relational databases’ flexibility allows us to arrange information in new ways by nesting data in data and how it avoids confusing results and repetitive querying.

 

[subscribe_form]

Leave a Reply

Your email address will not be published.