SYS-CON MEDIA Authors: Stackify Blog, Zakia Bouachraoui, Elizabeth White, Pat Romanski, Liz McMillan

Article

Practical Tips to Reduce SQL Server Database Table Size

Boosting productivity while working with database tables

In this article, I'd like to elaborate on the subject of boosting productivity while working with database tables. Chances are you already know this from multiple resources on database development process.

However, the topic seems to become a front-burner issue when there are continuous data growths - tables become too large that leads to the performance loss.

It happens due to an ill-designed database schema that was not originally designed for handling large volumes of data.

To avoid the performance loss in the context of continuous data growth, you should stick to certain rules when designing a database schema.

Rule # 1 - Minimum Redundancy of Data Types

The fundamental unit of SQL Server data storage is the page. The disk space intended for a data file in a database is logically divided into pages numbered contiguously from 0 to n. In SQL Server, the page size is 8 KB. This means SQL Server databases have 128 pages per megabyte.

Disk I/O operations are performed at the page level. That is, SQL Server reads or writes whole data pages. The more compact data types is used, the less pages for storing that data are required, and as a result, less I/O operations needed.

Introduced in SQL Server, buffer pool significantly improves I/O throughput. The primary purpose of the SQL buffer pool is to reduce database file I/O and improve the response time for data retrieval.

Thus, when compact data types are used, the buffer pool stores larger amount of data on the same amount of pages. As a result, you will not waste the memory and reduce a number of logical operations.

Consider the following example - a table that stores working days of employees.

CREATE TABLE dbo.WorkOut1 (
DateOut DATETIME
, EmployeeID BIGINT
, WorkShiftCD NVARCHAR(10)
, WorkHours DECIMAL(24,2)
, CONSTRAINT PK_WorkOut1 PRIMARY KEY (DateOut, EmployeeID)
)

Are the selected data types correct? The most probable answer is no. It is unlikely that an enterprise has (2^63-1) employees. Therefore, the BIGINT is an unsuitable data type in this case.

We can remove this redundancy and estimate the query execution time.

CREATE TABLE dbo.WorkOut2 (
DateOut SMALLDATETIME
, EmployeeID INT
, WorkShiftCD VARCHAR(10)
, WorkHours DECIMAL(8,2)
, CONSTRAINT PK_WorkOut2 PRIMARY KEY (DateOut, EmployeeID)
)

The following execution plan demonstrates the cost difference which depends on a row size and expected number of rows.

The less data you need to retrieve, the faster query will run.

(3492294 row(s) affected)

SQL Server Execution Times:
CPU time = 1919 ms, elapsed time = 33606 ms.

(3492294 row(s) affected)

SQL Server Execution Times:
CPU time = 1420 ms, elapsed time = 29694 ms.

As you can see, the usage of non-redundant data types is a keystone for the best query performance. It also allows reducing the size of problem tables. By the way, you can execute the following query for measuring a table size:

SELECT
table_name = SCHEMA_NAME(o.[schema_id]) + '.' + o.name
, data_size_mb = CAST(do.pages * 8. / 1024 AS DECIMAL(8,4))
FROM sys.objects o
JOIN (
SELECT
p.[object_id]
, total_rows = SUM(p.[rows])
, total_pages = SUM(a.total_pages)
, usedpages = SUM(a.used_pages)
, pages = SUM(
CASE
WHEN it.internal_type IN (202, 204, 207, 211, 212, 213, 214, 215, 216,221, 222) THEN 0
WHEN a.[type]! = 1 AND p.index_id < 2 THEN a.used_pages
WHEN p.index_id < 2 THEN a.data_pages ELSE 0
END
)
FROM sys.partitions p
JOIN sys.allocation_units a ON p.[partition_id] = a.container_id
LEFT JOIN sys.internal_tables it ON p.[object_id] = it.[object_id]
GROUP BY p.[object_id]
) do ON o.[object_id] = do.[object_id]
WHERE o.[type] = 'U'

For the above-considered tables, the query returns the following results:

table_name           data_size_mb
------- -----------
dbo.WorkOut1         167.2578
dbo.WorkOut2         97.1250

Rule # 2 - Use Database Normalization and Avoid Data Duplication

Recently, I have analyzed a database of a free web service that allows formatting T-SQL code. The server part is quite simple over there and consists of a single table:

CREATE TABLE dbo.format_history (
session_id BIGINT
, format_date DATETIME
, format_options XML
)

Every time when formatting SQL code, the following parameters were saved to the database: current session ID, server time, and the settings that were applied while formatting user's SQL code.

This data subsequently were used for determining of most popular formatting styles. There were plans to add these styles to SQL Complete default formatting styles.

However, the service popularity rise led to a significant table rows increase, and profiles processing became slow. The settings had the following XML structure:

&lt;FormatProfile&gt; &lt;FormatOptions&gt; &lt;PropertyValue Name="Select_SelectList_IndentColumnList"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_SelectList_SingleLineColumns"&gt;false&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_SelectList_StackColumns"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_SelectList_StackColumnsMode"&gt;1&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_Into_LineBreakBeforeInto"&gt;true&lt;/PropertyValue&gt; ... &lt;PropertyValue Name="UnionExceptIntersect_LineBreakBeforeUnion"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="UnionExceptIntersect_LineBreakAfterUnion"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="UnionExceptIntersect_IndentKeyword"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="UnionExceptIntersect_IndentSubquery"&gt;false&lt;/PropertyValue&gt; ... &lt;/FormatOptions&gt;

&lt;/FormatProfile&gt;

450 formatting options in total. Each row takes 33 KB in the table. The daily data growth exceeds 100 MB. As an obvius outcome, the database has been expanding day by day, thus making data analysis yet more complicated .

Surprisingly, the salvation turned out to be quite easy: all unique profiles were placed into a separate table, where a hash was defined for every set of options. As of SQL Server 2008, you can use the sys.fn_repl_hash_binary function for this.

So the DB schema has been normalized:

CREATE TABLE dbo.format_profile (
format_hash BINARY(16) PRIMARY KEY
, format_profile XML NOT NULL
)
CREATE TABLE dbo.format_history (
session_id BIGINT
, format_date SMALLDATETIME
, format_hash BINARY(16) NOT NULL
, CONSTRAINT PK_format_history PRIMARY KEY CLUSTERED (session_id,format_date)
)

And if the previous query looked like this:

SELECT fh.session_id, fh.format_date, fh.format_options
FROM SQLF.dbo.format_history fh

The new schema required the JOIN usage to retrieve the same data:

SELECT fh.session_id, fh.format_date, fp.format_profile
FROM SQLF_v2.dbo.format_history fh
JOIN SQLF_v2.dbo.format_profile fp ON fh.format_hash = fp.format_hash

If we compare the execution time for two queries, we can hardly see the advantages of the schema changes.

(3090 row(s) affected)

SQL Server Execution Times:
CPU time = 203 ms, elapsed time = 4698 ms.

(3090 row(s) affected)

SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 4479 ms.

But in this case, the goal was to decrease time for analysis. Before we had to write an intricate query for getting the list of popular formatting profiles:

;WITH cte AS (
SELECT
fh.format_options
, hsh = sys.fn_repl_hash_binary(CAST(fh.format_options AS VARBINARY(MAX)))
, rn = ROW_NUMBER() OVER (ORDER BY 1/0)
FROM SQLF.dbo.format_history fh
)
SELECT c2.format_options, c1.cnt
FROM (
SELECT TOP (10) hsh, rn = MIN(rn), cnt = COUNT(1)
FROM cte
GROUP BY hsh
ORDER BY cnt DESC
) c1
JOIN cte c2 ON c1.rn = c2.rn
ORDER BY c1.cnt DESC

Now due to the data normalization, we managed to simplify the query:

SELECT
fp.format_profile
, t.cnt
FROM (
SELECT TOP (10)
fh.format_hash
, cnt = COUNT(1)
FROM SQLF_v2.dbo.format_history fh
GROUP BY fh.format_hash
ORDER BY cnt DESC
) t
JOIN SQLF_v2.dbo.format_profile fp ON t.format_hash = fp.format_hash

As well as to decrease the query execution time:

(10 row(s) affected)

SQL Server Execution Times:
CPU time = 2684 ms, elapsed time = 2774 ms.

(10 row(s) affected)

SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 379 ms.

In addition, the database size has decreased:

database_name    row_size_mb
------ -----
SQLF             123.50
SQLF_v2          7.88

To retrieve a file size, the following query can be used:

SELECT
database_name = DB_NAME(database_id)
, row_size_mb = CAST(SUM(CASE WHEN type_desc = 'ROWS' THEN size END) *8. / 1024 AS DECIMAL(8,2))
FROM sys.master_files
WHERE database_id IN (DB_ID('SQLF'), DB_ID('SQLF_v2'))
GROUP BY database_id

Hope I managed to demonstrate the importance of data normalization.

Rule # 3 - Be careful while selecting indexed columns.

An index is an on-disk structure associated with a table or view that speeds retrieval of rows from a table or a view. Indexes are stored on pages, thus, the less pages is required to store indexes, the faster search process is. It is extremely important to be careful while selecting clustered indexed columns, because all the clustered index columns are included in every non-clustered index. Due to this fact, a database size can increase dramatically.

Rule # 4 - Use Consolidated Tables.

You do not need to execute a complex query on a large table. Instead, you can execute a simple query on a small table.

For instance, we have the following consolidation query

SELECT
WorkOutID
, CE = SUM(CASE WHEN WorkKeyCD = 'CE' THEN Value END)
, DE = SUM(CASE WHEN WorkKeyCD = 'DE' THEN Value END)
, RE = SUM(CASE WHEN WorkKeyCD = 'RE' THEN Value END)
, FD = SUM(CASE WHEN WorkKeyCD = 'FD' THEN Value END)
, TR = SUM(CASE WHEN WorkKeyCD = 'TR' THEN Value END)
, FF = SUM(CASE WHEN WorkKeyCD = 'FF' THEN Value END)
, PF = SUM(CASE WHEN WorkKeyCD = 'PF' THEN Value END)
, QW = SUM(CASE WHEN WorkKeyCD = 'QW' THEN Value END)
, FH = SUM(CASE WHEN WorkKeyCD = 'FH' THEN Value END)
, UH = SUM(CASE WHEN WorkKeyCD = 'UH' THEN Value END)
, NU = SUM(CASE WHEN WorkKeyCD = 'NU' THEN Value END)
, CS = SUM(CASE WHEN WorkKeyCD = 'CS' THEN Value END)
FROM dbo.WorkOutFactor
WHERE Value > 0
GROUP BY WorkOutID

If there is no need to often change the table data, we can create a separate table

SELECT *
FROM dbo.WorkOutFactorCache

The data retrieval from such consolidated table will be much faster:

(185916 row(s) affected)

SQL Server Execution Times:

CPU time = 3448 ms, elapsed time = 3116 ms.

(185916 row(s) affected)

SQL Server Execution Times:

CPU time = 1410 ms, elapsed time = 1202 ms.

Rule # 5 - Every rule has an exception

I've shown a couple of examples that demonstrated how to eliminate redundant data types and shorten queries execution time. But it does not always happen.

For instance, the BIT data type has a peculiarity -  SQL Server optimizes the storage of such columns group on a disk. If a table contains 8 (or less) columns of the BIT type, they are stored in the page as 1 byte. And if the table contains 16 columns of the BIT type, they are stored in the page as 2 bytes etc. The good news is that the table will take up little space and reduce disc I/O.

The bad news is that an implicit decoding will take place while retrieving data,  and the process is very demanding in terms of CPU resources.

Here is the example. Assume we have 3 identical tables containing information about employees work schedule (31 + 2 PK columns).  The only difference between tables is the data type for consolidated values (1- presence, 2 - absence)

SELECT * FROM dbo.E_51_INT
SELECT * FROM dbo.E_52_TINYINT
SELECT * FROM dbo.E_53_BIT

When using less redundant data types, the table size decreases considerably (especially the last table)

table_name           data_size_mb
------- -----
dbo.E31_INT          150.2578
dbo.E32_TINYINT      50.4141
dbo.E33_BIT          24.1953

However, there is no significant speed gain from using the BIT type

(1000000 row(s) affected)

Table ‘E31_INT'. Scan count 1, logical reads 19296, physical reads 1, read-ahead reads 19260, ...

SQL Server Execution Times:

CPU time = 1607 ms,  elapsed time = 19962 ms.

(1000000 row(s) affected)

Table ‘E32_TINYINT'. Scan count 1, logical reads 6471, physical reads 1, read-ahead reads 6477, ...

SQL Server Execution Times:

CPU time = 1029 ms,  elapsed time = 16533 ms.

(1000000 row(s) affected)

Table ‘E33_BIT'. Scan count 1, logical reads 3109, physical reads 1, read-ahead reads 3096, ...

SQL Server Execution Times:

CPU time = 1820 ms,  elapsed time = 17121 ms.

But the execution plan will show the opposite.

So the negative effect from the decoding will not appear if a table contains less than 8 BIT columns. One must note that the BIT data type is hardly used in SQL Server metadata. More often the BINARY data type is used, however it requires manual manipulations for obtaining  specific values.

Rule # 6 - Delete data that no longer required.

SQL Server supports a performance optimization mechanism called read-ahead. This mechanizm anticipates the data and index pages needed to fulfill a query execution plan and brings pages into the buffer cache before they are actually used by the query.

So if the table contains a lot of needless data, it may lead to unnecessary disk I/O. Besides, getting rid of needless data allows you to reduce the number of logical operations while reading from the Buffer Pool.

In conclusion, my advice is to be extremely careful while selecting data types for columns and try predicting future data loads.

More Stories By Jordan Sanders

Jordan Sanders is a Software Marketing Manager at Devart Company. He helps DBAs, software developers (C#, .NET, Delphi) from all around the globe to increase their productivity by using new tools, practices and new approaches to database development and management. He has experience in MySQL, SQL Server, Oracle databases consulting and also in Delphi development. He is always trying to share his knowledge and ideas with the community of his interest.

Latest Stories
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get tailored market studies; and more.
While a hybrid cloud can ease that transition, designing and deploy that hybrid cloud still offers challenges for organizations concerned about lack of available cloud skillsets within their organization. Managed service providers offer a unique opportunity to fill those gaps and get organizations of all sizes on a hybrid cloud that meets their comfort level, while delivering enhanced benefits for cost, efficiency, agility, mobility, and elasticity.
Kubernetes as a Container Platform is becoming a de facto for every enterprise. In my interactions with enterprises adopting container platform, I come across common questions: - How does application security work on this platform? What all do I need to secure? - How do I implement security in pipelines? - What about vulnerabilities discovered at a later point in time? - What are newer technologies like Istio Service Mesh bring to table?In this session, I will be addressing these commonly asked ...
The KCSP program is a pre-qualified tier of vetted service providers that offer Kubernetes support, consulting, professional services and training for organizations embarking on their Kubernetes journey. The KCSP program ensures that enterprises get the support they're looking for to roll out new applications more quickly and more efficiently than before, while feeling secure that there's a trusted and vetted partner that's available to support their production and operational needs.
Skeuomorphism usually means retaining existing design cues in something new that doesn’t actually need them. However, the concept of skeuomorphism can be thought of as relating more broadly to applying existing patterns to new technologies that, in fact, cry out for new approaches. In his session at DevOps Summit, Gordon Haff, Senior Cloud Strategy Marketing and Evangelism Manager at Red Hat, discussed why containers should be paired with new architectural practices such as microservices rathe...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It's clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Th...
When you're operating multiple services in production, building out forensics tools such as monitoring and observability becomes essential. Unfortunately, it is a real challenge balancing priorities between building new features and tools to help pinpoint root causes. Linkerd provides many of the tools you need to tame the chaos of operating microservices in a cloud native world. Because Linkerd is a transparent proxy that runs alongside your application, there are no code changes required. I...
xMatters helps enterprises prevent, manage and resolve IT incidents. xMatters industry-leading Service Availability platform prevents IT issues from becoming big business problems. Large enterprises, small workgroups, and innovative DevOps teams rely on its proactive issue resolution service to maintain operational visibility and control in today's highly-fragmented IT environment. xMatters provides toolchain integrations to hundreds of IT management, security and DevOps tools. xMatters is the ...
With the rise of Docker, Kubernetes, and other container technologies, the growth of microservices has skyrocketed among dev teams looking to innovate on a faster release cycle. This has enabled teams to finally realize their DevOps goals to ship and iterate quickly in a continuous delivery model. Why containers are growing in popularity is no surprise — they’re extremely easy to spin up or down, but come with an unforeseen issue. However, without the right foresight, DevOps and IT teams may lo...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
Between the mockups and specs produced by analysts, and resulting applications built by developers, there exists a gulf where projects fail, costs spiral, and applications disappoint. Methodologies like Agile attempt to address this with intensified communication, with partial success but many limitations. In his session at @DevOpsSummit at 19th Cloud Expo, Charles Kendrick, CTO at Isomorphic Software, presented a revolutionary model enabled by new technologies. Learn how business and develop...
"There is a huge interest in Kubernetes. People are now starting to use Kubernetes and implement it," stated Sebastian Scheele, co-founder of Loodse, in this SYS-CON.tv interview at DevOps at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.