Click here to close now.

SYS-CON MEDIA Authors: Liz McMillan, Elizabeth White, Esmeralda Swartz, Andy Jonak, Carmen Gonzalez

Related Topics: Virtualization, MICROSERVICES, .NET, Linux, Security, SDN Journal

Virtualization: Article

Data Efficiency at Scale

Overcoming limitations in data efficiency features

The initial wave of data efficiency features for primary storage focus on silos of information organized in terms of individual file systems. Deduplication and compression features provided by some vendors are limited by the scalability of those underlying file systems, essentially the file systems have become silos of optimized data. For example, NetApp deduplication can't scale beyond a 100 TB limit, because that's the limit in size of its WAFL file system. But ask anyone who's ever used NetApp deduplication if they've done it on a 100 TB file system, and you're likely to hear "are you crazy?" It's one thing to claim that data efficiency features can scale, quite a different one to actually use them with performance at scale.

Challenges around scalability generally center on two areas: scalability of random IO and memory overhead. Older solutions, like the one from NetApp, face the first challenge while newer flash-based storage systems are struggling with the second. I'll review both here:

The IO Challenge
Primary data-oriented storage devices handle both streaming and random throughput and therefore are sensitive to latency effects. Data efficiency requirements for primary storage must have fast hashing techniques to reduce the impact of latency. Fast hashes are non-cryptographic in nature and so require data comparison when used to do deduplication. It works like this:

  1. When a new chunk of data is read in it is first given a name using the hash algorithm.
  2. The system then checks a deduplication index to see if a chunk with that name has been seen before (note that this can consume disk IO and tremendous amounts of memory if done wrong).
  3. If the name has been seen we need to take extra steps. Because fast hashes are non-cryptographic, it is possible to have a name match while the data content differs. This is known in computer science as a hash-collision. To account for this, the existing copy of the chunk must be read in and compared bit-by-bit to the new. If they match, only a reference to the chunk is created. If not, then the new chunk must be written.

Essentially, this form of deduplication means trading a write of a duplicate chunk for a read. Depending on the design of the underlying block virtualization layer, duplicate chunks may be widely dispersed throughout the system. In that case, the bigger the system gets, the more expensive reads get - so processing of duplicate data becomes slower and slower as the storage system fills - this is why you won't find many 100 TB NetApp file systems with deduplication turned on. Certainly not for primary storage applications, the system would be flooded with random read requests and NetApp's deduplication process can end up taking months, years or even never complete.

A number of techniques have been used to reduce the impact of IO in other products. For example, the Hitachi NAS (HNAS) and Hitachi Unified Storage (HUS) solutions from HDS make use of hardware-acceleration to generate cryptographically secure hashes that do not require a data compare at all - this allows for linear scaling of deduplication performance on volumes up to 256 TB in size. Data is also written out before it is deduplicated to avoid introducing any latency through the hash computation process itself.

Permabit's own Albireo Virtual Data Optimizer (VDO) product, a plug-in module for Linux-based storage solutions, takes a different approach but with a similar result. VDO works inline to provide immediate data reduction. When data is written out, the VDO process intelligently lays it out in a sequential pattern, so that subsequent read compares of duplicates are more likely to be sequential as well. Both solutions do a fine job at solving the problem in real world scenarios, they just take different approaches.

The Memory Challenge
Many of today's flash array vendors are providing deduplication using similar fast hashing techniques to what I outlined above. With flash, the cost of doing random reads for read compares is a non-issue (random seeks on flash are much less expensive than for hard drive environments) so the use of the fast hash alone is enough to minimize latency. These systems (such as EMC's recently launched XtremIO product) are focused on delivering performance and the big challenge to performance at scale is available memory (DRAM). As above, after chunks are read in, they are named using a fast hashing algorithm. After that, the flash system must determine whether or not a chunk has been seen before. To get at this information as quickly as possible, flash-based storage systems have tended to use huge amounts of DRAM to cache chunk names in memory. It's not uncommon to see flash storage systems that allocate 16 GB of working cache per TB of storage. To support a 256 TB storage volume, such a system would require a TBs of DRAM. The increased hard costs in terms of more expensive (denser) DIMMS, as well as the increased cost of the server board required to support this many DIMMs combine to make this an extremely costly and unpopular proposition. Combine this with the fact that DRAM prices are not falling at the same rate as flash prices, and you can see why no vendor today makes a 256TB flash storage array with global deduplication capabilities.

The solution to the memory challenge is coming, in the form of a next generation of flash storage products that utilize Albireo indexing and Albireo VDO. Unlike the flash arrays described above, flash-optimized arrays with VDO takes advantage of advanced caching techniques to operate with 128 MB of working cache per TB of storage and deliver excellent performance. With VDO, a 256 TB system can be delivered with as little as 32 GB of RAM while delivering 1M IOPS performance. The net result is a cost effective and easily deployed data efficiency solution for flash arrays.

Conclusion

Deduplication Scalability by Vendor

As you can see in the table above, forward thinking vendors like HDS have done a good job at overcoming limitations in their data efficiency features and have products on the market today that can scale to meet the requirements of the large enterprise. Many other vendors are lagging behind, because of their inability to address IO and/or memory requirements, a serious downfall since data efficiency is at the core of distinguishing storage solutions, a critical end user requirement, and a ‘must have' component for 2014. Permabit's VDO product overcomes both of these limitations through the use of advanced memory-efficient caching techniques.

More Stories By Louis Imershein

As Senior Director of Product Strategy at Permabit Technology Corporation, Louis Imershein is responsible for product evolution and strategic planning for the Albireo family of products. He has 22 years of technical leadership experience in product management, software development and support. Prior to joining Permabit, Imershein was a Senior Product Marketing Manager for the Sun Microsystems Data Management Group. He has a Bachelor's degree in Biological Science from the University of California, Santa Cruz.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Latest Stories
SYS-CON Events announced today that Creative Business Solutions will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Creative Business Solutions is the top stocking authorized HP Renew Distributor in the U.S. Based out of Long Island, NY, Creative Business Solutions offers a one-stop shop for a diverse range of products including Proliant, Blade and Industry Standard Servers, Networking, Server Options and...
SYS-CON Events announced today that the DevOps Institute has been named “Association Sponsor” of SYS-CON's DevOps Summit, which will take place on June 9–11, 2015, at the Javits Center in New York City, NY. The DevOps Institute provides enterprise level training and certification. Working with thought leaders from the DevOps community, the IT Service Management field and the IT training market, the DevOps Institute is setting the standard in quality for DevOps education and training.
Wearable technology was dominant at this year’s International Consumer Electronics Show (CES) , and MWC was no exception to this trend. New versions of favorites, such as the Samsung Gear (three new products were released: the Gear 2, the Gear 2 Neo and the Gear Fit), shared the limelight with new wearables like Pebble Time Steel (the new premium version of the company’s previously released smartwatch) and the LG Watch Urbane. The most dramatic difference at MWC was an emphasis on presenting we...
SYS-CON Events announced today that robomq.io will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. robomq.io is an interoperable and composable platform that connects any device to any application. It helps systems integrators and the solution providers build new and innovative products and service for industries requiring monitoring or intelligence from devices and sensors.
Today, IT is not just a cost center. IT is an enabler and driver of business. With the emergence of the hybrid cloud paradigm, IT now has increasingly more capabilities to create new strategic opportunities for a business. Hybrid cloud allows an organization to utilize multi-tenant public clouds, dedicated private clouds, bare metal hosting, and the associated support and services for the right use cases through an on-demand, XaaS model. This model of IT creates tremendous opportunities for busi...
Business as usual for IT is evolving into a “Make or Buy” decision on a service-by-service conversation with input from the LOBs. How does your organization move forward with cloud? In his general session at 16th Cloud Expo, Paul Maravei, Regional Sales Manager, Hybrid Cloud and Managed Services at Cisco, discusses how Cisco and its partners offer a market-leading portfolio and ecosystem of cloud infrastructure and application services that allow you to uniquely and securely combine cloud busi...
Businesses are looking to empower employees and departments to do more, go faster, and streamline their processes. For all workers – but mobile workers especially – utilizing the cloud to reconnect documents and improve processes without destructing existing workflows can have a dramatic impact on productivity. In his session at 16th Cloud Expo, Mark Grilli, vice president of Acrobat Solutions marketing at Adobe Systems Incorporated, will outline new ways that the cloud is changing the way peo...
Internet of Things (IoT) will be a hybrid ecosystem of diverse devices and sensors collaborating with operational and enterprise systems to create the next big application. In their session at @ThingsExpo, Bramh Gupta, founder and CEO of robomq.io, and Fred Yatzeck, principal architect leading product development at robomq.io, will discuss how choosing the right middleware and integration strategy from the get-go will enable IoT solution developers to adapt and grow with the industry, while at...
One of the hottest areas in cloud right now is DRaaS and related offerings. In his session at 16th Cloud Expo, Dale Levesque, Disaster Recovery Product Manager with Windstream's Cloud and Data Center Marketing team, will discuss the benefits of the cloud model, which far outweigh the traditional approach, and how enterprises need to ensure that their needs are properly being met.
SYS-CON Events announced today that Solgenia will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between Personal and Professional S...
WSM International has launched a DevOps services division that offers assessment, consulting and implementation to large enterprises and organizations with complex infrastructures. The concept of DevOps is to blend information technology (IT) software development with operations to optimize the computing infrastructure according to the specific needs of the organization. According to a recent press release from Gartner, "By 2016, DevOps will evolve from a niche strategy employed by large cloud ...
SYS-CON Events announced today that QTS Realty Trust, one of the nation’s largest and fastest-growing providers of data center facilities and cloud services and a leader in security and compliance, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. QTS Realty Trust, Inc. (NYSE: QTS) is a leading national provider of data center solutions and fully managed services, and a leader in security and compliance...
SYS-CON Events announced today that WSM International (WSM), the world’s leading cloud and server migration services provider, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. WSM is a solutions integrator with a core focus on cloud and server migration, transformation and DevOps services.
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY., and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. MangoApps provides private all-in-one social intranets allowing workers to securely collaborate from anywhere in the world and from any device. Social, mobile, and eas...
With the arrival of the Big Data revolution, a data professional is expected to master a broad spectrum of complex domains including data processing, mathematics, programming languages, machine learning techniques, and business knowledge. While this mastery is undoubtedly important, this narrow focus on tool usage has divorced many from the imagination required to solve real-world problems. As the demand for analysis increases, the data science community must transform from tool experts to "data...