Back in 2010 the then-CEO of Google, Eric Schmidt, was quoted as saying that every two days we create as much information as we did from the dawn of civilization up until 2003, which he figured was approximately five exabytes of data. To put that into perspective, one exabyte (EB) of data is equivalent to one million terabytes (TB).
Take that and compare it to how much data a genomics research lab creates. For example, a small- to mid-sized research lab would have between three and five DNA sequencers — an instrument that converts a physical DNA sample into data. Each sequencer would generate 500-600 TB of data per year, which is equivalent to 0.0005-0.0006 EB or 0.0015 or 0.0018 EB for three sequencers and up to 0.0025 or 0.0030 EB for five sequencers. Worldwide there are now more than 2,000 next-generation sequencers and that number will continue to increase.
Because of the volume of data involved, many of the challenges surrounding genomics and data management derive from the technologies needed to store and analyze the raw sequencing data. On top of that, the availability of deep and large genomic datasets raises concerns over who has access to what data, data security, as well as subject/patient privacy. These challenges can be broken down into five key areas that need to be addressed in order for the field to continue its rapid advances in medical and scientific research.
The following challenges are based on speaking to real-life customers where use-verification models were applied to mirror the data management process a genomics research lab would typically adopt. All of the customers identified these challenges and said a solution to automate the process would help in terms of time and cost savings.
Moving the data
The first challenge customers experience is moving the genomics data. Not only is the size of the datasets an issue, so too is moving data from different storage layers by manual scripts — which often involves a team of IT administrators. First, the data needs to be moved out of the genomics sequencer into a high performance computing setup. From there, it needs to go into production storage and later into secondary storage for longer-term retention. Creating the manual scripts required to move the data from one stage to the next is time-consuming and expensive.
However, if this process were automated, the lab could not only save time but also reduce the number of employees it takes to do the job from a team to one IT admin. Automating the process also improves the accuracy of the data and avoids drop downs of manual scripts, which are less reliable because human error is inevitable. While there are no specific percentages of how much time is saved with automation versus manual scripts due to the fact that each lab is different and the amount of data they handle varies, it is a very high percentage and the benefits far exceed moving the data manually.
Automation can be achieved with software that offers storage lifecycle management, making it easy to move the data in and out of secondary storage for when it might be needed in production to analyze or search the data, for example.
Understanding the data
Once the data is where it needs to be, the second challenge is indexing the data in a repository so that it can be searched and analyzed. This process also requires manual scripts and can be improved with automation. Storage lifecycle management software offers indexing capabilities to make the data searchable in a virtual repository, which means that the lab doesn’t have to invest the time and money associated with moving it from a physical location. Instead, a virtual repository can be created and connected to indexing and reporting structures as well as analytics Application Programming Interfaces.
Sharing and securing the data
The next two challenges are particularly important given the sensitivity and private nature of the data. How the data is shared and with whom it’s shared, without losing any of the data or failing to meet compliance regulations, is a significant challenge. An endpoint and edge solution can be added to allow data sharing with very high security and data protection levels, along with auditing/reporting to let the IT administrators know what they’re sharing, who they’re sharing it with and what levels of access each person has to that data.
As explained above, the virtual data repository can be connected to reporting structures, which gives IT administrators policy driven information about who has access to what data and what types of levels of access they have. This adds a high level of security and protection around the data if it can’t be backed up for some reason. Because of the extremely large size of the dataset, protecting that data without automation involves a significant investment in time, money and resources. However, by automating this process and linking it to the lab’s policies, data security can be streamlined, saving time and resources while preventing the data from getting into the wrong hands.
Pipelining the data
The final challenge is data pipelining, in which the data is run through different levels of analysis from step to step. As outlined above, the first step is moving from the genomics sequencer to the high performance compute environment and then onto the production and secondary layers of storage, depending on what the lab is doing with the data at that particular point in time. With the data processed, the lab can get real-time notifications when certain steps in the pipeline process have been performed on that data. Once level one (as outlined in the first challenge) is complete, then the IT admin knows it’s time to proceed to the next step.
With one data management platform rolled out across the lab, IT administrators and researchers can address all the data management challenges. Depending what they already have in place, they can take all of the solution, including hardware and software, or parts of it as they need. Overall, the platform saves them time managing, developing and administering manual scripts. On top of that, they gain an added protection layer that will save them money on traditional backup.
The latter benefit is particularly pertinent for these labs as often these types of research institutions fully depend on public and private research funding and budgets, so they are looking to achieve their goals from a business and scientific standpoint with the least amount of cost.
The time saved also helps contribute to the bottom line as labs that have automated these processes for data can be more productive because they can depend on one admin looking after the entire process from a single screen.
Where is the PACS for genomics?
While these challenges can be addressed by a data management platform, there is still no equivalent for the genomics world like the picture archiving and communication system (PACS) system for the imaging world. Data management does bring some sanity to the madness of the enormous genomics research datasets and the complexity that comes with them. Beyond these challenges, however, there’s still the question of cloud computing and what role it has to play in genomics research.
To date, most of the data in genomics research has been stored on premise due to the size and sensitivity of the data. But that doesn’t rule out the cloud completely. One of the main challenges is the size of the data sets — because you’re dealing with hundreds of terabytes or Petabytes of data, there’s an inherent bandwidth issue with upload and download times when it comes to moving data in and out of the cloud. This is also quite a complex and costly exercise compared to moving data around on-premise. On top of that, storage isn’t cheap.
However, if storage consumption can be brought under control by a data management platform, then the workloads can be automated and managed from a single dashboard. This makes it easier to move the data into production when needed for analysis and then back into secondary storage when it isn’t needed. In turn, this helps to optimize cost performance.
When it comes to sharing the data with other labs, one of the biggest challenges aside from those already mentioned is that unlike imaging data, which is used for diagnostic purposes, genomics data is used for research and varies from one institution to another. The process under which the data is collected, stored and analyzed is also more ad hoc compared to a unified system like a PACS in the imaging world. One of the key reasons for this is the genetic sequencers are siloed. Until this changes, there will always be components of the data gathering, management, storage and analysis that require an outside platform to connect them. The way in which the data can be interpreted also varies greatly compared with imaging data. Genomics data can be used in a multitude of ways by different research institutes and organizations, meaning that a one-size-fits-all platform won’t work in all cases and it needs to be configured on a case-by-case basis.
The future of genomics
Although automation is tackling many of these challenges head on, there’s still a long way to go before genomics research labs can really reap the rewards that data management platforms promise. Many vendors are still looking at how they can take data management out to the market and apply it in a real world scenario.
The study of genomics in Canada is growing and this country is one of the leaders in the field. All of the main universities in Canada have research labs that are involved with genomics research. Canada is also recognized worldwide for Genome Canada, a not-for-profit organization funded by the Government of Canada and working to harness the power of genomics for the benefit of Canadians, including the fields of health, agriculture, forestry, fisheries and aquaculture, the environment, energy and mining.
As technology improves, so does our ability to gain better insights from the study of genomics and enhance the lives of people across Canada and around the world. But it’s important to remember that genomics data is only as strong as our capacity to manage it. Labs must find solutions that solve the five data management challenges in order to keep up with the rapid advances researchers are making in genomic science.
Bassam Hemdan is Vice President, Canada for Commvault, a global leader in backup, recovery, and data management across any hybrid environment