Understanding the value of ‘long’ data (Part 2)
This is a three part series about long data. Part I focuses on the definition, quality and value. Part II explores the promises of blockchain to deliver data integrity for long data. Finally, Part III outlines the potential obstacles to guaranteeing data integrity for long data even supported by technologies such as blockchain.
We define ‘long data’ as longitudinal data or “data which tracks the same sample at different points in time” An example of ‘long data’ (sometimes referred to as panel data) in the health context, would be data that track the same cohort of clinical trial subjects over time in relation to the same health variable (for example, blood pressure).
The Long Data Integrity Challenge
In the first blog of this two-part series on ‘long data’, we saw how the ‘shelf life’ of data, which we called its ‘longevity’ varies heavily, depending on a range of contextual factors. In the second part of this series, our focus is on understanding why long data can be of great value, particularly in the context of health data. As we saw in Part I of this series, the value any data only exists “in use”. In other words, we can argue that data has no intrinsic value, rather, it has value when considering its “potential use” in the future.
How then, can we understand the potential value of ‘long data’? To illustrate, long data may track several clinical trial subjects over time, yet, the actual value of the longitudinal data to the hospital, pharmaceutical company, science, and society, etc., may vary significantly dependent on many different contextual factors. For example, long data may include data from thousands of patients over 5 years, yet, if it represents a single run in a clinical trial, then – without more - it might have little or no value at al, at least until the required number of trials are completed to enable meaningful evaluation of the data. Equally, a long dataset might have dozens of observations over several years and thousands or even millions of data points, but it may be collected from a mere handful of subjects (consider, for example, early fMRI trials). In this case, unless more data is collected from more subjects, the value of the longitudinal dataset is unlikely to improve due to the initial lack of subjects in the sample (with some exceptions).
All these contextual issues contribute to what we refer to as the “Long Data Integrity Challenge”: although a longitudinal dataset can be extended and improved over time, thereby enhancing its value, it is vitally important to ensure that all contextual aspects of data collection, data handling and data storage are duly recorded and systematized. This systematized approach not only is necessary to alleviate such longitudinal data issues as potential noise, omissions, errors, etc, but it also simplifies the detection of false, “fake”, or fraudulent results.
Blockchain as a Possible Solution to the Long Data Integrity Challenge
Recent advances in blockchain technologies offer an potential solution to the Long Data Integrity Challenge, which might be harnessed to increase data life-span, ensure its integrity, facilitiate its accessibility and thereby increase its value. Blockchains are, in essence, an append-only distributed databases, these technologies are – at core, concerned with the management of data, for the specific purpose of ensuring data security on Blockchain-based databases (Gaetani et al., 2017) by providing a technological means for uniform, consistent and secure data collection, handling, and storage. A Blockchain is a cryptographic protocol that allows a network of computers to collectively maintain a shared ledger of information without the need for verification by a trusted third party. This information is stored in a chronological, cryptographically secured ‘chain’ of data that serves as an immutable and irreversible record of transactions.
Because the value of longitudinal data depends, in large measure, on its integrity, which is especially important in the health context (particularly in medical contexts when human life may be at stake), Blockchain technology has the potential to ensure that long health data retains its value by automatically and cryptographically verifying and cross-verify data at every step. This, in turn, may help to reduce the risk of noise, errors, and fraud.
Consider, for example, the Measles, Mumps, and Rubella (MMR) vaccination controversy. In 1998, The Lancet published a paper by Andrew Wakefield and his co-authors (Wakefield et al., 1998) which argued that there was a causal link between the MMR vaccination and autism in children. Subsequent attempts to replicate the initial trials, as well as the revelation that Wakefield had a significant conflict of interest in conducting a trial (apparently, he received payments from a law firm interested in implicating health services by establishing the link between the vaccine and autism), led to the retraction of the article by The Lancet once the study was recognized to be fraudulent (see, e.g., Godlee et al., 2011 for detail). Yet, it took over 12 years (the actual retraction took place in 2010) and much public debate for the story to reach this conclusion. To date, the original piece by Wakefield and co-authors fuels the anti-vaccination movement in the UK and beyond affecting thousands if not millions of human lives. Now imagine that blockchain technology were available in 1998 and provided a secure repository of the MMR trials’ data. It is clear in this case that (1) the initial trial report would be taken far less seriously because it would be in contradiction with previous results and (2) that even if it were taken seriously, it would not have taken this much time to retract the originally reported results due to the data being available for the scientific community and the original authors accountable for its quality.
All this suggests that blockchain technology may help alleviate or even eliminate the possibility that data integrity of a long-lived datasets could be jeopardized thereby ensuring that data lives longer and is of more use to the organizations and society.
Maletic, J.I. and Marcus, A., 2000, October. Data Cleansing: Beyond Integrity Analysis. In Iq (pp. 200-209).
Demchenko, Y., Grosso, P., De Laat, C. and Membrey, P., 2013, May. Addressing big data issues in scientific data infrastructure. In Collaboration Technologies and Systems (CTS), 2013 International Conference on (pp. 48-55). IEEE.
Gaetani, E., Aniello, L., Baldoni, R., Lombardi, F., Margheri, A. and Sassone, V., 2017. Blockchain-based database to ensure data integrity in cloud computing environments.
Godlee, F., Smith, J., Marcovitch, H. Wakefield's article linking MMR vaccine and autism was fraudulent. BMJ.2011;342:c7452.
Katal, A., Wazid, M. and Goudar, R.H., 2013, August. Big data: issues, challenges, tools and good practices. In Contemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
Lee, J., Bagheri, B. and Kao, H.A., 2015. A cyber-physical systems architecture for industry 4.0-based manufacturing systems. Manufacturing Letters, 3, pp.18-23.
Levitin, A.V. and Redman, T.C., 1993. A model of the data (life) cycles with application to quality. Information and Software Technology, 35(4), pp.217-223.
Bernd Panzer-Steindel. Data integrity. CERN Technical Report Draft 1.3, CERN/IT, April 8, 2007.
Monino, J.L., 2016. Data value, big data analytics, and decision-making. Journal of the Knowledge Economy, pp.1-12.
Wakefield, A., Murch, S.A., Linnell, J., Casson, D., Malik, M. Ileal-lymphoid-nodular hyperplasia, non specific colitis, and pervasive developmental disorder in children. The Lancet. 1998;351:637-641.
Woerner, S.L. and Wixom, B.H., 2015. Big data: extending the business strategy toolbox. Journal of Information Technology, 30(1), pp.60-62.
 See https://www.nlsinfo.org/content/getting-started/what-are-longitudinal-data for complete definition.