"A review on data quality dimensions for big data"

A summary and discussion for data pros

May 25, 2024

black Android smartphone — Photo by Balázs Kétyi on Unsplash

Recently, the seventh Information Systems International Conference (ISICO 2023) convened in Bali, Indonesia, as well as online. Among the presenters were Fakhitah Ridzuan and Wan Mohd Namzee Wan Zainon, a research team from Malaysia who did potentially valuable work for #datafam, especially those in industry. What did they do? Ridzuan and Zainon conducted a review of data quality dimensions, essentially asking the question, “How do we assess data quality nowadays?” They concluded that the classical quaternity of accuracy, consistency, completeness, and timeliness might not be enough anymore, especially as “big data” becomes even bigger.

Here, I want to summarize their review and offer some constructive discussion. In short, I believe that the new data quality taxonomy from Ridzuan and Zainon can be beneficial for practitioners across sectors. However, completeness checks and additional validation are needed.

The classical quaternity

First, let’s revisit the “big four” dimensions of data quality, which Ridzuan and Zainon (2024) explain to set up their review.

Accuracy. Accurate data are free from typos, capture real-world context, and do not give rise to uncertainty. Accuracy is easiest to assess when it comes down to “black and white” responses, such as “did you smoke in the past 30 days, yes or no?” Accuracy assessments are more challenging when the data are less categorical or numeric. E.g., when an analyst is parsing patient narratives about their efforts to quit smoking, successful or not.

Consistency. Consistent data have the same values across all instances, copies, systems, and databases. Since inconsistencies can occur whenever data passes hands, the “four eyed” principle from IBM can be useful: “any activity by an individual within the organization must be controlled (reviewed and double checked) by a second individual that is independent and competent.”

Completeness. Completeness indicates the degree to which the data captures “a significant state of the represented real world” (Ridzuan & Zainon, 2024, p. 343). Complete data are usually assessed through quantitative standards, such as the percentage of raw data volume relative to required data volume. On a current project, my team is reporting response rates, showing how many companies we successfully contacted (the numerator) over how many companies we attempted to contact based on our project parameters (the denominator).

Timeliness. Timeliness, or “freshness,” can be divided into currency and volatility. Currency indicates how valid the data are to the real world; that is, how up-to-date the data are. Volatility measures how likely the data are to change. E.g., comments submitted to a closed FDA docket in 2015 likely have low currency (because much may have changed in the past decade), and also low volatility (because the docket no longer has responses are coming in). The operations room of a smart city, such as the Songdo International Business District, likely has the capacity to display data of all traffic patterns in the city at 3:04 p.m. Those data would be high currency (because the data accurately reflect a particular moment in time) but also high volatility (because once the traffic lights, more employees clock out for the day, or a collision occurs, the patterns will be different).

These are the big four dimensions of data quality for good reason: they work well across a variety of applications. But as Ridzuan and Zainon (2024) argued, the big four do have limitations. In particular, these dimensions may not adequately capture the five V’s of big data: “a large amount of data (volume) coming from various resources (variety) and accommodate the dynamic aspects of data (velocity). Besides, the data should be assessable and accurate (veracity), so that it can be used for decision-making and monetary aspect[s] (value)” (p. 343).

That last V, value, is something that Andrew and I have repeatedly emphasized since the earliest days of the Insights x Design podcast. Many organizations do a great job of collecting data. They do a much poorer job of translating the data into usable information. Then there are the challenges, as Ridzuan and Zainon (2024) stated, of adapting data quality dimensions to contexts of use. The big four can go far, but they are not a panacea. Neither should they be used uncritically, and not every researcher agrees on their usefulness. Hence the motivation for the literature review.

An expanded taxonomy

While recognizing the continued usefulness of the big four, Ridzuan and Zainon (2024) identified additional data quality dimensions based on 16 papers. They arranged the results into a helpful taxonomy that consists of 20 dimensions that fit into four categories: accessibility, contextual, intrinsic, and representational. Accessibility describes how easily users can get at the data. Contextual considers context in a general sense, as well as the specific task and preferences of the users. Intrinsic "expresses the natural quality of the data” (p. 345), and representational underlines the role of the system.

Below is the author’s own representation of the taxonomy, which is nice and concise.

Ridzuan and Zainon (2024) emphasized that “it is important for organizations to consider the specific use cases of their data and choose appropriate data quality dimensions to ensure high-quality data” (p. 346). As with the big four, not all of the 20 dimensions on the expanded taxonomy are needed, or relevant, to every data project. Thus, data analysts will need to “find the flag” and “pick the right club.” In practical terms, this can involve four steps:

Decide which dimensions fits the data under consideration.
Determine whether a dimension helps you attain a higher goal.
Rank the dimensions, showing which are most important.
Decide which indicators to measure for each dimension.

You can learn more about these steps from DAMA NL.

Discussion

Overall, I thought this review was useful and adaptable, and Ridzuan and Zainon (2024) can potentially advance the conversations on data quality in meaningful ways. Yet, I am left to wonder about the data quality of their data quality review. Perhaps due to space limitations, the authors provided sparse information on their methods, so it is difficult to judge how complete their data collection was. E.g., did they review studies of data quality in a single sector, or did they consider multiple sectors? Did they exclude certain publication years? Questions like these are important because, as with many things in data, the principle of “garbage in, garbage out” applies.

It is also difficult to judge the analysis. Terminology can vary widely between researchers, and only a close look can determine whether “timeliness” and “freshness,” e.g., means the same thing. Knowing more about how the authors extracted and arranged the 20 dimensions would allow us to better understand the strength of the results — and those in industry might make use of those results. Certainly, we could list even more dimensions; DAMA NL recognizes nearly 130. However, these are not specific to big data, the focus of the review by Ridzuan and Zainon (2024).

Of course, this review is a conference paper, and conferences often showcase research in preliminary stages. Ridzuan and Zainon (2024) will likely be doing more work on the data quality of big data, and they may further refine the categories that they outlined here. The “intrinsic” category, in particular, could use further elaboration, and future research may expand the number of categories. Still, as it stands, the review seems to effectively outline the directions and relative density of current studies, offering data practitioners plenty of new metrics to try out.

My intent is not to hold up this paper for unwarranted criticism, of course. Rather, inspired in part by Josh Starmer (Bam!), who bridges scholarly and professional worlds, my hope is that a quick review of their review can help disseminate the results from Ridzuan and Zainon (2024) and spark additional discussion.

We at Insights x Design will continue to faithfully monitor the research journals and bring you the latest, most intriguing developments.

APA citation of the review:

Ridzuan, F., & Zainon, W. M. N. W. (2024). A review on data quality dimensions for big data. Procedia Computer Science, 234, 341-348.

Insights x Design

"A review on data quality dimensions for big data"

A summary and discussion for data pros

The classical quaternity

An expanded taxonomy

Discussion

Discussion about this post