Key Dimensions that Characterize Acceptable Data
Organizing the rules of data quality into dimensions not only improves the specification and measurement of the data quality, it also provides the framework under which quality can be measured and reported. This in turn enables better governance of data quality. Tools can then be built around this to determine the minimum levels required to meet Business Expectations and also to monitor the levels in relation to data quality. This can also help in root cause analysis and eventual mitigation of such issues.
The dimensions are often defined in line with contexts in which the metrics associated with the business processes will be measured. These dimensions require continuous management review and oversight. However, it should be noted that in many cases these dimensions are the ones that lend themselves handy to system automation and are the best ones for defining rules for data quality monitoring.
The dimensions and their descriptions are listed below:
Uniqueness – When uniqueness of entities is asserted it means that there is no repetition of entities and there is also a unique key that defines the entity in the system. Uniqueness means that the requirements of the entities are captured and represented uniquely within the relevant application architectures. It is also not correct to create data instances where there is an existing record of that entity. Apart from running duplicate analysis techniques this also implies creating an identity matching resolution service at the time of record creation.
Accuracy – This refers to the extent to which the data correctly represents the real-life objects they are intended to model. An example of a real-life object is reference data. Among the different sources of correct information we can also find a database of record, a similar corroborative set of data from another table, dynamically computed values or perhaps the result of a manual process.
Consistency – The term does not necessarily imply correctness. It means that two values from different data sets must not be in conflict with each other. It can come with constraints, which is a set of rules that define relationships between values of attributes, either at a record or a message or along all values of the attribute. There are many contexts in which consistency can be defined:
- Record level i.e. within the same record
- Between different records i.e. cross-record consistency
- Temporal i.e. across different points in time
- It must also take into account reasonableness
Completeness – This means that certain attributes must be assigned values in a data set. This can be assigned in three levels of constraints – mandatory attributes that require a value, optional which may have a value under certain set of conditions and inapplicable attributes which may not have a value. This may also be seen as encompassing the usability and appropriateness of data values.
Timeliness – This can be measured from the time the information is expected and the time it is ready for use. It refers to the time expectation of accessibility and availability of information. In the real-world service levels with respect to the availability of information are defined which indicate how quickly the data must be provided.
Currency – This measure is how latest the information is in the world that is being modeled. In other words it indicates how up-to-date the information actually is despite time related changes. Data currency apart from verifying that the data is up-to-date also indicates the expected frequency at which these are expected to be refreshed. These may require some manual as well as automated processes.
Conformance – This refers to if the instances of data are stored, exchanged or presented in a format that is consistent with the domain of values i.e. if it follows the rules of meta-data that are assigned to it.
Referential Integrity – Unique identifiers are assigned to objects which simplifies the management of data. This however introduces new expectations that any time an object identifier is used as a foreign key within a data set that refer to some core representation, that core representation actually exists.
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $29 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.