Ethical Ai With Synthetic Data
Recent advancements in Big Data have encouraged enterprises to invest in data strategies to fully expand into the operationalization of ethical AI principles, policies, and frameworks. For example, Netflix saves $1 billion annually by using Big Data. This is evident by the amount of investments organizations are willing to make in the data domain, as it is heralded as the new currency of the digital age. However, real data is often biased, which means that there is an unexpected or unwanted correlation among different data features. This is because collecting real data is an uncontrolled process that naturally encodes biases present in the real world. The problem is, biased data can have serious implications for enterprises.
To elaborate the above claim, let us look at an example. Enterprises use fraud detection solutions that heavily rely on transaction patterns, geo-location, and demographics data to identify unusual behaviours or suspicious activities. This may potentially lead to unintended as well as unknown data biases, such as a particular demographic group being flagged out for no apparent reason. If we ask ourselves, in this situation, how can the AI / ML models, their testing, and data handling process be made fair and ethical to ensure that there are no biases present in a dataset against a particular group of individuals while protecting against potential fraud, is an open question.
Biased data can cause inaccurate AI / ML models to be trained and deployed, which can perpetuate discrimination. Regulations around the world are starting to get stricter on AI, especially those that could potentially harm the fundamental rights of citizens, such as being non-discriminatory. In April 2021, The European Commission (EC) proposed a brand-new regulatory framework to protect EU citizens from AI because of its huge transparency issues. This has widespread implications for businesses globally. Concretely, it affects foreign companies selling AI-based services in the EU and also, any providers of AI systems that are located in a non-EU country but their infrastructure is hosted in the EU, such as cloud services. Any non-compliance can result in heavy penalties of up to 6 percent of the annual revenue of a company.
Data used to be a data science problem, but it is now becoming a compliance problem as well. With the release of Ethical Guidelines for Trustworthy AI by the EU and Fairness, Ethics, Accountability and Transparency (FEAT) principles by the Monetary Authority of Singapore (MAS), AI models and consequently, the data they are trained on, must be fair and ethical. This poses a question, if real data is biased and its acquisition process cannot be controlled, then what is the alternative? The answer to the above question is AI powered synthetic data to give data owners the option to control the generation process itself. This means that biases present in the real data can be removed with synthetic data. How? Let us find out.
"Regulations around the world are starting to get stricter on AI, especially those that could potentially harm the fundamental rights of citizens such as being nondiscriminatory"
If an ML model is trained on a dataset that contains 70 percent males and 30 percent females, it will be biased towards males because of the skewed distribution. This introduces AI transparency issues that is a huge debate in the financial industry as well as the extended research community. This can be effectively rectified by giving enterprises the ability to generate a dataset that contains balanced distributions, i.e., 50 percent males and 50 percent females from a dataset that had an imbalanced male-to-female ratio. This is possible with AI generative models that operate on data types, so the scope of bias correction is much broader. As such, it can be applied among any number of data features, giving organizations the ability to be fair, ethical, and inclusive with their data practices. Extending this example, it can be applied to literally any data feature that is present in real data that can include gender, underrepresented regions, new users with no background credit history, salary levels, and different age bands.
Data bias is now becoming one of biggest challenges for enterprises and a barrier towards mass AI adoption. It may be introduced anywhere within AI / ML model pipelines, from data acquisition to modelling and how model outputs are used for actionable insights and decisions. Even these issues are somehow magically fixed, the inherent challenge is still left; real datasets are not fair because the real world is not fair. Therefore, the only feasible way to train AI / ML models at scale for the world we want to live in, is via AI powered data synthesis to create synthetic datasets that are fair in terms of legally protected qualities and other important dimensions/features. AI fairness is a relatively new topic, but synthetic data will be the most important tool for avoiding damaging trends in the actual world. Indeed, there are early indications that synthetic data can be very useful while maintaining AI / ML metrics that match the ones organizations will get with real data