Search
Wrangle Your Data: How to Prep Content to Get the Best Value from Your AI
Posted
Google recently unveiled its latest AI-integrated search engine—and the internet didn’t hold back, roasting it for suggesting recipes like glue-infused pizza sauce and recommending rocks as a nutritious snack. The tech giant’s AI bot could scrape the web’s boundless resources and serve up answers, but apparently, it couldn’t tell sincerity from sarcasm or facts from wild fiction. This fiasco is just another reminder of why data governance really matters.
Data governance refers to how companies manage the vast amounts of information that feed their business—including the data that is ingested by an AI algorithm. As data strategist Malcolm Hawker describes in Forbes, AI-based systems consume data differently than other software and analytics tools. While traditional software is designed to optimize structured data in tabular formats, generative-AI tools, in particular large language models, excel at making sense of unstructured data. Unstructured data, essentially information in its raw form, can be scattered across an organization among disparate systems and tools. Hawker explains: “Data consumed by AI-based systems must be accurate, consistent and explainable to mitigate any suboptimal behaviors of AI, yet the scope of data governance processes at most companies does not encompass the very data these AI systems prefer to consume.”
Furthermore, AI systems, unlike traditional software, rely heavily on patterns derived from vast datasets to generate accurate and meaningful outputs. The more data AI models are exposed to, the better they can capture complex nuances, reduce biases, and enhance their predictive accuracy. For organizations, this means that effective data governance should be comprehensive—spanning as much data as is practical, and responsible.
Though AI is often portrayed as machine-driven, the information it pulls from still requires thorough human oversight. To maximize the benefits of AI systems across all facets of a business, companies should ensure their data is clean and thoughtfully organized. Ahead, we outline several steps to corralling the huge amounts of information used by AI programs.
Step 1: Conduct a Data Audit
Before implementing AI, conduct a thorough audit of all relevant data. Identify which datasets are incomplete, redundant, outdated or unstructured, and focus on proper permissions and controls. Firm leaders and IT teams may need to review license rights and consents received related to data. Begin by cataloging all data sources and conducting a thorough review of each dataset’s provenance, quality and relevance to the intended AI application.
When conducting a data audit, examining the availability and condition of the data is just the start. Equally as important is understanding licensing rights, as well as the confidentiality protections. Data licenses often determine the extent to which data can be utilized, modified or distributed. In the context of preparing data for AI, this means ensuring that all datasets used are properly licensed for the intended purpose, particularly if data processing involves multiple jurisdictions with different licensing requirements. Companies should verify that their data sources permit the kinds of analysis AI requires, including any transformation, modeling or sharing. Compliance with these licensing terms not only mitigates legal risks but also ensures AI tools can utilize the data without restriction, avoiding potential interruptions during implementation. When preparing data for AI applications, organizations need to ensure that datasets are not only properly licensed, but also not subject to additional confidentiality or legal limitations that might further limit data usage. Regularly updating this audit process helps maintain compliance as legal requirements or the data itself evolve, ensuring a reliable foundation for AI governance.
Step 2: Clean and Organize Your Data for AI Efficiency
Often, companies have decades’ worth of data lurking in their systems, like a digital attic filled with mismatched keepsakes. Not only is the sheer volume of information enough to induce a headache, but the formats are often a chaotic mix—videos, text, images and more, all siloed in different places across the company’s sprawling systems. It’s like trying to assemble a jigsaw puzzle when half the pieces are hidden in the garage, some are in the attic, and a few are inexplicably in the fridge. Cleaning and streamlining this content before it reaches a new AI system can help prevent a catastrophe like Google’s. Removing duplicates, standardizing formats and correcting errors—sometimes as innocuous as a typo—are all actions that can reduce bias or skewed results. Other steps may include tagging and archiving old data that you’d like to store separately from an AI algorithm.
Automated tools can help identify missing or outdated data, flagging gaps that need to be addressed. Redundancies should be eliminated by cross-referencing datasets and removing duplicates or overlapping information that could skew AI outputs. A variety of platforms can aid in spiffing up your data, and some might already be at your fingertips. For example, Python and Excel can help with the process. Programs like Tableau can provide further visualization, while programs like SAS Viya, Oracle Enterprise Data Quality, Integrate.io and many others serve up dedicated software for this purpose.
Though regulatory considerations may seem like they should be several steps down the line, data legislation must be taken into account early in the cleanup process. Many regulatory bodies, such as the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Biometric Information Privacy Act (BIPA), the EU AI Act and the Health Insurance Portability and Accountability Act of 1996 (HIPAA), have stringent requirements for processing inaccurate or out-of-date data, risking significant fines if mishandled.
Step 3: Set Clear Data Governance Policies
Establish data governance rules to ensure content is consistently monitored, organized and protected. Assign clear ownership roles and responsibilities for data quality. These roles become especially relevant when collaborating with third-party vendors.
When drafting data policies, Atlan, a data management company, suggests focusing on several key areas such as data quality, privacy, security, lifecycle and ethics. It is also valuable to ensure that central concepts, terms, domains and business needs of the organization are defined, so that all team members “speak the same language.” Not only must companies set these policies, but they should be prepared to enforce them, too.
In addition, California recently passed a number of bills pertaining to data used in AI models, and other states are looking at similar legislation. Specifically, AB 2013 requires AI companies to be more transparent about the data they use to train their model, and it could be a standard that other states look to mirror in their own jurisdictions. In looking forward over the regulatory landscape, companies developing AI systems should consider whether to adopt similar transparency standards, even if such laws are not currently applicable to them.
The legal and regulatory landscape is evolving, but is likely to become more restrictive rather than the wild west that it currently is. As such, proactive governance sets an organization up to already have in place guardrails and an infrastructure for compliance.
Step 4: Monitoring and Ongoing Compliance
Once data has been audited, cleaned and organized, organizations must finally establish processes for ongoing monitoring and compliance. AI systems continuously learn and adapt, and so too should your data governance.
To ensure your data governance policies remain effective, it is advisable to revisit them on a regular basis—at least annually. This review process should account for any organizational changes, such as mergers, new data sources or shifts in business strategy, which may require adjustments to data management practices. Incorporating new data effectively also involves reassessing governance policies to verify that they align with evolving data quality standards, privacy requirements and legal obligations. Periodic reviews help identify gaps, ensure compliance with the latest regulations, and maintain the high quality and relevance of your data.
Assigning dedicated teams to monitor legal developments helps maintain adherence to new rules and prevents potential fines or liabilities. Regularly updated training for employees involved in data governance can further reinforce compliance and create a culture of accountability within the organization.
Conclusion: Well-Prepared Data Is a Strategic Asset
Legal systems are still navigating exactly how to handle the challenges that arise from data governance mishaps, but it’s likely that companies will increasingly be held accountable for the quality of their content and how their AI tools interact with this content. Earlier this year, Air Canada lost a small claims case in which its AI-powered chatbot provided false information regarding the company’s bereavement fares. In trying to pull material from the airline’s website, the chatbot contradicted Air Canada’s guidelines. Ultimately, the court found that Air Canada was responsible for claims made by its chatbot. While not exactly “bet the company” type of litigation (the settlement was only $812.02), it indicates that businesses will be expected to tame their data so that it’s accurately represented by AI tools.
To truly unlock the power of AI, companies need to treat their data like the crown jewel it is. Clean, compliant and well-governed data isn’t just about avoiding embarrassing AI blunders or costly legal headaches—it’s about laying a rock-solid foundation for innovative, game-changing AI that propels business goals forward. By diving into detailed data audits, streamlining and standardizing information, and setting up airtight governance policies, companies can tap into the real value of AI, boosting efficiency and gaining a serious competitive edge. Well-prepared data doesn’t just fuel AI; it supercharges smarter, more informed decisions that drive success across the entire organization.
RELATED ARTICLES
California Legislature Passes Generative AI Training Data Transparency Bill (UPDATED)
What You Need to Know If You’re Using AI-Generated Voices for Your Company