Originally posted by woohoo
View Post
My opinion was that master data that has a finite match in the public arena such as an address or company would be a good use of AI, train it on the various abbreviations , format variants (internationally that is a doozy) , character sets Cyrillic or Kana etc. just blow away most matches.
With a brute force cleansing I personally did with a real UK dataset that one customer had was an 60-80% match (60 with rough matching and 80 after writing about 50 remap rules). This was a bit of a worry as they owned all the properties!
A similar match with one of the largest company information suppliers and lots of consultants across a million or so customer records got to about 70%. so only 300,000 records to fix manually because they used the European address formats or had Cyrillic. Using a well trained AI and using confidence settings of 98% would probably reduce that to 60,000 records very quickly.
But when you get to Government data its quickly in the billions when you add people transaction data. IOT the same.
Reference data is frequently a rule set so OldCategoryA =1 and OldCategoryB=7 then the new category=V17. You need to document & explain that so retrospective reports make sense, just doing that with a black box is not on.
I expect tools like these to be produced quickly for address brokers.
Now data inside systems may not have the variance of obvious stupidity that produces the incorrect addresses. Also working out what is wrong normally requires a SME to identify, after doing data analysis agreeing & documenting a fix before changing data is normally required. Maybe making the analysis an SME first is a good step, so teach it about countries, states, currencies and languages so you can spot outliers easily like someone in Abu Dhabi having a state called Cambridge.
Once you have done Address, company and country etc there is loads of unstructured data you can do.
Comment