Medicare Fraud Detection Becoming Possible Through Machine-Learning Algorithms

Researchers from Florida Atlantic University’s College of Engineering and Computer Science published a study in Health Information Science and Systems that shows how machine learning and advanced analytics could lead to Medicare fraud detection. The breakthrough could lead to $19-65 billion annual savings of Medicare funds lost to fraud.

The researchers tested six different machine learners on both balanced and imbalanced data sets using Medicare Part B data, ultimately finding the RF100 random forest algorithm to be the most effective in detecting potential fraudulent claims, and that imbalanced data sets provided the most accurate results.

The research team used four years worth of Medicare Part B data totaling 37 million cases and examined them for potential patient abuse, neglect, and overcharging or charging for services that were never provided. They used the NPI — National Provider Identifier, which is a unique identification number issued to healthcare providers by the government — to match fraud labels to the data, checking against provider details, payment and charges, procedure codes, total procedures performed, and medical specialty.

The computer then compared the data to the statistical analysis of a physician’s specialty, sorting out unusual behaviors and flagging them for potential fraud. The researchers found that the “sweet spot” for the machines to determine fraud cases was data sets that featured as low as 10 percent potentially fraudulent filings, which was much lower than they expected.

“There are so many intricacies determining what is fraud and what is not fraud, such as clerical error,” Richard A. Bauder, a senior author and PhD student at FAU, said in Health Care Analytics News. “Our goal is to enable machine learners to cull through all this data and flag anything suspicious. Then we can alert investigators and auditors, who will only have to focus on 50 cases instead of 500 cases or more.”

Machine learning is a field of artificial intelligence in which statistical techniques are used to allow computer systems to develop the ability to progressively improve performance on a task without being specifically programmed. Email filtering is one example of machine learning.

An imbalanced data set is one in which the distribution of classes is not uniform. Imbalanced data sets have the problem of accuracy being high for certain classifications but otherwise poor overall. For example, consider a small data set that has 100 total with 90 of one Type A and 10 of Type B. An algorithm that predicts that all 100 will be Type A is 90 percent accurate, which is good, but it is poor in that it correctly predicts 0 percent of the Type B data.