Sztuczna inteligencja nie powinna być karmiona danymi publicznymi

Feeding artificial intelligence with public data is a mistake

In the era of rapid development of artificial intelligence, companies like OpenAI, Meta and Google are actively searching for data to train their models. To do this, they scour the internet, books, podcasts and movies. But there is a better solution.

Synthetic data instead of analyzing the chaos we have created on the web

Synthetic data is data generated artificially by machine learning algorithms, often from a small amount of original data. Ali Golshan, whose company Gretel enables experimentation and building on synthetic data, says that they are more secure and private than public data. Thanks to synthetic data gaps, inconsistencies and prejudices can be avoidedwhich often appear in raw public data.

Moreover, synthetic data allows precise design of data sets tailored to specific AI applications. This makes the models more accurate and reliable.

Using public data is not that easy either

There are many challenges with using public data. First, raw data often contains incomplete information, which limits their usefulness in specialized applications, such as predicting health outcomes. Secondly, growing regulatory pressure limits data collection practices, making it difficult for companies to access fresh, up-to-date information. Public data with information delays are treated as less valuable.

Society has already figured out what IT companies would most like to do with our data, and the era of fast action and breaking the rules is coming to an end. It is worth noting that companies usually use only 1-10% of the collected data, and the rest is unused ballast, only increasing costs and the risk of data leakage.

Synthetic data can change this situation by enabling data to be shared securely across the organization without the risk of privacy breaches.

Similar Posts