BeInCrypto was included in a dataset to train and improve artificial intelligence (AI) tools such as ChatGPT, according to a recent analysis.
BeInCrypto has been included in a huge dataset for training AI called C4. The Washington Post and the Allen Institute for AI recently studied Google’s C4 dataset to determine what sites were feeding into AI tools.
Many large language models have used C4 (which stands for Colossal Clean Crawled Corpus) as an instructional tool. However, Open AI‘s ChatGPT does not make use of this dataset.
Helping AI Replicate Human Speech
Large language models like C4, and that employed by ChatGPT, “scrape” the internet for content to include in their model. The vastness of the dataset allows AI to mimic human speech.
The Washington Post sorted C4’s websites using data from the web analytics company, Similarweb. Then, they ranked the top 10 million websites by the number of “tokens” they contributed.
Tokens refer to short chunks of text utilized to make sense of unstructured data, usually consisting of a word or a phrase.

The three largest contributors to the dataset were patents.google.com, wikipedia.org, and scribd.com, a subscription-based digital library. And news organizations dominated the top ranks, with the
Go to Source to See Full Article
Author: Josh Adams