TranscribeMe creates structured data sets for customers to use to create or enhance machine learning models.
Before getting to case studies illustrating this work, some terms need to be either defined or clarified, i.e., “structured data” and “AI.”
I consider AI to be a misnomer. Intelligence is intelligence; excluding all other flora and fauna, it divides into human or machine. So for me, there’s nothing artificial about an intelligent machine. It’s simply not human.
Learning Through Structured Data
Consider how humans learn. A newborn is pretty much helpless, but from birth it packs an enormously powerful and complex brain that from day one is collecting, integrating, and assimilating environmental data, including speech. Without speech, the child is in stealth mode, but the right brain is hyper engaged in an activity that data scientists would call unsupervised learning.
As the child grows, structured data is introduced in the form of books. Initially, a parent may read to the child and point out elements in the story. For example, while reading “Goodnight Moon,” the parent might say, “Moon,” then point to its picture, tying the word to a visual. That is data annotation!
As children continue to learn, the enormous capacity of the brain to log, store, and collate data comes into play and the children become, for the most part, autonomous learners.
A newborn machine has neither a right brain, nor the nearly unlimited data capacity of a human brain to begin learning and storing data. It’s estimated that a human brain can store 2.5 petabytes of information. That would be equivalent to a DVR recording continuously for 300 years!
A newborn machine begins its quest for intelligence at the Goodnight Moon stage where a pairing takes place: an audio recording of the word “moon” with the written word, or an image of the moon with an audio recording of the word.
As is the case with the child learner, this is data annotation.
An example of structured data could be, let’s say, a complex set of data defining all North American songbirds at the exclusion of all else. This would produce an intelligent machine that could identify every single songbird on the continent. But it couldn’t tell us a thing about butterflies! And there would be nothing in its database or algorithmic logic to take it from songbird to butterfly.
A new set of structured data must be created and assimilated for every new thing we want our machine to learn. It’s always been this way from the beginning of time, machine learning time, that is.
Here’s a quote from Wikipedia in the article, Expert System: “In the late 1950s… biomedical researchers started creating computer-aided systems for diagnostic applications in medicine and biology. These early diagnostic systems used patients’ symptoms and laboratory test results as inputs to generate a diagnostic outcome.” Even for the first machines, data annotation was required.
From the 1950s until now, all machine learning has required data annotation to create structured datasets to create or enhance machine learning models. There have been many claims of unsupervised learning, but that has not been true in cases we’ve seen. The machines have gotten more sophisticated with their data collection, but overall the machine needs to be trained for a specific use.
Use Cases for Annotated Data
Every day AI and machine learning technologies are delivering astounding accomplishments that benefit a broad spectrum of fields and people around the world, including encompassing areas such as software and development, cybersecurity, medicine, engineering, customer service, finance, manufacturing, and more.
But scientists, technologists, and huge industries are not the only ones reaping the benefits of machine learning. Small businesses and individuals alike are beginning to understand that data collection and analysis are now the norm, so it is no wonder that AI and machine learning are among the fastest growing technologies globally.
These technologies include audio, images, videos, podcasts and more. Simply put, data is labeled to make it comprehensible to AIs. The key is the accuracy of the data sets and the quantity of data sets is also very important so that there is increased variety in the verbiage and context.
This is where TranscribeMe comes in. We have been asked to provide annotated data for a variety of use cases. And we have teams that are specially trained to label and process data appropriately for any given project. Here are just a few examples:
We Train ASR’s
As technology advances and as more general transcribed audio becomes available on the net, ASR systems can scrape this data and self-train to a degree. We’re currently working with a company that is actively doing this and has produced very good results–but not great results. Consequently, they have come to us to acquire what is considered the gold standard in training data–human transcribed and annotated audio to text. That human factor is what it takes to make a good ASR a much better ASR.
**
Ledley RS, and Lusted LB (1959). “Reasoning foundations of medical diagnosis”. Science. 130 (3366): 9–21. Bibcode:1959Sci…130….9L. doi:10.1126/science.130.3366.9. PMID 13668531
Weiss SM, Kulikowski CA, Amarel S, Safir A (1978). “A model-based method for computer-aided medical decision-making”. Artificial Intelligence. 11 (1–2): 145–172. doi:10.1016/0004-3702(78)90015-2