The data collection market

Data is the new oil. Several growth factors are present in the data market, one of them being artificial intelligence, a big consumer of data. By 2032, the data collection market will experience a Compound Annual Growth Rate (CAGR) of 25%, meaning that the market will grow from $63 billion to $109 billion dollars (1) (2). For example, Openai (Chat GPT) announced publicly that they hired 1000 external consultants, 600 of them being dedicated to collecting and annotating data (3).

Among the AI applications (see diagram below) that will experience high growth between 2022 and 2032 (4) , driven by sectors such as voice recognition, computer vision, automotive, telephony, robotics and any other form of human-machine interaction.

Targeted prospects

All the companies interested in what Ta-da has to offer have one common point: they are developing AI and they need data to train their algorithms. Since we will launch Ta-daโ€™s first version with voice and image capabilities, we want to onboard companies developing AI in voice recognition and sound recognition on one hand, and companies building computer vision AI, so they will need specific types of images with specific annotations on these images.

Within these companies we will want to get in touch with research and development teams, as well as data collection and annotation teams, or product teams.

We are not targeting specific geographies when it comes to finding customers, as AI companies are present at a worldwide scale, and their needs are generally similar: high quality data at a fair price.

Competitive landscape

There are several players already present on the market, they mostly all come down under two categories : "full outsourcing companies" and "microtasking companies".

1. Full outsourcing companies

  • They are more professionalized than microtasking companies, with their own sourcing teams, recording studios, higher bitrate data, and a professional data sourcing offering. The result is usually a higher quality data, but also a much higher price - and a longer data collection process, because of all the paperwork and human labour needed to produce the data.

  • Since the whole data collection process is verticalized, they are judge and jury when it comes to delivering data. They validate themselves the data they have produced, and they only validate samples, they seldom go up to 100% data validation. There is no way for a customer to make sure that data validation as been done, except by checking on their own.

Full outsourcing companies have two major weaknesses: their operational costs are expensive because they have a big amount of human labour to manage โ€œmanuallyโ€, and they have recording equipment to maintain and operate. Their second weakness is scalability, as they often us brick and mortar recording studios to collect the data. On the other hand, as they are responsible for data collection they pay more attention to quality, and they can offer โ€œoff the shelfโ€ data sets ready to go when you need it, as long as they have what a customer is looking for.

Ta-da aims at offering still higher quality, and at a much better price thanks to a lower cost structure and a decentralized data collection process by introducing a hybrid model that will be discussed in the How it works section.

2. Microtasking companies

  • Some of them have been around for quite some time. They are interesting for their lower prices compared to what full outsourcing companies offer, but the quality is not really where it should be at, mainly due to a poor data validation process. For most (if not all) microtasking companies, the customer has to validate or reject data from a producer.

  • Their model is usually interesting when you want to scale a project because you can have access to big crowds fast, but these crowds today are mostly based in India and lower income countries. Having localised communities is an issue when AI companies are needing diversified data. Furthermore a common problem with crowdsourcing is that there are so-called focus groups i.e. same people signing up to different platforms doing the work, therefore there is no real diversity. The way that these crowdsourcing companies work is they hire agencies or push ads to recruit in geographies that are not open to them yet, but this process is at the expense of the client, it is slow and often takes a very long time to set-up.

From our personal experience, most companies rely on outdated processes and technologies to produce material that is absolutely crucial for AI companies to build cutting edge products. Validating data to ensure data quality is by design hard for them to enforce, so they rely on worker ratings to offer seasoned crowds of workers. We want to drastically improve this aspect.

The good side of microtasking that Ta-da wants to keep is its scalability, since all you need to produce data is a pc or a smartphone, and an internet connection. Compared to the full outsourcing method, microtaskingโ€™s promise is that a data collection job will be fulfilled much faster since the job can be broken down into separate tasks that are executed by numerous people in parallel.

Major players such as Amazon Mechanical Turk or Clickworker are well known microtasking platforms that have been in this market for several years with millions of workers active every day, so they claim. None of the platforms weโ€™ve studied use blockchain or modern technologies to augment the microtasking practice, which is a shame for AI companies hoping to access quality data at a fair price, or even have a good user experience while using these platforms.

Obtainable markets

As previously shown, the data collection market was estimated to be worth $63 billion in 2022, taking into account various verticals (image, video, text, translation, voice, sounds) and regions (worldwide). When we launch Ta-da in its first version at the end of 2023, we can target a market of approximately $950 million based on our target markets.

As voice is one of the fastest-growing verticals, and because image annotation is one of the most common needs, these are the two verticals we'll focus on for Ta-da's launch. We can achieve quick wins right from the start. You can learn more about the verticals and regions we're targeting for V1 in the go-to-market section below.

In our product roadmap we are already preparing functionalities for further versions of Ta-da, you can have a glimpse in the go to market section, and we are also aiming at more regions as we go along in the platformโ€™s development. Our obtainable market will continue to grow in size and can quickly double as there are functionalities such as video annotation that are strongly lacking in terms of data collection and data annotation, these functionalities could be ready a couple months after launch in a more elaborate version of Ta-da.

Go to market strategy

Our objective is to onboard a community of users and several customers. This is a crucial aspect of Ta-da's growth. As you may know, we have two go-to-market strategies:

  • B2B oriented to find customers

  • Curating a Community of producers and checkers

Current customers

Our first main target markets:

  • Voice: improving AI for speech understanding and synthetic voices for voice technology companies.

  • Computer vision: improve AI for object detection and/or facial recognition using images and video.

  • Content management: Helping e-commerce & retail companies structure and classify their data.

Customer profile: companies that develop their own AI and need data to train it. Our targets will be R&D teams, Data Collection teams, Product teams.

To increase our customer base in these target markets, we adapt our strategies to our customers' profiles, locations and business cultures. That said, they all have one thing in common: they love physical meetings. As the human link with our customers is essential, we take part in trade shows.

  • October 2022: GITEX, Dubai

  • January 2023: Consumer Electronic Show, Las Vegas

  • May 2023: Data Innovation Summit, Stockholm

  • May 2023: GITEX, Marrakesh

  • August 2023: Interspeech, Dublin

  • Octobre 2023 : GITEX, Dubai

  • January 2024: Consumer Electronic Show, Las Vegas


Our community, which must be global and worldwide, will be divided into 2 categories, with their respective importance.

  • Web3 communities: will be our most active users in the growth phase. Why? Because they understand the incentive mecanism behind WEB3 and have acquired a form of habit in using Web3 applications. They are more easily mobilizable because they are connected to very specific networks and because there is money at stake, they will do everything to recover it.

  • Web2 communities: will be the majority of our users in the long term. They are the community we are really targeting, as we want Ta-da to have global adoption and fulfill our promise of diverse data. With this in mind, we are building Ta-da to be very friendly to Web2 communities, even though we are a Web3 project.

Last updated