A basic requirement for supervised machine learning is reliable ground truth data. While machine learning techniques are advancing rapidly, those techniques would not perform if the ground truth data powering the machine learning system is inadequate or faulty. It is critical to build clean and good datasets, but that is both time consuming and expensive. The standard practice today is to use a crowdsourcing platform, such as Amazon MTurk or Crowdflower, to process and label the data. This approach requires considerable care and expertise.
This project aims to automate the whole process through an AI-Driven Classifier Building Pipeline, which produces high-quality data labels and better classifiers by relying on machine learning. It works with different data modalities such as text and images.
To create the initial dataset, we use the Semantic Pipeline built at the LSIR lab, which gives access to multiple social media data sources. Our platform selects images randomly from the image collection and presents them for labeling. In that process, the classifier is trained with the new images, identifying the images that are worth labeling. In the following steps, the system doesn’t draw images randomly; instead, it draws as many images it has low confidence about as possible.
Instead of showing images one by one in the user interface for labelling, we show multiple images per page in a clustered representation so that similar images are shown near each other, facilitating labeling by group rather than individual images. By scaling up labeling to 10-30 items at a time, we aim to accelerate the labeling process dramatically and increase the size of datasets.
Using our platform, it is possible to maintain any classifier throughout the lifetime of its usage and to iterate on its accuracy in a smarter way.