Laying data pipelines to identify low birth weight babies

In this post, we want to identify how we collect 2D data, set up the data pipeline, process the data to make it usable and the challenges that we encounter in the process.

At Wadhwani Institute for Artificial Intelligence, we work on finding elegant solutions to difficult problems. One such problem is the identification of low birth weight babies. We detailed the role health workers play in a post last year. And in an earlier post, we identified the role AI can play and an overview of the challenges that lie in front of us. 

Today, we want to wade deeper into the pool of how AI can take the next step. In this post, we want to identify how we collect 2D data, set up the data pipeline, process the data to make it usable and the challenges that we encounter in the process.

Data is one of the most important building blocks of any AI model. Collecting good quality data and annotations is very important for any data-based modelling. The saying goes “Garbage in, Garbage out” so it’s important that the data and labels convey information that is relevant and useful for the model, otherwise the predictions are as good as garbage. Hence, it is wise to spend a fair amount of time upfront figuring out the nuances involved in data collection, annotation and further processing. Having said that, in the real world, it also helps to just start collecting data so that some of the unforeseen issues (unknown unknowns) show up as early as possible and iterate over time.

Our data collection effort is spearheaded by a mixture of trained nurses, Auxiliary Nurse and Midwives (ANMs) and Accredited Social Health Activists (ASHAs) spanning hospitals, primary health centres and community visits across ~60 locations pan India. We collect 2D video data using a generic low-cost smartphone accessible to our data collectors and for each baby, record weight, height, chest circumference, head circumference and arm circumference which represent the target variables we wish to predict during evaluation.

The data collected is processed through a data pipeline where anonymization, human in the loop validation, visualization, annotation, and further data processing is done to make the data usable for modelling purposes. This also updates the status of every step to a dashboard in real-time so that members in the team can monitor progress and identify issues if any. This effectively reduces the feedback cycle.

The overall system diagram of our data pipeline is shown above. All our data is collected using a mobile app built in-house dubbed Fieldy using which we collect information about the baby and corresponding 2D videos. The collected data goes through an automated pipeline where sensitive data is anonymized and the videos are converted into sequences of images and copied over to appropriate directories on our servers. We set up a data visualization portal where team members and our manual data validation team can go through the data and annotations systematically and provide feedback that then feeds into the data collectors’ retraining at regular intervals. We check the collected data with a checklist we prepared using the visualization portal. Keypoint annotation is done for frames at a regular interval and interpolated to annotate the entire video using label propagation algorithms. All the information is then standardized at our database abstraction layer so that data we collect from about 60 sites is normalized and is queryable while also allowing us to monitor various blocks in the pipeline individually.


For the kind of work we do at Wadhwani AI, more often than not we end up having to collect our datasets from the field which adds another layer of complexity as opposed to crawling the internet to collect data and processing it. Collecting and curating a custom dataset is hard, especially when the effort is distributed i.e. data being collected from multiple locations and states simultaneously by multiple co-ordinators, and when the data collectors have comparatively little digital literacy.

We faced a variety of challenges some of which we couldn’t anticipate at the get-go. We’ve listed down some of the challenges below

  1. Video quality not matching the desired quality
  2. Blurry videos
  3. Cropped videos
  4. Multiple babies within the same video
  5. Lighting and shadows
  6. Annotation quality not matching our desired quality
  7. Differences in the data collected from various sites and standardization
  8. Improving the pipeline without affecting data collection

Where we are now

We spent some time at the beginning of data collection in training the data collectors, guiding them on what valid/invalid videos look like and other instructions to ensure that the baby is not cropped out during the filming. We also have a training session when onboarding new data collectors and annotators, where someone from the team typically gives a tutorial on expected criteria and addresses their questions in the process.

The most important piece though, came through when we began scaling our data collection process. We made a design choice to build everything off a centralized database that not only allowed real-time tracking of each step of our data pipeline but also was instrumental in the automation of the data pipeline. 

We can track the data collected daily, automatically send unannotated data for manual annotation, do basic sanity checks and track all the information via the master database in real-time to allow members of the team to track progress and make other adjustments if necessary. Recently we built a comprehensive data visualization tool that allows us to view the videos being collected via the browser, allows overlaying the manual and algorithmic annotations so that we get a qualitative view of our data as well as annotations. We can also query the database to find the weight distribution, gender distribution, location distribution of the data as well as inspect the annotations, track the annotators who annotated a particular frame etc., all of which has proven to be extremely beneficial to our Products and Programs team.

Investing time in the data pipeline has proven to be extremely useful in scaling our data collection efforts. We have now collected about 5k videos amounting to roughly 3M images from 4 different states in India. We have annotated over 150k images manually and have used algorithms to propagate annotations across frames within a video. 

Our engineering team is using some of these learnings in building a generic data processing and automation pipeline to use across all our projects at WIAI.

  • Wadhwani AI

    We are an independent and nonprofit institute developing multiple AI-based solutions in healthcare and agriculture, to bring about sustainable social impact at scale through the use of artificial intelligence.

ML Engineer


An ML Engineer at Wadhwani AI will be responsible for building robust machine learning solutions to problems of societal importance; usually under the guidance of senior ML scientists, and in collaboration with dedicated software engineers. To our partners, a Wadhwani AI solution is generally a decision making tool that requires some piece of data to engage. It will be your responsibility to ensure that the information provided using that piece of data is sound. This not only requires robust learned models, but pipelines over which those models can be built, tweaked, tested, and monitored. The following subsections provide details from the perspective of solution design:

Early stage of proof of concept (PoC)

  • Setup and structure code bases that support an interactive ML experimentation process, as well as quick initial deployments
  • Develop and maintain toolsets and processes for ensuring the reproducibility of results
  • Code reviews with other technical team members at various stages of the PoC
  • Develop, extend, adopt a reliable, colab-like environment for ML

Late PoC

This is early to mid-stage of AI product development

  • Develop ETL pipelines. These can also be shared and/or owned by data engineers
  • Setup and maintain feature stores, databases, and data catalogs. Ensuring data veracity and lineage of on-demand pulls
  • Develop and support model health metrics

Post PoC

Responsibilities during production deployment

  • Develop and support A/B testing. Setup continuous integration and development (CI/CD) processes and pipelines for models
  • Develop and support continuous model monitoring
  • Define and publish service-level agreements (SLAs) for model serving. Such agreements include model latency, throughput, and reliability
  • L1/L2/L3 support for model debugging
  • Develop and support model serving environments
  • Model compression and distillation

We realize this list is broad and extensive. While the ideal candidate has some exposure to each of these topics, we also envision great candidates being experts at some subset. If either of those cases happens to be you, please apply.


Master’s degree or above in a STEM field. Several years of experience getting their hands dirty applying their craft.


  • Expert level Python programmer
  • Hands-on experience with Python libraries
    • Popular neural network libraries
    • Popular data science libraries (Pandas, numpy)
  • Knowledge of systems-level programming. Under the hood knowledge of C or C++
  • Experience and knowledge of various tools that fit into the model building pipeline. There are several – you should be able to speak to the pluses and minuses of a variety of tools given some challenge within the ML development pipeline
  • Database concepts; SQL
  • Experience with cloud platforms is a plus

ML Scientist


As an ML Scientist at Wadhwani AI, you will be responsible for building robust machine learning solutions to problems of societal importance, usually under the guidance of senior ML scientists. You will participate in translating a problem in the social sector to a well-defined AI problem, in the development and execution of algorithms and solutions to the problem, in the successful and scaled deployment of the AI solution, and in defining appropriate metrics to evaluate the effectiveness of the deployed solution.

In order to apply machine learning for social good, you will need to understand user challenges and their context, curate and transform data, train and validate models, run simulations, and broadly derive insights from data. In doing so, you will work in cross-functional teams spanning ML modeling, engineering, product, and domain experts. You will also interface with social sector organizations as appropriate.  


Associate ML scientists will have a strong academic background in a quantitative field (see below) at the Bachelor’s or Master’s level, with project experience in applied machine learning. They will possess demonstrable skills in coding, data mining and analysis, and building and implementing ML or statistical models. Where needed, they will have to learn and adapt to the requirements imposed by real-life, scaled deployments. 

Candidates should have excellent communication skills and a willingness to adapt to the challenges of doing applied work for social good. 


  • B.Tech./B.E./B.S./M.Tech./M.E./M.S./M.Sc. or equivalent in Computer Science, Electrical Engineering, Statistics, Applied Mathematics, Physics, Economics, or a relevant quantitative field. Work experience beyond the terminal degree will determine the appropriate seniority level.
  • Solid software engineering skills across one or multiple languages including Python, C++, Java.
  • Interest in applying software engineering practices to ML projects.
  • Track record of project work in applied machine learning. Experience in applying AI models to concrete real-world problems is a plus.
  • Strong verbal and written communication skills in English.