New research utilizes AI model to learn from a child’s perspective through their eyes and ears
A recent study involved training an AI model to acquire vocabulary and understanding by analyzing headcam video recordings of a child from the age of six months until their second birthday.
The researchers showed that an artificial intelligence (AI) model could learn a significant number of words and concepts using limited parts of a child’s experience. Although only one percent of the child’s waking time was recorded on video, it was enough for genuine language learning, according to them.
“Using artificial intelligence models to study real-world language learning problems that children face allows us to address classic debates about what ingredients children need to learn words—whether they need language-specific biases, innate knowledge, or just associative learning to get going,” said Brenden Lake, an assistant professor at NYU’s Center for Information Science and the Department of Psychology. from the department and senior author of the study published in Science magazine.
We are on WhatsApp channels. Click to join.
To develop the model, the researchers first analyzed the child’s learning process, captured on first-person video – with a light, head-mounted camera – every week for six months and 25 months.
Using more than 60 hours of video footage, the team found that it contained about a quarter of a million words – the number of words conveyed, many of them repeated – linked to video frames of what the child saw as words. was talked about.
The footage also included a wide range of different developmental stages, including mealtimes, reading books and child play, the team said.
The researchers then trained the multimodal neural network with two separate modules—one that took individual frames of the video and another that took the transcribed form of the child’s speech.
These modules were combined and trained using an algorithm called contrastive learning, which aims to learn by matching the input data, they said.
For example, they explained that when a parent said something from a child’s mind, it was likely that some of the words used referred to something the child saw, meaning that understanding was instilled by combining visual and linguistic cues.
“This gives the model a clue as to which words should be associated with which objects,” said Wai Keen Vong, a researcher at NYU’s Center for Information Science.
“Combining these cues allows contrastive learning to gradually determine which words belong in which visuals and capture the learning of a child’s first words,” Vong said.
After training the model, the team tested it by presenting the model with a target word and a set of four different picture options and asking it to choose the picture corresponding to the target word.
The model was able to learn a “significant” number of words and concepts that are present in a child’s everyday experience, the researchers said.
Additionally, for some of the words the model learned, it was found to be able to generalize them to visual instances that differed from what it saw in its training data.
According to the researchers, this reflected a generalization that is also observed in children when they are studied in the laboratory.