Episode 19 | Tech Futures

Episode 19: Masked Image Modeling

By Courtney Clark

Play Now

Preview

Show Notes

In November of 2022 Chat GPT was released to the public. Since then it’s popularity has increased dramatically and it has about a hundred million users. This natural language processing tool has been hailed for its intelligence and usefulness by users. If a natural language processing tool could have such an amazing impact on how we learn, what would happen if we applied it to other types of data?

Transcript

Full text of transcript

Courtney

In November of 2022 Chat GPT was released to the public. Since then it’s popularity has increased dramatically and it has about a hundred million users. This natural language processing tool has been hailed for its intelligence and usefulness by users. If a natural language processing tool could have such an amazing impact on how we learn, what would happen if we applied it to other types of data?

Eliza

[0:21:3] The current state is that there is more data coming in via a variety of sensors – be it satellites, airborne sensors, traffic cams, any of these things – there’s more data coming in than analyst time we have to actually look through them. It takes a lot of training to become an imagery analyst and there are people whose entire expertise is extracting the magic information from these images. And it’s really incredible, right? But there’s just not enough people that are trained to do that. And it’s just gonna keep getting worse. Storage is cheap, sensors are getting better all the time and people are launching more things into space because that’s getting more accessible too.

Courtney

[1:00:9] That was Eliza Mace, a lead computer engineer who works in supervised learning here at MITRE. As she explains, we live in a data rich world that is only going to increase – particularly when it comes to visual data. Be it from a satellite, traffic stop, airplane, or even an image from a medical device. All of that data needs to be reviewed and contextualized by an analyst on the other end. But what if we could take the bulk of the legwork out of that process? Masked Image Modeling does just that. By utilizing the same techniques that Chat GBT was trained on, we can make sensors smarter, faster, than if we had used supervised learning.

Eliza

[1:37:1] So the idea here is that, all those images that don’t have time to be looked at, can we have a machine take a quick glance at them? Can it throw a flag when “oh, something is different here,” or “something is interesting here”. Or even the images that people are going to look at that are a part of their workflow, maybe it’s an interesting part of the world. There’s a conflict going on and their job is to look every day at the imagery that comes in that area, right? Could we prioritize that imagery for them in some way? Because a machine got the first look.

Courtney

[2:07:5] Hi, and welcome to MITRE’s Tech Futures podcast. I’m your host, Courtney Clark. And I’m a cyber business strategist here at MITRE. At MITRE, we offer unique vantage points and objective insights that we share in the public interest. And in this podcast series, we showcase emerging technologies that will affect the government and our nation in the future.

Today, we’re going to talk about masked image modeling or MIM, a technique spearheaded by Liya Wang and Alex Tien that aims to minimize the effort necessary to understand an image by making sensors smarter, more efficient, and more effective. We’re going to hear about the idea behind the research, as well as how it’s already made an impact in the space, particularly when it comes to aerial imaging.

Before we begin. I want to say a huge thank you to Dr. Kris Kris Rosfjord, the Tech Futures, Innovation Area Leader in MITRE’s independent research and development program that supported this effort. This episode would not have happened without her support. Now without further ado, I bring you MITRE’s Tech Futures podcast, episode number 19.

Liya

[3:08:2] Computers cannot recognize image or video by themselves. So, we must transfer the raw image or video into inviting vector. This whole process is called representation learning. Usually we have two methods to do it. One is supervised learning, and another is unsupervised learning. Supervised learning needs tons and tons of data labels, which are very expensive to get.

So people currently resort to unsupervised learning method to do it. In unsupervised learning, there’s a sub branch called Self Supervised Learning, Masked Image Modeling belongs to Self Supervised Learning.

Courtney

[3:59:9] That was Liya Wang, principal investigator for the Masked Image Modeling project. As she points out, computers don’t automatically know what an image represents. Show it a picture of a dog and it’ll register ones and zeros. Because of this, it’s important to teach it what images mean. Masked Image Modeling, inspired by masked language modeling, aims to do just that.

Liya

[4:22:2] In masked language modeling, for your input sentence, you randomly mask out some words and then let the neural network come out, mask words for you. In this way, you can learn good representation for the words and make downstream AI applications much easier and faster. Similarly, people want to make this idea in computer vision field, so mask auto encode was invented for this.

Courtney

[4:58:9] By masking or hiding certain words in a sentence and asking the computer to fill in the blanks, the computer becomes smarter in filling in information and understanding context. This can come in handy when it comes to visual data. Alex Tien goes on to explain.

Alex

[5:13:7] The question probably should be, why do we want to mask the image using patches to train AI models? Patches of image is just a way to divide it into a small chunk so the computer can read it in more efficiently, using less computer resources.

And when you divide a big image into small patches, each patch is supposed to contain certain information that could be useful. So the belief is that a few patches together might give us, human or machine, enough information to understand the image.

That means you don’t need to read the whole image but with just partial information, you will learn enough to perform the downstream task. So in Masked Image Modeling by masking a few patches of the image, a model can be trained to learn important features from only a subset of patches and then reproduce the original image. And when that model is trained, the model is supposed to learn all those features embedded in the image data set. And then once you have that it becomes the foundation, the model for you to apply to the downstream test. For example, in this case image classification or object detection.

Liya

[6:27:0] So we apply it to image classification and object detection, and we found that they give us better performance. For example, for object detection, masked auto encode surpassed traditional method by up to 17%. Secondly, they can also make that downstream AI applications development easier and faster.

Courtney

[6:56:2] As Alex and Liya both illustrate. Masked Image Modeling can make a tremendous impact in helping artificial intelligence teach itself context when it comes to visual data. Not only that but it can help it be smarter when data is incomplete. This can be a vital skill, particularly when it comes to aerial imaging.

Alex

[7:14:0] The challenge for aerial image is that it has the bird’s eye view perspective, so it’s not like the typical image you took with your camera, because not all the objects you want to detect are in the right orientation; and also the background is really complex, so it could be forest, could be the land covered with snow. That makes a challenging condition for your computer vision model and also the image size coming larger or smaller so you want a trend model to recognize a bunch of different objects, right? So, that’s also part of the challenge dealing with the aerial images. Fortunately, there are some open source data sets that we can leverage to train the model up and even to do a benchmarking.

Courtney

[7:57:8] While Liya and Alex’s research study focuses on aerial imaging. The possibilities of this technology are endless. From something as large as a bird’s eye view to as microscopic as cells in a medical scan.

Liya

[8:08:7] So Masked Image Modeling can support many applications. For example we can use it for image classification, object detection, segmentation, video tasks including classification, object tracking, and also audio tasks, reinforcement learning, medical vision tasks upon the cloud. They have tons of these applications. If people are interested, they can find out in our paper.

Courtney

[8:42:3] Masked Image Modeling requires recognition of many different objects in many different orientations. So that when a plane is obscured by a tree or a bush, for example, the computer is still able to alert an analyst. However, developing that understanding takes a copious amount of input data.

While, open source data sets are available, there’s something organizations can do to benefit from this research.

Eliza

[9:04:2] Data is so valuable and people are starting to see, people pay for experts to label the data. This is a hugely valuable thing, and the idea that you can do something meaningful with that data in an automated fashion without human intervention, the idea that if people knew that their data sitting on some hard drive somewhere could be doing something productive if you just gave it the chance, I think more people would jump at that opportunity.

Fund small studies and make your data accessible and know that it can have a purpose, even if it’s completely automated.

Courtney

[9:39:4] Thanks for tuning into this episode of MITRE’s tech futures podcast. I wrote, produced, and edited this show with the help of Dr. Heath Farris and Dr. Kris Rosfjord: Technology Futures Innovation Area Leads, Tom Scholfield: Media Engineer, and Beverly Wood: Strategic Communications. Our guests for this episode were Liya Wang, Alex Tien, and Eliza Mace.

Meet the Guests

Dr. Elizabeth Hohman

Elizabeth Hohman is a Principal Data Scientist in MITRE’s Innovation department. She began working on graph applications for the Naval Surface Warfare Center in 2005, with a freshly released copy of the Enron email dataset, which was one of the few real-world datasets available to investigate social networks. Her work focused on combining graphs and text, and she worked on new data as it was released, including Wikipedia graph data, multiple language Wikipedia graphs, and then huge social network graphs using Twitter data. She received her PhD in computational statistics from George Mason University in 2008, using a dynamic graph model to analyze streaming text documents. She joined MITRE in 2015 and has had the opportunity to work on a wide variety of projects from fraud detection, aviation, healthcare, and other domains.

Dr. James Tanis

James Tanis received a Ph.D. in mathematics from the University of Maryland. He joined MITRE in 2017 and has been involved in a variety of data science projects. Dr. Tanis began working alongside his colleague Chris Giannella on their paper “Feature Learning on Graphs” in 2021, after being exposed to the technology through another MITRE project.

Dr. Joe Hoffman

Joe Hoffman is a Senior Principal Scientist in MITRE’s Center for Advanced Aviation System Development, where he has worked since 1991 analyzing airspace designs, developing computer simulations of air traffic systems, and measuring the safety and efficiency of possible future air traffic management operations. He holds a B.S. in Mathematics from the College of William and Mary and a Ph.D. in nuclear physics from the Florida State University.