The Robot Post: ChatGPT Advanced Voice Vision, presented just a few hours ago

A few days ago, Google with Gemini Flash introduced its new multimodal AI model capable of seeing in real-time through our camera what we are viewing. Just a few hours ago, OpenAI also launched its vision model. It’s worth mentioning that on September 25, 2023, ChatGPT already announced it was working on its vision model and stated that ChatGPT could now see, hear, and speak. But as things go, right after the launch of Gemini 2.0’s vision, OpenAI also presents its own, already operational and ready to be used soon on devices, depending on the subscription plan we have for ChatGPT.

How does it work?

It’s very easy. On our mobile device, a video camera icon will appear at the bottom. When we press it, ChatGPT will take control of the camera on our device or phone and will see exactly what we are focusing on with the phone. From there, we can ask ChatGPT what it is seeing at that moment, and it will respond exactly with what it sees. Simply fantastic! This opens the door to countless applications, such as improving accessibility for people with vision problems, who will now see how ChatGPT can assist them in their daily tasks, making their day-to-day life much easier. All of this is thanks to this new advancement in artificial intelligence. It may sound like science fiction, but it’s not, it's already a reality.

ChatGPT Vision and Voice Features:

ChatGPT has been updated with the ability to see and speak simultaneously in real-time. This means it can now process and analyze images while also having voice conversations in a more natural and contextual manner.

Real-Time Image Analysis:

Users can upload images or use the camera to show ChatGPT what they are seeing. The model can describe scenes, identify objects, read text, and even infer contexts or activities from the images we are showing it.

Enhanced Voice Interaction:

Voice interaction enables smooth conversations, where ChatGPT not only understands human speech with greater accuracy but also responds vocally, mimicking a natural conversation. It listens and intervenes when appropriate, acting as another participant in the dialogue.

Practical Applications:

Education: Assisting in the explanation of visual or auditory concepts, enhancing the learning experience.
Accessibility: Providing real-time descriptions of the environment for individuals with visual or auditory impairments, helping them navigate daily tasks with the help of ChatGPT's vision.
Entertainment: Potential for interactive games or augmented reality applications where the AI can react to what it sees through the camera, creating a more immersive and responsive experience.

Implications and Future of These New Advances:

This innovation marks a significant step toward integrating AI into everyday life in a more immersive and practical way, moving closer to the vision where machines understand and react to the world as humans do.

However, attention will be needed to address issues of ethics, privacy, and security, as these technologies can process highly personal or sensitive information. Ensuring proper safeguards are in place will be critical as this technology continues to evolve and expand its capabilities.

Availability:

These new ChatGPT features are being gradually implemented and will soon be available to a wider audience, likely starting with premium users.

Here is the official presentation video from OpenAI on their YouTube channel announcing the imminent launch of ChatGPT Video in Advanced Voice, with a complete demonstration of what we can do with these groundbreaking new implementations.

GPT-4V(ision) technical work and authors

Pages

ChatGPT Advanced Voice Vision, presented just a few hours ago