Emotional speech recognition for everyone.
Pitch Presentation: https://pitch.com/public/00c1a1be-aa56-4639-82a2-68f7aa1a29cb
Problem Background
People with ear impairments can now caption conversations and call like everyone else, thanks to voice to text technology. But what about emotions? How can we translate emotions in a non spoken medium?
We know that there is a fundamental analogy between sign language (visual) and spoken language (acoustic / phonological). In fact, just as phonemes are minimal units without meaning participating in the formation of words of articulated language, in the same way the cheremi, whatever the national sign language, are presented as minimal units without meaning, which - as formational parameters - can be combined between them to give rise to the signs of sign language.
When an ear impairments person communicates by text, the correspondence emotion into visual sign is lost.
Presently, we have successful apps with voice-to-text and text-to-voice technology. We know that there is an extended literature on AI models which can predict the emotion of the speaker by analysing the recorded audio clip of the speaker’s voice, what we don’t have is the text-emotion control application.
The main situation we wanted to understand was: how do deaf people deal with an emergency situation?
We interviewed three people with ear impairments and one ASL interpreter, and asked them to walk us through their actions and feelings during their approach to an emergency situation.
What we discovered is that there are two scenarios in an emergency case: the support is provided freely and immediately via a CODA, Children of Deaf Adult, given that generally deaf people have a very supportive network of people helping each other.
Otherwise there are on the market text-to-voice applications: as rogervoice or google live transcribe.
During the interviews, while texting our questions, we came across the same situation. The interviewee often asked us to slow down or rephrase the sentence because something was difficult for them to catch: emotion or intention in written text.
We narrowed down the solution to adding emotion into text. It is hard to recognize other people’s emotions by text. I wish my voice-to-text app could recognize their emotions so that I could communicate with people better.
The features we would like to have in IHEARYOU are:
A note on the last feature is that Automatic Recognition of Emotions from Speech and Text of speech is a challenging problem. Recently, AI and DeepLearning solutions have restricted audio data emotions to the top six classes: angry, disgusted, fear, happy, sad, neutral, surprise and calm, for these emotions text has a visual correspondence.
Speech Emotion Recognition (SER), is tough because emotions are subjective and annotating audio is challenging. If emotions can be encoded in visual text, people with ear impairments would have fewer difficulties in understanding voice-to-text conversation.
We created a list of User pain points, here below.
User Story #1: As a user, I want to use voice to text with emotion, so that I understand non-visual signals.
Scenario #1: Understand emotion with text
Acceptance Criteria:
- User can see font ⇿ emotion correspondence 1 on 1
- User can see spaced text correspondence to vocal speed
- User can see contrast text correspondence to vocal intensity
(Setting : Equaliser and Text control such as font typeface and size)
User Story #2: As a user, I want to switch to another language, so that I can use the app anywhere.
Scenario #1: Translate
Acceptance Criteria:
- User can put the device that they want to listen
- User can see voice-to-text in another language
- Recognize the language and translate text to voice
User Story #3: If I don’t understand then I wish someone could explain me easy
Scenario #1: paraphrase and translate
Acceptance Criteria:
- Paraphrasing with real-life terminology
– Explain the dictionaries term
Based on our target users’ pain points, we knew we wanted to work on the text chat box with selecting emotion as the first feature, and with increasing tech difficulty arrive at the automatic emotion recognition feature.
The feedback we received was enthusiastic, because of the general lack of resources for deaf people. Interviewees enjoyed the simplicity of the design and the efficacy of the solution. They would like to test the web site or app, and proposed to be a plug-in of the existing live transcribe app.
We created a flow user chart
From which the designer derived the HiFi Mockups
The chat box with emotion control menu has a dropdown menu with emotions to be selected and the background is colored according to the selected emotion.
Implementation Details
Technical implementation
Technical challenges
We are not working on the project as a team, because we are going into different professional paths, nonetheless the product is interesting and socially useful, and we would have liked to complete the product adding all the desired features, specifically the API for Emotion Speech Recognition.
I have learned how to collaborate with a cross-functional team, dealing with contrast and difficult situations. How to prioritize features and listen to customers' needs. I am already using the learnings into my work and receiving good feedback.
Gained more familiarity with the agile project management process while working with a team including a product manager, designer and developer
Learned a lot about building out applications that prioritize accessibility features for folks with hearing impairment, blindness and more. This allowed me to look into the technologies that already exists.
We have learnt how to collaborate and work through stressful situations. We successfully derived a solution from a complex problem as emotion recognition for the purpose of inclusivity.