Vishing with Synthetic Voice
Articles by: Richey May, Dec 05, 2022
Author: Michael Nouguier,
Chief Information Security Officer and Director, Cybersecurity Services
Raise your hand if you haven’t received the phone call to extend your car’s warranty or if the IRS wants you to buy gift cards before you talk to a magistrate. Vishing social engineering phone calls are on the rise and show increasing velocity. Vishing is a type of social engineering attack where a malicious attacker calls their target to gain an objective e.g., illicit information from the victim, process a fraudulent payment, or invoke the victim to perform other potentially malicious activities. The most prevalent vishing attacks today fall into two scenarios:
- Mass calls, blindly targeting the maximum number of people, pretending to be in an authoritative or unassuming position such as the IRS, police, charities, etc. The objective is financial gain from the target.
- Targeted calling campaigns towards organizations to assist the malicious actor in breaching the organization’s environment (e.g., gathering technical information to craft a more successful campaign) or illicit sensitive information regarding the company and/or its customers.
Exactly how do gender and vocals lead to more successful social engineering campaigns? Stick with us through this as it is a bit derivative. Attackers have a plethora of techniques in their repertoire which they use to increase the likelihood of success for their attack. Attackers often use urgency and emotion to trigger a more successful response from the victim. Urgency and emotion are only successful tactics if the person on the other end of the phone trusts you. Studies over the past decade have shown that women tend to be more successful in voice and physical social engineering campaigns. A study by social-engineer.org shows that women are perceived as being more trustworthy and honest and therefore more successful in their social engineering endeavors due to that implicit trust. Further, a recent study by Bugcrowd shows that about 91% of cyber attackers are men, which increases suspicion during social engineering activities. The advancement in synthetic/digital voice technologies can now allow cyber attackers to sound like any gender or even a person, thus leading to a more advanced, dynamic, and successful social engineering campaign.
This is just one of the many apparent use cases for utilizing synthetic voice for a more successful social engineering campaign. Richey May has crafted several scenarios and campaigns that can be utilized to drive success for testing your organization’s risk related to malicious social engineering attacks. Often organizations cycle through the same social engineering campaigns when assessing the posture of their user base. This reduces the effectiveness of most internal testing. Specifically crafted campaigns show the effects of a “real world” attack on an organization and can provide greater remediation responses that lead to better resilience for an organization in the event of an actual attack.
Richey May is partnering with Respeecher, an Emmy-award winning voice cloning company (https://www.respeecher.com/), to research how the use of digital voice/deep fake voice will increase the susceptibility of employees at an organization who fall for a vishing attack. Respeecher can alter a deeper, masculine-sounding voice to a higher pitched, feminine-sounding voice with almost no “tells” that there is manipulation to the voice the target would be hearing. More specifically, Respeecher specializes in speech-to-speech voice conversion, which enables one person to speak and perform in the voice of another specific person. Due to its focus on high fidelity, Respeecher’s synthetic speech is featured in some of the biggest Hollywood movies.
What makes speech-to-speech voice conversion sound exceptionally realistic is the fact that most of the non-linguistic aspects of speech (intonation, emotion, strength of the voice, cadence, etc.) are NOT generated by a computer. Instead, those attributes from a real human actor (source actor) drives the system. The role of the AI algorithm is then to convert the timbre, and their phonetic habits into those of the target speaker. The synthetic speech generated with this approach can often be indistinguishable from real speech by an unsuspecting listener. In simple words, recordings of the “target voice” (the recreated voice) are used to train the system and apply it to the “source speaker” (i.e., the actor reading the lines). To mitigate the malicious use of the technology, Respeecher requires permission from the “target voice” owner.
Together with Respeecher, Richey May has designed several scenarios of using synthetic speech for social engineering penetration testing. One such scenario is utilizing the voice of a trusted leader within an organization to complete a malicious objective, such as a CEO calling the CFO to transfer money to the attacker’s account or installing malicious software on company’s internal computers. Such simulations can be conducted using Respeecher’s real-time (sub 500ms latency) voice cloning tech. With only about five minutes of existing recordings of a person’s voice, Respeecher can enable an engineer, who will be running the tests, to sound like a specific person and attempt to phish out sensitive information over a phone call or through a video conferencing app. We believe that conducting vishing test scenarios like this could help identify employees’ susceptibility to such threats and address it by proper personnel training.
Richey May and Respeecher can perform advanced vishing tests using Respeecher’s technology and Richey May’s advanced social engineering team to provide real-world and innovative attack simulations to prepare organizations for the future of cyber-attacks. We would love to talk to you more! Shoot us an email at email@example.com.