Attacks as Defenses: Designing Robust Audio CAPTCHAs Using Attacks on Automatic Speech Recognition Systems
Network and Distributed System
Security Symposium (NDSS) - 2023
Abstract—Audio CAPTCHAs are supposed to
provide a strong defense for online
resources; however, advances in speech-to-text
mechanisms have rendered these
defenses ineffective. Audio CAPTCHAs cannot
simply be abandoned, as they are
specifically named by the W3C as important
enablers of accessibility. Accordingly,
demonstrably more robust audio CAPTCHAs are
important to the future of a secure and
accessible Web. We look to recent
literature on attacks on speech-to-text
systems for inspiration for the
construction of robust, principle-driven
audio defenses. We begin by comparing 20
recent attack papers, classifying and
measuring their suitability to serve as the
basis of new “robust to transcription” but
“easy for humans to understand” CAPTCHAs.
After showing that none of these attacks
alone are sufficient, we propose a new
mechanism that is both comparatively
intelligible (evaluated through a user
study) and hard to automatically transcribe
(i.e., P (transcription) = 4 × 10−5). We
also demonstrate that our audio samples
have a high probability of being detected
as CAPTCHAs when given to speech-to-text
systems (P (evasion) = 1.77 × 10−4).
Finally, we show that our method can break
WaveGuard, a mechanism designed to defend
adversarial audio, with a 99% success rate.
In so doing, we not only demonstrate a
CAPTCHA that is approximately four orders
of magnitude more difficult to crack, but
that such systems can be designed based on
the insights gained from attack papers
using the differences between the ways that
humans and computers process audio.
Who Are You (I Really Wanna Know)?
Detecting Audio DeepFakes Through Vocal
Tract Reconstruction
USENIX Security Symposium (USENIX
Security) - 2022
First Author
Generative machine learning models have
made convincing voice synthesis a reality.
While such tools can be extremely useful in
applications where people consent to their
voices being cloned (e.g., patients losing
the ability to speak, actors not wanting to
have to redo dialog, etc), they also allow
for the creation of nonconsensual content
known as deepfakes. This malicious audio is
problematic not only because it can
convincingly be used to impersonate
arbitrary users, but because detecting
deepfakes is challenging and generally
requires knowledge of the specific deepfake
generator. In this paper, we develop a new
mechanism for detecting audio deepfakes
using techniques from the field of
articulatory phonetics. Specifically, we
apply fluid dynamics to estimate the
arrangement of the human vocal tract during
speech generation and show that deepfakes
often model impossible or highly-unlikely
anatomical arrangements. When
parameterized to achieve 99.9% precision,
our detection mechanism achieves a recall
of 99.5%, correctly identifying all but
one deepfake sample in our dataset. We then
discuss the limitations of this approach,
and how deepfake models fail to reproduce
all aspects of speech equally. In so doing,
we demonstrate that subtle, but
biologically constrained aspects of how
humans generate speech are not captured by
current models, and can therefore act as a
powerful tool to detect audio deepfakes.
Lux: Enable Ephemeral Authorization for
Display-Limited IoT Devices
ACM/IEEE Conference on Internet of
Things Design and Implementation (IoTDI)
- 2021
First Author
Smart speakers are increasingly appearing
in homes, enterprises, and businesses
including hotels. These systems serve as
hubs for other IoT devices and deliver
content from streaming media services.
However, such an arrangement creates a
number of security concerns. For instance,
providing such devices with long-term
secrets is problematic with regards to
vulnerable devices and fails to capture the
increasingly transient nature of the
relationship between users and the
devices (e.g., in hotel or airbnb settings,
this device is not owned by the customer
and may only be used for a single day).
Moreover, the limited interfaces available
to such speakers make entering such
credentials in a safe manner difficult. We
address these problems with Lux, a system
to provide ephemeral, fine-grained
authorization to smart speakers which can
be automatically revoked when the user
and hub are no longer in the same location.
We develop protocols using the LED/light
channel available to many smart speaker
devices to help users properly identify the
device with which they are communicating,
and demonstrate through a formally
validated protocol that such authorization
takes only a few seconds in practice.
Through this effort, we demonstrate that
Lux can safely authorize devices to access
user accounts while limiting any long-term
exposure to compromise.
Hear "No Evil", See "Kenansville":
Efficient and Transferable Black-Box Attacks
on Speech Recognition and Voice Identification
IEEE Symposium on Security and Privacy
- 2021
Automatic speech recognition and voice
identification systems are being deployed
in a wide array of applications, from
providing control mechanisms to devices
lacking traditional interfaces, to the
automatic transcription of conversations
and authentication of users. Many of these
applications have significant security and
privacy considerations. We develop attacks
that force mistranscription and
misidentification in state of the art
systems, with minimal impact on human
comprehension. Processing pipelines for
modern systems are comprised of signal
preprocessing and feature extraction steps,
whose output is fed to a machine-learned
model. Prior work has focused on the
models, using white-box knowledge to tailor
model-specific attacks. We focus on the
pipeline stages before the models, which
(unlike the models) are quite similar
across systems. As such, our attacks are
black-box and transferable, and
demonstrably achieve mistranscription and
misidentification rates as high as 100% by
modifying only a few frames of audio. We
perform a study via Amazon Mechanical Turk
demonstrating that there is no
statistically significant difference
between human perception of regular and
perturbed audio. Our findings suggest that
models may learn aspects of speech that are
generally not perceived by human subjects,
but that are crucial for model accuracy. We
also find that certain English language
phonemes (in particular, vowels) are
significantly more susceptible to our
attack. We show that the attacks are
effective when mounted over cellular
networks, where signals are subject to
degradation due to transcoding, jitter, and
packet loss.
Digital Healthcare-Associated
Infection: A Case Study on the Security
of a Major Multi-Campus Hospital
Network and Distributed System
Security Symposium (NDSS) - 2019
Modern hospital systems are complex
environments that rely on high
interconnectivity with the larger Internet.
With this connectivity comes a vast attack
surface. Security researchers have expended
considerable effort to characterize the
risks posed to medical devices (e.g.,
pacemakers and insulin pumps). However,
there has been no systematic,
ecosystem-wide analyses of a modern
hospital system to date, perhaps due to the
challenges of collecting and analyzing
sensitive healthcare data. Hospital traffic
requires special considerations because
healthcare data may contain private
information or may come from
safety-critical devices in charge of
patient care. We describe the process of
obtaining the network data in a safe and
ethical manner in order to help expand
future research in this field. We present
an analysis of network-enabled devices
connected to the hospital used for its
daily operations without posing any harm to
the hospital’s environment. We perform a
Digital Healthcare-Associated Infection
(D-HAI) analysis of the hospital ecosystem,
assessing a major multi-campus healthcare
system over a period of six months. As part
of the D-HAI analysis, we characterize DNS
requests and TLS/SSL communications to
better understand the threats faced within
the hospital environment without disturbing
the operational network. Contrary to past
assumptions, we find that medical devices
have minimal exposure to the external
Internet, but that medical support devices
(e.g., servers, computer terminals)
essential for daily hospital operations are
much more exposed. While much of this
communication appears to be benign, we
discover evidence of insecure and broken
cryptography and misconfigured devices, and
potential botnet activity. Analyzing the
network ecosystem in which they operate
gives us an insight into the weaknesses and
misconfigurations hospitals need to address
to ensure the safety and privacy of
Characterizing the Security of the SMS
Ecosystem with Public Gateways
ACM Transactions on Privacy and
Security (TOPS) Volume 22 - 2018
Recent years have seen the Short
Message Service (SMS) become a critical
component of the security infrastructure,
assisting with tasks including identity
verification and second-factor
authentication. At the same time, this
messaging infrastructure has become
dramatically more open and connected to
public networks than ever before.
However, the implications of this openness,
the security practices of benign services,
and the malicious misuse of this ecosystem
are not well understood. In this article,
we provide a comprehensive longitudinal
study to answer these questions, analyzing
over 900,000 text messages sent to public
online SMS gateways over the course of 28
months. From this data, we uncover the
geographical distribution of spam
messages, study SMS as a transmission
medium of malicious content, and find that
changes in benign and malicious behaviors
in the SMS ecosystem have been minimal
during our collection period. The key
takeaways of this research show many
services sending sensitive security-based
messages through an unencrypted medium,
implementing low entropy solutions for
one-use codes, and behaviors indicating
that public gateways are primarily used for
evading account creation policies that
require verified phone numbers. This
latter finding has significant implications
for combating phone-verified account fraud
and demonstrates that such evasion will
continue to be difficult to detect and
Hello, Is It Me You're Looking For?
Differentiating Between Human and
Electronic Speakers for Voice Interface
ACM Conference on Security and Privacy
in Wireless and Mobile Networks (Wisec)
- 2018
First Author
Voice interfaces are increasingly
becoming integrated into a variety of
Internet of Things (IoT) devices. Such
systems can dramatically simplify
interactions between users and devices with
limited displays. Unfortunately, voice
interfaces also create new opportunities
for exploitation. Specifically, any
sound-emitting device within range of the
system implementing the voice interface
(e.g., a smart television, an
Internet-connected appliance, etc) can
potentially cause these systems to perform
operations against the desires of their
owners (e.g., unlock doors, make
unauthorized purchases, etc). We address
this problem by developing a technique to
recognize fundamental differences in audio
created by humans and electronic speakers.
We identify sub-bass over-excitation, or
the presence of significant low frequency
signals that are outside of the range of
human voices but inherent to the design of
modern speakers, as a strong differentiator
between these two sources. After
identifying this phenomenon, we demonstrate
its use in preventing adversarial requests,
replayed audio, and hidden commands with a
100%/1.72% TPR/FPR in quiet environments.
In so doing, we demonstrate that commands
injected via nearby audio devices can be
effectively removed by voice interfaces.
2MA: Verifying Voice Commands via Two
Microphone Authentication
ACM ASIA Conference on Computer and
Communications Security (ASIACCS) -
First Author
Voice controlled interfaces have vastly
improved the usability of many devices
(e.g., headless IoT systems).
Unfortunately, the lack of authentication
for these interfaces has also introduced
command injection vulnerabilities - whether
via compromised IoT devices, television ads
or simply malicious nearby neighbors,
causing such devices to perform
unauthenticated sensitive commands is
relatively easy. We address these
weaknesses with Two Microphone
Authentication (2MA), which takes advantage
of the presence of multiple ambient and
personal devices operating in the same
area. We develop an embodiment of 2MA that
combines approximate localization through
Direction of Arrival (DOA) techniques with
Robust Audio Hashes (RSHs). Our results
show that our 2MA system can localize a
source to within a narrow physical cone
(<30°) with zero false positives, eliminate
replay attacks and prevent the
injection of inaudible/hidden
commands. As such, we
dramatically increase the
difficulty for an adversary to
carry out such attacks and
demonstrate that 2MA is an
effective means of authenticating
and localizing voice commands.
AuthentiCall: Efficient Identity and
Content Authentication for Phone
USENIX Security Symposium (USENIX
Security) - 2017
Phones are used to confirm some of our most
sensitive transactions. From coordination
between energy providers in the power grid
to corroboration of high-value transfers
with a financial institution, we rely on
telephony to serve as a trustworthy
communications path. However, such trust is
not well placed given the widespread
understanding of telephony’s inability to
provide end-to-end authentication between
callers. In this paper, we address this
problem through the AuthentiCall system.
AuthentiCall not only cryptographically
authenticates both parties on the call, but
also provides strong guarantees of the
integrity of conversations made over
traditional phone networks. We achieve
these ends through the use of formally
verified protocols that bind low-bitrate
data channels to heterogeneous audio
channels. Unlike previous efforts, we
demonstrate that AuthentiCall can be used
to provide strong authentication before
calls are answered, allowing users to
ignore calls claiming a particular Caller
ID that are unable or unwilling to provide
proof of that assertion. Moreover, we
detect 99% of tampered call audio with
negligible false positives and only a
worst-case 1.4 second call establishment
overhead. In so doing, we argue that strong
and efficient end-to-end authentication for
phone networks is approaching a practical
Authloop: Practical End-to-End
Cryptographic Authentication for
Telephony over Voice Channels
USENIX Security Symposium (USENIX
Security) - 2016
Telephones remain a trusted platform for
conducting some of our most sensitive
exchanges. From banking to taxes, wide
swathes of industry and government rely on
telephony as a secure fall-back when
attempting to confirm the veracity of a
transaction. In spite of this,
authentication is poorly managed between
these systems, and in the general case it
is impossible to be certain of the identity
(i.e., Caller ID) of the entity at the
other end of a call. We address this
problem with AuthLoop, the first system to
provide cryptographic authentication solely
within the voice channel. We design,
implement and characterize the performance
of an in-band modem for executing a
TLS-inspired authentication protocol, and
demonstrate its abilities to ensure that
the explicit single-sided authentication
procedures pervading the web are also
possible on all phones. We show
experimentally that this protocol can be
executed with minimal computational
overhead and only a few seconds of user
time (≈9 instead of ≈97 seconds for a
naıve implementation of TLS 1.2) over
heterogeneous networks. In so doing, we
demonstrate that strong end-to-end
validation of Caller ID is indeed practical
for all telephony networks.
Detecting SMS Spam in the Age of
Legitimate Bulk Messaging
ACM Conference on Security and Privacy
in Wireless and Mobile Networks (Wisec)
- 2016
Text messaging is used by more people
around the world than any other
communications technology. As such, it
presents a desirable medium for spammers.
While this problem has been studied by
many researchers over the years, the recent
increase in legitimate bulk traffic (e.g.,
account verification, 2FA, etc.) has
dramatically changed the mix of traffic
seen in this space, reducing the
effectiveness of previous spam
classification efforts. This paper
demonstrates the performance degradation of
those detectors when used on a large-scale
corpus of text messages containing both
bulk and spam messages. Against our labeled
dataset of text messages collected over 14
months, the precision and recall of past
classifiers fall to 23.8% and 61.3%
respectively. However, using our
classification techniques and labeled
clusters, precision and recall rise to 100%
and 96.8%. We not only show that our
collected dataset helps to correct many of
the overtraining errors seen in previous
studies, but also present insights into a
number of current SMS spam campaigns.
Sending Out an SMS: Characterizing the
Security of the SMS Ecosystem with
Public Gateways
IEEE Symposium on Security and Privacy
(IEEE S&P) - 2016
Text messages sent via the Short Message
Service (SMS) have revolutionized
interpersonal communication. Recent years
have also seen this service become a
critical component of the security
infrastructure, assisting with tasks
including identity verification and
second-factor authentication. At the same
time, this messaging infrastructure has
become dramatically more open and connected
to public networks than ever before.
However, the implications of this openness,
the security practices of benign services,
and the malicious misuse of this ecosystem
are not well understood. In this paper, we
provide the first longitudinal study to
answer these questions, analyzing nearly
400,000 text messages sent to public online
SMS gateways over the course of 14 months.
From this data, we are able to identify not
only a range of services sending extremely
sensitive plaintext data and implementing
low entropy solutions for one-use codes,
but also offer insights into the prevalence
of SMS spam and behaviors indicating that
public gateways are primarily used for
evading account creation policies that
require verified phone numbers. This latter
finding has significant implications for
research combatting phone-verified account
fraud and demonstrates that such evasion
will continue to be difficult to detect and