CAPTCHA Turing Test Defeated: Ramifications for Legal Community
While debates on the use of machine learning algorithms in e-discovery continue (for example, so called “predictive coding”), the recent defeat of CAPTCHA signals a growing maturity in machine learning capabilities with direct applicability to the legal community.
Background on CAPTCHA Use: The Problem
Websites typically include pages where a website visitor can input and submit information (a web form) to the website operator. A “Contact Us” form, where the website visitor types in name, address, and email address information, is a common type of web form. A form to signup for a newsletter or a special offer is also an example of a web form.
Unfortunately, web forms are subject to abuse from automated submission ‘bots.’ Automated submission of web forms creates significant problems for a website operator. Not only can automated form submissions overload a database or tax server resources, but the form submission can be used to initiate spam or to “poison” the database with malicious code. Thus at best, automated submissions are an annoyance; in more serious cases, the automated submissions can result in criminal activity or data breaches.
The CAPTCHA Solution
To reduce automated submissions, website owners added automatically generated CAPTCHA images to each form submission page.
How Does a CAPTCHA Work?
A CAPTCHA image is a distorted image (see example) usually consisting of text or numbers (but other types are available such as audio CAPTCHAS). The distortion arises from added image “noise” (specks, fuzziness, blurring, bending characters), added colors, shading, or combinations of techniques. In principle, a human can decipher the contents of the distorted image while a computer (automated submission bot) cannot—a classic Turing Test.[FN1] A Turing Test is a classic computer science problem related to artificial intelligence. Essentially, if a computer response is indistinguishable from a human response (i.e., a human cannot tell the difference), the computer passes the Turing Test. The acronym CAPTCHA stands for “Completely Automated Public Turing Test To Tell Computers and Humans Apart.” While a mouthful, CAPTCHA assumes that a typical computer cannot pass a Turing Test. But the computer did as the recent reports indicate.
What Does the Defeat of CAPTCHA Potentially Mean for the Legal Profession?
Machine learning is a subcategory of the science of artificial intelligence. At first, the CAPTCHA defeat may seem like a techie issue or a quaint ‘gee whiz’ moment. But step back from the announcement. A Turing Test by definition tests the apparent boundaries between human analysis and computer (machine) analysis. Thus, the defeat of CAPTCHA demonstrates a maturing of machine learning capabilities. The CAPTCHA was useful precisely because it distinguished human input from machine input. No longer. The machine can now also tease out the text from the distorted image by applying machine learning algorithms to the images. (While the computer still exhibits an error rate, I also have a significant error rate with these sometimes exceedingly cryptic CAPTCHA images.)
And this comes back to the legal profession. While the implications of advancing artificial intelligence and machine learning pose broader challenges to the legal profession (e.g., robotics, computer agent liability, impersonation, authentication, authenticity, privacy), machine learning also plays an emerging role in eDiscovery document analysis (e.g., “predictive coding”). While machine learning algorithms exist in many forms such as supervised and unsupervised models, the CAPTCHA defeat indicates that the algorithms, at least for some formerly complex tasks, have reached a formidable level of capability. Thus, while application of machine learning to legal document analysis still sounds like science fiction to some, the CAPTCHA defeat at least hints that these capabilities are real and not science fiction.
Personal Perspective on Machine Learning and Artificial Intelligence
I have had the pleasure of participating this fall (2011) in two, non-credit, online classes presented in collaboration with Stanford University (one of the leaders in artificial intelligence and machine learning science). The two classes are Artificial Intelligence and Machine Learning. Even as “introductory” courses on these topics, the capabilities demonstrated are startling—especially when you consider that these classes target undergraduate level or entry-level, graduate computer science majors.
For a simple application, think of a Google search. While we now take the searches for granted, how does the system take an unknown query, parse through trillions and trillions of data points, organize them, and return relevant results in milliseconds? Possible as a result of machine learning.
The convergence of massive data storage, ready-availability of phenomenally powerful computers (even over the capabilities just five years ago), and creative algorithms makes machine learning a reality for the legal community. We are merely seeing hints of what is to come.
Original Publication Date: 2011-11-09
Lucian Constantin, Researchers Defeat CAPTCHA on Popular Websites: New tool is capable of solving CAPTCHA tests on Wikipedia, eBay, CNN and others, CIO Magazine Online (Nov. 1, 2011), http://www.cio.com/article/692902/Researchers_Defeat_CAPTCHA_on_Popular_Websites?source=CIONLE_nlt_insider_2011-11-02
John Leyden, Report: Popular CAPTCHAs easily defeated, The Register (Nov. 2, 2011), http://www.theregister.co.uk/2011/11/02/popular_captchas_easily_defeated/
Constantine von Hoffman, Researchers’ Ability to Break CAPTCHA Highlights Need to Customize All Security Systems, CIO Magazine, IT Security HACK, (Nov. 2, 2011), http://blogs.cio.com/security/16599/researchers-ability-break-captcha-highlights-need-customize-all-security-systems?source=CIONLE_nlt_insider_2011-11-03
Sebastian Thrun and Peter Norvig, Introduction to Artificial Intelligence, [Stanford University Collaboration] https://www.ai-class.com/overview (2011), http://www.ml-class.org
Andrew Ng, Introduction to Machine Learning, [Stanford University Collaboration] (2011)
Luis von Ahn, Manuel Blum and John Langford, Telling Humans and Computers Apart Automatically, Communications of the ACM (Feb. 2004), http://captcha.net/captcha_cacm.pdf (a concise description of how CAPTCHAs work)
Official CAPTCHA Website, http://www.captcha.net/
CAPTCHA, Wikipedia, https://secure.wikimedia.org/wikipedia/en/wiki/CAPTCHA
Turing Test, Wikipedia, https://secure.wikimedia.org/wikipedia/en/wiki/Turing_test
FN1—Many of these principles stem from the ground breaking work on information theory by Claude Shannon. Essentially, humans can communicate even if parts of the communication are left out or unclear (think of text messages or tweets which frequently leave out letters but are still readable).