Emerging eDiscovery Tools and the New, Technology-Augmented Lawyer

“Any sufficiently advanced technology is indistinguishable from magic.”—Arthur C. Clarke

Approximately 96% of all documents now originate in electronic format. An estimated 5 exabytes of data is created every two days—or about the same amount of recorded data now created every two days as created by all of humankind from the dawn of history up to 2003. The volume of electronically stored information (ESI) should not surprise attorneys. Consider the increasing role of emails, text messages, social media, word processing documents, PDFs, database records, “cloud” documents, cell phone records, and geolocation information in administrative, civil, and criminal matters. We increasingly labor under a data deluge.

Predictive Analytics Tools Emerge—Predictive Coding, Technology Assisted Review (TAR), Computer Assisted Document Analysis (CADA), Etc.

Acknowledging the volume of electronic documents, how can an attorney efficiently and effectively analyze materials to respond to a discovery request or to evaluate records produced by opposing counsel? New legal technologies deliver potential solutions. Much like the shift from paper-based legal research to online electronic search, lawyers will need to learn how to use an emerging class of “data analysis technologies”—called predictive coding, Technology Assisted Review (TAR), Computer Assisted Document Analysis (CADA), or legal analytics. No matter which term is used, these new technologies will become indispensable to the new, technology-augmented lawyer.

This article will use the term Computer Assisted Document Analysis (CADA) to refer to these technologies and processes. CADA software uses advanced computer algorithms that allow a computer to learn about specific tasks from an expert and then to apply that learning to other similar situations. Put simply, a computer now has the capacity to learn—with striking accuracy and consistency. As science fiction writer Arthur C. Clarke wrote, “any sufficiently advanced technology is indistinguishable from magic,” and CADA may, at first, seem a little like magic.

Proven Track-record for Predictive Analytics Algorithms

But CADA isn’t magic. CADA software derives from decades of artificial intelligence (AI), machine learning, statistics, algorithms, and natural language processing research. These algorithms have a long history and are proven in many fields including medicine, insurance, engineering, and finance.

In fact, most of you have probably already encounter some form of these algorithms. Most online shopping and search engines use forms of these algorithms to make recommendations about similar products or to rank search engine results by relevance. The new, natural-language-based legal research tools from WestLaw and LexisNexis use these algorithms. Some of you might have seen IBM’s Watson computer play the game Jeopardy, some might use Apple’s Siri, voice-recognition software, and all probably use spam email filters and anti-virus software—all permutations of the algorithms underlying CADA. Digital X-rays and medical tests used by dentists and doctors likely rely on these algorithms. Even “scanning” electronic copies of documents with optical character recognition (OCR) and voice recognition systems use these technologies. Thus, while new to the legal community, these algorithms are in mainstream use.

Using Predictive Coding Tools for Legal Matters

For the legal community, these algorithms may help an attorney to accurately and consistently sort through documents to reduce the volume of irrelevant materials and to classify the documents. Here is an example of how an attorney might use CADA software. Assume that you have 100,000 documents to review. The attorney loads the electronic documents into the CADA software, and the CADA software pre-processes the documents into a machine-understandable format. After loading, the CADA software randomly selects a subset of the documents (perhaps several hundred) for attorney review.

The attorney reviews and analyzes each randomly sampled document much as one would review a paper document. During review, the attorney also assigns a legally significant classification to each randomly selected document—for example, relevant/not-relevant, privileged/not-privileged, important/not-important. (You might think of this as the electronic equivalent of sorting the reviewed documents into “piles” based on some criteria.) This attorney-review results in a “seed-set” of expertly sorted documents.

After the seed-set is prepared, the software executes a special “training algorithm” on the seed-set. This is where the computer learns. Put simply, the training algorithm exhaustively analyzes the attorney-classified, seed-set and generates a comprehensive, mathematical, learning-model summarizing the attorney’s analysis. A summary training report tells the attorney how accurate and potentially effective the learning-model is. Once the computer is trained, the computer is ready to make predictions about the remaining documents.

The ability to make predictions represents the efficiency of CADA. For example, if the attorney started with 100,000 documents, she might only look at a random sample of 800 documents to create the seed-set but then the computer analyzes the remaining 99,200 documents (using the learning-model) and predicts the classifications of the remaining documents. Understand, the computerized predictions for the remaining 99,200 documents might take only a few minutes to do what would otherwise take the attorney days or weeks—and when properly trained, the computer achieves high accuracy and consistency (the computer never has a bad day, gets bored, or gets distracted). Thus, CADA may help make an otherwise very time-consuming review suddenly feasible, efficient, accurate, and consistent. This is why large law firms are already using these technologies and why smaller firms might consider them.

This example briefly illustrates some of the basic techniques and procedures. But, attorneys should understand that there is a lot more to learn about this topic. Furthermore, while CADA is powerful, it is not necessarily a panacea. First, CADA might not be appropriate for all matters. CADA achieves the highest efficiency with larger datasets—roughly more than 10,000 items. Second, generally speaking, CADA algorithms predicatively classify documents. Thus, the utility and efficiency is especially high when 1) culling through large document collections to identify groups of items, such as for production during discovery, or 2) when initially reviewing materials received from an opposing party—especially if the items received are just a “data dump.” (CADA might also be a promising tool to help during preliminary case evaluation or when developing a settlement strategy—spending some time with a CADA system might help determine the merits of a matter or the strength of the documents supporting a case.) Third, while high accuracies are achievable, accuracy may vary from matter to matter (due to the types of data involved in each project). Understanding these potential limits, CADA, nevertheless, can be a compelling addition to the attorney’s toolkit.

Conclusion: A New Legal Community Paradigm

With computers seemingly doing the work, won’t these algorithms replace attorneys? Probably not. The volume of data in cases where CADA applies is already beyond what attorneys can reasonably and competently handle using traditional methods. Also, CADA tools fundamentally require direct attorney input—albeit new, this is law practice. Thus, at worst, the new CADA technologies will simply change how attorneys perform some legal analysis (much as online legal research tools changed Shepardizing with books). At best, the algorithms may help the profession to become more relevant, more responsive, and more viable—and both sophisticated clients and courts are already advocating for CADA use (not to mention pending changes to the ethics rules that will require knowledge of these types of technologies).

Thus, CADA certainly shows promise for the legal profession. CADA software is viable (and pretty cool) but, as with any tool, has limitations. These technologies will become a law practice efficiency (and possibly survival) issue. Certainly, the demands on the legal profession are changing—due to the nature and volume of materials—warranting adoption of new methods. The adoption foretells the era of the new, technology-augmented lawyer.

Publication & Citation Information:

Article Submission Date: 2013-05-15
Article Publication Date: 2013-06-13
Please Cite as: Shannon Brown, Emerging eDiscovery Tools and the New, Technology-Augmented Lawyer, In Brief, 16 Lancaster Bar Association Newsletter 2 at 15, 19 (Second Quarter 2013), available at http://www.shannonbrownlaw.com/?p=1824.
Note: Headings not in original print publication.