Snowden Archive
——
The SIDtoday
Files
Browse the Archive

Building Human-Language Technology

SUMMARY

Human Language Technology relies on corpora--collections of linguistic data. For the NSA's linguistic tools to analyze SIGINT accurately, their datasets need to include analysis of specific terms from classified materials, meaning that the corpora themselves must be classified and the analysts who work on them must have security clearance. 

DOCUMENT’S DATE

Sep 07, 2006

PUBLICLY AVAILABLE

May 29, 2019

1/2
Download
Page 1 from Building Human-Language Technology
DYNAMIC PAGE -- HIGHEST POSSIBLE CLASSIFICATION IS TOP SECRET // SI / TK // REL TO USA AUS CAN GBR NZL (U) Building Human-Language Technology FROM: and Human Language Technology (S23) Run Date: 09/07/2006 (U) cor¿pus \'korpәs..\ noun. plural : corpo¿ra \'korp(ә)rә\ ... A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language. -- Dictionary of Linguistics and Phonetics, 3rd edition, 1991. (S//SI) Analysts: Imagine a future where the tools you use are tailored to handle the complex SIGINT data that you process every day... a future where tools are developed to deal with the unique challenges you face ... a future where a commercial product from an outside vendor can be carefully evaluated using real operational data so that smart decisions can be made about spending Agency funds to provide you with technology that really works.... (U//FOUO) The Corpora Activity, an effort within the Human Language Technology Program Management Office (HLT PMO), is helping to make that future a reality. But, how are corpora related to technology? Language corpora, annotated sets of linguistic data, are not merely related, they are actually crucial to the research, development, and evaluation of any HLT tool. They are the foundation on which the tools are built. (U//FOUO) The careful preparation of data sets that reflect Agency language challenges are especially important in the creation of tools that make their way to analysts' desks, because the tools need to be trained to deal with the material that inundates those desks. Acquiring, preparing, and disseminating corpora -- for voice, text and image, for any stage of HLT development -- is the primary task of the Corpora Activity. (U) Outside of NSA, in the commercial and academic worlds, linguistic corpora are created using open-source, unclassified data. Qualified workers annotate, or mark up, the data to reflect some aspect of their content, like an interesting speaker or a particular foreign language. Such open-source data are instrumental in conducting foundational research to develop the mathematical algorithms that underlie HLT tools. These data sets are carefully controlled and marked for variables that the technology will be taught to recognize, such as gender, speaker, language, and other factor, and thus they are ideal for many research applications. (U//FOUO) For the development and evaluation of HLT tools that Agency language analysts use, in addition to unclassified data sets, HLT researchers and developers need classified corpora. The creation of these agency data sets poses unique challenges that are not found in the open-source world. In the outside world, a company may pay to acquire whatever data they need, crafted to their exact specifications; within the agency, we must create corpora with the data available to us -- SIGINT intercept. SERIES: (U) HLT 1. Human-Language Technology in Your Future 2. For Media Mining, the Future Is Now! 3. For Media Mining, the Future Is Now! (conclusion) 4. 'Knowledge Discovery': Finding the Best Material 5. Human-Language Technology -Everywhere 6. Dealing With a 'Tsunami' of Intercept 7. Building HumanLanguage Technology 8. Strangers in a Strange Land?
Page 2 from Building Human-Language Technology
(U//FOUO) Keeping and distributing open-source data poses no problems in the outside world so long as licensing restrictions are obeyed, yet corpora based on SIGINT must be stored and shared in a way that carefully obeys policy and security restrictions. When annotating open-source data, a commercial company needs only to worry about the qualifications of the workers who annotate them; since our data are classified, any workers who annotate them, of course, must be highly qualified linguists, but also must have the appropriate clearances. These issues and others make the development of classified corpora particularly challenging. (U//FOUO) But when there is a challenge, there is also an element of excitement. The work of annotating SIGINT language material may be difficult, but it can also be an interesting and intense diversity activity that may teach language analysts more about their own languages. It can be rewarding to analysts who want to help their mission by supporting the development of tools that will be tailored to their specific requirements. Diversity tours and details are available within the Research Directorate (R6), Analytic Automation Technologies (S202B1) and the HLT PMO (S23) itself. (U) Acquiring and preparing linguistic corpora are essential basic steps for conducting the research and development of tomorrow's HLT products. Laying the groundwork with linguistic data that accurately reflects Agency issues will help bring the analysts' future closer to today "(U//FOUO) SIDtoday articles may not be republished or reposted outside NSANet without the consent of S0121 (DL sid_comms)." DYNAMIC PAGE -- HIGHEST POSSIBLE CLASSIFICATION IS TOP SECRET // SI / TK // REL TO USA AUS CAN GBR NZL DERIVED FROM: NSA/CSSM 1-52, DATED 08 JAN 2007 DECLASSIFY ON: 20320108