SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December ZEUS SIRDCC Speech Technelegyr WG assessment ef current STT technelegy Security Service have asked the SIRDCC Speech Wbrking Grbup tb give its technical assessment cf the current state cf the art in Speech tc Text and it is likely tc develcp. Executive summary The SIRDCC Speech Wbrking Grcup has evidence that current state cf the art STT is capable cf previding scme business bene?t in very specific circumstances. It has still tc prbve itself in larger-scale applicaticns, but the pbtential fer majcr bene?ts in prbductivity in the future is clear, given sufficient investment in further develeping the systems fer bur target speech. The Wcrking Grcup believes that the effective way tc achieve these benefits is tb centinue tb fund research and develbpment activities. vvhere practical this sheuld be supplemented with small-scale pilct tc the areas where immediate business benefit can be get, sc as tc help fecus the RSD investment. The underlying used by all existing state-cf-the-art systems is similar, and thus each is in principle capable cf cbtaining similar results in any given applicaticn, given sufficient in bespcke develcpment and tuning. Hewever the SEN system currently deplcyed at GCHQ fer the last 5 years and at msa fer lenger has preved itself stable, currently ethers en the standard measure bf errcr rate and is therefcre reccmmended fer cperalibnal pilcts in the near term. The decisicn as tc when and it is tc an cperatibnal pilct in any agency must depend en business decisibns internal tc that agency, but it is that we share and cellabbrate te the fullest extent te minimise casts and maximise bene?ts. Ccntext Security Service and GCHQ have been ccllabcrating en research and develcpment cf capability fer Speech tc Text (STT), alsc as eutcmatic Speech Reccgniticn (ASH), fer a number cf years under the auspices cf the SIRDCC Speech Wcrking Grbup. The aims are tc assess the applicability cf the tc gain business bene?t, and tc ccnduct research and develbpment tc advance the where needed. The ether members cf the Speech WE have a interest in the butcbme as a means bf their future investment decisiens. UK SECRET

Page 2 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December ZBUB DARPA eyaluatien pregramme The DAB eyaluatien pregramme, with significant steer frem NBA, has been the main driying ferce behind technelegy impreyements in the field. Unfertunately the results bf the eyaluatiens are net put in the public demain, making reference difficult. Mest bf the large cerpera bf transcribed speech were preduced under this pregramme fer eyaluatien purpeses: they are made bf up rather artificial cenyersatiens between speakers (eften cellege students} whe are paid te take part. Cambridge Uniyersity and BBN haye participated threugheut the lifetime ef the pregramme: they haye jeined ferces fer the current phase Beth haye always been at the ferefrent. Be were Dragen until their cellapse and IBM until they pulled cut a few years age. IBM haye subsequently re-entered with the stated ebjectiye ef ebtaining better than human perfermance, and they marginally eutperfermed the entry in the mest recent eyaluatien. C-ther research labs and uniyersities haye alse taken part but haye neyer dene as well as the erganisatiens mentiened abeye. BAIL haye neyer participated. The systems used in these eyaluatiens are research seftware, and net written fer use by anyene ether than the eriginating labs. Ayersien ef the BBN system is the enly esceptien te this, haying been in use at NBA fer abeut it] years. In this peried a let ef effert has been put inte giying it at least seme rebustness and usability, and inte making it user-trainable. Cambridge Uniyersity haye always taken the yiew that their seftware was fer running en their ewn site enly, theugh a medular teelkit HTK is publicly ayailable. Te the best ef eur knewledge Security Beryice?s purchase ef Attila frem IBM is the first instance ef it being trained ether than at its eriginating site, theugh we haye reperts that DBTC and CIA are alse inyestigating its perfermance. NBA pregramme NBA haye had the BBN speech-te-test system Bybles running at Fert Meade fer at least 11'] years. (Initially they alse had Bragen.) During this peried they haye inyested heayily in preducing their ewn cerpera ef transcribed Bigint in beth American English and an increasing range ef ether languages. Their applicatien ef English is te menitering. Cne ef hepes is that NBA will giye it access te the medels being trained en BICINT data, since NBA haye censiderable difficulty in releasing the intercept itself. This is ene ef the metiyes fer adepting Bybles, since medels trained by ene system cannet be used by anether. 2Dfll UK SECRET

Page 3 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December 2000 GCH QfSecurity Service approach We have pursued bur aims in this field in twe main ways, evaluating systems as delivered and cbtaining training data tc seek tb them. Our gcals have been: tc evaluate the itself and its business applicability; tc a cemparative evaluatibn bf cbmpeting systems te decide where best tb cbncentrate bur resburces. tr Systems evaluaticn GCHQ has licensed the system SEN since 2002. This system was chcsen partly because it was the system in esternal trials run by DARPA, but because itwas already in use as a research system within NSA, were alsc funding much cf its develcpment. SCHQ alsc funded scme specific develcpment by SEN in 2006 in crder tc make it mere easily deplcyable en cur systems. Security Service has investigated the cf speech reccgniticn IBM. The initial judgement bf ISlvl, made in 2001, was that their was net yet ready but their ccmparative success in trials in 2004 led tc renewed interest Security Service arranged fer further trials en UK-accented speech by IBM. In 2000 Security Service licensed the ISM Attila system and funded ISlvl tb help build and evaluate a speech recbgniser specifically fer Security Service prbduct. Security Service (A2K), with funding assistance SCHQ, has investigated the cf speech reccgniticn a Eurcpean ccmpany, SAIL labs cf vienna. SAIL have licensed their system tc Security Service and built a speech reccgniser fer evaluaticn. tr Bulk transcripticn It has been reccgnised fer several years that the main cbstacle tc effective STT cf intercepted speech was the mismatch between the medels bf speech used in STT systems and the intercept. Tc address this using current STT tens er hundreds cf heurs cf speech must be carefully transcribed at great cast in crder tc previde training data. There are deficiencies in current STT systems. their mcdels cf ccnversaticnal English speech are biased strengly tcwards US English. the material is gathered bpenly and is net representative cf the speech cf the majcrity cf cur targets. GCHQ and Security Service have ccllabbrated tc acguire, transcribe and share data sets. cf these have been UK. English cf varicus regibnal accents, cbtained cemmercially, butwe alse have a substantial cerpus bf regibnal Arabic. A small amcunt heurs in tetal} has been transcribed intercept. Of this, there is cne 30fll UK SECRET

Page 4 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December 2009 significant UK-regienal cerpus, NIBAD, which is 50 heurs ef Nerthern Irish accented speech. The verv high cest ef transcriptien fer BTT purpeses (ef the erder ef ?1500 per heur ef speech} makes it vital that we centinue te cellaberate and share as much as pessible. Status in December tr Systems evaluatien The NIBAD cerpus has been used te train and evaluate all three svstems. The results are reperted in ajeint GCHQ-Becuritv Service paper The everall figures en werd errer rate were: BEN 63%, IBM 82%, BAIL 101%. The figures fer werd accuracvr were: BEN 42%, IBM 32%, BAIL 20%. Nete that errer rate and accuracvr de net necessarilvr add up te 100% as the errer rates are nermalised with respect te the true transcript and there mav be additienal werds incerrectlv inserted bv the recegniser. The analysis shews that the BEN recegniser is better than the IBM recegniser at transcribing werds bv a significant margin, as measured bv the number ef werds in each speech file that it get cerrect (better in BB eut ef BB files}. The analysis alse shews that bv this measure the IBM recegniser is better than the BAIL recegniser bv a significant margin (better in 5? cut cf 59 files}. There is substantial variatien in the recegnitien rates ef individual werds. See the Appendix fer a representative sample ef test as transcribed bv the BEN Bvbles svstem, and hew bespeke training impreves the recegnitien. There is alse a table ef the best recegnised werds, ether than these which are recegnised 100% which are mesva singletens perhaps well-recegnised by accident. Fer these esperiments Bvbles was trained bv GCHQ staff with ne BEN invelvement. The BAIL svstem was trained bv its develepers. Attila was trained bv Security Service with assistance frem an IBM engineer. Beveral lessens have been learnt frem this evaluatien. the results fer Bvbles are cemparable with BIGINT esperience {theugh admittedlvr semewhat werse}, cenfirming that esperience is applicable te eur data. this is the first time te eur knewledge that the BAIL svstem has been ebjectivelv evaluated. it is the first time Attila has been trained en intercept. Hewever there is a let ef uncertainty ever the reasens fer its werse perfermance than Bvbles?s. Dne facter, Aefll UK SECRET

Page 5 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December ZDUS probably, is lack of skill in its use: the IBM engineer who assisted Security Seryice was new to the field. Another factor is that esperience from SIGINT applications has not fed into Attila in the way it has into Byblos. This was the interpretation BIS-N put on the result when informed of it: their lead deyeloper commented that I doubt that fnnoamental technology is somehow irretrievahly hehino but it?s nice to know that the effort that you ano we invest in making Byblos run ?somewhat smoothly? on challenging data can pay off in this way. Since this eyaluation was completed, the ISM system has been retuned by IBM and the SEN system retuned by GCHQ (no further work has been done on the SAIL system}. The current best performance is word error rate: ISBN SQUID, lUl?x?c and word accuracy: BEN 45%, IBM 42%, ZUGXD. tr Bulk transcription The need for additional bqu transcription can be seen from the data presented in the Figure at the end of this report. It shows data points deriyed from usa esperiments on a yariety of languages, as well as data points drawn from NIST eyaluations sponsored by Each point shows the measured word error rate {or character error rate for Korean and Mandarin} for a giyen number of hours of transcribed training data. All points are got using the Byblos system, and all escept those labelled English? correspond to esperiments conducted on transcribed SIG-INT data. There are three lines drawn on the figure. The bottom one labelled ?owe ea English? shows the performance of models built on public data, assessed on such data. There is a clear trend of improyed performance associated with the use of more training data, but note that the improyement is only logarithmic. The top one, labelled ?Unclass. system on Ia English? shows the performance of these same models on an Information assurance application, where the speech to be transcribed is US English. The trend is the same, but there is a significant performance gap - of the order of 2D percentage points. The middle line, labelled ?Ia English? shows the improyement that can be got by training a bespoke model for the task. There is still a substantial residual gap of around percentage points between the line and the la English line. The reason for this gap is not known, but it is clear that there has been a substantial improyement of performance of the order of 13 percentage points by using bespoke training. The remaining points for other languages haye much more yariation, but oyerall are compatible with the esistence of a similar trend of better performance associated with using more data. We haye no information for these other languages on how much worse the performance would haye been if public data had instead been used in the system training, these points are all drawn from models built using intercept. Sofll UK SECRET

Page 6 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET BEFESS Brie? December 213139 The peint fer NIHAD English is high in cemparisen with the bread trend fer all the nen-Ir?. English languages ene weuld haye espected a werd errer rate ef cleser te 513% rather than the 62.5% measured. This may be due te the nature ef the data, as it has been recerded with beth sides ef the cenyersatien merged which is knewn te haye an adyerse effect en the perfermance ef speech precessing algerithms. We cannet esplain the substantial gap between the perfermance en la English and that en all ether languages; it may be attributable te an inbuilt bias in current speech recegnitien systems tewards features ef US English caused by decades ef intensiye research driyen by US funding using US speech data. GCHQ eperatienal experience GCHQ has been making eperatienal use ef Bybles since areund 21304. The transcripts it preduces unaided haye net been ef suf?cient accuracy te haye any yalue, but the technique ef language-medel biasing has enabled GCHQ te tailer Bybles te specific keywerds er strings ef interest. {The pessibility ef sharing techniques ef this sert is afurther reasen te aim fer cempatibility between agencies.) The first applicatien was te strings ef digits speken by Caribbean drugs runners. GCHQ was able te detect speken telephene numbers with high reliability using an eut-ef-the-bes recegniser whese errer rate was greater than under the standard metric. Since then seyeral instances ef number detectien haye been depleyed. In ene recent case the digits are recegnised with sufficient accuracy fer it te be werth reperting their yalues te rather than just reperting their detectien. GCHQ has ene depleyed esample ef keywerd detectien ether than speken digits, but has had difficulty in persuading te prepese suitable search strings. GC HQ espects te be able te estend the range ef depleyments eyer the nest ceuple ef years, ewing beth te the wider range ef languages ayailable and te impreyed accuracy as Sigint cerpera get transcribed. The eperatienal benefit in the shert term is likely te remain small cempared with ether technelegies such as diarisatien, gender and speaker ID. Cenclusien The current state ef technelegy is that systems are capable ef autematic transcriptien with werd errer rates ef between sets and il??x?e, giyen ameunts ef training data ef the erder ef hundreds ef heurs. The cest ef transcribing this ameunt ef training data is substantial ef the erder ef EDEN fer SUD-400 heurs ef material. The accuracy reguired ef a system in erder fer it te preyide business benefit will depend en the business applicatien, and we de net yet haye a geed understanding ef this. GCHQ haye successfully depleyed seyeral STT applicatiens te lecate the Eefll UK SECRET

Page 7 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December ZEUS esistence cf spcken numbers such as telephcne numbers in speech. They haye alsc deplcyed a STT applicatibn which lbcates the esistence bf speci?c In each bf these applicatibns, success has been achieyed using an estremely peer ccre STT mbdel (the default unclassified bne supplied by SEN), with the enhanced by tailcring the language mcdel. as the bf STT systems either by prbyiding mere training data er by technical adyances in the algcrithms used, sc the range bf applicaticns fer which they can prcyide business bene?t will espand. In the term it is difficult tc predict the will eyclye. Durjudgement is that the recent in driyen by large-scale US inyestment is likely tc plateau as the cf STT en transcripticn cf er public speech attains leyels apprcaching EDDIE: accuracy. US inyestment is new mcying tbwards fellbw-bn applicatibns such as machine translatibn cf the reccgnised speech. There remains a significant gap between the measured en public data and the measured an intercept data, which may limit the pctential fer transcripticn bf intercept data tc accuracies cf the crder cf sets using current Hc-weyer, te achieye such leyels bf accuracy will need substantial inyestment in bespcke training, and we shculd net wait fer them tc be achieyed befcre seeking applicaticns. It is premature tc cheese between the IBM and SSH systems in terms cf en classi?ed material, as we cnly haye cne esperiment tc guide us. Heweyer the fact cf the esperience cf ISBN in deyelcping systems fer use an SIGINT material makes it the preferred system fer cperaticnal in the term. State cf the art speech recbgnisers are net shrink-wrapped prbducts and require substantial training in crder tc understand tc use them and esplbit them. There is me standard fer STT mcdels, and sc mcdels built fer cne recbgniser are net pcrtable tb anbther. STT medels are net cheap te build, reguiring cf the brder bf a year bf CPU time (depending en the amcunt cf data). These mean that there is ccnsiderable benefit tb be had in agencies agreeing tb use a system in the term, which wculd allcw peeling cf espertise and sharing cf built mcdels. Chair, SIRDCC Speech Wcrking Grbup Refer en ces Minutes cf SIRDCC Wcrking Grcup Meeting an Speech bell UK SECRET

Page 8 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December 213139 Cempereljve eueluetien ef three eemmereiel {revised 2009-12-0?) B?fll UK SECRET

Page 9 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET BFESSENMUUJUUDDEMEIG December 213139 Figure: Error rates from training Byblos on different amounts of data Ell?fll UK SECRET

Page 10 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December 2009 Appendix: Illustrative text and 100 wards BBN Bybles transeriptien eerreet 1werds are marked in red As delivered 2007' Truth: great D. k. that that's that's perfect c. k. well listen [talking] tc derry give me i'll expect ycu there i will expect a call maybe scme time thursday mcrning critical credit beck purple it was miles tc gc befcre ycu en the ccmmunal experts will but the ccma missicn and mcurn Bespeke trained 2009 Truth: great c.k. that that's that's perfect c.k. well listen right c.k. but that is that?s that?s perfect c.k. what Truth: [talking] tc Berry and [talking] give me i will expect ycu there i i will expect a call maybe scme time thursday mcrning en the faricnes shculd give me A *tft all tc gc tc the hcspital call maybe scme cunt was a mcrning The 1werds (ether Inan 100%) wilh their frequeney eeunts 94% 23% 59% 56% CRAIC 1? SGMEBIDDY 18 LAST 25 ND 251 FELT 3 FUCKING 204 WEEK 22 2E 390 FIFTY 15 SCALLY El FRIDAY 25 BELFAST 11 45 FIND 12 3D TWELVE 13 SIX 33 TIDLD 35 HDFEFULLY 3 DIFFERENT SEVEN 42 GIVE ?5 NUMBER 5? '3 MUMMY 14 AGAIN 2Q RIGHT 2B4 13E- JIDKING 3 NINETY TALKING 18 FHIDNE 4? LEAST 3 YEAH ALREADY 4 REALLY 25 SAYS 135 3 WEEKEND 12 CHECKED 4 CHANCE HALF 2E MCIVING 3 BACK 103 DEAD El DRIVING HUNDRED BE- MUCH 33 CLEAR 5 DUBLIN 4 ELEVEN 2B 3 NIGHTMARE 3 CIDUFLE 15 EACH 4 MIDBILE BLAME 3 3 DRINK 5 EXACTLY 21 BRILLIANT 12 3 3 NEXT 24 CHRISTMAS E- 3 HELLCI 100 4 BIG 1? CLEAN E- 3 CIDMING 19 El 40 DATE 3 QUID E- INUT 19 PARK 4 MIDN DAY 10 DERRY 3 SEAN 3 10 0f 11 UK SECRET

Page 11 from SIRDCC Speech Technology WG assessment of current STT technology

UK SECRET December 213139 1?3 PICTURES 4 SDMEWHERE 1D DRINKING 3 3 DDUELE ?3 THIRTEEN 4 ANTWAT 23 DRUNK 3 SIXTY '3 REMEMBER '3 BRAND 15 3E- DURING Er 3 11 Df ll UK SECRET

SIRDCC Speech Technology WG assessment of current STT technology

Enter your email to keep reading for free.

We’re independent of corporate interests. Will you join us?

Enter your email to keep reading for free.

No ads. No corporate BS. Skip the propaganda and donate to keep The Intercept going strong:

No ads. No corporate BS. Skip the propaganda and donate to keep The Intercept going strong:

We’re independent of corporate interests. Will you join us?