Documents
SIRDCC Speech Technology WG assessment of current STT technology
May 5, 2015
UK SECRET
December ZEUS
SIRDCC Speech Technelegyr WG assessment ef current STT technelegy
Security Service have asked the SIRDCC Speech Wbrking Grbup tb give
its technical assessment cf the current state cf the art in Speech tc Text
and it is likely tc develcp.
Executive summary
The SIRDCC Speech Wbrking Grcup has evidence that current state cf
the art STT is capable cf previding scme business bene?t in very specific
circumstances. It has still tc prbve itself in larger-scale applicaticns, but the pbtential
fer majcr bene?ts in prbductivity in the future is clear, given sufficient investment in
further develeping the systems fer bur target speech.
The Wcrking Grcup believes that the effective way tc achieve these benefits is
tb centinue tb fund research and develbpment activities. vvhere practical this sheuld
be supplemented with small-scale pilct tc the areas where
immediate business benefit can be get, sc as tc help fecus the RSD investment.
The underlying used by all existing state-cf-the-art systems is similar, and
thus each is in principle capable cf cbtaining similar results in any given applicaticn,
given sufficient in bespcke develcpment and tuning. Hewever the SEN system
currently deplcyed at GCHQ fer the last 5 years and at msa fer lenger has preved
itself stable, currently ethers en the standard measure bf errcr rate
and is therefcre reccmmended fer cperalibnal pilcts in the near term.
The decisicn as tc when and it is tc an cperatibnal pilct in
any agency must depend en business decisibns internal tc that agency, but it is
that we share and cellabbrate te the fullest extent te minimise casts and
maximise bene?ts.
Ccntext
Security Service and GCHQ have been ccllabcrating en research and develcpment
cf capability fer Speech tc Text (STT), alsc as eutcmatic Speech Reccgniticn
(ASH), fer a number cf years under the auspices cf the SIRDCC Speech
Wcrking Grbup. The aims are tc assess the applicability cf the tc gain
business bene?t, and tc ccnduct research and develbpment tc advance
the where needed.
The ether members cf the Speech WE have a interest in the butcbme as a
means bf their future investment decisiens.
UK SECRET
UK SECRET
December ZEUS
SIRDCC Speech Technelegyr WG assessment ef current STT technelegy
Security Service have asked the SIRDCC Speech Wbrking Grbup tb give
its technical assessment cf the current state cf the art in Speech tc Text
and it is likely tc develcp.
Executive summary
The SIRDCC Speech Wbrking Grcup has evidence that current state cf
the art STT is capable cf previding scme business bene?t in very specific
circumstances. It has still tc prbve itself in larger-scale applicaticns, but the pbtential
fer majcr bene?ts in prbductivity in the future is clear, given sufficient investment in
further develeping the systems fer bur target speech.
The Wcrking Grcup believes that the effective way tc achieve these benefits is
tb centinue tb fund research and develbpment activities. vvhere practical this sheuld
be supplemented with small-scale pilct tc the areas where
immediate business benefit can be get, sc as tc help fecus the RSD investment.
The underlying used by all existing state-cf-the-art systems is similar, and
thus each is in principle capable cf cbtaining similar results in any given applicaticn,
given sufficient in bespcke develcpment and tuning. Hewever the SEN system
currently deplcyed at GCHQ fer the last 5 years and at msa fer lenger has preved
itself stable, currently ethers en the standard measure bf errcr rate
and is therefcre reccmmended fer cperalibnal pilcts in the near term.
The decisicn as tc when and it is tc an cperatibnal pilct in
any agency must depend en business decisibns internal tc that agency, but it is
that we share and cellabbrate te the fullest extent te minimise casts and
maximise bene?ts.
Ccntext
Security Service and GCHQ have been ccllabcrating en research and develcpment
cf capability fer Speech tc Text (STT), alsc as eutcmatic Speech Reccgniticn
(ASH), fer a number cf years under the auspices cf the SIRDCC Speech
Wcrking Grbup. The aims are tc assess the applicability cf the tc gain
business bene?t, and tc ccnduct research and develbpment tc advance
the where needed.
The ether members cf the Speech WE have a interest in the butcbme as a
means bf their future investment decisiens.
UK SECRET
UK SECRET
December ZBUB
DARPA eyaluatien pregramme
The DAB eyaluatien pregramme, with significant steer frem NBA, has been the
main driying ferce behind technelegy impreyements in the field. Unfertunately the
results bf the eyaluatiens are net put in the public demain, making reference difficult.
Mest bf the large cerpera bf transcribed speech were preduced under this
pregramme fer eyaluatien purpeses: they are made bf up rather artificial
cenyersatiens between speakers (eften cellege students} whe are paid te take part.
Cambridge Uniyersity and BBN haye participated threugheut the lifetime ef the
pregramme: they haye jeined ferces fer the current phase Beth haye always
been at the ferefrent. Be were Dragen until their cellapse and IBM until they pulled
cut a few years age. IBM haye subsequently re-entered with the stated ebjectiye ef
ebtaining better than human perfermance, and they marginally eutperfermed the
entry in the mest recent eyaluatien.
C-ther research labs and uniyersities haye alse taken part but haye neyer dene as
well as the erganisatiens mentiened abeye. BAIL haye neyer participated.
The systems used in these eyaluatiens are research seftware, and net written fer use
by anyene ether than the eriginating labs. Ayersien ef the BBN system is the enly
esceptien te this, haying been in use at NBA fer abeut it] years. In this peried a let ef
effert has been put inte giying it at least seme rebustness and usability, and inte
making it user-trainable.
Cambridge Uniyersity haye always taken the yiew that their seftware was fer running
en their ewn site enly, theugh a medular teelkit HTK is publicly ayailable.
Te the best ef eur knewledge Security Beryice?s purchase ef Attila frem IBM is the
first instance ef it being trained ether than at its eriginating site, theugh we haye
reperts that DBTC and CIA are alse inyestigating its perfermance.
NBA pregramme
NBA haye had the BBN speech-te-test system Bybles running at Fert Meade fer at
least 11'] years. (Initially they alse had Bragen.) During this peried they haye inyested
heayily in preducing their ewn cerpera ef transcribed Bigint in beth American English
and an increasing range ef ether languages. Their applicatien ef English is te
menitering. Cne ef hepes is that NBA will giye it access te the
medels being trained en BICINT data, since NBA haye censiderable difficulty in
releasing the intercept itself. This is ene ef the metiyes fer adepting Bybles,
since medels trained by ene system cannet be used by anether.
2Dfll
UK SECRET
UK SECRET
December ZBUB
DARPA eyaluatien pregramme
The DAB eyaluatien pregramme, with significant steer frem NBA, has been the
main driying ferce behind technelegy impreyements in the field. Unfertunately the
results bf the eyaluatiens are net put in the public demain, making reference difficult.
Mest bf the large cerpera bf transcribed speech were preduced under this
pregramme fer eyaluatien purpeses: they are made bf up rather artificial
cenyersatiens between speakers (eften cellege students} whe are paid te take part.
Cambridge Uniyersity and BBN haye participated threugheut the lifetime ef the
pregramme: they haye jeined ferces fer the current phase Beth haye always
been at the ferefrent. Be were Dragen until their cellapse and IBM until they pulled
cut a few years age. IBM haye subsequently re-entered with the stated ebjectiye ef
ebtaining better than human perfermance, and they marginally eutperfermed the
entry in the mest recent eyaluatien.
C-ther research labs and uniyersities haye alse taken part but haye neyer dene as
well as the erganisatiens mentiened abeye. BAIL haye neyer participated.
The systems used in these eyaluatiens are research seftware, and net written fer use
by anyene ether than the eriginating labs. Ayersien ef the BBN system is the enly
esceptien te this, haying been in use at NBA fer abeut it] years. In this peried a let ef
effert has been put inte giying it at least seme rebustness and usability, and inte
making it user-trainable.
Cambridge Uniyersity haye always taken the yiew that their seftware was fer running
en their ewn site enly, theugh a medular teelkit HTK is publicly ayailable.
Te the best ef eur knewledge Security Beryice?s purchase ef Attila frem IBM is the
first instance ef it being trained ether than at its eriginating site, theugh we haye
reperts that DBTC and CIA are alse inyestigating its perfermance.
NBA pregramme
NBA haye had the BBN speech-te-test system Bybles running at Fert Meade fer at
least 11'] years. (Initially they alse had Bragen.) During this peried they haye inyested
heayily in preducing their ewn cerpera ef transcribed Bigint in beth American English
and an increasing range ef ether languages. Their applicatien ef English is te
menitering. Cne ef hepes is that NBA will giye it access te the
medels being trained en BICINT data, since NBA haye censiderable difficulty in
releasing the intercept itself. This is ene ef the metiyes fer adepting Bybles,
since medels trained by ene system cannet be used by anether.
2Dfll
UK SECRET
UK SECRET
December 2000
GCH QfSecurity Service approach
We have pursued bur aims in this field in twe main ways, evaluating systems as
delivered and cbtaining training data tc seek tb them. Our gcals have been:
tc evaluate the itself and its business applicability; tc a
cemparative evaluatibn bf cbmpeting systems te decide where best tb cbncentrate
bur resburces.
tr Systems evaluaticn
GCHQ has licensed the system SEN since 2002.
This system was chcsen partly because it was the system in
esternal trials run by DARPA, but because itwas already in use as a
research system within NSA, were alsc funding much cf its develcpment. SCHQ
alsc funded scme specific develcpment by SEN in 2006 in crder tc make it mere
easily deplcyable en cur systems.
Security Service has investigated the cf speech reccgniticn
IBM. The initial judgement bf ISlvl, made in 2001, was that their was net
yet ready but their ccmparative success in trials in 2004 led tc renewed
interest Security Service arranged fer further trials en UK-accented speech
by IBM. In 2000 Security Service licensed the ISM Attila system and funded ISlvl
tb help build and evaluate a speech recbgniser specifically fer Security Service
prbduct.
Security Service (A2K), with funding assistance SCHQ, has investigated the
cf speech reccgniticn a Eurcpean ccmpany, SAIL labs cf vienna.
SAIL have licensed their system tc Security Service and built a speech reccgniser fer
evaluaticn.
tr Bulk transcripticn
It has been reccgnised fer several years that the main cbstacle tc effective STT cf
intercepted speech was the mismatch between the medels bf speech used in STT
systems and the intercept. Tc address this using current STT tens er
hundreds cf heurs cf speech must be carefully transcribed at great cast in crder tc
previde training data. There are deficiencies in current STT systems. their
mcdels cf ccnversaticnal English speech are biased strengly tcwards US English.
the material is gathered bpenly and is net representative cf the speech cf
the majcrity cf cur targets.
GCHQ and Security Service have ccllabbrated tc acguire, transcribe and share data
sets. cf these have been UK. English cf varicus regibnal accents, cbtained
cemmercially, butwe alse have a substantial cerpus bf regibnal Arabic. A small
amcunt heurs in tetal} has been transcribed intercept. Of this, there is cne
30fll
UK SECRET
UK SECRET
December 2000
GCH QfSecurity Service approach
We have pursued bur aims in this field in twe main ways, evaluating systems as
delivered and cbtaining training data tc seek tb them. Our gcals have been:
tc evaluate the itself and its business applicability; tc a
cemparative evaluatibn bf cbmpeting systems te decide where best tb cbncentrate
bur resburces.
tr Systems evaluaticn
GCHQ has licensed the system SEN since 2002.
This system was chcsen partly because it was the system in
esternal trials run by DARPA, but because itwas already in use as a
research system within NSA, were alsc funding much cf its develcpment. SCHQ
alsc funded scme specific develcpment by SEN in 2006 in crder tc make it mere
easily deplcyable en cur systems.
Security Service has investigated the cf speech reccgniticn
IBM. The initial judgement bf ISlvl, made in 2001, was that their was net
yet ready but their ccmparative success in trials in 2004 led tc renewed
interest Security Service arranged fer further trials en UK-accented speech
by IBM. In 2000 Security Service licensed the ISM Attila system and funded ISlvl
tb help build and evaluate a speech recbgniser specifically fer Security Service
prbduct.
Security Service (A2K), with funding assistance SCHQ, has investigated the
cf speech reccgniticn a Eurcpean ccmpany, SAIL labs cf vienna.
SAIL have licensed their system tc Security Service and built a speech reccgniser fer
evaluaticn.
tr Bulk transcripticn
It has been reccgnised fer several years that the main cbstacle tc effective STT cf
intercepted speech was the mismatch between the medels bf speech used in STT
systems and the intercept. Tc address this using current STT tens er
hundreds cf heurs cf speech must be carefully transcribed at great cast in crder tc
previde training data. There are deficiencies in current STT systems. their
mcdels cf ccnversaticnal English speech are biased strengly tcwards US English.
the material is gathered bpenly and is net representative cf the speech cf
the majcrity cf cur targets.
GCHQ and Security Service have ccllabbrated tc acguire, transcribe and share data
sets. cf these have been UK. English cf varicus regibnal accents, cbtained
cemmercially, butwe alse have a substantial cerpus bf regibnal Arabic. A small
amcunt heurs in tetal} has been transcribed intercept. Of this, there is cne
30fll
UK SECRET
UK SECRET
December 2009
significant UK-regienal cerpus, NIBAD, which is 50 heurs ef Nerthern Irish
accented speech.
The verv high cest ef transcriptien fer BTT purpeses (ef the erder ef ?1500 per heur
ef speech} makes it vital that we centinue te cellaberate and share as much as
pessible.
Status in December
tr Systems evaluatien
The NIBAD cerpus has been used te train and evaluate all three svstems. The
results are reperted in ajeint GCHQ-Becuritv Service paper
The everall figures en werd errer rate were: BEN 63%, IBM 82%, BAIL 101%. The
figures fer werd accuracvr were: BEN 42%, IBM 32%, BAIL 20%. Nete that errer rate
and accuracvr de net necessarilvr add up te 100% as the errer rates are nermalised
with respect te the true transcript and there mav be additienal werds incerrectlv
inserted bv the recegniser.
The analysis shews that the BEN recegniser is better than the IBM recegniser at
transcribing werds bv a significant margin, as measured bv the number ef werds in
each speech file that it get cerrect (better in BB eut ef BB files}.
The analysis alse shews that bv this measure the IBM recegniser is better than the
BAIL recegniser bv a significant margin (better in 5? cut cf 59 files}.
There is substantial variatien in the recegnitien rates ef individual werds. See the
Appendix fer a representative sample ef test as transcribed bv the BEN Bvbles
svstem, and hew bespeke training impreves the recegnitien. There is alse a table ef
the best recegnised werds, ether than these which are recegnised 100% which are
mesva singletens perhaps well-recegnised by accident.
Fer these esperiments Bvbles was trained bv GCHQ staff with ne BEN invelvement.
The BAIL svstem was trained bv its develepers. Attila was trained bv Security Service
with assistance frem an IBM engineer.
Beveral lessens have been learnt frem this evaluatien. the results fer Bvbles
are cemparable with BIGINT esperience {theugh admittedlvr semewhat werse},
cenfirming that esperience is applicable te eur data.
this is the first time te eur knewledge that the BAIL svstem has been
ebjectivelv evaluated.
it is the first time Attila has been trained en intercept. Hewever there is a let ef
uncertainty ever the reasens fer its werse perfermance than Bvbles?s. Dne facter,
Aefll
UK SECRET
UK SECRET
December 2009
significant UK-regienal cerpus, NIBAD, which is 50 heurs ef Nerthern Irish
accented speech.
The verv high cest ef transcriptien fer BTT purpeses (ef the erder ef ?1500 per heur
ef speech} makes it vital that we centinue te cellaberate and share as much as
pessible.
Status in December
tr Systems evaluatien
The NIBAD cerpus has been used te train and evaluate all three svstems. The
results are reperted in ajeint GCHQ-Becuritv Service paper
The everall figures en werd errer rate were: BEN 63%, IBM 82%, BAIL 101%. The
figures fer werd accuracvr were: BEN 42%, IBM 32%, BAIL 20%. Nete that errer rate
and accuracvr de net necessarilvr add up te 100% as the errer rates are nermalised
with respect te the true transcript and there mav be additienal werds incerrectlv
inserted bv the recegniser.
The analysis shews that the BEN recegniser is better than the IBM recegniser at
transcribing werds bv a significant margin, as measured bv the number ef werds in
each speech file that it get cerrect (better in BB eut ef BB files}.
The analysis alse shews that bv this measure the IBM recegniser is better than the
BAIL recegniser bv a significant margin (better in 5? cut cf 59 files}.
There is substantial variatien in the recegnitien rates ef individual werds. See the
Appendix fer a representative sample ef test as transcribed bv the BEN Bvbles
svstem, and hew bespeke training impreves the recegnitien. There is alse a table ef
the best recegnised werds, ether than these which are recegnised 100% which are
mesva singletens perhaps well-recegnised by accident.
Fer these esperiments Bvbles was trained bv GCHQ staff with ne BEN invelvement.
The BAIL svstem was trained bv its develepers. Attila was trained bv Security Service
with assistance frem an IBM engineer.
Beveral lessens have been learnt frem this evaluatien. the results fer Bvbles
are cemparable with BIGINT esperience {theugh admittedlvr semewhat werse},
cenfirming that esperience is applicable te eur data.
this is the first time te eur knewledge that the BAIL svstem has been
ebjectivelv evaluated.
it is the first time Attila has been trained en intercept. Hewever there is a let ef
uncertainty ever the reasens fer its werse perfermance than Bvbles?s. Dne facter,
Aefll
UK SECRET
UK SECRET
December ZDUS
probably, is lack of skill in its use: the IBM engineer who assisted Security Seryice
was new to the field. Another factor is that esperience from SIGINT applications has
not fed into Attila in the way it has into Byblos. This was the interpretation BIS-N put on
the result when informed of it: their lead deyeloper commented that
I doubt that fnnoamental technology is somehow irretrievahly
hehino but it?s nice to know that the effort that you ano we
invest in making Byblos run ?somewhat smoothly? on challenging data
can pay off in this way.
Since this eyaluation was completed, the ISM system has been retuned by IBM and
the SEN system retuned by GCHQ (no further work has been done on the SAIL
system}. The current best performance is word error rate: ISBN SQUID,
lUl?x?c and word accuracy: BEN 45%, IBM 42%, ZUGXD.
tr Bulk transcription
The need for additional bqu transcription can be seen from the data presented in the
Figure at the end of this report. It shows data points deriyed from usa esperiments
on a yariety of languages, as well as data points drawn from NIST eyaluations
sponsored by Each point shows the measured word error rate {or character
error rate for Korean and Mandarin} for a giyen number of hours of transcribed
training data. All points are got using the Byblos system, and all escept those labelled
English? correspond to esperiments conducted on transcribed SIG-INT data.
There are three lines drawn on the figure. The bottom one labelled ?owe ea English?
shows the performance of models built on public data, assessed on such data. There
is a clear trend of improyed performance associated with the use of more training
data, but note that the improyement is only logarithmic.
The top one, labelled ?Unclass. system on Ia English? shows the performance of
these same models on an Information assurance application, where the speech to be
transcribed is US English. The trend is the same, but there is a significant
performance gap - of the order of 2D percentage points.
The middle line, labelled ?Ia English? shows the improyement that can be got by
training a bespoke model for the task. There is still a substantial residual gap of
around percentage points between the line and the la English line. The
reason for this gap is not known, but it is clear that there has been a substantial
improyement of performance of the order of 13 percentage points by using
bespoke training.
The remaining points for other languages haye much more yariation, but oyerall are
compatible with the esistence of a similar trend of better performance associated with
using more data. We haye no information for these other languages on how much
worse the performance would haye been if public data had instead been used in the
system training, these points are all drawn from models built using intercept.
Sofll
UK SECRET
UK SECRET
December ZDUS
probably, is lack of skill in its use: the IBM engineer who assisted Security Seryice
was new to the field. Another factor is that esperience from SIGINT applications has
not fed into Attila in the way it has into Byblos. This was the interpretation BIS-N put on
the result when informed of it: their lead deyeloper commented that
I doubt that fnnoamental technology is somehow irretrievahly
hehino but it?s nice to know that the effort that you ano we
invest in making Byblos run ?somewhat smoothly? on challenging data
can pay off in this way.
Since this eyaluation was completed, the ISM system has been retuned by IBM and
the SEN system retuned by GCHQ (no further work has been done on the SAIL
system}. The current best performance is word error rate: ISBN SQUID,
lUl?x?c and word accuracy: BEN 45%, IBM 42%, ZUGXD.
tr Bulk transcription
The need for additional bqu transcription can be seen from the data presented in the
Figure at the end of this report. It shows data points deriyed from usa esperiments
on a yariety of languages, as well as data points drawn from NIST eyaluations
sponsored by Each point shows the measured word error rate {or character
error rate for Korean and Mandarin} for a giyen number of hours of transcribed
training data. All points are got using the Byblos system, and all escept those labelled
English? correspond to esperiments conducted on transcribed SIG-INT data.
There are three lines drawn on the figure. The bottom one labelled ?owe ea English?
shows the performance of models built on public data, assessed on such data. There
is a clear trend of improyed performance associated with the use of more training
data, but note that the improyement is only logarithmic.
The top one, labelled ?Unclass. system on Ia English? shows the performance of
these same models on an Information assurance application, where the speech to be
transcribed is US English. The trend is the same, but there is a significant
performance gap - of the order of 2D percentage points.
The middle line, labelled ?Ia English? shows the improyement that can be got by
training a bespoke model for the task. There is still a substantial residual gap of
around percentage points between the line and the la English line. The
reason for this gap is not known, but it is clear that there has been a substantial
improyement of performance of the order of 13 percentage points by using
bespoke training.
The remaining points for other languages haye much more yariation, but oyerall are
compatible with the esistence of a similar trend of better performance associated with
using more data. We haye no information for these other languages on how much
worse the performance would haye been if public data had instead been used in the
system training, these points are all drawn from models built using intercept.
Sofll
UK SECRET
UK SECRET
BEFESS Brie?
December 213139
The peint fer NIHAD English is high in cemparisen with the bread trend fer all the
nen-Ir?. English languages ene weuld haye espected a werd errer rate ef cleser te
513% rather than the 62.5% measured. This may be due te the nature ef the data, as
it has been recerded with beth sides ef the cenyersatien merged which is knewn te
haye an adyerse effect en the perfermance ef speech precessing algerithms.
We cannet esplain the substantial gap between the perfermance en la English and
that en all ether languages; it may be attributable te an inbuilt bias in current speech
recegnitien systems tewards features ef US English caused by decades ef intensiye
research driyen by US funding using US speech data.
GCHQ eperatienal experience
GCHQ has been making eperatienal use ef Bybles since areund 21304. The
transcripts it preduces unaided haye net been ef suf?cient accuracy te haye any
yalue, but the technique ef language-medel biasing has enabled GCHQ te tailer
Bybles te specific keywerds er strings ef interest. {The pessibility ef sharing
techniques ef this sert is afurther reasen te aim fer cempatibility between agencies.)
The first applicatien was te strings ef digits speken by Caribbean drugs runners.
GCHQ was able te detect speken telephene numbers with high reliability using an
eut-ef-the-bes recegniser whese errer rate was greater than under the
standard metric. Since then seyeral instances ef number detectien haye been
depleyed. In ene recent case the digits are recegnised with sufficient accuracy fer it
te be werth reperting their yalues te rather than just reperting their
detectien.
GCHQ has ene depleyed esample ef keywerd detectien ether than speken digits, but
has had difficulty in persuading te prepese suitable search strings. GC HQ
espects te be able te estend the range ef depleyments eyer the nest ceuple ef years,
ewing beth te the wider range ef languages ayailable and te impreyed accuracy as
Sigint cerpera get transcribed. The eperatienal benefit in the shert term is likely te
remain small cempared with ether technelegies such as diarisatien, gender and
speaker ID.
Cenclusien
The current state ef technelegy is that systems are capable ef autematic transcriptien
with werd errer rates ef between sets and il??x?e, giyen ameunts ef training data ef the
erder ef hundreds ef heurs. The cest ef transcribing this ameunt ef training data is
substantial ef the erder ef EDEN fer SUD-400 heurs ef material.
The accuracy reguired ef a system in erder fer it te preyide business benefit will
depend en the business applicatien, and we de net yet haye a geed understanding ef
this. GCHQ haye successfully depleyed seyeral STT applicatiens te lecate the
Eefll
UK SECRET
UK SECRET
BEFESS Brie?
December 213139
The peint fer NIHAD English is high in cemparisen with the bread trend fer all the
nen-Ir?. English languages ene weuld haye espected a werd errer rate ef cleser te
513% rather than the 62.5% measured. This may be due te the nature ef the data, as
it has been recerded with beth sides ef the cenyersatien merged which is knewn te
haye an adyerse effect en the perfermance ef speech precessing algerithms.
We cannet esplain the substantial gap between the perfermance en la English and
that en all ether languages; it may be attributable te an inbuilt bias in current speech
recegnitien systems tewards features ef US English caused by decades ef intensiye
research driyen by US funding using US speech data.
GCHQ eperatienal experience
GCHQ has been making eperatienal use ef Bybles since areund 21304. The
transcripts it preduces unaided haye net been ef suf?cient accuracy te haye any
yalue, but the technique ef language-medel biasing has enabled GCHQ te tailer
Bybles te specific keywerds er strings ef interest. {The pessibility ef sharing
techniques ef this sert is afurther reasen te aim fer cempatibility between agencies.)
The first applicatien was te strings ef digits speken by Caribbean drugs runners.
GCHQ was able te detect speken telephene numbers with high reliability using an
eut-ef-the-bes recegniser whese errer rate was greater than under the
standard metric. Since then seyeral instances ef number detectien haye been
depleyed. In ene recent case the digits are recegnised with sufficient accuracy fer it
te be werth reperting their yalues te rather than just reperting their
detectien.
GCHQ has ene depleyed esample ef keywerd detectien ether than speken digits, but
has had difficulty in persuading te prepese suitable search strings. GC HQ
espects te be able te estend the range ef depleyments eyer the nest ceuple ef years,
ewing beth te the wider range ef languages ayailable and te impreyed accuracy as
Sigint cerpera get transcribed. The eperatienal benefit in the shert term is likely te
remain small cempared with ether technelegies such as diarisatien, gender and
speaker ID.
Cenclusien
The current state ef technelegy is that systems are capable ef autematic transcriptien
with werd errer rates ef between sets and il??x?e, giyen ameunts ef training data ef the
erder ef hundreds ef heurs. The cest ef transcribing this ameunt ef training data is
substantial ef the erder ef EDEN fer SUD-400 heurs ef material.
The accuracy reguired ef a system in erder fer it te preyide business benefit will
depend en the business applicatien, and we de net yet haye a geed understanding ef
this. GCHQ haye successfully depleyed seyeral STT applicatiens te lecate the
Eefll
UK SECRET
UK SECRET
December ZEUS
esistence cf spcken numbers such as telephcne numbers in speech. They haye alsc
deplcyed a STT applicatibn which lbcates the esistence bf speci?c
In each bf these applicatibns, success has been achieyed using an estremely peer
ccre STT mbdel (the default unclassified bne supplied by SEN), with the
enhanced by tailcring the language mcdel. as the bf STT systems
either by prbyiding mere training data er by technical adyances in the
algcrithms used, sc the range bf applicaticns fer which they can prcyide business
bene?t will espand.
In the term it is difficult tc predict the will eyclye. Durjudgement
is that the recent in driyen by large-scale US inyestment is
likely tc plateau as the cf STT en transcripticn cf er public
speech attains leyels apprcaching EDDIE: accuracy. US inyestment is new mcying
tbwards fellbw-bn applicatibns such as machine translatibn cf the reccgnised speech.
There remains a significant gap between the measured en public data
and the measured an intercept data, which may limit the pctential fer
transcripticn bf intercept data tc accuracies cf the crder cf sets using current
Hc-weyer, te achieye such leyels bf accuracy will need substantial
inyestment in bespcke training, and we shculd net wait fer them tc be achieyed
befcre seeking applicaticns.
It is premature tc cheese between the IBM and SSH systems in terms cf
en classi?ed material, as we cnly haye cne esperiment tc guide us.
Heweyer the fact cf the esperience cf ISBN in deyelcping systems fer use an
SIGINT material makes it the preferred system fer cperaticnal in the
term.
State cf the art speech recbgnisers are net shrink-wrapped prbducts and require
substantial training in crder tc understand tc use them and esplbit them. There is
me standard fer STT mcdels, and sc mcdels built fer cne recbgniser are net pcrtable
tb anbther. STT medels are net cheap te build, reguiring cf the brder bf a year bf CPU
time (depending en the amcunt cf data). These mean that there is
ccnsiderable benefit tb be had in agencies agreeing tb use a system in
the term, which wculd allcw peeling cf espertise and sharing cf built mcdels.
Chair, SIRDCC Speech Wcrking Grbup
Refer en ces
Minutes cf SIRDCC Wcrking Grcup Meeting an Speech
bell
UK SECRET
UK SECRET
December ZEUS
esistence cf spcken numbers such as telephcne numbers in speech. They haye alsc
deplcyed a STT applicatibn which lbcates the esistence bf speci?c
In each bf these applicatibns, success has been achieyed using an estremely peer
ccre STT mbdel (the default unclassified bne supplied by SEN), with the
enhanced by tailcring the language mcdel. as the bf STT systems
either by prbyiding mere training data er by technical adyances in the
algcrithms used, sc the range bf applicaticns fer which they can prcyide business
bene?t will espand.
In the term it is difficult tc predict the will eyclye. Durjudgement
is that the recent in driyen by large-scale US inyestment is
likely tc plateau as the cf STT en transcripticn cf er public
speech attains leyels apprcaching EDDIE: accuracy. US inyestment is new mcying
tbwards fellbw-bn applicatibns such as machine translatibn cf the reccgnised speech.
There remains a significant gap between the measured en public data
and the measured an intercept data, which may limit the pctential fer
transcripticn bf intercept data tc accuracies cf the crder cf sets using current
Hc-weyer, te achieye such leyels bf accuracy will need substantial
inyestment in bespcke training, and we shculd net wait fer them tc be achieyed
befcre seeking applicaticns.
It is premature tc cheese between the IBM and SSH systems in terms cf
en classi?ed material, as we cnly haye cne esperiment tc guide us.
Heweyer the fact cf the esperience cf ISBN in deyelcping systems fer use an
SIGINT material makes it the preferred system fer cperaticnal in the
term.
State cf the art speech recbgnisers are net shrink-wrapped prbducts and require
substantial training in crder tc understand tc use them and esplbit them. There is
me standard fer STT mcdels, and sc mcdels built fer cne recbgniser are net pcrtable
tb anbther. STT medels are net cheap te build, reguiring cf the brder bf a year bf CPU
time (depending en the amcunt cf data). These mean that there is
ccnsiderable benefit tb be had in agencies agreeing tb use a system in
the term, which wculd allcw peeling cf espertise and sharing cf built mcdels.
Chair, SIRDCC Speech Wcrking Grbup
Refer en ces
Minutes cf SIRDCC Wcrking Grcup Meeting an Speech
bell
UK SECRET
UK SECRET
December 213139
Cempereljve eueluetien ef three eemmereiel
{revised 2009-12-0?)
B?fll
UK SECRET
UK SECRET
December 213139
Cempereljve eueluetien ef three eemmereiel
{revised 2009-12-0?)
B?fll
UK SECRET
UK SECRET
BFESSENMUUJUUDDEMEIG
December 213139
Figure: Error rates from training Byblos on different amounts of
data
Ell?fll
UK SECRET
UK SECRET
BFESSENMUUJUUDDEMEIG
December 213139
Figure: Error rates from training Byblos on different amounts of
data
Ell?fll
UK SECRET
UK SECRET
December 2009
Appendix: Illustrative text and 100 wards
BBN Bybles transeriptien eerreet 1werds are marked in red
As delivered 2007'
Truth: great D. k. that that's that's perfect c. k. well
listen [talking] tc derry give me i'll expect ycu there i will
expect a call maybe scme time thursday mcrning
critical credit beck purple it was
miles tc gc befcre ycu en the ccmmunal experts will but
the ccma missicn and mcurn
Bespeke trained 2009
Truth: great c.k. that that's that's perfect c.k. well
listen
right c.k. but that is that?s that?s perfect c.k.
what
Truth: [talking] tc Berry and [talking] give me i will
expect ycu there i i will expect a call maybe scme
time thursday mcrning
en the faricnes shculd give me A *tft
all tc gc tc the hcspital call maybe scme
cunt was a mcrning
The 1werds (ether Inan 100%) wilh their frequeney eeunts
94% 23% 59% 56%
CRAIC 1? SGMEBIDDY 18 LAST 25 ND 251 FELT 3
FUCKING 204 WEEK 22 2E 390 FIFTY 15
SCALLY El FRIDAY 25 BELFAST 11 45 FIND 12
3D TWELVE 13 SIX 33 TIDLD 35 HDFEFULLY 3
DIFFERENT SEVEN 42 GIVE ?5 NUMBER 5? '3
MUMMY 14 AGAIN 2Q RIGHT 2B4 13E- JIDKING 3
NINETY TALKING 18 FHIDNE 4? LEAST 3
YEAH ALREADY 4 REALLY 25 SAYS 135 3
WEEKEND 12 CHECKED 4 CHANCE HALF 2E MCIVING 3
BACK 103 DEAD El DRIVING HUNDRED BE- MUCH 33
CLEAR 5 DUBLIN 4 ELEVEN 2B 3 NIGHTMARE 3
CIDUFLE 15 EACH 4 MIDBILE BLAME 3 3
DRINK 5 EXACTLY 21 BRILLIANT 12 3
3 NEXT 24 CHRISTMAS E- 3
HELLCI 100 4 BIG 1? CLEAN E- 3
CIDMING 19 El 40 DATE 3 QUID E-
INUT 19 PARK 4 MIDN DAY 10 DERRY 3 SEAN 3
10 0f 11
UK SECRET
UK SECRET
December 2009
Appendix: Illustrative text and 100 wards
BBN Bybles transeriptien eerreet 1werds are marked in red
As delivered 2007'
Truth: great D. k. that that's that's perfect c. k. well
listen [talking] tc derry give me i'll expect ycu there i will
expect a call maybe scme time thursday mcrning
critical credit beck purple it was
miles tc gc befcre ycu en the ccmmunal experts will but
the ccma missicn and mcurn
Bespeke trained 2009
Truth: great c.k. that that's that's perfect c.k. well
listen
right c.k. but that is that?s that?s perfect c.k.
what
Truth: [talking] tc Berry and [talking] give me i will
expect ycu there i i will expect a call maybe scme
time thursday mcrning
en the faricnes shculd give me A *tft
all tc gc tc the hcspital call maybe scme
cunt was a mcrning
The 1werds (ether Inan 100%) wilh their frequeney eeunts
94% 23% 59% 56%
CRAIC 1? SGMEBIDDY 18 LAST 25 ND 251 FELT 3
FUCKING 204 WEEK 22 2E 390 FIFTY 15
SCALLY El FRIDAY 25 BELFAST 11 45 FIND 12
3D TWELVE 13 SIX 33 TIDLD 35 HDFEFULLY 3
DIFFERENT SEVEN 42 GIVE ?5 NUMBER 5? '3
MUMMY 14 AGAIN 2Q RIGHT 2B4 13E- JIDKING 3
NINETY TALKING 18 FHIDNE 4? LEAST 3
YEAH ALREADY 4 REALLY 25 SAYS 135 3
WEEKEND 12 CHECKED 4 CHANCE HALF 2E MCIVING 3
BACK 103 DEAD El DRIVING HUNDRED BE- MUCH 33
CLEAR 5 DUBLIN 4 ELEVEN 2B 3 NIGHTMARE 3
CIDUFLE 15 EACH 4 MIDBILE BLAME 3 3
DRINK 5 EXACTLY 21 BRILLIANT 12 3
3 NEXT 24 CHRISTMAS E- 3
HELLCI 100 4 BIG 1? CLEAN E- 3
CIDMING 19 El 40 DATE 3 QUID E-
INUT 19 PARK 4 MIDN DAY 10 DERRY 3 SEAN 3
10 0f 11
UK SECRET
UK SECRET
December 213139
1?3 PICTURES 4 SDMEWHERE 1D DRINKING 3 3
DDUELE ?3 THIRTEEN 4 ANTWAT 23 DRUNK 3 SIXTY '3
REMEMBER '3 BRAND 15 3E- DURING Er 3
11 Df ll
UK SECRET
UK SECRET
December 213139
1?3 PICTURES 4 SDMEWHERE 1D DRINKING 3 3
DDUELE ?3 THIRTEEN 4 ANTWAT 23 DRUNK 3 SIXTY '3
REMEMBER '3 BRAND 15 3E- DURING Er 3
11 Df ll
UK SECRET