Documents
GCHQ Analytic Cloud Challenges
Sep. 25, 2015
GCHQ Analytic Cloud Challenges
Innovation Lead for Data, Analytics &
Visualisation Engineering
This information is exempt from Disclosure under the Freedom of Information Act 2000 and may be subject
to exemption under other UK information legislation. Refer disclosure requests to GCHQ on
14/5/2012
UK TOP SECRET STRAP 1
1
GCHQ Analytic Cloud Challenges
Innovation Lead for Data, Analytics &
Visualisation Engineering
This information is exempt from Disclosure under the Freedom of Information Act 2000 and may be subject
to exemption under other UK information legislation. Refer disclosure requests to GCHQ on
14/5/2012
UK TOP SECRET STRAP 1
1
Approximately 50 Billion Events Per
Day from 250+ x10G bearers
Other
9%
Volume by type
Target
Detection
Identifier:
Presence
35%
Comms
5%
Host
Refererer
20%
Chelt
33%
Web
Visits
31%
Bude
67%
• Scale Target for May 2012 is
approximately 320 x 10G with
a capacity for 100 BEPD
26/3/2012 Ref: 18171507
Volume by site
• Typical retention is six
months, sometimes reduced
to cope with volume increase
UK TOP SECRET STRAP 1
2
Approximately 50 Billion Events Per
Day from 250+ x10G bearers
Other
9%
Volume by type
Target
Detection
Identifier:
Presence
35%
Comms
5%
Host
Refererer
20%
Chelt
33%
Web
Visits
31%
Bude
67%
• Scale Target for May 2012 is
approximately 320 x 10G with
a capacity for 100 BEPD
26/3/2012 Ref: 18171507
Volume by site
• Typical retention is six
months, sometimes reduced
to cope with volume increase
UK TOP SECRET STRAP 1
2
lJl'L SECRET
TDB Events PG (JV-1 - Events High Level Concept Graphic
Purpose
A high level view can
GCHQ event (unselected
meta-data} capabilities
and uses.
Mail Cinder Files
armed Events
vents
eaten Analytic.
i: ADODP
Clusters]
Extract nd ?ltering
Analytic
velopment, Ad-
he data mining
Query Asses vie
Dat pert
{Interface};r
'v'AlL [web Lil] I
Git-13 Hieh Created: rs'ommzizezna Medined: meme? 15:21:32 Owner: -
crap This infnm'ialinn is lemm d'meeure uni'jer [he Freedem Ui' infennesen nuzeeu end may be sub-3911' to medium under other UK infnn'nalien .Refet disclosure raqueste In ism-la en
SECRET
lJl'L SECRET
TDB Events PG (JV-1 - Events High Level Concept Graphic
Purpose
A high level view can
GCHQ event (unselected
meta-data} capabilities
and uses.
Mail Cinder Files
armed Events
vents
eaten Analytic.
i: ADODP
Clusters]
Extract nd ?ltering
Analytic
velopment, Ad-
he data mining
Query Asses vie
Dat pert
{Interface};r
'v'AlL [web Lil] I
Git-13 Hieh Created: rs'ommzizezna Medined: meme? 15:21:32 Owner: -
crap This infnm'ialinn is lemm d'meeure uni'jer [he Freedem Ui' infennesen nuzeeu end may be sub-3911' to medium under other UK infnn'nalien .Refet disclosure raqueste In ism-la en
SECRET
Query Focused Dataset
What are QFDs
• A database designed for answering a
single (or few) analytic question
• Initially adopted from
research, engineering tasked to
mature and scale ASAP
• The engineering has been
incremental based on priority
• Library of components and design
patterns developed
• Scale using appropriately sized
database instances, federation
using service tiers
• Deployment of new instances and
updates streamlined
• Corporate support
• New QFDs still being produced by
engineering and others using the
standard components
• Some now provided interactive
access to datasets generated by
HADOOPS batch analytics
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
QFD Facts and Figures
• Approximately 100 QFD instances
deployed for 16 different QFDs +
more for UI etc
• Each with events (or results)
appropriate for the questions they
answer
• Hardware shared depending on
QFD, all storage is shared
• Driven by the need for a flexible
platform at large scale with low
overall cost
• Redhat Enterprise Linux, Oracle DB
(DB needs are simple)
• Latest generation HP BL456c G7
blades, 24 core, 64GB RAM, EVA
storage ~200TB usable
• Most QFD instances have 70TB
usable capacity
• Total storage 15PB raw, 11PB usable
4
Query Focused Dataset
What are QFDs
• A database designed for answering a
single (or few) analytic question
• Initially adopted from
research, engineering tasked to
mature and scale ASAP
• The engineering has been
incremental based on priority
• Library of components and design
patterns developed
• Scale using appropriately sized
database instances, federation
using service tiers
• Deployment of new instances and
updates streamlined
• Corporate support
• New QFDs still being produced by
engineering and others using the
standard components
• Some now provided interactive
access to datasets generated by
HADOOPS batch analytics
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
QFD Facts and Figures
• Approximately 100 QFD instances
deployed for 16 different QFDs +
more for UI etc
• Each with events (or results)
appropriate for the questions they
answer
• Hardware shared depending on
QFD, all storage is shared
• Driven by the need for a flexible
platform at large scale with low
overall cost
• Redhat Enterprise Linux, Oracle DB
(DB needs are simple)
• Latest generation HP BL456c G7
blades, 24 core, 64GB RAM, EVA
storage ~200TB usable
• Most QFD instances have 70TB
usable capacity
• Total storage 15PB raw, 11PB usable
4
A selection of the major QFDs
Name
Description
Questions Answered
Physical Bude+Chelt
AUTOASSOC
Bulk unselected TDI-TDI correlations with
confidence scores.
What other TDIs belong to your target ?
What technologies your target is using ?
2+1 instances, each 50-70TB
storage
Evolved
Mutant Broth
Identify when certain TDIs appear in traffic
which indicate target usage and their
location. Telephony and C2C data provide a
converged view.
Where has my target been?
What kind of communications devices has my target
been using?
10+5 instances, each 70TB
storage
Hard Assoc
Provide strongly correlated selectors for
both C2C and Telephony traffic taken from
TDIs appearing in the same packet
Are there any alternative C2C or Telephony selectors
for my target?
3+2 instances, each 70TB
storage
HRMap
Host-referrer relationships - information
about how people get to websites, including
links followed and direct accesses.
How do people get to my website of interest and
where do they go to next?
What websites have been visited from a given IP?
5+3 instances, each 70TB
storage
KARMA
POLICE
Which TDIs have been seen at approximately
the same time, and from the same
computer, as visits to websites.
Which websites your target visits, and when/where
those visits occurred.
Who visits suspicious websites, and when/where
those visits occurred.
Which other websites are visited by people who visit
a suspicious website.
Which IP address and web browser were being used
by your target when they visited a website.
11+7 instances, each 70TB
storage, 3+1 correlator
instances
SOCIAL
ANTHROPOID
Converged comms events allowing you to
see who your targets have communicated
with via phone, over the internet, or using
converged channels (e.g. sending emails
from a phone or making voice calls over the
internet).
What communications your target is engaged in.
Who has your target been communicating with.
What communications have occurred using a
particular locator (IP address, cell tower, etc).
6+3 instances, each 70TB
storage
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
5
A selection of the major QFDs
Name
Description
Questions Answered
Physical Bude+Chelt
AUTOASSOC
Bulk unselected TDI-TDI correlations with
confidence scores.
What other TDIs belong to your target ?
What technologies your target is using ?
2+1 instances, each 50-70TB
storage
Evolved
Mutant Broth
Identify when certain TDIs appear in traffic
which indicate target usage and their
location. Telephony and C2C data provide a
converged view.
Where has my target been?
What kind of communications devices has my target
been using?
10+5 instances, each 70TB
storage
Hard Assoc
Provide strongly correlated selectors for
both C2C and Telephony traffic taken from
TDIs appearing in the same packet
Are there any alternative C2C or Telephony selectors
for my target?
3+2 instances, each 70TB
storage
HRMap
Host-referrer relationships - information
about how people get to websites, including
links followed and direct accesses.
How do people get to my website of interest and
where do they go to next?
What websites have been visited from a given IP?
5+3 instances, each 70TB
storage
KARMA
POLICE
Which TDIs have been seen at approximately
the same time, and from the same
computer, as visits to websites.
Which websites your target visits, and when/where
those visits occurred.
Who visits suspicious websites, and when/where
those visits occurred.
Which other websites are visited by people who visit
a suspicious website.
Which IP address and web browser were being used
by your target when they visited a website.
11+7 instances, each 70TB
storage, 3+1 correlator
instances
SOCIAL
ANTHROPOID
Converged comms events allowing you to
see who your targets have communicated
with via phone, over the internet, or using
converged channels (e.g. sending emails
from a phone or making voice calls over the
internet).
What communications your target is engaged in.
Who has your target been communicating with.
What communications have occurred using a
particular locator (IP address, cell tower, etc).
6+3 instances, each 70TB
storage
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
5
HADOOP Clusters
Facts & Figures
Usage
• Three batch analytic clusters
(HADOOP clouds)
• Each contains unique, unselected
events, reference and results data
• Redhat Enterprise Linux, Apache
HADOOP (Map Reduce)
• Each with ~900 data nodes, plus
UI/ingest/egress servers
• Each node 8 core, 64GB
RAM, 6x1TB
• Each 6PB raw storage, 1.5PB for
events (3x replication and results)
• The three clouds could, in
theory, store 24 Trillion events @140
BEPD and 6 months retention period
• 200 capability developer accounts
• Many more users able to access
applications
• Guiding light > 150 users
• First Contact, Cloudy Cobra > 50
• Jobs during working hours
• Operationally ad-hoc
(experimental, search, one time
use)
• Development of new capability
• Schedules Jobs run overnight
• Sustained, operational applications
• Increasing expectation for fully
supported, automated jobs such as
Rumour Mill
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
6
HADOOP Clusters
Facts & Figures
Usage
• Three batch analytic clusters
(HADOOP clouds)
• Each contains unique, unselected
events, reference and results data
• Redhat Enterprise Linux, Apache
HADOOP (Map Reduce)
• Each with ~900 data nodes, plus
UI/ingest/egress servers
• Each node 8 core, 64GB
RAM, 6x1TB
• Each 6PB raw storage, 1.5PB for
events (3x replication and results)
• The three clouds could, in
theory, store 24 Trillion events @140
BEPD and 6 months retention period
• 200 capability developer accounts
• Many more users able to access
applications
• Guiding light > 150 users
• First Contact, Cloudy Cobra > 50
• Jobs during working hours
• Operationally ad-hoc
(experimental, search, one time
use)
• Development of new capability
• Schedules Jobs run overnight
• Sustained, operational applications
• Increasing expectation for fully
supported, automated jobs such as
Rumour Mill
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
6
Physical Overview
Data Node
Data Node
Data Node
Application Node
Application Node
(QFD)
(QFD)
Analyst
External Data
Sources
Ingest Node
Ingest Node
Master Node
Master Node
Hadoop Cluster
Innovation Node
Innovation Node
(QFD)
(QFD)
Capability
Developer
Private Cloud
• Data Nodes
• HADOOP Master Nodes (Job Track & Name Node)
Edge Nodes connect to Computer Hall Network
• Ingest Nodes
• Application Nodes (provide user interface – web)
• Innovation Nodes (cap dev login to servers)
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
7
Physical Overview
Data Node
Data Node
Data Node
Application Node
Application Node
(QFD)
(QFD)
Analyst
External Data
Sources
Ingest Node
Ingest Node
Master Node
Master Node
Hadoop Cluster
Innovation Node
Innovation Node
(QFD)
(QFD)
Capability
Developer
Private Cloud
• Data Nodes
• HADOOP Master Nodes (Job Track & Name Node)
Edge Nodes connect to Computer Hall Network
• Ingest Nodes
• Application Nodes (provide user interface – web)
• Innovation Nodes (cap dev login to servers)
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
7
Data Ingest
Ingest Design Goals
Data Formats
• Store all data received, i.e.
don’t risk losing important
fields by normalising to single
file format.
• Partition by directory structure
• Partition by data type to
simplify common queries.
• Partition by ingest date to
allow incremental analytics.
• Partition by security
compartment.
• Horizontal scaling by adding
new hardware.
26/3/2012 Ref: 18171507
•
•
•
•
•
•
•
TLV – Tag Length Value
Comma Delimited
Fix Position
Single Line & Multi-Line
Multiple Variations
Actor Action
Many sources and many types
UK TOP SECRET STRAP 1
8
Data Ingest
Ingest Design Goals
Data Formats
• Store all data received, i.e.
don’t risk losing important
fields by normalising to single
file format.
• Partition by directory structure
• Partition by data type to
simplify common queries.
• Partition by ingest date to
allow incremental analytics.
• Partition by security
compartment.
• Horizontal scaling by adding
new hardware.
26/3/2012 Ref: 18171507
•
•
•
•
•
•
•
TLV – Tag Length Value
Comma Delimited
Fix Position
Single Line & Multi-Line
Multiple Variations
Actor Action
Many sources and many types
UK TOP SECRET STRAP 1
8
Analytic use of data
Security
Silver Library
• Data access controlled across 2
dimensions:
• Usability: Operational (standard
events) or Controlled (special
purpose e.g. test)
• Classification: Multiple buckets Open (TSS2), a few compartments,
general CIOs Sensitive
• Capability developer access
restricted by file/directory
permissions
• User facing applications apply fine
grained security on per record basis
• Abstraction layer to separate analytic
development from storage
dependencies
• Pluggable parsers for new formats
• Record parsers for event data held in
HDFS
• Applications still need to know field
labels (function calls)
• Applications often add additional
abstractions
• Input and Output handlers to read
and write data based on type
• Utility functions to aid development
• Simple search and filter routines
• Events Product Centre maintained
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
9
Analytic use of data
Security
Silver Library
• Data access controlled across 2
dimensions:
• Usability: Operational (standard
events) or Controlled (special
purpose e.g. test)
• Classification: Multiple buckets Open (TSS2), a few compartments,
general CIOs Sensitive
• Capability developer access
restricted by file/directory
permissions
• User facing applications apply fine
grained security on per record basis
• Abstraction layer to separate analytic
development from storage
dependencies
• Pluggable parsers for new formats
• Record parsers for event data held in
HDFS
• Applications still need to know field
labels (function calls)
• Applications often add additional
abstractions
• Input and Output handlers to read
and write data based on type
• Utility functions to aid development
• Simple search and filter routines
• Events Product Centre maintained
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
9
Example Analytics
Every Assoc
•User /machine correlations from C2C presence
Every Police
•User /machine website visits
Every Creature
•User /machine search terms
Every eAD
•User /machine electronic attack patterns
Every Cipher
•User /machine cipher events
Grey Fox / Silver Fox
•Country level summary of where identifier observed
First Contact
•1 & 2 hop contact chains between seeds and targets
Cloudy Cobra
•Glorified grep driven by GUI – find events that contain
user search term
26/3/2012 Ref: 18171507
Guiding Light
•MI information types/volumes of traffic on bearer
Golden Axe
•Generated list of suspected clone mobile phones (IMEI
grey list)
Tribal Carnem
•Uses Radius logs to identify & collect activity for IP session
Public Anemone
•Geolocation based on web-based map searches
Epic Fail
•Identifies careless use of TOR networks
Sterling Moth
•IP summarisation tool using c2C presence events
Foghorn
•Find non-targets using targets machines
And many more….
UK TOP SECRET STRAP 1
10
Example Analytics
Every Assoc
•User /machine correlations from C2C presence
Every Police
•User /machine website visits
Every Creature
•User /machine search terms
Every eAD
•User /machine electronic attack patterns
Every Cipher
•User /machine cipher events
Grey Fox / Silver Fox
•Country level summary of where identifier observed
First Contact
•1 & 2 hop contact chains between seeds and targets
Cloudy Cobra
•Glorified grep driven by GUI – find events that contain
user search term
26/3/2012 Ref: 18171507
Guiding Light
•MI information types/volumes of traffic on bearer
Golden Axe
•Generated list of suspected clone mobile phones (IMEI
grey list)
Tribal Carnem
•Uses Radius logs to identify & collect activity for IP session
Public Anemone
•Geolocation based on web-based map searches
Epic Fail
•Identifies careless use of TOR networks
Sterling Moth
•IP summarisation tool using c2C presence events
Foghorn
•Find non-targets using targets machines
And many more….
UK TOP SECRET STRAP 1
10
Rich Visualisation
• Desktop applications for rich
visualisation and analysis - Eclipse RCP
• LOOKING GLASS client platform, FIRE
ENGINE question-based federated
access to events and reference data
sources
• Pilots
– NGCC Contact chaining with
C2C/converged event data
– QFD Federator
– NG Geo: Converged geo events
analysis
• Next stage to learn from pilots and
build the tool for everyone - ALPHA
CENTAURI
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
11
Rich Visualisation
• Desktop applications for rich
visualisation and analysis - Eclipse RCP
• LOOKING GLASS client platform, FIRE
ENGINE question-based federated
access to events and reference data
sources
• Pilots
– NGCC Contact chaining with
C2C/converged event data
– QFD Federator
– NG Geo: Converged geo events
analysis
• Next stage to learn from pilots and
build the tool for everyone - ALPHA
CENTAURI
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
11
Rumour Mill – Push Analytics
•
Rumour Mill is a dashboard that
will:
– Enable analysts to prioritise new
work as it arrives from customers by
easily finding out "what does GCHQ
already know"
– Enable analysts to monitor existing
work to spot when something
happens that would change their
priorities
•
•
•
26/3/2012 Ref: 18171507
First level results, list of questions
against identifiers. Simple yes/no
answers
Click on a yes to the second level drills into the detail
Many questions are derived from
cloud based analytics run each day
against the current identifier list
UK TOP SECRET STRAP 1
12
Rumour Mill – Push Analytics
•
Rumour Mill is a dashboard that
will:
– Enable analysts to prioritise new
work as it arrives from customers by
easily finding out "what does GCHQ
already know"
– Enable analysts to monitor existing
work to spot when something
happens that would change their
priorities
•
•
•
26/3/2012 Ref: 18171507
First level results, list of questions
against identifiers. Simple yes/no
answers
Click on a yes to the second level drills into the detail
Many questions are derived from
cloud based analytics run each day
against the current identifier list
UK TOP SECRET STRAP 1
12
Sharing & Collaboration
Other SIA and foreign partners
Data (bulk & query) and technology exchange
Two major components:
• Web user interfaces (VAIL) on GCHQ servers but
accessible from the partner site. Interactive query of
QFDs. Allow exposure of GCHQ tradecraft.
• Brokering services. Sustained access for interactive query
of GCHQ data integrated into partner tools.
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
13
Sharing & Collaboration
Other SIA and foreign partners
Data (bulk & query) and technology exchange
Two major components:
• Web user interfaces (VAIL) on GCHQ servers but
accessible from the partner site. Interactive query of
QFDs. Allow exposure of GCHQ tradecraft.
• Brokering services. Sustained access for interactive query
of GCHQ data integrated into partner tools.
26/3/2012 Ref: 18171507
UK TOP SECRET STRAP 1
13
Future Options for Event
Processing
Future Options for Event
Processing
Key Challenges
Affordable, continuing
scale of our capabilities
• All dimensions, cost, power, cooling, space, storage, bandwidth, processing etc
• As we share with more partners, the access demands increase
• Enable more delivery by collaborating with others
Enabling our analysts to
cope with complexity
and volume of data
• Need to streamline understood workflows to save analysts time
• Need to push data to analysts, sometimes with low latency
• Enable next level analysis, behavioural, pattern of life, noticing
change, predictions
Complexity and pace of
analytic development
• Suitably skilled people are hard to find in-house and externally
• Must be able to cope with mixed maturity capabilities
• Workload separation, processing contention, cross site analytics
Agility, resilience and
maturity of platforms
26/3/2012 Ref: 18156927
• Keep pace with development, respond to community demands
• Understand and match the evolving business expectations of maturity
• Support appropriately, agreed strategy for resilience
UK TOP SECRET STRAP 1
15
Key Challenges
Affordable, continuing
scale of our capabilities
• All dimensions, cost, power, cooling, space, storage, bandwidth, processing etc
• As we share with more partners, the access demands increase
• Enable more delivery by collaborating with others
Enabling our analysts to
cope with complexity
and volume of data
• Need to streamline understood workflows to save analysts time
• Need to push data to analysts, sometimes with low latency
• Enable next level analysis, behavioural, pattern of life, noticing
change, predictions
Complexity and pace of
analytic development
• Suitably skilled people are hard to find in-house and externally
• Must be able to cope with mixed maturity capabilities
• Workload separation, processing contention, cross site analytics
Agility, resilience and
maturity of platforms
26/3/2012 Ref: 18156927
• Keep pace with development, respond to community demands
• Understand and match the evolving business expectations of maturity
• Support appropriately, agreed strategy for resilience
UK TOP SECRET STRAP 1
15
Approach to challenges
Deploy more of the
same while maturing
Explore new
technology
Know the value of our
data
Understand usage of
capabilities
Stay flexible and
increase collaboration
26/3/2012 Ref: 18156927
• Suitably refreshed, will answer immediate scaling challenge
• Reduce support burden and streamline new capability development
• Implement improvements to existing architecture
• Non-relational, distributed databases; QFD consolidation & convergence
• Research initiatives (A)SEM/HAKIM/STREAMING, and commercial offerings
• Collaboration opportunities CLOUDBASE/ACCUMULO
• Don’t generate/ingest data we don’t use, filter or de-dupe upstream
• Be smart about retention, be smart about need for interactive availability
• Distil the raw data to generate rich information sets for analysts
• Automate the simple workflows with summaries and push analytics
• Optimise capabilities for their use and access load
• Mature, provide resilience and support as appropriate
• Enable collaboration and lower the entry bar for capability developers
• Stay vendor neutral, continue to use open standards, open software
• Enhance the benefit for both research & engineering of working together
UK TOP SECRET STRAP 1
16
Approach to challenges
Deploy more of the
same while maturing
Explore new
technology
Know the value of our
data
Understand usage of
capabilities
Stay flexible and
increase collaboration
26/3/2012 Ref: 18156927
• Suitably refreshed, will answer immediate scaling challenge
• Reduce support burden and streamline new capability development
• Implement improvements to existing architecture
• Non-relational, distributed databases; QFD consolidation & convergence
• Research initiatives (A)SEM/HAKIM/STREAMING, and commercial offerings
• Collaboration opportunities CLOUDBASE/ACCUMULO
• Don’t generate/ingest data we don’t use, filter or de-dupe upstream
• Be smart about retention, be smart about need for interactive availability
• Distil the raw data to generate rich information sets for analysts
• Automate the simple workflows with summaries and push analytics
• Optimise capabilities for their use and access load
• Mature, provide resilience and support as appropriate
• Enable collaboration and lower the entry bar for capability developers
• Stay vendor neutral, continue to use open standards, open software
• Enhance the benefit for both research & engineering of working together
UK TOP SECRET STRAP 1
16
Understand usage of capabilities
Use Case Class Mapping
Large proportion of queries for
most of the analysts
Known
Target, Unknown
Query
Small proportion of queries for
some of the analysts
• Developed jointly between GCHQ
and NSA to understand
– the benefits of our current
capabilities
– where respective strengths and
weaknesses exist
Unknown
Target, Unknown
Query
Target Based Discovery
Anomaly Detection
Target-based
Development
Target Tracking
Known
Target, Known
Query
26/3/2012 Ref: 18156927
• Provides a clear set of drivers for
architectural evolution
Behaviour-based
Discovery
Unknown
Target, Known
Query
UK TOP SECRET STRAP 1
– Missing capabilities
– Suboptimal use of capabilities
– Opportunities for collaboration and
reuse
17
Understand usage of capabilities
Use Case Class Mapping
Large proportion of queries for
most of the analysts
Known
Target, Unknown
Query
Small proportion of queries for
some of the analysts
• Developed jointly between GCHQ
and NSA to understand
– the benefits of our current
capabilities
– where respective strengths and
weaknesses exist
Unknown
Target, Unknown
Query
Target Based Discovery
Anomaly Detection
Target-based
Development
Target Tracking
Known
Target, Known
Query
26/3/2012 Ref: 18156927
• Provides a clear set of drivers for
architectural evolution
Behaviour-based
Discovery
Unknown
Target, Known
Query
UK TOP SECRET STRAP 1
– Missing capabilities
– Suboptimal use of capabilities
– Opportunities for collaboration and
reuse
17
Existing coverage of use cases
Known Target, Known
Query
• Streaming, Rumour Mill and QFD capabilities (possibly assisted by cloud analytic)
• Could select data subset based on target, could pre-calculate results
• Large usage and automation suggests need for optimised capability
Known Target, Unknown
Query
• Not possible with existing QFD capabilities
• Needs new analytic or more indexes on target selected data
• Large usage and needs suggest a new capability
Unknown Target, Known
Query
• Core QFD capability, possibly populated by cloud analytic
• Full unselected data set required but what level of interactive query?
• Is it acceptable that older data be made available non-interactively?
Unknown
Target, Unknown Query
• Use batch analytic platform for low level search or new analytic
• Could a heavily indexed store provide a responsive capability?
• Possibly only over a subset of data - recent data only?
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
18
Existing coverage of use cases
Known Target, Known
Query
• Streaming, Rumour Mill and QFD capabilities (possibly assisted by cloud analytic)
• Could select data subset based on target, could pre-calculate results
• Large usage and automation suggests need for optimised capability
Known Target, Unknown
Query
• Not possible with existing QFD capabilities
• Needs new analytic or more indexes on target selected data
• Large usage and needs suggest a new capability
Unknown Target, Known
Query
• Core QFD capability, possibly populated by cloud analytic
• Full unselected data set required but what level of interactive query?
• Is it acceptable that older data be made available non-interactively?
Unknown
Target, Unknown Query
• Use batch analytic platform for low level search or new analytic
• Could a heavily indexed store provide a responsive capability?
• Possibly only over a subset of data - recent data only?
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
18
Value assessment
What measures of value?
• Duplicated over time or across bearers
• Value decreases with age
• Value decreases after processing (SPAM, summaries)
• Use by analysts and analytics
What actions to take?
• Do filtering upstream or don’t generated
• Investigate use of streaming capabilities for filtering or sampling
• Discard immediately after first stage processing
• Age off and discard more selectively
• Periodically evaluate data types to understand usage
Need to agree possibilities with operations and experiment
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
19
Value assessment
What measures of value?
• Duplicated over time or across bearers
• Value decreases with age
• Value decreases after processing (SPAM, summaries)
• Use by analysts and analytics
What actions to take?
• Do filtering upstream or don’t generated
• Investigate use of streaming capabilities for filtering or sampling
• Discard immediately after first stage processing
• Age off and discard more selectively
• Periodically evaluate data types to understand usage
Need to agree possibilities with operations and experiment
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
19
Technology candidates
QFD consolidation and HAKIM
Existing QFD drawbacks
• Some duplication between QFDs (~10%)
• Need new QFDs to answer new questions
Need a consolidated database with multiple indexes and flexible additions
• HAKIM is a research prototype to do just that
• Unification of data, associated data kept together
• Quick and flexible addition of new data types and indexes
• Scalable and cost effective
Other candidates exist and some HAKIM components could be replaced
• Oracle DB or distributed, non-relational database?
• Convergence with HADOOP stack HBASE/ACCUMULO
Engineering is working with research to develop to the next stage
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
20
Technology candidates
QFD consolidation and HAKIM
Existing QFD drawbacks
• Some duplication between QFDs (~10%)
• Need new QFDs to answer new questions
Need a consolidated database with multiple indexes and flexible additions
• HAKIM is a research prototype to do just that
• Unification of data, associated data kept together
• Quick and flexible addition of new data types and indexes
• Scalable and cost effective
Other candidates exist and some HAKIM components could be replaced
• Oracle DB or distributed, non-relational database?
• Convergence with HADOOP stack HBASE/ACCUMULO
Engineering is working with research to develop to the next stage
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
20
Technology candidates
The “skinny” cloud (Laurel)
What is it?
• A batch analytic HADOOP cluster
• Contains all the data but for a short retention period
What would it be used for?
• An ideal place to do incremental analytics
• Answers the cross-site analytic problem
• Could be dedicated to sustained usage
• Provides some resilience by duplicating recent data
What are the challenges?
• Can we transfer and ingest all the data
Engineering could potential build this using on-order kit
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
21
Technology candidates
The “skinny” cloud (Laurel)
What is it?
• A batch analytic HADOOP cluster
• Contains all the data but for a short retention period
What would it be used for?
• An ideal place to do incremental analytics
• Answers the cross-site analytic problem
• Could be dedicated to sustained usage
• Provides some resilience by duplicating recent data
What are the challenges?
• Can we transfer and ingest all the data
Engineering could potential build this using on-order kit
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
21
Technology candidates
The GCHQ “core” (Hardy)
What is it?
• A HADOOP cluster with Map/Reduce and interactive query/analytics capabilities
• Probably using CLOUDBASE/ACCUMULO and reusing NSA knowledge
What would it be used for?
• A place for data and summaries promoted from the bulk stores
• Known Target, Know Query and some Known Target, Unknown Query
• Optimised for major use case and suitable for data sharing
• Provides some resilience by duplicating important data
What are the challenges?
• GCHQ has limited expertise with CLOUDBASE/ACCUMULO this technology
• The promotion analytics and criteria are not developed
Engineering could potential build this using on-order kit
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
22
Technology candidates
The GCHQ “core” (Hardy)
What is it?
• A HADOOP cluster with Map/Reduce and interactive query/analytics capabilities
• Probably using CLOUDBASE/ACCUMULO and reusing NSA knowledge
What would it be used for?
• A place for data and summaries promoted from the bulk stores
• Known Target, Know Query and some Known Target, Unknown Query
• Optimised for major use case and suitable for data sharing
• Provides some resilience by duplicating important data
What are the challenges?
• GCHQ has limited expertise with CLOUDBASE/ACCUMULO this technology
• The promotion analytics and criteria are not developed
Engineering could potential build this using on-order kit
26/3/2012 Ref: 18156927
UK TOP SECRET STRAP 1
22
UK TCIPSEERET
TDB Events PG Stir-1 - Skinny cloud, GCHQ Core and QFD consolidation
Input Buffer
All Bude Even
All l[Ihelt Er Buds
Speci?c Event Types
Bull: Events
Batch Analytic
Cluster 1
heltenham Sile
Daily new Events for
current selectors
All Evenls for new
selectors
lery Ftes ponss
Push daily
wanes
NSA Site
All Events for new
selectors
All Events for nevrI
selectors
Purpose
Shows possible hey.r architecture for GCHQ Events.
New components:
All Bude Even
Input Buffer
All Bude Events
Bull: Events
Batch Analytic
muster 1
Bull Event:
Batch Analytic
Cluster 2
Elude Site
Skinny cloud - all the data from both sites but held for less
time. A place to do daily, incremental analytics on bulk
events. Potentially. only sustained analytics. Also
provides a usable copy of the data for a shon time period
giving some resilience.
GCHQ I{lore 'v'aluable events are selected SA 3L
GCHQ seeds} by analytics and copied into the core. The
core is not primarily a place for batch analytics, rather an
interactive query repository. In addition to the ray.r eventsr
summaries are held and shipped to NBA for integration
into their tools. All the important events are held for the
normal retention period providing resilience.
Consolidated DFD - The next stage for QFDs. lower
storage, more ?exible and more esin scaled are all
Sit-1 Resource interaction Speci?cation
Created: ?reassurance-l
in mm disclaer under rm Freedom drinrodnarjm menus and may beenbiectip underethlr urc legislation. Referdiaclcsure requests to ecr-io or
possible bene?ts.
Modified: 15:09:22 Owner:
UK
UK TCIPSEERET
TDB Events PG Stir-1 - Skinny cloud, GCHQ Core and QFD consolidation
Input Buffer
All Bude Even
All l[Ihelt Er Buds
Speci?c Event Types
Bull: Events
Batch Analytic
Cluster 1
heltenham Sile
Daily new Events for
current selectors
All Evenls for new
selectors
lery Ftes ponss
Push daily
wanes
NSA Site
All Events for new
selectors
All Events for nevrI
selectors
Purpose
Shows possible hey.r architecture for GCHQ Events.
New components:
All Bude Even
Input Buffer
All Bude Events
Bull: Events
Batch Analytic
muster 1
Bull Event:
Batch Analytic
Cluster 2
Elude Site
Skinny cloud - all the data from both sites but held for less
time. A place to do daily, incremental analytics on bulk
events. Potentially. only sustained analytics. Also
provides a usable copy of the data for a shon time period
giving some resilience.
GCHQ I{lore 'v'aluable events are selected SA 3L
GCHQ seeds} by analytics and copied into the core. The
core is not primarily a place for batch analytics, rather an
interactive query repository. In addition to the ray.r eventsr
summaries are held and shipped to NBA for integration
into their tools. All the important events are held for the
normal retention period providing resilience.
Consolidated DFD - The next stage for QFDs. lower
storage, more ?exible and more esin scaled are all
Sit-1 Resource interaction Speci?cation
Created: ?reassurance-l
in mm disclaer under rm Freedom drinrodnarjm menus and may beenbiectip underethlr urc legislation. Referdiaclcsure requests to ecr-io or
possible bene?ts.
Modified: 15:09:22 Owner:
UK