Snowden Archive
——
The SIDtoday
Files
Browse the Archive

Ask Raul: Getting the Most from Metadata

SUMMARY

Raul gives advice to Digital Network Intelligence analysts about squeezing the most value out of Microsoft Office document metadata. He shows examples from Word and Excel documents and explains how you can use pieces of metadata to track down all documents by a specific author, or specific company, without even having an email address or other hard selectors.

DOCUMENT’S DATE

Apr 28, 2005

PUBLICLY AVAILABLE

Feb 05, 2018

1/4
Download
Page 1 from Ask Raul: Getting the Most from Metadata
DYNAMIC PAGE -- HIGHEST POSSIBLE CLASSIFICATION IS TOP SECRET // SI / TK // REL TO USA AUS CAN GBR NZL (U) Ask Raul: Getting the Most from Metadata FROM: Raul A DNI Analyst Run Date: 04/28/2005 Dear Raul, (C) I am swamped with metadata and things that look like metadata. Can you think of something important and easy that I should be able to use from all the metadata we generate which would provide a quick and easy payoff? Betsy Dear Betsy, (C) I feel your pain. To get directly to your question, let me recommend the fields found in OLE (Object Linking and Embedding) property sets for Microsoft Office products. What the heck are you talking about?! Let me explain. (U//FOUO) Ages ago (I'm looking at an article from 1994 on OLE 2 and I know I have a book at home from the 80's on OLE), Microsoft began using OLE to make their products, especially documents, spiffy. To save you a bunch of details, the basic idea behind OLE was that each OLE document would take the form of a file system and if you know what you are doing, you can read it. Here is an example: sample.xls:OK :Workbook : Root = Start / Count = 006c / 1 SBD = Start / Count = fffffffe / 0 Size of BBD = 1 BBD = 006b OLE DIRECTORY ID Name Type Start Size Prev Next Dir 0:Root Entry ROOT 0000 0 -1 -1 2 1:Workbook FILE 0000 46112 -1 -1 -1 2:eSummaryInformation FILE 005b 4096 1 3 -1 3:eDocumentSummaryInformation FILE 0063 4096 -1 -1 -1 This is what a plain old Excel spreadsheet looks like, OLE-wise. This structure is such that it makes it fairly straightforward to repair some damaged documents. Don't get me going here! Just thinking about it has me fit to be tied. (S) The Root Entry tells us where everything is. The spreadsheet is actually the Workbook section but of particular interest here are the eSummaryInformation and eDocumentSummaryInformation property sets. Inside these property sets is information you can use to track your target such as the company name, author and last author. Here is an example of a dump of a property set from a document we collected: OS: Win32 (Windows) Version 5.1 Summary Information Property Set SERIES: (U) Ask Raul - Answers to DNI Questions 1. Ask Raul : Fonts and Encoding 2. Ask Raul : Dictionary Equations 3. Ask Raul : HTML Coding and Email 4. Ask Raul : PDF Files 5. Ask Raul: Damaged Data 6. Ask Raul : Getting the Most from Metadata
Page 2 from Ask Raul: Getting the Most from Metadata
Codepage is 1256 Arabic Title: No Subject: Empty. No value. Author: Masoud Keywords: Empty. No value. Comments: Empty. No value. Template: Normal.dot LastAuthor: Masoud RevNumber: 8 AppName: Microsoft Word 9.0 EditTime: 20 Hrs 28 Min Create_DTM: April 11, 2005 1246 hrs LastSave_DTM: April 11, 2005 1453 hrs PageCount: 1 WordCount: 3450 CharCount: 19670 Security: 0 Summary 1 OS: Win32 (Windows) Version 5.1 Document Summary Information Property Set Codepage is 1256 Arabic Company: IRISL LineCount: 163 ParCount: 39 Invalid or Unknown Document Summary Property 17 value is 24156 Invalid or Unknown Document Summary Property 23 value is 592544 Scale: False LinksDirty: False Invalid or Unknown Document Summary Property 19 value is False Invalid or Unknown Document Summary Property 22 value is False DocParts: ... Error. No further processing. Doc_Summary 1 Notice the Author's name: Masoud, Last Author: Masoud and Company: IRISL. If you are still awake, you can probably guess that this info could also be used to quickly link identical documents from different sources together or as a means to validate the integrity of documents as well as a whole host of other things. (C) The two most commonly collected file types you will see these property sets in are Word and Excel. They aren't the only file types, but certainly the most common. (S) Now the nice thing about this is that often, very often, our targets fill in these values, or the machine does it for them, and they never get changed. Are things getting clearer? If Joe Badguy isn't paying attention, and he usually isn't, and you get an OLE document from him, you can then target what he has in his property sets or you can go look in Pinwale for additional documents from him without having to know his email address, domain or anything else. You can also become a DNI wizard and guess at what might be in the property sets and have some fairly good luck. In the example above, IRISL is the publicly known abbreviation for the foreign company in question. We could have guessed at this without having to have the first collection to discover it. You can hunt, not fish. (S) Now here's the nitty-gritty. The property sets use a specific
Page 3 from Ask Raul: Getting the Most from Metadata
format which is at least 15 years old now. Using the example above, the Company name is actually stored as follows in the file: 0x1e 0x00 0x00 0x00 0x06 0x00 0x00 0x00 I R I S L 0x00 Here is what it means: 0x00 0x00 0x00 0x1e String (8-bit), null terminated 0x06 0x00 0x00 0x00 String length + null String IRISL0x00 (The first two values are reversed because they are stored in the file in Little Endian format.) If we wanted, we could slap the above hex string into a dictionary and suddenly we'd be hauling in Word and Excel documents from IRISL without having to know who was sending it or to whom. Additionally, we could take this term and convert it to base64 and task that also. Like this: AAAeAAAABgAAAElSSVNMAC8v LwAAHgAAAAYAAABJUklTTAAvLy8v Ly8AAB4AAAAGAAAASVJJU0wALy8A Although this isn't perfect (again, don't get me going), it will work just fine and keeps you from having to worry about what equipment is or isn't at what site. It's that simple. The Unicode version of the string type is not much different. Plus, lucky us, what we do for the dictionary is easily adapted for Pinwale and vice versa. (S) Now, imagine this: You're sitting here digging through the mess you've made with your traffic and lucky you, you find a Word document with a value set for the company or author in the property set. You then take that string and search through Pinwale and discover that there were more documents lying there waiting for you that you otherwise never would have found. You didn't have the email addresses or other hard selectors but you had that Word document and the property sets. Additionally, you now harvest several new email addresses and other bits of EEI from the sessions you found that help you to find even more goodies. Pretty spiffy! (S) Needless to say, if we harvested this data and used it properly, it would also be a big help for DNI chaining. Let's say Joe Badguy, whom we know about, sends a message with an OLE document in it to Bill Badguy, whom we don't know about and unlucky us, we don't collect this session. Three months later, Bill forwards that file to Tom Badguy, again whom we don't know about but because he is on a link we're watching, and because we had the value in the property set tasked, we haul in the session and suddenly we know about Bill and Tom and that they are associated with Joe. Not bad! Some of us have actually been doing this. It works. (C) There you go. In case you are wondering, corporately we've had the ability to extract the data from property sets for about five years. Also, other data types have similar bits of information that can be exploited. Is that metadata being extracted? It pays to know how your data works. (U//FOUO) Enjoy! DNI is a lot easier than you think. Or at least it can be. Raul
Page 4 from Ask Raul: Getting the Most from Metadata
"(U//FOUO) SIDtoday articles may not be republished or reposted outside NSANet without the consent of S0121 (DL sid_comms)." DYNAMIC PAGE -- HIGHEST POSSIBLE CLASSIFICATION IS TOP SECRET // SI / TK // REL TO USA AUS CAN GBR NZL DERIVED FROM: NSA/CSSM 1-52, DATED 08 JAN 2007 DECLASSIFY ON: 20320108