Filelocator Pro Tips and Tricks for Indexing Large Breach Data Sets
Tomorrow I’ll be giving a talk on breach data including:
- Places where it’s located
- How to make large data sets searchable in a reasonable amount of time
- How some organizations are using breach data to improve their security posture
Whenever I give conference talks I try to remove or reduce any barriers to entry. When I have given talks on memory forensics, I have always used the Windows standalone version of Volatility instead of Linux for my demos so attendees who were not really comfortable with Linux wouldn’t feel like they couldn’t try the techniques.
With that idea in mind, I wanted to find a way to make large breach datasets searchable without the need to maintain huge databases, normalize hundreds (or more) of disparate datasets etc. Similar to a recent blog post I wrote where I used a forensics tool called bulk extractor to help quickly acquire selectors (emails, phone numbers etc) from a large dataset, I decided to use a common forensics technique of indexing for this problem.
Indexing has been used in forensics for years. You basically trade effort and extra storage space now for much quicker search results in the future. Imagine getting a hard drive in to examine on a Friday. You could let the drive process over the weekend and Monday morning quickly view the results and perform your searches. Ironically indexing isn’t nearly as common as it used to be in forensics but the technique works very well for breach data. To understand the tradeoffs and advantages, here’s a real world example.
I had a dataset of breach data that was 126 GB in size.Searching that data for an email address took about 50 minutes. I started up ajob to index the data which took two full days to run and an extra 76 GB instorage space. I felt the results were well worth it since now my searches took2 minutes instead of 50.
Years ago in a forensics class I learned of a free toolcalled “Agent Ransack” (https://www.mythicsoft.com/)which made searching drives for information easier. As I started searching foroptions to index this data I realized that the same company that made AgentRansack made a professional version called “Filelocator Pro” which has indexingcapabilities. I try to stick to free resources whenever I can and FilelocatorPro has a $60 cost but it seemed to be the easiest and most affordable methodof accomplishing what I was going for without the need to massage a lot ofdata.
While indexing large amounts of data I figured out that Filelocator pro has a few…. idiosyncrasies that I wanted to bring to the attention of anyone thinking of using it.
Before you index large datasets, I would highly recommend splittingup large files to chucks no bigger than 1GB in size. This will not only helpthem index faster, but you’ll get a lot less errors while indexing. There are a lot of ways to do this but I useda program called G-Split (https://www.gdgsoft.com/gsplit)which made it really quick and easy. You can pick how big you want the chunksto be, what it should name the files etc.
For organizational purposes, I made two directories, data and indexes. In each of those directories, I would make a sub directory for each dataset. For instance in data I may have a sub directory called “collection_1” with all of the data from the collection 1 dump in chunks no bigger than 1 GB. In the index directory I will have a directory called “collection_1_index” where I have Filelocator pro store the index it’s making.
Thankfully storage space is really cheap so I bought an 8 TBhard drive to store the data on.
While making the index, I highly suggest making an index foreach set of data rather than one massive index. It just makes things quickeroverall and much more useable. With some particularly large datasets (100 GBplus), I would even have to split them into a couple of indexes. When you usethe command line interface we’ll talk about in the next section, it will makethe search process painless even with a large number of indexes.
Aftermaking the index:
Once the indexes are made, when you go to search them you’ll notice that the graphical interface feels super sluggish. For some reason, Filelocator pro starts doing searches while you’re typing in the search term (like google does). That might be fine for searches of small datasets but for big ones, it can bog you down. I started typing my terms into a text file and just loading that in to search.
What I finally decided to do was use the command lineinterface to run my searches. I wrote a very basic python wrapper to take allof the terms from a text file and search all of the indexes listed in anothertext file. All of the results are placed in an HTML file with a built in headergraphic and all of the terms are bolded to make them easier to find in the results.I posted the code here: https://github.com/azmatt/DuckHunt
This makes it really easy to run big batches of searchesunattended and quickly scan through the results. I have two different indexlists. One that has all of the indexes in it, and one that only has indexes fordatasets that have phone numbers. That way if I’m searching phone numbers, Idon’t waste time searching data that won’t contain any.
Whilesearching your indexes:
It’s always a good idea to search for the username of an email in addition to the full email. For instance if you’re searching for firstname.lastname@example.org, you should also search example034 by itself to find results at Hotmail, etc. Unfortunately, a Filelocator pro idiosyncrasy forces that issue. When doing searches for emails, the results were taking FAR too long to come back, if they came back at all.
I reached out to their tech support and they told me that itwas splitting the email at the ‘@’ into two different searches. Because of thisyou never want to search for full email addresses, just the unique username (ora unique domain) part. Fortunately this isn’t usually a deal breaker sincesearching for the username is a best practice anyway.
Whenever you’re searching data, you should always know whatyour data looks like and do some test searches. One area where that isparticularly true is phone numbers. Most datasets don’t have any special charactersbetween the numbers but some have country codes and some don’t.
Another special idiosyncrasy is that Filelocator pro finds partial hits on words, but not on numbers. Searching for “matt0177” will also hit on “matt01773.” This is fine. What sucks is searching for 9155551212 will hit on that but WILL NOT hit on 19155551212. Because of that I recommend searching for phone numbers both with and without the country code.
This whole process feels a little rough around the edges attimes but it’s a pretty low effort way to make huge datasets searchable anduseable for OSINT efforts.