This article has been published in the Sleuthkit Informer #16
Description
Forensic investigations of hard drives/images have a lot of benefit from the different tools that are around. Some of these tools are Open Source or Free Software, and some of these tools are commercial.
Searching for keywords is probably one of the most performed actions during forensic investigations. But depending on the tools that are used this can take a lot of time depending on the size of the hard drive/image that is investigated.
In order to speed up searches it is possible to make a time-space trade-off and create an index of the forensic image. Such an index takes up a portion of space on the hard drive but this index will greatly speed up searches afterwards.
At first I used a commercial Windows based tool to index my images and quickly search through them, but I had some very bad experiences with the tool. A number of times during important investigations the tool crashed/hung during the creation of indexes or the tool was not able to open an index it just created without any identifiable reason. As most forensic investigations are done within a very short time-frame, it is very frustrating if after a full day/night/weekend of indexing (Depending on the size of the drive) you are presented with a crash or failure to open an index file.
Because these very frustrating experiences with this tool, I decided to design/create tools that could be used to perform indexed searches on images. In addition I chose to make these tools an addition to the already existing Open Source forensic tools Autopsy and Sleuthkit.
The collection of tools is called ‘Searchtools’
The workings of Searchtools
Index parameters
Because it is not viable to create an index containing all strings that are present inside an image, a number of parameters have consequences for the size of the index files and the strings that are present herein. Among these parameters the most important are:
- Minimum string length (Default: 4)
- Maximum string length (Default: 15)
- Characters indexed (Default: alphanumeric characters)
- Folding parameter (Default: no folding)
The minimum and maximum string length determine the lengths of the strings that will be indexed. Indexing strings shorter that 4 characters will result in a large amount of rubbish due to the high chance that 3 indexable characters occur in succession in a piece of random or binary data. Indexing strings longer than 15 will not result in more useful information.
The characters that are indexed should depend on the needs of the investigator. If only words have to be searched, it is wise to only index alphabetic characters in order to limit the size of the index and thus the time it takes to generate.
The folding parameter specifies if diacritic characters should map to their non-diacritic character in the index. This allows for easier searching of words that contain diacritic characters. Currently only folding of diacritic iso_8859-1 characters is supported.
Index types
Currently Searchtools is able to create two different types of indexes:
- Raw index
- Raw fragments index
The Raw index type contains all the strings that are located in the raw image. This means that this index type does not take into account any form of structure that might be available or present on the image.
If a string is located in a fragmented file and spans non-consecutive sectors, then it will not be found using the raw index. To find this string, the data on the image is indexed using the original file system structure. This index is called the raw fragment index. To reduce the raw fragment index size and prevent duplicate entries in the indexes, this index contains only the strings that start in one fragment and end in a non-consecutive fragment.
Simplified index example
In order to visualize the data that is contained in an index, a small example is presented. The example creates a simplified raw index of a file containing only the string “This looks like a sentence: look looks looked”. The default parameters are used, thus only strings with a length of 4 to 15 are indexed. A simplified parsing of the file results in the following index information:
0 this 22 ence
5 looks 28 look
6 ooks 33 looks
11 like 34 ooks
18 sentence 39 looked
19 entence 40 ooked
20 ntence 41 oked
21 tence
Note: All locations are zero-based.
Internally though the information is represented in a tree. So a more accurate representation would be the following simplified drawing:
e - n - c - e(22)
/ \
/ t - e - n - c - e(19)
/
/- l - i - k - e(11)
/ \
/ o - o - k(28) - s(5,33)
/ \
/ e - d(39)
/
/----- n - t - e - n - c - e(20)
root
\ k - e - d(41)
\ /
\----- o - o - k - s(6,34)
\ \
\ e - d(40)
\
\--- s - e - n - t - e - n - c - e(18)
\
-- t - e - n - c - e(21)
\
h - i - s(0)
As can be seen, the internal representation uses the letters of the indexed strings as nodes in a tree. At the node of the final letter of an indexed string, the offsets of that string are located.
So if one now wanted to search for the string “look”, only the letters of the string have to be walked in the tree from the root node to see that the string is present at location 28. If all strings starting with “look” are to be found, all nodes beneath that node have to be accounted for too, thus resulting in the locations 5, 28, 33 and 39.
Index directories
In order to facilitate indexed searches a directory is created: The index directory. This directory contains the resulting index for one image. The index itself currently consists of three different file types:
- Index configuration file
- Raw index files
- Raw fragment index files
Exactly one index configuration is located in an index directory and this file contains the general information used for creating the index itself. This file is used by the different tools of the Searchtools and contains a binary form of the configuration. This file is therefore not meant to be read by human beings.
The actual index is split into a number of raw index files and a number of raw fragment index files. The reason for not using a single large index file is simple. The current process of generating an index requires a lot of memory. Each file represents a single piece of full memory dumped into a file. Thus if the generating computer has an immense amount of memory a single index file would be the result.
The index files contain very compact and optimized tree representations created during the indexing process. As described above, each file contains the contents that could fit in one full memory piece. The tree in memory is optimized for in memory use. The tree in file format is optimized for searching with the least possible searches and thus disk seeks.
Different Searchtools
Overview
The collection of searchtools consists of:
-
indexer
Performs the actual indexing of the image and creates the indexes that can be used for indexed searching.
-
searcher
Uses the indexes created by indexer and can perform quick searches.
-
print_keywords
Prints a sorted list of the keywords found in an index directory or a specific index file.
-
counter
Checks and counts the number of offsets and nodes that are contained within an index file.
-
print_config
Prints the information from the config file contained in an index directory in a human readable form.
-
print_header
Prints the header of an index file in a human readable form.
Demonstration
This section continues with a short description/demonstration of the most used tools. Not all options will be demonstrated and this will definitely not be a complete manpage for these tools, but this demonstration will give a general idea of the possibilities and capabilities of the Searchtools.
Some commands will be timed by using the standard ‘time’ command integrated in most shells.
The image that we are using is a dd image of a 50 Mb linux ext2 partition that is packed with data. Packed meaning that almost all of the 50 Mb is used by the files present on the partition.
# ls -l test.img
-rw-r--r-- 1 paul paul 50M Jul 27 21:10 test.img
First we will create a standard index of the image (With the most important parameters as specified above)
# time indexer -v test.img idx_std
Starting raw indexing.
Done 100.0 percent: 282 kNodes 6447 kOffsets 27M Mem
Saving.
Read 52428800 bytes.
Total nodes 369063.
Total offsets 6447387.
Starting raw fragment indexing.
Done 100.0 percent: 1 kNodes 0 kOffsets 0M Mem
12824/12824 Inodes
Saving.
Total nodes 1380.
Total offsets 437.
real 0m35.398s
user 0m28.750s
sys 0m1.180s
The output of the indexer command shows us that using these index parameters a total of 6.447.387 raw indexes where indexed and a small total of 437 raw fragment indexes. The total time to index this small 50 Mb image is around 35 seconds on this 2.4 GHz PC. Thus a rough correlation would result in about 11 minutes per gigabyte of image. Note though that whenever memory is filled (250 Mb by default), the contents have to be written to disk in order to continue.
The resulting index directory contains the following files:
# ls -l idx_std
516 index.cnf
20437 raw_frag_idx.000
17672517 raw_idx.000
As can be seen the raw index file is about 17 Mb and this file contains all the 6.477.387 raw indexes that were found during the previous step.
Now the image is created we will search for ‘notifications’ which occurs once in the image.
# time searcher test.img idx_std notifications
Type: Raw
50712898 notifications
real 0m0.003s
user 0m0.010s
sys 0m0.000s
The output of the searcher command shows us that the string ‘notifications’ is located on byte offset 50.712.898 of the image and that the search took only a fraction of a second.
This time the image is searched for the string ‘data’ which occurs 23.270 times in the image. (Flag -i is used for case insensitive searching, -p for better to parse output format)
# time searcher test.img idx_std data -i -p
raw 253452 database
raw 254271 data
<snipped lots of results>
raw 52357097 DataBase
raw_frag 1913 12274 DATA
real 0m0.132s
user 0m0.070s
sys 0m0.060s
The search and recovery of these results took much less than one second and gives 28.189 results back. Wait a minute! Didn’t I just point out that the string ‘data’ occurs only 23.270 times in this image? By default the searcher returns all strings starting with the search string. By specifying the ‘-w’ flag, only the keywords that exactly match the search string are returned.
Sometimes just searching will not do. In order to find special words you want to be able to look at the number of occurrences for a specific keyword, or all keywords that are present within the image. The print_keywords command prints all the keywords in an index directory or in a specific index file. In order to facilitate scripting it is possible to let print_keywords skip the count that is appended to the end by default.
# time print_keywords -d idx_std
0000 2307
00000 1982
<snipped lots of results>
priorities 37
prioritized 76
prioritizing 2
priority 1824
prioritydata 42
prioritynames 2
<snipped lots of results>
zzzvz 1
zzzz 2
zzzzz 2
real 0m2.761s
user 0m1.650s
sys 0m0.250s
This concludes the small demonstration of the searchtools.
Conclusion
This article only lightly discusses the internal workings of the Searchtools, but I hope it is able to shed a little light on the subject for people interested in it. If any of you require extra information, don’t hesitate to ask, as I probably want create the documentation anyway and then have an incentive as to actually doing it.
Almost all functionality described in this article is also available in some way from the Autopsy interface. This article only used the command-line versions of the tools in order to visualize the actions done under the hood by the Autopsy interface.