[ Pobierz całość w formacie PDF ]
A Study of Malcode-Bearing Documents
Wei-Jen Li, Salvatore Stolfo, Angelos Stavrou, Elli Androulaki,
and Angelos D. Keromytis
Computer Science Department, Columbia University
{
weijen,sal,angel,elli,angelos
}
@cs.columbia.edu
Abstract.
By exploiting the object-oriented dynamic composability of
modern document applications and formats, malcode hidden in otherwise
inconspicuous documents can reach third-party applications that may
harbor exploitable vulnerabilities otherwise unreachable by network-level
service attacks. Such attacks can be very selective and dicult to detect
compared to the typical network worm threat, owing to the complex-
ity of these applications and data formats, as well as the multitude of
document-exchange vectors. As a case study, this paper focuses on Mi-
crosoft Word documents as malcode carriers. We investigate the pos-
sibility of detecting embedded malcode in Word documents using two
techniques: static content analysis using statistical models of typical doc-
ument content, and run-time dynamic tests on diverse platforms. The
experiments demonstrate these approaches can not only detect known
malware, but also most zero-day attacks. We identify several problems
with both approaches, representing both challenges in addressing the
problem and opportunities for future research.
Keywords:
Intrusion Detection, N-gram, Sandbox Diversity.
1
Introduction
In this paper, we focus on
stealthy and targeted
attacks where malcode is deliv-
ered to a host in an otherwise normal-appearing document. Modern documents
and the corresponding applications make use of embedded code fragments. This
embedded code is capable of indirectly invoking other applications or libraries
on the host as part of document rendering or editing. For example, a pie chart
displaying the contents of a spreadsheet embedded in a Word document will
cause Excel components to be invoked when the Word document is opened. As
a result, documents offer a convenient means for attackers to penetrate systems
and reach third-party host-based applications that may harbor vulnerabilities
which are not reachable, and thus not directly exploitable, remotely over the
network. Disturbingly, attackers are simply exploiting deliberate features that
are critical to the way modern document-handling applications operate, instead
of some temporary vulnerabilities or bugs.
Several cases have been reported where malcode has been embedded in docu-
ments (e.g., PDF, Word, Excel, and PowerPoint [1,2,3]) transforming them into
a vehicle for host intrusions. These trojan-infected documents can be served up
B. M. Hammerli and R. Sommer (Eds.): DIMVA 2007, LNCS 4579, pp. 231–250, 2007.
c
Springer-Verlag Berlin Heidelberg 2007
232
W.-J. Li et al.
by any arbitrary web site or search engine in a passive “drive by” fashion, trans-
mitted over email or instant messaging (IM), or even introduced to a system by
other media such as CD-ROMs and USB drives, bypassing all the network fire-
walls and intrusion-detection systems. Furthermore, the attacker can use such
documents as a stepping stone to reach other systems, unreachable via the regu-
lar network. Hence, any machine inside an organization with the ability to open
a document can become the spreading point for the malcode to reach any host
within that organization. Indeed, a recent attack of this nature was reported
in [4] using Wikipedia. There is nothing new about the presence of viruses in
streams, embedded as attached documents, nor is the use of malicious macros
a new threat [5,6], e.g., in Word documents. However, simply disabling macros
does not solve the problem; other forms of code may be embedded in Word
documents, for which no easy solution is available other than not using Word
altogether.
The underlying problem is that modern document formats are essentially
object-containers (e.g., Object Linking and Embedding (OLE) format for Word)
of any executable object. Hence, one should expect to see any kind of code
embedded in a document. Since malcode is code, one cannot be entirely certain
that a piece of code detected in a document is legitimate or not, unless it is
discovered and embedded in an object that typically does not contain code.
Simply stated,
modern document formats provide a convenient object-
container format and constitute thus a convenient and easy to use
“code-injection platform.”
To better illustrate the complexity of the task of identifying malcode in doc-
uments through a concrete study, we limit our investigation to Microsoft Word
document files; Word documents serve as a “container” for complex object em-
beddings that need to be parsed and executed to render the document for display.
In addition to the well known macro viruses, two further possible scenarios are
introduced bellow:
Execution strategies of embedded malcode:
From the attackers perspec-
tive, the optimal attack strategy is to architect the injected code as an embedded
object that would be executed automatically upon rendering the document. In
addition to automated techniques such as the
WMF, PNG and JPEG vulnera-
bilities
, an attacker can also use social engineering whereby an embedded object
in a document, appearing as an icon, is opened manually by the user, launch-
ing an attack including attacks against third-party vulnerable applications. The
left-side screen shot of Fig. 1 is an example of a Word document with embedded
malcode, in this case a copy of the Slammer worm, with a message enticing a
user to click on the icon and launch the malcode.
Dormant malcode in multi-partite attacks:
Another stealth tactic is to
embed malcode in documents that does not execute automatically nor by user
intervention when the document is opened, but rather lies dormant in the file
store of the target environment awaiting a future attack that would extract the
hidden malcode. This multi-partite attack strategy could be used to successfully
A Study of Malcode-Bearing Documents
233
embed an arbitrarily large and sophisticated collection of malcode components
across multiple documents. The right screen shot in Fig. 1 demonstrates another
simple example of embedding a known malicious code sample, in this case also
Slammer, into an otherwise normal Word document. The document opens en-
tirely normally, with Slammer sitting idly in memory. Both infected files can
open normally in a Windows environment. However, the right one appears with
no discernible differences from a normal document while a different document
could incorporate this Slammer-laden document when it is opened, and invoke
the malcode contained therein. Although real-world attacks identical to our ex-
ample have not appeared, similar scenarios that combine multiple attacks have
been studied. Bontchev [5] discussed a new macro attack that can be created by
combining two known malicious macros. (e.g., a macro virus resides on a ma-
chine, another macro virus reaches it, and “mutates” into a third virus.) Filiol
et
al.
[7] analyzed the complexity of another type of viruses named k-ary viruses,
which combine actions of multiple attacks.
Fig. 1.
Left: A screen shot of an embedded executable object to entice the user to
click and launch malcode. Right: Example of malicious code (Slammer) embedded in
anormaldocument.
Our aim is to study the effectiveness of two techniques that have been applied
in the context of “traditional” network worms: statistical analysis of content to
identify portions of input that deviate from expected normal content as esti-
mated from a training corpora, and detection of malicious behavior by dynamic
execution on multiple, diverse platforms. The challenge is to find a method to
inspect the binary content of any document file before it is opened to determine
whether it is suspicious and may indeed be infected with malicious code without
234
W.-J. Li et al.
apriori
knowledge of the specific code in question or where it may be embedded
in the document.
Initially, we explore the detection capabilities of statical analysis techniques.
More specifically, we investigate the application of statistical modeling tech-
niques to characterize the typical content of documents. Our goal is to deter-
mine whether we can detect embedded malcode using statistical methods on
the binary file content. Furthermore, we introduce novel dynamic run-time tests
that attempt to expose the attackers’ actions through application diversity: we
open the files using a set of different implementations of document processing
application in a sandboxed environment. To quantify the detection capabili-
ties of statistical analysis, we perform a series of experiments where statistical
analysis is applied to labeled training documents to characterize both normal
and malicious document content. Our experiments show that statistical analy-
sis techniques outperform generic COTS Anti-Virus (AV) scanners. To further
improve our detection capability, we designed novel tests that harness the ap-
plication diversity to expose malicious byte-code. In these tests, documents are
opened in a diverse set of sandboxed and emulated environments exposing ma-
licious code execution. We show that in most cases, malicious code depend on
operating system or program characteristics for successful completion of its ex-
ecution. In the process of our experimentation, we discovered that attackers use
existing benign documents as vehicles for their attack. Thus, we can further im-
prove our classification if we use benign documents from the Web to train our
detectors since even small deviations from normality can expose an attack.
Our results indicate that both static statistical and dynamic detection tech-
niques can be employed to detect malicious documents. However, there are some
weaknesses that make each method incomplete if used in isolation. For statis-
tical analysis, we would like to be able to determine the “intent” and “effect”
of the malicious code. On the other hand, dynamic tests may fail to detect the
presence of stealthy malcode that is designed to hide its actions. Hence, neither
technique alone will solve the problem in its entirety. We posit that a hybrid
approach integrating dynamic and static analysis techniques will likely provide
a suitable solution.
Paper Organization:
The next section discusses related work and research
reported in the literature. Section 3 describes the static statistical approach
including an overview of the byte-value n-gram algorithm, the SPARSEGui pro-
gram and the experimental results. We introduce the dynamic run-time tests
and the use of application diversity in Section 4. Section 5 concludes the paper
with suggestions that perhaps collaborative detection methods may provide a
fruitful path forward.
2
Background and Related Work
2.1 Binary Content File Analysis
Probabilistic modeling in the area of content analysis mainly involves n-gram
approaches [8,9,10]; the file binary contents are measured and the distribution
A Study of Malcode-Bearing Documents
235
of the frequency of 1-gram, as well as each fixed size n-gram, is computed. An
early research effort in this area is the Malicious Email Filter [11], using a naive
Bayes classifier algorithm applied to the binary content of email attachments
known to be viral. The classifier was trained on both “normal” executables and
known viruses to determine whether emails likely included malicious attachments
that should be filtered.
Others have applied similar techniques including, for example, Abou-Assaleh
et al.
[12,13] to detect worms and viruses. Furthermore, Karim
et al.
suggest that
malicious programs are frequently related to previous ones [14]. They define a
variation on n-grams called “n-perms” An n-perm represents every possible per-
mutation of an n-gram sequence, and n-perms can be used to match possibly
permuted malicious code. McDaniel and Heydari [15] introduce algorithms for
generating “fingerprints” of file types using byte-value distributions of file con-
tent. However, instead of computing a set of centroid models, they compute a
single representative fingerprint for the entire class. This strategy may be un-
wise. Mixing the statistics of different subtypes and averaging of the statistics of
an aggregation of examples may tend to loose information. A report from AFRL
proposes the Detector and Extractor of Fileprints (DEF) process for data pro-
tection and automatic file identification [16]. By applying the DEF process, they
generate visual hashes, called fileprints, to measure the integrity of a data se-
quence, compare the similarity between data sequences, and to identify the data
type of an unknown file. Goel [17] introduces a signature-matching technique
based on Kolmogorov Complexity metrics, for file type identification.
2.2
Steganalysis
There exists a substantial literature on the subject of steganography, the means
of hiding secret messages embedded in otherwise normal appearing objects or
communication channels. We do not provide an analysis of this area since it is
not exactly germane to the topic of identifying embedded malcode in documents.
However, many of the steganalysis techniques that have been under investigation
to detect steganographic communication over covert channels may bear resem-
blance to the techniques we applied during the course of this research study.
For example, Provos’ work on defeating steganalysis [18] highlights the diculty
of identifying “foreign” material embedded cleverly within media objects that
defeats statistical analysis while maintaining what otherwise appears to be a
completely normal-appearing objects, e.g., a sensible image.
The general class of steganographic embedding of secret messages may be
viewed as a “mimicry” attack, whereby the messages are embedded in such a
fashion as to mimic the statistical characteristics of the objects in which the
messages are embedded. Our task in this project was a more limited view of
the problem, to identify embedded “zero day malcode” inside documents. The
conjecture that drives our analysis is that code segments may be limited to a
specific set of statistical characterizations so that one may be able to differentiate
code from other material in which it is embedded; i.e., code embedded in an
image may appear to have a significantly different statistical distribution to
[ Pobierz całość w formacie PDF ]