Polygraph Automatically Generating Signatures for Polymorphic Worms, Hacking and IT E-Book Dump Release

[ Pobierz całość w formacie PDF ]
Polygraph: Automatically Generating Signatures
for Polymorphic Worms
James Newsome
Carnegie Mellon University
jnewsome@ece.cmu.edu
Brad Karp
Intel Research Pittsburgh
brad.n.karp@intel.com
Carnegie Mellon University
bkarp+@cs.cmu.edu
Dawn Song
Carnegie Mellon University
dawnsong@cmu.edu
Abstract
tures,
that correspond to malicious trafﬁc. When such mali-
cious trafﬁc is found, the IDS may raise an alarm; block fu-
ture trafﬁc from the offending source address; or even block
the remainder of the offending ﬂow’s trafﬁc. To date, to
detect and/or block Internet worm ﬂows, IDSes use signa-
tures that match bytes from a worm’s
payload
, using match-
ing techniques including string matching at arbitrary pay-
load offsets [20, 21]; string matching at ﬁxed payload off-
sets [21]; and even matching of regular expressions within
a ﬂow’s payload [20].
It is natural to ask where the signature databases for
IDSes come from. To date, signatures have been generated
manually by security experts who study network traces af-
ter a new worm has been released, typically hours or days
after the fact. Motivated by the slow pace of manual sig-
nature generation, researchers have recently given attention
to
automating
the generation of signatures used by IDSes
to match worm trafﬁc. Systems such as Honeycomb [14],
Autograph [13], and EarlyBird [22] monitor network traf-
ﬁc to identify novel Internet worms, and produce signatures
for them using
pattern-based
analysis,
1
i.e.
, by extracting
common byte patterns across different suspicious ﬂows.
These systems all generate signatures consisting of a
sin-
gle, contiguous substring
of a worm’s payload, of
sufﬁcient
length
to match only the worm, and not innocuous trafﬁc.
The shorter the byte string, the greater the probability it
will appear in some ﬂow’s payload, regardless of whether
the ﬂow is a worm or innocuous. Thus, these signature gen-
eration systems all make the same underlying assumptions:
that there exists a single payload substring that will remain
invariant
across worm connections, and will be sufﬁciently
unique to the worm that it can be used as a signature without
causing false positives.
Regrettably, the above payload invariance assumptions
are naıve, and give rise to a critical weakness in these previ-
It is widely believed that content-signature-based intru-
sion detection systems (IDSes) are easily evaded by
poly-
morphic worms,
which
vary
their payload on every infec-
tion attempt. In this paper, we present Polygraph, a sig-
nature generation system that successfully produces signa-
tures that match polymorphic worms. Polygraph gener-
ates signatures that consist of
multiple disjoint
content sub-
strings. In doing so, Polygraph leverages our insight that
for a real-world exploit to function properly, multiple
in-
variant
substrings must often be present in all variants of
a payload; these substrings typically correspond to proto-
col framing, return addresses, and in some cases, poorly
obfuscated code. We contribute a deﬁnition of the poly-
morphic signature generation problem; propose classes of
signature suited for matching polymorphic worm payloads;
and present algorithms for automatic generation of signa-
tures in these classes. Our evaluation of these algorithms on
a range of polymorphic worms demonstrates that Polygraph
produces signatures for polymorphic worms that exhibit low
false negatives and false positives.
1. Introduction and Motivation
Enabled by ever-more pervasive Internet connectivity, an
increasing variety of exploitable vulnerabilities in software,
and a lack of diversity in the software running on Internet-
attached hosts, Internet worms increasingly threaten the
availability and integrity of Internet-based services.
Toward defending against Internet worms (and other at-
tacks), the research community has proposed and built in-
trusion detection systems (IDSes) [20, 21]. A network ad-
ministrator deploys an IDS at the gateway between his edge
network and the Internet, or on an individual end host. The
IDS searches inbound trafﬁc for known patterns, or
signa-
1
TaintCheck recently proposed a new approach,
semantic-based
auto-
matic signature generation [18]. We discuss this further in Section 8.
ously proposed signature generation systems. A worm au-
thor may craft a worm that substantially changes its payload
on every successive connection, and thus evades matching
by any single substring signature that does not also occur
in innocuous trafﬁc.
Polymorphism
techniques
2
, through
which a program may encode and re-encode itself into suc-
cessive, different byte strings, enable production of chang-
ing worm payloads. It is pure serendipity that worm au-
thors thus far have not chosen to render worms polymor-
phic; virus authors do so routinely [17, 24]. The effort re-
quired to do so is trivial, given that libraries to render code
polymorphic are readily available [3, 10].
It would seem that given the imminent threat of polymor-
phic worms, automated signature generation, and indeed,
even ﬁltering of worms using human-generated signatures,
are doomed to fail as worm quarantine strategies. In this
paper, we argue the contrary: that it is possible to gener-
ate signatures automatically that match the many variants of
polymorphic worms, and that offer low false positives and
low false negatives. This argument is based on a key insight
regarding the fundamental nature of polymorphic worms as
compared with that of polymorphic viruses. Polymorphic
viruses are executables stored locally on a host, invoked by
a user or application. As such, their content may be entirely
arbitrary, so long as when executed, they perform the oper-
ations desired by the author of the virus. That is, a poly-
morphic generator has free reign to obfuscate all bytes of
a virus. In sharp contrast, to execute on a vulnerable host,
a worm must exploit one or more speciﬁc server software
vulnerabilities.
In practice, we ﬁnd that exploits contain
invariant
bytes
that are crucial to successfully exploiting the vulnerable
server. Such invariant bytes can include protocol framing
bytes, which must be present for the vulnerable server to
branch down the code path where a software vulnerabil-
ity exists; and the value used to overwrite a jump target
(such as a return address or function pointer) to redirect
the server’s execution. Individually, each of these invariant
byte strings may cause false positives. Thus, in our work,
we explore automatic generation of signature types that in-
corporate
multiple disjoint byte strings,
that used together,
yield low false positive rates during trafﬁc ﬁltering. These
signature types include conjunctions of byte strings, token
subsequences (substrings that must appear in a speciﬁed or-
der, a special case of regular expression signatures, matched
by Bro and Snort), and Bayes-scored substrings.
Our contributions in this work are as follows:
Problem deﬁnition:
We deﬁne the signature generation
problem for polymorphic worms.
Signature generation algorithms:
We present Polygraph,
a suite of novel algorithms for automatic generation of sig-
natures that match polymorphic worms.
Evaluation on real polymorphic worms:
We use several
real vulnerabilities to create polymorphic worms; run our
signature generation algorithms on workloads consisting of
samples of these worms; evaluate the quality (as measured
in false positives and false negatives) of the signatures pro-
duced by these algorithms; and evaluate the computational
cost of these signature generation algorithms.
We proceed in the remainder of the paper as follows.
In Section 2, we ﬁrst provide evidence of the existence of
invariant payload bytes that cannot be rendered polymor-
phic using examples from real exploits, to motivate several
classes of signature tailored to match disjoint invariant byte
strings. We continue in Section 3 by setting the context in
which Polygraph will be used, and stating our design goals
for Polygraph. Next, in Section 4, we describe Polygraph’s
signature generation algorithms, before evaluating them in
Section 5. We discuss possible attacks against Polygraph in
Section 6; discuss our results in Section 7; review related
work in Section 8; and conclude in Section 9.
2. Polymorphic Worms:
Characteristics and
Signature Classes
To motivate Polygraph, we now consider the anatomy of
polymorphic worms. We refer to a network ﬂow containing
a particular infection attempt as an
instance
or
sample
of a
polymorphic worm. After brieﬂy characterizing the types
of content found in a polymorphic worm, we observe that
samples of the same worm often share some
invariant
con-
tent due to the fact that they exploit the same vulnerability.
We provide examples of real-world software vulnerabilities
that support this observation. Next, we demonstrate that
a single, contiguous byte string signature
3
cannot always
match a polymorphic worm robustly. Motivated by the in-
sufﬁciency of single substring signatures and the inherent
structure in many exploits, we identify a family of signa-
ture types more expressive than single substrings that better
match an exploit’s structure. While these signature types
are more complex than single substring signatures, and thus
computationally costlier to generate and match, they hold
promise for robust matching of polymorphic worms.
2.1. Exploits and Polymorphism
Within a worm sample, we identify three classes of
bytes.
Invariant bytes
are those ﬁxed in value, which if
changed, cause an exploit no longer to function. Such bytes
3
For brevity, we hereafter refer to such signatures as
single substring
signatures.
2
Throughout this paper, we refer to both polymorphism and metamor-
phism as polymorphism, in the interest of brevity.
are useful as portions of signatures.
Wildcard bytes
are
those which may take on any value without affecting the
correct functioning of a worm—neither its exploit nor its
code. Finally,
code bytes
are the polymorphic code executed
by a worm, that are the output of a polymorphic code en-
gine. Typically, the main worm code will be encrypted un-
der a different key in each worm sample. Execution starts at
a small decryption routine, which is obfuscated differently
in each worm sample. The degree of variation in code bytes
from worm sample to worm sample depends on the quality
of the polymorphic obfuscator used—a poor polymorphic
obfuscater may leave long regions of bytes unchanged be-
tween the code instances it outputs, whereas a more aggres-
sive one may leave nearly no multi-byte regions in common
across its outputs. In this work, we do not depend on weak-
nesses of current code obfuscators to be able to generate
quality signatures. Instead, we render worms to be
perfectly
polymorphic
, by ﬁlling in code bytes with values chosen
uniformly at random. We will also show that the current
generation of polymorphic obfuscators actually do produce
invariant byte sequences in their output, which means that
we should be able to generate even higher quality signatures
for worms that use these real-world code obfuscators.
force a jump to some speciﬁc point in library code. Such
exploits typically must include an address from some small
set of narrow ranges in the request. In attacks that redirect
execution to injected code, the overwritten address must
point at or near the beginning of the injected code, mean-
ing that the high-order bytes of the overwritten address are
typically invariant. A previous study of exploits contains a
similar observation [19]. Attacks that redirect execution to
a library also typically select from a small set of candidate
jump targets. For example, CodeRed causes the server to
jump to an address in a common Windows DLL that con-
tains the instruction
call ebx
. For this technique to be
stable, the address used for this purpose must work for a
range of Windows versions. According to the Metasploit
op-code database, there are only six addresses that would
work across Windows 2000 service packs zero and one [4].
2.3. Examples: Invariant Content in Polymorphic
Worms
We manually identiﬁed the invariant content for exploits
of a range of vulnerabilities by analyzing server source code
(when available), and by studying how current exploits for
the vulnerabilities work. We now present six of the vulner-
abilities and exploits that we studied to illustrate the exis-
tence of invariant content in polymorphic worms, even with
an ideal polymorphic engine. We also present our analysis
of the output of one of the polymorphic generators, to show
how close the current generators are to the ideal.
Apache multiple-host-header vulnerability
First, we con-
sider the hypothetical payload of a polymorphic worm
structured like the payload of the Apache-Knacker ex-
ploit [9], shown in Figure 1. This exploit consists of a
GET
request containing multiple
Host
headers. The server con-
catenates the two
Host
ﬁelds into one buffer, leading to an
overﬂow. This exploit contains several invariant protocol
framing strings: “GET”, “HTTP/1.1”, and “Host:” twice.
The second
Host
ﬁeld also contains an invariant value used
to overwrite the return address.
BIND TSIG vulnerability
Next, we consider the Lion
worm [5]. We constructed a polymorphic version of the
Lion worm, shown in Figure 2. The Lion worm payload
is a DNS request, and begins with the usual DNS proto-
col header and record counts, all of which may be varied
considerably across payloads, and are thus wildcard bytes;
only a single bit in the header must be held invariant for
the exploit to function—the bit indicating that the packet
is a request, rather than a response. Next come two ques-
tion entries. The second contains an invariant value used
to overwrite a return address (also encoded in a QNAME).
Finally, to take the vulnerable code path in the server, the
exploit payload must include an Additional record of type
TSIG; this requirement results in three contiguous invariant
2.2. Invariant Content in Polymorphic Exploits
If a vulnerability requires that a successful exploit con-
tain invariant content, that content holds promise for use
in signatures that can match all variants of a polymorphic
worm. But to what extent do real vulnerabilities have this
property? We surveyed over ﬁfteen known software vul-
nerabilities, spanning a diverse set of operating systems and
applications, and found that
nearly all
require invariant con-
tent in any exploit that can succeed. We stress that we do
not claim all vulnerabilities share this property—only that
a signiﬁcant fraction do. We now describe the two chief
sources of invariant content we unearthed: exploit framing
and exploit payload.
Invariant Exploit Framing
A software vulnerability ex-
ists at some particular code site, along a code path exe-
cuted upon receiving a request from the network. In many
cases, the code path to a vulnerability contains branches
whose outcome depends on the content of the received re-
quest; these branches typically correspond to parsing of the
request, in accordance with a speciﬁc protocol. Thus, an
exploit typically includes
invariant framing
(
e.g.,
reserved
keywords or well known binary constants that are part of
a wire protocol) essential to exploiting a vulnerability suc-
cessfully.
Invariant Overwrite Values
Exploits typically alter the
control ﬂow of the victim program by overwriting a jump
target in memory with a value provided in the exploit, ei-
ther to force a jump to injected code in the payload, or to
vulnerable code path. It uses a buffer overrun to overwrite
a return address with a pointer to a
call esp
instruction
contained in a common Windows DLL. There are only a
small number of such values that work across multiple win-
dows versions.
CodeRed
The CodeRed [6] exploit takes advantage of a
buffer overﬂow when converting ASCII to Unicode. The
exploit must be a
GET
request for a
.ida
ﬁle. The
value used to overwrite the return address must appear
later in the URL. CodeRed overwrites the return address
to point to
call esp
. There are only a small number of
such pointers that will work across multiple Windows ver-
sions. Hence, the exploit must contain the invariant proto-
col framing string “GET”, followed by “.ida?”, followed by
a pointer to
call esp
.
AdmWorm
The AdmWorm [7] exploits BIND via a buffer
overrun. Unlike the other exploits described here, there are
no
invariant protocol framing bytes in this exploit. How-
ever, there is still an invariant value used to overwrite a re-
turn address.
Decryption
Routine
Decryption
Key
Encrypted
Payload
NOP
Slide
Return
Address
Random
Headers
Payload
Part 1
Random
Headers
Payload
Part 2
Random
Headers
GET
URL
HTTP/1.1
Host:
Host:
Figure 1. Polymorphed Apache-Knacker ex-
ploit. Unshaded content represents wild-
card bytes; lightly shaded content represents
code bytes; heavily shaded content repre-
sents invariant bytes.
QNAME
[Return
NAME
TYPE
[TSIG]
QTYPE QCLASS
CLASS
[0x00]
Address]
Question 1
[Obfuscated
Shellcode]
DNS
Header
Record
Counts
Additional
Question 2
Record
eb 2d 59 31
d2 b2 20 8b 19 c1 c3 0e
Figure 2. BIND TSIG vulnerability,
as ex-
ploited by the Lion worm.
Shading as for
81 f3 81 68 44 b3 c1 c3 0a c1 c3 19 c1 c3
11 89 19 81 e9 ff ff ff ff 41 41 41 80 ea
02 4a 4a 74 07 eb d8 e8 ce ff ff ff 0b
Apache vulnerability.
bytes near the end of the payload.
Slapper
The Slapper worm [1] exploits a heap buffer over-
run vulnerability in Apache’s mod ssh module. Note that
the attack takes place during the initial handshake, meaning
that it is not encrypted. It is a two-part attack; It ﬁrst uses
the overrun to overwrite a variable containing the session-id
length, causing the server to leak pointer values. This part
must contain the normal protocol framing of a
client-
hello
message, as well as the value used to overwrite the
variable (0x70).
In the second part of the attack, another session is
opened, and the same buffer is overrun. This time, the
leaked data is patched in, allowing the exploit to perform
a longer buffer overrun while still not causing the server to
crash. The heap metadata is overwritten in such a way as to
later cause the GOT entry of
free
to be overwritten with
a pointer to the attacker’s code, placed previously on the
heap. Thus, there is an invariant overwrite value that points
to the attacker’s code, and another that points to the GOT
entry for
free
. An aggressively polymorphic worm may
try to target other GOT entries or function pointers as well.
However, there will still only be a relatively small number
of values that will work.
SQLSlammer
The SQLSlammer [2] exploit must begin
with the invariant framing byte 0x04 in order to trigger the
Figure 3. Output by Clet polymorphic engine
includes invariant substrings. Boxed bytes
are found in at least 20% of Clet’s outputs;
shaded bytes are found in all of Clet’s out-
puts.
Clet polymorphic engine
Figure 3 shows a sample output
by the Clet polymorphic code engine [10].
4
The output con-
sists of encrypted code, which is completely different each
time, and a decryption routine that is obfuscated differently
each time. In order to determine how effective the Clet ob-
fuscation is, we generated 100 Clet outputs for the same
input code, and counted substrings of all lengths in com-
mon among the decryption routines in these 100 outputs.
Strings that were present in
all
100 outputs appear with
shaded backgrounds; those that were present in at least 20
outputs, but fewer than all 100, appear boxed. Clearly, Clet
produces substrings that are entirely invariant across pay-
loads, and other substrings that occur in a substantial frac-
tion of payloads. However, upon examining the Clet source
4
We also evaluated the ADMmutate [3] polymorphic engine. We
present Clet as the more pessimal case, as it produced less invariant content
than the ADMmutate engine.
code, it seems likely that the obfuscation engine could be
improved signiﬁcantly, reducing the number of substrings
in common between Clet outputs.
address, and TSIG identiﬁer, two and three bytes long, re-
spectively, are too short to be speciﬁc to the Lion worm.
As we show in our evaluation in Section 5, we found false
positives when searching for those substrings in DNS traf-
ﬁc traces from a busy DNS server that is a nameserver for
top-level country code domains.
We conclude that single substring signatures cannot
match polymorphic worms with low false positives and low
false negatives.
2.4. Substring Signatures Insufﬁcient
As described previously, the pattern-based signature
generation systems proposed to date [14, 13, 22] gener-
ate single substring signatures, found either in reassembled
ﬂow payloads, or individual packet payloads. These sys-
tems thus make two assumptions about worm trafﬁc:
2.5. Signature Classes for Polymorphic Worms
A single invariant substring exists across payload in-
stances for the same worm; that is, the substring is
sensitive
, in that it will match all worm instances.
5
Motivated by the insufﬁciency of single substring sig-
natures for matching polymorphic worms robustly, we now
propose other signature classes that hold promise for match-
ing the particular invariant exploit framing and payload
structures described in this section. All these signatures are
built from substrings, or
tokens
. The signature classes we
investigate in detail in Section 4 include:
Conjunction signatures
A signature that consists of a set
of tokens, and matches a payload if
all
tokens in the set are
found in it, in any order. This signature type can match the
multiple invariant tokens present in a polymorphic worm’s
payload, and matching multiple tokens is more speciﬁc than
matching one of those tokens alone.
Token-subsequence signatures
A signature that consists
of an
ordered
set of tokens. A ﬂow matches a token-
subsequence signature if and only if the ﬂow contains the
sequence of tokens in the signature with the same order-
ing. Signatures of this type can easily be expressed as
regular expressions, allowing them to be used in current
IDSes [20, 21]. For the same set of tokens, a token sub-
sequence signature will be more speciﬁc than a conjunc-
tion signature, as the former makes an ordering constraint,
while the latter makes none. Framing often exhibits order-
ing;
e.g.
the TSIG record in the Lion worm, which must
come
last
in the payload for the exploit to succeed, and the
return address, which must therefore come before it.
Bayes signatures
A signature that consists of a set of to-
kens, each of which is associated with a
score
, and an over-
all
threshold.
In contrast with the exact matching offered by
conjunction signatures and token-subsequence signatures,
Bayes signatures provide probabilistic matching—given a
ﬂow, we compute the probability that the ﬂow is a worm us-
ing the scores of the tokens present in the ﬂow. If the result-
ing probability is over the threshold, we classify the ﬂow to
be a worm. Construction and matching of Bayes signatures
is less
rigid
than for conjunction or token-subsequence sig-
natures. This provides several advantages. It allows Bayes
signatures to be learned from suspicious ﬂow pools that
contain samples from unrelated worms, and even innocu-
ous network requests.
The invariant substring is sufﬁciently long to be
spe-
ciﬁc
; that is, the substring does not occur in any non-
worm payloads destined for the same IP protocol and
port.
Can a sensitive and speciﬁc single substring signature
be found in the example payloads in the Apache and DNS
exploits described in Section 2.3? Consider the Apache ex-
ploit. The unshaded bytes are wildcards, and cannot be re-
lied upon to provide invariant content; note that even the
NOP slide can contain signiﬁcantly varying bytes across
payloads, as many instruction sequences effectively may
serve as NOPs. If we assume a strong code obfuscator, we
cannot rely on there being an invariant substring longer than
two bytes long in the obfuscated decryption routine, shown
with light shading. The only invariant bytes are the heavily
shaded ones, which are pieces of HTTP protocol framing,
and a return address (or perhaps a two-byte preﬁx of the
return address, if the worm is free to position its code any-
where within a 64K memory region). Clearly, the HTTP
protocol framing substrings individually will not be spe-
ciﬁc, as they can occur in both innocuous and worm HTTP
ﬂows. By itself, even the two-to-four-byte return address
present in the payload is not sufﬁciently speciﬁc to avoid
false positives; consider that a single binary substring of
that length may trivially occur in an HTTP upload request.
As we show in our evaluation in Section 5, we have exper-
imentally veriﬁed exactly this phenomenon; we have found
return address bytes from real worm payloads in innocuous
ﬂows in HTTP request traces taken from the DMZ of Intel
Research Pittsburgh.
The Lion worm presents a similar story: the heavily
shaded invariant bytes, the high-order bytes of the return
5
It is possible that a worm’s content varies only very slightly across in-
stances, and that at least one of a small, constant-cardinality set of substring
signatures matches all worm instances. We view this case as qualitatively
the same as that where worm content is invariant, and focus our attention
herein on worms whose content varies to a much greater extent, such that
a small set of substring signatures does not sufﬁce to match all variants.
(We show that the other signature
[ Pobierz całość w formacie PDF ]

zanotowane.pl

doc.pisz.pl

pdf.pisz.pl

jaczytam.htw.pl

Polygraph Automatically ...

Polygraph Automatically Generating Signatures for Polymorphic Worms, Hacking and IT E-Book Dump Release

Archiwum