Polygraph Automatically Generating Signatures for Polymorphic Worms, Hacking and IT E-Book Dump Release

[ Pobierz całość w formacie PDF ]
Polygraph: Automatically Generating Signatures
for Polymorphic Worms
James Newsome
Carnegie Mellon University
jnewsome@ece.cmu.edu
Brad Karp
Intel Research Pittsburgh
brad.n.karp@intel.com
Carnegie Mellon University
bkarp+@cs.cmu.edu
Dawn Song
Carnegie Mellon University
dawnsong@cmu.edu
Abstract
tures,
that correspond to malicious traffic. When such mali-
cious traffic is found, the IDS may raise an alarm; block fu-
ture traffic from the offending source address; or even block
the remainder of the offending flow’s traffic. To date, to
detect and/or block Internet worm flows, IDSes use signa-
tures that match bytes from a worm’s
payload
, using match-
ing techniques including string matching at arbitrary pay-
load offsets [20, 21]; string matching at fixed payload off-
sets [21]; and even matching of regular expressions within
a flow’s payload [20].
It is natural to ask where the signature databases for
IDSes come from. To date, signatures have been generated
manually by security experts who study network traces af-
ter a new worm has been released, typically hours or days
after the fact. Motivated by the slow pace of manual sig-
nature generation, researchers have recently given attention
to
automating
the generation of signatures used by IDSes
to match worm traffic. Systems such as Honeycomb [14],
Autograph [13], and EarlyBird [22] monitor network traf-
fic to identify novel Internet worms, and produce signatures
for them using
pattern-based
analysis,
1
i.e.
, by extracting
common byte patterns across different suspicious flows.
These systems all generate signatures consisting of a
sin-
gle, contiguous substring
of a worm’s payload, of
sufficient
length
to match only the worm, and not innocuous traffic.
The shorter the byte string, the greater the probability it
will appear in some flow’s payload, regardless of whether
the flow is a worm or innocuous. Thus, these signature gen-
eration systems all make the same underlying assumptions:
that there exists a single payload substring that will remain
invariant
across worm connections, and will be sufficiently
unique to the worm that it can be used as a signature without
causing false positives.
Regrettably, the above payload invariance assumptions
are naıve, and give rise to a critical weakness in these previ-
It is widely believed that content-signature-based intru-
sion detection systems (IDSes) are easily evaded by
poly-
morphic worms,
which
vary
their payload on every infec-
tion attempt. In this paper, we present Polygraph, a sig-
nature generation system that successfully produces signa-
tures that match polymorphic worms. Polygraph gener-
ates signatures that consist of
multiple disjoint
content sub-
strings. In doing so, Polygraph leverages our insight that
for a real-world exploit to function properly, multiple
in-
variant
substrings must often be present in all variants of
a payload; these substrings typically correspond to proto-
col framing, return addresses, and in some cases, poorly
obfuscated code. We contribute a definition of the poly-
morphic signature generation problem; propose classes of
signature suited for matching polymorphic worm payloads;
and present algorithms for automatic generation of signa-
tures in these classes. Our evaluation of these algorithms on
a range of polymorphic worms demonstrates that Polygraph
produces signatures for polymorphic worms that exhibit low
false negatives and false positives.
1. Introduction and Motivation
Enabled by ever-more pervasive Internet connectivity, an
increasing variety of exploitable vulnerabilities in software,
and a lack of diversity in the software running on Internet-
attached hosts, Internet worms increasingly threaten the
availability and integrity of Internet-based services.
Toward defending against Internet worms (and other at-
tacks), the research community has proposed and built in-
trusion detection systems (IDSes) [20, 21]. A network ad-
ministrator deploys an IDS at the gateway between his edge
network and the Internet, or on an individual end host. The
IDS searches inbound traffic for known patterns, or
signa-
1
TaintCheck recently proposed a new approach,
semantic-based
auto-
matic signature generation [18]. We discuss this further in Section 8.
ously proposed signature generation systems. A worm au-
thor may craft a worm that substantially changes its payload
on every successive connection, and thus evades matching
by any single substring signature that does not also occur
in innocuous traffic.
Polymorphism
techniques
2
, through
which a program may encode and re-encode itself into suc-
cessive, different byte strings, enable production of chang-
ing worm payloads. It is pure serendipity that worm au-
thors thus far have not chosen to render worms polymor-
phic; virus authors do so routinely [17, 24]. The effort re-
quired to do so is trivial, given that libraries to render code
polymorphic are readily available [3, 10].
It would seem that given the imminent threat of polymor-
phic worms, automated signature generation, and indeed,
even filtering of worms using human-generated signatures,
are doomed to fail as worm quarantine strategies. In this
paper, we argue the contrary: that it is possible to gener-
ate signatures automatically that match the many variants of
polymorphic worms, and that offer low false positives and
low false negatives. This argument is based on a key insight
regarding the fundamental nature of polymorphic worms as
compared with that of polymorphic viruses. Polymorphic
viruses are executables stored locally on a host, invoked by
a user or application. As such, their content may be entirely
arbitrary, so long as when executed, they perform the oper-
ations desired by the author of the virus. That is, a poly-
morphic generator has free reign to obfuscate all bytes of
a virus. In sharp contrast, to execute on a vulnerable host,
a worm must exploit one or more specific server software
vulnerabilities.
In practice, we find that exploits contain
invariant
bytes
that are crucial to successfully exploiting the vulnerable
server. Such invariant bytes can include protocol framing
bytes, which must be present for the vulnerable server to
branch down the code path where a software vulnerabil-
ity exists; and the value used to overwrite a jump target
(such as a return address or function pointer) to redirect
the server’s execution. Individually, each of these invariant
byte strings may cause false positives. Thus, in our work,
we explore automatic generation of signature types that in-
corporate
multiple disjoint byte strings,
that used together,
yield low false positive rates during traffic filtering. These
signature types include conjunctions of byte strings, token
subsequences (substrings that must appear in a specified or-
der, a special case of regular expression signatures, matched
by Bro and Snort), and Bayes-scored substrings.
Our contributions in this work are as follows:
Problem definition:
We define the signature generation
problem for polymorphic worms.
Signature generation algorithms:
We present Polygraph,
a suite of novel algorithms for automatic generation of sig-
natures that match polymorphic worms.
Evaluation on real polymorphic worms:
We use several
real vulnerabilities to create polymorphic worms; run our
signature generation algorithms on workloads consisting of
samples of these worms; evaluate the quality (as measured
in false positives and false negatives) of the signatures pro-
duced by these algorithms; and evaluate the computational
cost of these signature generation algorithms.
We proceed in the remainder of the paper as follows.
In Section 2, we first provide evidence of the existence of
invariant payload bytes that cannot be rendered polymor-
phic using examples from real exploits, to motivate several
classes of signature tailored to match disjoint invariant byte
strings. We continue in Section 3 by setting the context in
which Polygraph will be used, and stating our design goals
for Polygraph. Next, in Section 4, we describe Polygraph’s
signature generation algorithms, before evaluating them in
Section 5. We discuss possible attacks against Polygraph in
Section 6; discuss our results in Section 7; review related
work in Section 8; and conclude in Section 9.
2. Polymorphic Worms:
Characteristics and
Signature Classes
To motivate Polygraph, we now consider the anatomy of
polymorphic worms. We refer to a network flow containing
a particular infection attempt as an
instance
or
sample
of a
polymorphic worm. After briefly characterizing the types
of content found in a polymorphic worm, we observe that
samples of the same worm often share some
invariant
con-
tent due to the fact that they exploit the same vulnerability.
We provide examples of real-world software vulnerabilities
that support this observation. Next, we demonstrate that
a single, contiguous byte string signature
3
cannot always
match a polymorphic worm robustly. Motivated by the in-
sufficiency of single substring signatures and the inherent
structure in many exploits, we identify a family of signa-
ture types more expressive than single substrings that better
match an exploit’s structure. While these signature types
are more complex than single substring signatures, and thus
computationally costlier to generate and match, they hold
promise for robust matching of polymorphic worms.
2.1. Exploits and Polymorphism
Within a worm sample, we identify three classes of
bytes.
Invariant bytes
are those fixed in value, which if
changed, cause an exploit no longer to function. Such bytes
3
For brevity, we hereafter refer to such signatures as
single substring
signatures.
2
Throughout this paper, we refer to both polymorphism and metamor-
phism as polymorphism, in the interest of brevity.
are useful as portions of signatures.
Wildcard bytes
are
those which may take on any value without affecting the
correct functioning of a worm—neither its exploit nor its
code. Finally,
code bytes
are the polymorphic code executed
by a worm, that are the output of a polymorphic code en-
gine. Typically, the main worm code will be encrypted un-
der a different key in each worm sample. Execution starts at
a small decryption routine, which is obfuscated differently
in each worm sample. The degree of variation in code bytes
from worm sample to worm sample depends on the quality
of the polymorphic obfuscator used—a poor polymorphic
obfuscater may leave long regions of bytes unchanged be-
tween the code instances it outputs, whereas a more aggres-
sive one may leave nearly no multi-byte regions in common
across its outputs. In this work, we do not depend on weak-
nesses of current code obfuscators to be able to generate
quality signatures. Instead, we render worms to be
perfectly
polymorphic
, by filling in code bytes with values chosen
uniformly at random. We will also show that the current
generation of polymorphic obfuscators actually do produce
invariant byte sequences in their output, which means that
we should be able to generate even higher quality signatures
for worms that use these real-world code obfuscators.
force a jump to some specific point in library code. Such
exploits typically must include an address from some small
set of narrow ranges in the request. In attacks that redirect
execution to injected code, the overwritten address must
point at or near the beginning of the injected code, mean-
ing that the high-order bytes of the overwritten address are
typically invariant. A previous study of exploits contains a
similar observation [19]. Attacks that redirect execution to
a library also typically select from a small set of candidate
jump targets. For example, CodeRed causes the server to
jump to an address in a common Windows DLL that con-
tains the instruction
call ebx
. For this technique to be
stable, the address used for this purpose must work for a
range of Windows versions. According to the Metasploit
op-code database, there are only six addresses that would
work across Windows 2000 service packs zero and one [4].
2.3. Examples: Invariant Content in Polymorphic
Worms
We manually identified the invariant content for exploits
of a range of vulnerabilities by analyzing server source code
(when available), and by studying how current exploits for
the vulnerabilities work. We now present six of the vulner-
abilities and exploits that we studied to illustrate the exis-
tence of invariant content in polymorphic worms, even with
an ideal polymorphic engine. We also present our analysis
of the output of one of the polymorphic generators, to show
how close the current generators are to the ideal.
Apache multiple-host-header vulnerability
First, we con-
sider the hypothetical payload of a polymorphic worm
structured like the payload of the Apache-Knacker ex-
ploit [9], shown in Figure 1. This exploit consists of a
GET
request containing multiple
Host
headers. The server con-
catenates the two
Host
fields into one buffer, leading to an
overflow. This exploit contains several invariant protocol
framing strings: “GET”, “HTTP/1.1”, and “Host:” twice.
The second
Host
field also contains an invariant value used
to overwrite the return address.
BIND TSIG vulnerability
Next, we consider the Lion
worm [5]. We constructed a polymorphic version of the
Lion worm, shown in Figure 2. The Lion worm payload
is a DNS request, and begins with the usual DNS proto-
col header and record counts, all of which may be varied
considerably across payloads, and are thus wildcard bytes;
only a single bit in the header must be held invariant for
the exploit to function—the bit indicating that the packet
is a request, rather than a response. Next come two ques-
tion entries. The second contains an invariant value used
to overwrite a return address (also encoded in a QNAME).
Finally, to take the vulnerable code path in the server, the
exploit payload must include an Additional record of type
TSIG; this requirement results in three contiguous invariant
2.2. Invariant Content in Polymorphic Exploits
If a vulnerability requires that a successful exploit con-
tain invariant content, that content holds promise for use
in signatures that can match all variants of a polymorphic
worm. But to what extent do real vulnerabilities have this
property? We surveyed over fifteen known software vul-
nerabilities, spanning a diverse set of operating systems and
applications, and found that
nearly all
require invariant con-
tent in any exploit that can succeed. We stress that we do
not claim all vulnerabilities share this property—only that
a significant fraction do. We now describe the two chief
sources of invariant content we unearthed: exploit framing
and exploit payload.
Invariant Exploit Framing
A software vulnerability ex-
ists at some particular code site, along a code path exe-
cuted upon receiving a request from the network. In many
cases, the code path to a vulnerability contains branches
whose outcome depends on the content of the received re-
quest; these branches typically correspond to parsing of the
request, in accordance with a specific protocol. Thus, an
exploit typically includes
invariant framing
(
e.g.,
reserved
keywords or well known binary constants that are part of
a wire protocol) essential to exploiting a vulnerability suc-
cessfully.
Invariant Overwrite Values
Exploits typically alter the
control flow of the victim program by overwriting a jump
target in memory with a value provided in the exploit, ei-
ther to force a jump to injected code in the payload, or to
vulnerable code path. It uses a buffer overrun to overwrite
a return address with a pointer to a
call esp
instruction
contained in a common Windows DLL. There are only a
small number of such values that work across multiple win-
dows versions.
CodeRed
The CodeRed [6] exploit takes advantage of a
buffer overflow when converting ASCII to Unicode. The
exploit must be a
GET
request for a
.ida
file. The
value used to overwrite the return address must appear
later in the URL. CodeRed overwrites the return address
to point to
call esp
. There are only a small number of
such pointers that will work across multiple Windows ver-
sions. Hence, the exploit must contain the invariant proto-
col framing string “GET”, followed by “.ida?”, followed by
a pointer to
call esp
.
AdmWorm
The AdmWorm [7] exploits BIND via a buffer
overrun. Unlike the other exploits described here, there are
no
invariant protocol framing bytes in this exploit. How-
ever, there is still an invariant value used to overwrite a re-
turn address.
Decryption
Routine
Decryption
Key
Encrypted
Payload
NOP
Slide
Return
Address
Random
Headers
Payload
Part 1
Random
Headers
Payload
Part 2
Random
Headers
GET
URL
HTTP/1.1
Host:
Host:
Figure 1. Polymorphed Apache-Knacker ex-
ploit. Unshaded content represents wild-
card bytes; lightly shaded content represents
code bytes; heavily shaded content repre-
sents invariant bytes.
QNAME
[Return
NAME
TYPE
[TSIG]
QTYPE QCLASS
CLASS
[0x00]
Address]
Question 1
[Obfuscated
Shellcode]
DNS
Header
Record
Counts
Additional
Question 2
Record
eb 2d 59 31
d2 b2 20 8b 19 c1 c3 0e
Figure 2. BIND TSIG vulnerability,
as ex-
ploited by the Lion worm.
Shading as for
81 f3 81 68 44 b3 c1 c3 0a c1 c3 19 c1 c3
11 89 19 81 e9 ff ff ff ff 41 41 41 80 ea
02 4a 4a 74 07 eb d8 e8 ce ff ff ff 0b
Apache vulnerability.
bytes near the end of the payload.
Slapper
The Slapper worm [1] exploits a heap buffer over-
run vulnerability in Apache’s mod ssh module. Note that
the attack takes place during the initial handshake, meaning
that it is not encrypted. It is a two-part attack; It first uses
the overrun to overwrite a variable containing the session-id
length, causing the server to leak pointer values. This part
must contain the normal protocol framing of a
client-
hello
message, as well as the value used to overwrite the
variable (0x70).
In the second part of the attack, another session is
opened, and the same buffer is overrun. This time, the
leaked data is patched in, allowing the exploit to perform
a longer buffer overrun while still not causing the server to
crash. The heap metadata is overwritten in such a way as to
later cause the GOT entry of
free
to be overwritten with
a pointer to the attacker’s code, placed previously on the
heap. Thus, there is an invariant overwrite value that points
to the attacker’s code, and another that points to the GOT
entry for
free
. An aggressively polymorphic worm may
try to target other GOT entries or function pointers as well.
However, there will still only be a relatively small number
of values that will work.
SQLSlammer
The SQLSlammer [2] exploit must begin
with the invariant framing byte 0x04 in order to trigger the
Figure 3. Output by Clet polymorphic engine
includes invariant substrings. Boxed bytes
are found in at least 20% of Clet’s outputs;
shaded bytes are found in all of Clet’s out-
puts.
Clet polymorphic engine
Figure 3 shows a sample output
by the Clet polymorphic code engine [10].
4
The output con-
sists of encrypted code, which is completely different each
time, and a decryption routine that is obfuscated differently
each time. In order to determine how effective the Clet ob-
fuscation is, we generated 100 Clet outputs for the same
input code, and counted substrings of all lengths in com-
mon among the decryption routines in these 100 outputs.
Strings that were present in
all
100 outputs appear with
shaded backgrounds; those that were present in at least 20
outputs, but fewer than all 100, appear boxed. Clearly, Clet
produces substrings that are entirely invariant across pay-
loads, and other substrings that occur in a substantial frac-
tion of payloads. However, upon examining the Clet source
4
We also evaluated the ADMmutate [3] polymorphic engine. We
present Clet as the more pessimal case, as it produced less invariant content
than the ADMmutate engine.
code, it seems likely that the obfuscation engine could be
improved significantly, reducing the number of substrings
in common between Clet outputs.
address, and TSIG identifier, two and three bytes long, re-
spectively, are too short to be specific to the Lion worm.
As we show in our evaluation in Section 5, we found false
positives when searching for those substrings in DNS traf-
fic traces from a busy DNS server that is a nameserver for
top-level country code domains.
We conclude that single substring signatures cannot
match polymorphic worms with low false positives and low
false negatives.
2.4. Substring Signatures Insufficient
As described previously, the pattern-based signature
generation systems proposed to date [14, 13, 22] gener-
ate single substring signatures, found either in reassembled
flow payloads, or individual packet payloads. These sys-
tems thus make two assumptions about worm traffic:
2.5. Signature Classes for Polymorphic Worms
A single invariant substring exists across payload in-
stances for the same worm; that is, the substring is
sensitive
, in that it will match all worm instances.
5
Motivated by the insufficiency of single substring sig-
natures for matching polymorphic worms robustly, we now
propose other signature classes that hold promise for match-
ing the particular invariant exploit framing and payload
structures described in this section. All these signatures are
built from substrings, or
tokens
. The signature classes we
investigate in detail in Section 4 include:
Conjunction signatures
A signature that consists of a set
of tokens, and matches a payload if
all
tokens in the set are
found in it, in any order. This signature type can match the
multiple invariant tokens present in a polymorphic worm’s
payload, and matching multiple tokens is more specific than
matching one of those tokens alone.
Token-subsequence signatures
A signature that consists
of an
ordered
set of tokens. A flow matches a token-
subsequence signature if and only if the flow contains the
sequence of tokens in the signature with the same order-
ing. Signatures of this type can easily be expressed as
regular expressions, allowing them to be used in current
IDSes [20, 21]. For the same set of tokens, a token sub-
sequence signature will be more specific than a conjunc-
tion signature, as the former makes an ordering constraint,
while the latter makes none. Framing often exhibits order-
ing;
e.g.
the TSIG record in the Lion worm, which must
come
last
in the payload for the exploit to succeed, and the
return address, which must therefore come before it.
Bayes signatures
A signature that consists of a set of to-
kens, each of which is associated with a
score
, and an over-
all
threshold.
In contrast with the exact matching offered by
conjunction signatures and token-subsequence signatures,
Bayes signatures provide probabilistic matching—given a
flow, we compute the probability that the flow is a worm us-
ing the scores of the tokens present in the flow. If the result-
ing probability is over the threshold, we classify the flow to
be a worm. Construction and matching of Bayes signatures
is less
rigid
than for conjunction or token-subsequence sig-
natures. This provides several advantages. It allows Bayes
signatures to be learned from suspicious flow pools that
contain samples from unrelated worms, and even innocu-
ous network requests.
The invariant substring is sufficiently long to be
spe-
cific
; that is, the substring does not occur in any non-
worm payloads destined for the same IP protocol and
port.
Can a sensitive and specific single substring signature
be found in the example payloads in the Apache and DNS
exploits described in Section 2.3? Consider the Apache ex-
ploit. The unshaded bytes are wildcards, and cannot be re-
lied upon to provide invariant content; note that even the
NOP slide can contain significantly varying bytes across
payloads, as many instruction sequences effectively may
serve as NOPs. If we assume a strong code obfuscator, we
cannot rely on there being an invariant substring longer than
two bytes long in the obfuscated decryption routine, shown
with light shading. The only invariant bytes are the heavily
shaded ones, which are pieces of HTTP protocol framing,
and a return address (or perhaps a two-byte prefix of the
return address, if the worm is free to position its code any-
where within a 64K memory region). Clearly, the HTTP
protocol framing substrings individually will not be spe-
cific, as they can occur in both innocuous and worm HTTP
flows. By itself, even the two-to-four-byte return address
present in the payload is not sufficiently specific to avoid
false positives; consider that a single binary substring of
that length may trivially occur in an HTTP upload request.
As we show in our evaluation in Section 5, we have exper-
imentally verified exactly this phenomenon; we have found
return address bytes from real worm payloads in innocuous
flows in HTTP request traces taken from the DMZ of Intel
Research Pittsburgh.
The Lion worm presents a similar story: the heavily
shaded invariant bytes, the high-order bytes of the return
5
It is possible that a worm’s content varies only very slightly across in-
stances, and that at least one of a small, constant-cardinality set of substring
signatures matches all worm instances. We view this case as qualitatively
the same as that where worm content is invariant, and focus our attention
herein on worms whose content varies to a much greater extent, such that
a small set of substring signatures does not suffice to match all variants.
(We show that the other signature
[ Pobierz całość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • jaczytam.htw.pl