Internet Systems Consortium’s SIE & Google Protobufs

Internet Systems Consortium’s SIE & Google Protobufs


>> …and SIE as a security information exchange,
which is kind of a more or less freeish way to get Paul and Jim later, to get access to
a bunch of data, which includes things like passive DNS and spam feeds and things like
this. Eric Ziegast is an ISC…>>ZIEGAST: It’s a program manager for SIE.
>>[INDISTINCT] job.>>Okay, good. And Paul Vixie is, you know,
one of the ISC founder folk and [INDISTINCT].>>[INDISTINCT].
>>VIXIE: You mentioned.>>ZIEGAST: Okay, all right. So, SIE, Security
Information Exchange, people are familiar with different kinds of exchanges. You have
exchanges of money on–like NASDAQ, you have Internet exchanges which Paul actually founded
one of them, Palo Alto Internet Exchange, where they need to exchange Internet traffic
and they may exchange pieces of copper between all of the Telco or Fiber these days between
all of the Telco gear. Back in 2007, Paul Vixie, David Dagon kind of thought of this
idea that there’s a lot of pools–there’s a lot of poll data out there that a lot of
people don’t have access to and we need to find a way to get it altogether. So, one of
things that we’re doing–because this is security data, people don’t like sharing it because
they can get trouble for it because either they’re snooping on their customers or because–or
because it can cause harm to people just because of the fact that someone knows something that
they shouldn’t have. So what we’ve created is a legal and privacy framework, which is
basically a contract and a bunch of privacy directives that’s say, “All right, everyone
is in this together, you can share stuff freely within the–within the infrastructure but
you don’t take the raw stuff out. It has to go through lots of processing before you can
take stuff out of there. The legal document keeps everyone honest. Another reason we’re
here is to centralize the data collection. When you have pools of data that are all over
the Internet, it’s hard to do any cross-correlation. You don’t have any standard way of–you don’t
have any standard way of sharing the data. They have different formats. You can get all
the formats the same and you can get all the data together in one datacenter or with in
a framework of datacenters all connected to each other, there’s a better chance that you’re
actually going to be able to do some cross-analysis, which is one of the things we’re trying to
do. You may have some passive DNS, you may have some NetFlow, you may have some Darknet
data or whatever you have, but if that’s all you have, you’re not going to be able to do
much with it, but if you can combine some of that, you’re going to find–you’re going
to be able to find out a lot more and find it out a lot more quickly than if you don’t
have access to it at all. One of the reasons we’re doing this is to create a network effect
between the security researchers. One model can suggest that it’s like stone soup or ever
heard of the parable. We’re bringing the–we’re bringing the soup in the stone, which includes
our infrastructure in a network and the tools that we have and then a lot of people are
bringing their carrots and onions and potatoes to make it taste better and the more people
add stuff, the better tasting the soup gets, and eventually you can actually do some really
effective work with it. Typically these days you have relationships between these various
participants, businesses, ISPs, law enforcement. The people on the top typically have some
of the data or the victims and the people at the bottom usually have the ability to
do something with it and they all have their own independent non-disclosure agreements,
contracts, whatever you need to manage, you keep everything private and functional and
keeping it going on this. That’s a lot of paperwork and that’s a lot of trust that has
to be built up between people and it’s inefficient. So one of the things we’re doing with SIE
is to create the efficient sharing within a common legal and privacy framework. People
can bring their data in the SIE, have it a sort of clearing house and a place where it’s
all available freely and you can share freely within there and you could sign a single agreement
that everyone else signs and everyone can work within SIE. We’re not going to replace
everyone’s sharing; we’re just helping to enable things which might be inefficient right
now. Typically, for the infrastructure, we have a bunch of sensor operators out there.
You know, it might be something sniffing packets off the wire, it might be something attached
to your mail server that the spam flows into or, you know, you have a WebCrawler and you’re
going out searching but you just go ahead and create–or basically a packet-sized bundles
and they’ll all get uploaded via rsync and some scripts that to our redundant servers
so that we can broadcast that unto an Ethernet infrastructure within the datacenter. Inside
of that broadcast infrastructure are a bunch of researchers who are out bringing their
own machines and hopefully some of their own data, the comparatives. As we build out, we’re
going to build a node on the East Coast and will relay data between the nodes. And additionally,
each of the researchers will be able to talk to each other over a private network. And
as we add other areas like, for example, going to Europe or Asia or wherever, we may have
a relay or people can upload stuff into the cloud and at some point, oh, boy, there’s
a wonderful–there’s a cloud thing, let’s strike that. We have relays where the data
can enter and at some point it may get promoted so that it can talk to all of the rest in
there, it’s on equal basis. And we basically take any of the unique data that’s going from
one area and pass it to others so the others can see it. It may not be all of the data,
there’s a lot of data and a lot of it doesn’t need to get shared but maybe some of aggregate
stuff will get shared between the nodes. Here are some other types of data that is out there.
We started with passive DNS and DNS blacklist data, we have some Darknet and NetFlow but
there’s a lot of other different types of data, which aren’t necessarily conducive to
plain old packet captures, and they have to invent something so that we can describe these
various types of data so that they can be officially shared on the broadcast network.
The first thing we did with–we work with was passive DNS, Florian Welmer of Germany,
he pioneered the capturing and we modified some tools so we can do a better job of collecting
it on the name servers to collect more of the data. Typically, a passive DNS is–at
least the way that we collect it is you have a name server and your clients all go up through
recursive name server or caching server to find out the names that you’re looking up.
You’re looking up www.google.com, well, the client is not going to talk to a Google name
server, it’s going to talk to recursor at their ISP that where recursor sends a request
out to the common name servers to google.com name servers, and then they’ll feed the data
back to recursive server. And then once the answer is found, the recursive server returns
that to the client. The position where we listen is on the downward arrows, they’re
going back to the recursive name server. There is a great benefit to that and that it helps
enable the privacy of the clients because the recursive name server is doing another
queries at that level, not the clients. So if you have a large population, say, a thousand
people or a million people, you won’t necessarily who’s making the query but you at least find
out the information that’s out there and that’s part of the goal is to actually get a better
map of what’s out there as far as IP addresses, mapping to names, mapping to name servers
to get a better map of what’s out there that you would not normally see if you didn’t have
these sensors out there. So we asked a bunch of ISPs in universities and friends to donate
some data and we’re very appreciative of that. And we have been building up and trying to
get more data and more data types from different sources. But mainly to do DNS data collection
back then was TCPDUMP or DNScap. There are some inadequacies in those programs that they
couldn’t capture everything so they created a new program called NCAP and NCAP tool and
added a bunch of features into that so that you could–so that you could replicate the
passive DNS data into the broadcast infrastructure that we’re setting up. We added some features
for doing plug-ins. They can do filtering on the data as it passes into one. You can
filter out the things that you don’t need to see so that we can spend more time crossing
and setting the data that you want. For the other data types, for example, spam or link
pairs or malware or whatever we created end message and we’ll get to that later. Also
to enable some collaboration between the researchers, we set up a VPN between our sites so that
researchers at one site can talk to another using unicast. Basically your typical access
to web server or a database or who is or DNS lookups or whatever, but it’s all with the
private within the framework. Some of the hardware that we need for this needs to handle
high packet rates, we need a fast switch that won’t drop packets. The servers will typically
be 64-bit with a lot of–with a lot of RAM in your storage, you know, if all you’re doing
is allowing disk is fine before you want to do anything active with it, you’re going to
need SSD or a lot more RAM because there are high packet rates for a lot of what we’re
dealing with. NCAP tool, here are some of the–here are some of the things that we did
to approve upon pcap or DNScap with larger DNS packets these days you’ll actually have–you
actually have fragments that is what used to be able to fit 512 bytes doesn’t fit anymore,
especially if we’re doing things like DNSSEC and increasing the number of servers and the
amount of data that’s coming back with each of these request. So we need to be able to
reassemble the packets and NCAP does that automatically whereas you might actually miss
that data if you were using just pcap. We drop the link query info. We don’t need to
carried up the Ethernet MAC address, we’re just really interested in layering it three
and above. Normalized the network formats so that when we collect on a Sun, we’ll work
on OpenBSD on a PC or work on an HP, you know, running risk, you know, just basically network,
you know, network by order. Nanosecond timestamps instead of millisecond and then added some
user-defined flags so we can actually track what was the sensor that actually give us
the data. What’s key to SIE is the fact that we have this common infrastructure where everyone
can listen to an Ethernet bus of the same data. So, when it comes into one of our nodes,
you know, we broadcast it on a local area network on a VLAN. And everyone who’s on that
VLAN will get that packet at the same time, so it’s not like we’re just sending it to
each researcher or actually just broadcasting it that makes for a lot of efficiencies. And
we need to be able to take data from files or put them out to files and we can do all
sorts of passive [INDISTINCT] the packets. Mostly of the packets we can ship them around
in many different ways. One of the best benefits we had with NCAP is when we started making
modules to do deduplication, which is very necessary. It’s a pattern matching internal
database lookups like if we you want to match what you’re seeing off of the wire against
an internal table of something that you know, and that was really important. Typically when
people are setting up security or doing security data gathering, they’ll put everything into
a database and it may not scale. At some point, they’ll become deskbound and we need to be
able to keep the information flowing in real time and not just being trapped into a database
which will eventually slow down. And we can’t just log the data because it’s really not
useful. People need to create real time tools to be able to analyze that on the slide. What
we ended up doing is we built a–it was a loop called a loosely coupled multi-processor
where one machine would actually stored out with the data, broadcast it unto the network
and other machine would do some different processing. And then once that machine does
processing you have to broadcast it back out unto the network and then a bunch of other
machines would be interested in that need you further processing. So, you have a whole
bunch of machines altogether on the same broadcast network and they’re all doing the processing
in real time. I have a diagram describing that more. We partition our data like for
the various data types and to different VLANs so you got passive DNS on one, deduplicate
and passive DNS on another, you have some NetFlow on one, you have some spam on another
and then you can choose which channel VLANs or channels that you want to subscribe to,
you know, to cut down on your overhead of what you have to process. So this is the typical–this
is the typical use of a filter or DNS coming on in the left. This is the route–let’s say
you’re passive DNS data that’s coming in from the sensors, it’s very high, at a very high
rate of speed, you have program which operates on a server runs it completely out of RAM
that does a deduplication of what’s in there. You know, you may have a hundred people looking
up www.google.com but you don’t really care about that. You’re really interested of the
fact that www.google.com is out there and here’s the information that came with it.
So, deduplication takes it down to a reasonable level where people can actually do the processing
with it. And then you can do some additional filtering like we have something that helps
detect fast flux. Fast flux in one example would be, let’s say, when your name servers
keep changing their IP addresses, that’s very useful for helping to detect Darknets because
that is one of the behaviors that they use. So, here’s a graph from last night where we
get about 40,000 packets per second and that’s why it’s somewhere around, you know, 80 to
100 megabits of all just DNS packets. And to do any real analysis with it, you know,
once you got, you know, like a, you know, 20 or 40 servers that can kind of split or
take a part of that feed off a bit, you know, it’s really going to be inefficient to process.
So, here the deduplication, you know, part in the graph, if you just look at the numbers
that–now, you’re down about 5,000 packets per second, which is a little easier to process.
That’s like an eight to one, we’re getting about an eight to one benefit out of our deduplication,
and it last about every four hours. It’ll roll over for fresh data, so gives you an
idea of how that works. And then just we’re finding fast flux, well, you know, you’re
about two to four packets per second of just things that are changing at fast flux. And
we can actually just watch that stuff scroll by on the screen and use your human intelligence
to figure stuff out like, you know, if you have like bank phishing site and it’s changing
its name servers where you just see it scroll by on the screen and say, “Hey, there’s a
domain. I’m interested in.” The concept of loosely coupled multi-processor is very important.
Dave Boggs was doing this Xerox PARC & DEC back in the ’80s and he invented the Ethernet.
A lot of people were just using it as, here as I get my packets from A to D but he is
starting to use different, he is starting to use methods actually as broadcast. So you
can actually take one piece of data and broadcast developed to multiple servers efficiently.
And another feature this is–we’re doing, we’re leaning toward real-time analysis. A
lot of people in the research field these days, they’ve build their big databases and
they do queries against the databases. Well, if instead, you know, if you do that and you’re–it’s
taking too long, you know, bad guys, they’re moving on after a few hours and if it’s going
to take you a whole day to figure out something well, you’re pretty much losing. So, if you
can basically find ways to put the stuff you know and around, you can compare it with that
what’s on the feed and you can actually give yourself an advantage. Do you want to say
anything more about…>>Some of this isn’t new except it’s–what
old is new again because the–when this kind of thing used to work when computers were
passed in the real world is slow or computers were fast enough as you put in the database
triggers so that everything got dump to the database but then when certain things, certain
lines were crossed and you would learn. “Okay, you just try to put something in the database
that caused the following exceptional behavior, you should analyze this.” And, certainly,
in the case of SQL, there’s no way you’re going to keep up with even 5,000 per second
let along hundreds of thousands of things per second, that’s their trigger [INDISTINCT]
and all the SST in the world isn’t going to help you do that. So, I will say that this
is, to me, the money bullet points in this presentation is that the security community
has gotten into the habit of storing PCAP files or storing things in database tables
and then having [INDISTINCT] that go look for things and we are not keeping up. The
bad guys are winning. I’m tired of that so, the idea of teaching people once again how
to look at things in real time and look for prosper relations in real time was at the
heart of this project, originally.>>ZIEGAST: There is some benefit to analysis
in the rears and that it–we can perhaps go back and look for things that would help
you change your patterns in what you’re looking for in real time. Just much like a stock analyst
would look at their historical trends of a stock to figure–try to project what’s going
on with the future. Well, you know, at some point, you have to have something automated
that’s pointing that trigger for you to buy or sell. And that’s perhaps some of the kind
of analysis we’d be doing for Internet security data. So, here are some of the things that
we’ve got on SIE right now. Raw Passive DNS is what comes in from the sensors and then
we filter it out. Fast-flux is one example. We do some comparisons to some things like,
CBL or SORBS, Spamhaus CBL. That’s interesting because you may know the IP address of something
that’s bad, but you don’t necessarily know all the names it’s using. So, using this,
you can actually, in real-time, gather all the names that are being used by a particular
machine or that people are looking up against it. How to do some cross-correlation and prevented
work. We gather some DNS queries from some DNS blacklist from dynamic DNS providers.
Some top-level domains; we put isc.org in there. AS112 is a project where all the stuff
that shouldn’t leak out for DNS lookups, someone looking at 10.in-addr.arpa, for example, or
something within there. Well, you know, that’s basically misconfigured networks or name servers
out there. Some of the root servers which no longer operate; they still have an IP address
that people are seeing the query so, we’re gathering a little bit of that too. We don’t
gather root servers, you know. We operate our own root servers but we actually have
a firewall between us and them, and a lot of that analysis is actually goes out to DNS
org. So, NCAP was great for packet data but there’s more that we need to capture. We need
to be extensible. Those maybe on the stake we made with NCAP when we set out, we didn’t
put version numbers in there. We need to be able to create new format as new data comes
available. It needs to be fast and scalable, you know, if you’re looking a set to describe
this stuff, you can imagine, oh, let’s just use XML. Well, that doesn’t work very well.
That’s part of why we’re here at Google today is because you guys had something that was
very applicable. It needs to be fast. It needs to use all these features from NCAP for working
with our infrastructure. And we also need to have filtering methods that we can plug-in
and developers can use so that we can keep up with the time. And Robert–we gave a lot
of this to Robert and Robert pretty much picked it out. He created NMSG. So now, we’re going
to switch over to Robert. All right. Do you want to take my yellow?
>>EDMONDS: Yeah, let’s switch cables.>>ZIEGAST: I was doing 800×600 but we’ll
see it whatever you have works.>>EDMONDS: So something Eric reminded of
this from RFC 1034. The sheer size of the database and frequency updates suggest that
as we maintain it in a stupid manner. Approaches it, it attempt to collect a consistent copy
of the entire database become more and more expensive and difficult, and hence should
be avoided in 1987. Well, unfortunately, we like doing expensive and difficult things.
>>Oh, he was taking about the post stuff here.
>>EDMONDS: Yeah. Well, this gets where, in fact, in all of the hostnames. Let’s see here.
So, we have this NMSG file format. That’s just the successor to the NCAP packet capture
format and the ideas that we’re not only capturing packet information but also things that are
not necessarily best represented as packets on a wire or datagram on the wire. So the
ideas–we don’t know what types of information we’re going to store so we should make it
store opaque blobs of information. And perhaps, at run time, we will load a module and be
able to learn how to interpret that blob. So, we have blobs on the order of 10 to 10,000
kilobytes in length. We probably don’t want to optimize the transmission of DVD ISOs,
UDP broadcast network. We’re interested on things like DNS and email and HTTP, things
of that order, of size. We optimize–we decided to optimize for UDP over jumbo frame Ethernet
in order to minimize number of socket receives and after it’d be done to either particular
quantum of data. And it turns out that Google has a ready-made including format called Protocol
Buffers. It’s essentially an extensible binary wire format for encoding fields of data, print
of types, integers, floats, binaries. Unfortunately, Protocol Buffers are not self-delimiting and
they’re not self-describing so we have to add some additional framing and some additional
intelligence in order to be able to use that for our UDP broadcast media. And for the protocol
engineers, it’s essentially a description of the protocol, so maybe, constant length
header part and the variable length part which can encode one or more payloads. So the NCAP
format captures one packet and represents one packet when it’s rebroadcast. But since
we’re bashing that data or we’re buffering that data, we can pack more than one payload
into a jumbo Ethernet frame. So, the average DNS packet is probably less than 512. We’re
going to fit perhaps 16 of those into a jumbo frame Ethernet packet, or even more so why
not minimize a number of socket calls, system calls that you have to perform in order to
read that data off the network. What if your payload is larger and your jumbo frame Ethernet
frame? We should be able to fragment that so, I’m sure you’ve seen spams, emails that
are longer than eight kilobytes. So, we want to avoid having to, having truncated that
indicates–this blob has been truncated and you have to deal with that. While we can just
fragment–we can fragment a payload and they receive, reassemble it and pass that reassembled
payload to the client application. And we are moving to the [INDISTINCT]. So, there’s
four byte magic of value and it’s the beginning of the frame or beginning of the file buffer.
The flag is octet and there’s a bit, I mean, there’s fragment, it was a bit that means
compress. There’s a version octet, conversion is two and we’re going to represent up to
about four gigabytes of payload. We have not come in, ready for that load. So fragment,
there’s a bit that says fragment. I mean, fragment into multiple frames, receiver as
to reassemble it and we kind of, you know, like the fragmentation because that limits
us to 64 kilobytes. So, we do the fragmentation or the segmentation in the application layer,
much like TCP. And there’s a bit that will compress the data so, we can fit even more
DNS data or email data into a given Ethernet frame. And if you see both of the bits and
you compress it and then fragment it because doing it in the other direction and the other
order is problematic which well, you won’t use the maximum, you won’t use as many bytes
on the frame if you fragment it and compress. So the payload header, this is the verbal
like part and it is now encoded using Google Protocol Buffers. There is a vendor ID and
a message type. The message types are pure vendor so if you want to create your own payload
message types, we will assign you a vendor ID and you can assign whatever message type
value as you want. A timestamp, 64-bits seconds plus 32-bits nanosecond so, you get a nanosecond
precision timestamp. And then we have a few optional fields for a classification, source,
operator, and group so cooperating senders and receivers again, pretty to classify your
data, and the payload itself. This is the opaque blob of information. Each vendor ID,
message type tuple will identify a particular unique type of message. And, we don’t necessarily
require that the blobs can be encoded with GPB, but they frequently are and we optimize
with that particular case. So now, we have the LIBNMSG c-client API and this is a client
applications that want to process sender, receive, read, write setup, you know, multiplexing,
demultiplexing, all sorts of [INDISTINCT] with their NMSG and payloads. We include both
the simple single-threaded interface and we have a multi-threaded octopus I/O engine that
you can utilize if you want. The multi-threaded code is good and that we could spread the
load across multiple CPUs when we decode those packets, those messages. And, if you happen
to have a trunk of code that processes with those messages and you make it reentrant so
it could be called at multiple times from multiple threads, you get whatever speedup
as possible from that. We are currently developing Python and Perl bindings. The Python bindings
are stabilizing–we haven’t yet made a stable release of the Python bindings yet. And [INDISTINCT]
is working on the Perl bindings because I do not use Perl.
>>ZIEGAST: [INDISTINCT] age.>>EDMONDS: Well, Python is probably just
a few years younger than Perl but there’s also a message module interface so that we
can extend the message types that whoever understands without having to recompile it
and relink all the readers and writers. So, essentially, this is a DSL that exports a
particular structure and deals in particular fields and may optionally provide function
pointers that perform a specific processing on, specific, you know, pretty printing, parsing,
type of–>>ZIEGAST: So, for example, DNS.
>>EDMONDS: Yes, DNS has a variety of interesting and particular wire formats where it’s–-for
its data fields. Let me see. The traditional label, octet label encoded name which is–with
those security vulnerability recently based on this concept in the SSL Certificates that
were happened to have embedded nulls in the labels which is valid according to the DNS
wire protocol. So, there’s an ISC DNS message type that will provide a specific function
to turn a label encoded DNS name into human readable, you know, dot delimited name and
we don’t want to put that type of logic in the LIBNMSG. We want to push that out into
a plug-in for that particular message type. Since, we try to make the d-core library as
agnostic of the upper layers as possible. And typically, the message module is a really
short amount of code that [INDISTINCT] some generated object code from the protocol buffer’s
compiler. Yes, we can make it as long as there’s additional complexity, we can keep out of
the core library. And that’s the end of my presentation.
>>ZIEGAST: Cool. This is all available online. You can download it at ftp.firm.ISC website.
And you can actually see some of the encryptions we have where ISC is vendor ID1, you know,
if you wanted to start using this yourself, you can make yourself a vendor ID.
>>EDMONDS: Why don’t you–you would ask us for vendor ID?
>>ZIEGAST: Oh, yeah. Yeah, everyone just picks their own, you know. John Pixel isn’t
moderating that anymore. I know, yeah, we’ll just start with us and we’ll make sure that
in the source code that everyone will play nicely. So now that we have–so now we have
a message, we now actually have some new channels that we can make available to people. Spam;
we have–we have some–a bunch of spamtraps out there and some–there’s actually another
provider that’s sending us these spam reports. We basically take all the envelope information,
some of the headers, extract some URLs out of it and then packet-sized that, and you
people can actually use that. A search provider has given us some URL link pairs so that we
an actually–if someone wanted to, they can actually make a map of here is the normal
web, and then if it’s not in that, maybe you wanted to spend some of special attention
to it. NetFlow is actually packet-sized, you can actually–we don’t do anything with that,
we just–NetFlow already has its own set of tools like the silk tool kit. People are
aware of Conficker. We got ourselves involved a lot with that and help aggregate and collect
all of the data that was going through the sinkholes. So we created some types that we
had all the web servers report in. And we also captured the DNS data and we actually
captured some of the P2P data and we created channels for them, so we could actually have
people see that stuff coming in real time. We are making some development work from malware.
I may just try that more on another slide. We’re also getting some darknet fees from
our self and other JISP and we expect to be getting some more. But that’s just normal
packets, it’s not necessary end message whenever I choose to create an end message module to
describe stuff. Particularly, if malware, you know, you might start with hashes or MD5s,
at another layer you might–above that you might have people passing around, “I’m interested
in this” or here’s, “I saw this too.” And above that, you might have something more
descriptive, you know, maybe even encapsulates some XML as IO depth is very popular with
that.>>Is an info channel just to–so it’s just
single blob that has all forms of NetFlow and sFlow or it is a single vendor?
>>ZIEGAST: It’s–we have some–we have some version 5 just from our own routers,
but getting NetFlow from our own routers has been difficult. It is not–it is not an active
channel. But, yeah, I would expect that sFlow version 9 is pretty much the standard for
that because of–there’s IPv6 and…>>Right.
>>ZIEGAST: …all that other stuff. Some other people are working with NetFlow and
using–combine with passive DNS. But we’re not doing any of that to build in ourselves
right now. There are already tools out there. So for passive DNS, you know, you can sniff
stuff off the wire that “.255” just basically typifies. If there’s a channel number that’s
202 and here’s the broadcast address, we’re listening on port 8430 where all of these
packets are splitting out there. And, you know, you see some information like the time
stamp, there’s the type ISC ncap, the identifier of the sensor operator who submitted it which
kind of randomized and kept separate, but you kind of tell where the data is coming
from one source. There’s the name server that it came from, our name server I commented
out, that’s the sensor operator, as you can see where it came in. And some of the flags
that go with it, the first part is the answer part or, actually, the first one is the query;
the zero means if there’s no answer, but then it told you where the name servers were in
the name server section in additional info that actually handed the IP addresses. So
you can take all of that, you can put that into a structure if you’re using “libnmsg”.
But you can even just do plain old text processing based on that.
>>EDMONDS: Oh, you should use a library.>>ZIEGAST: It’s much more efficient.
>>EDMONDS: Don’t force server through your text.
>>ZIEGAST: But, you know, some of us old timers used awk, sed, perl and such, and don’t
keep up with that. So as some people have done some useful things without being efficient,
but, yes, do take the time in the auction where in the C or python or whatever it carries
out there and save yourself some money on your hardware. So another one, here’s a sinkhole
for Conficker. So here we have another type, you know, “4 ISC http” and so we get where
the request came from. So, these are people who are infected for coming back to try to
talk to the command and control over the web, and we just happened to take over the domain
that they’re using for that. So we can kind of do some things. We do some POF. We look
at the–all the stuff that we could get out of the request and that would help identify
for which particular strain of Conficker they were infected with. And now they can create
a database, people can use some of these data for a mediation tools and some people do.
So, you know, Chris Lee did a lot of work. He’s out of–was at Georgia Tech in our shadow
server. He put a lot of it together and make it useful for people. So for spam, we do a
bunch of preprocessing scripts that basically take the email message, you know, out of your
standard input and then extracts the things that people are interested in. This should
be the helo from RCP to the IP address that came from, you know, there are receipt headers
in the URLs that are found in the message. No one’s really interested in the image blobs
yet, but if someone does become interested, we might do something with it. We have plugins
for postfix and qpsmtpd, qpsmtpd which is a pro-tool which is actually a very fast and
efficient from Linux and VST servers. So we have several spam traps of that type where
it’s just basically taking unused domains and people will just keep sending spam to
it anyway. They are seeded or populated. And that’s very useful because that really is
spam. We have methods where you can actually tag “this is spam” report. So if someone says
or reporting address, an abuse address, or they have a button, they click on their client;
we can actually create packets that say, “Here’s a user report”. Now, there’s a little that
you have to go a little statistical because there are false pauses and that that some
people may take some marketing and say its spam where it may not necessarily be pure
spam. Something we haven’t implemented yet but would be interesting, would be, say, as
every mail message comes in, take the headers from that or the envelope info and say what
spamassassin score was with it, and then you can basically–if you have everyone reporting
into a central source, you have a really good chance of real time reputation, maybe that
some commercial services out, you might be able to do a public domain one. All this spam
is a great starting point for analysis, you’ll find–typically find that people are using
botnets or in some cases even just buying services of the bolt requesting. I will lead
you to some other data. Here’s an example of some spam that you get off at spam channel.
Again, the time stamp and the time sensor, it’s a spam trap. I confiscated of this, except
that’s a real domain. And there’s a URL in there, that “ff24490.gif” or yet that points
to basically a–but, you know, a biox, you know, kind of advertisement. And so this is
obviously some kind of a bad domain that’s being used for phishing or, actually, in this
case, just spam. So one of the things we do is we take some of the passive DNS that we
have in there and we look up that domain, it points to some IP address, well, lo and
behold, here are all the other domains that are being used by that IP address that we’ve
collected via passive DNS. So once you find one, you can actually do blocking or add blocking
on all of the other domains even before they’re used, like maybe they don’t use them all right
away. So you can actually be–start to be proactive. You can find some other information
like “.j8w.ru” is probably something close to what the real name for that server is.
And then you can go program around down and find some stuff. When you have multiple data
types, like passive DNS or spam or other information about the networks, you can do a whole bunch
of data combine and find some more interesting info. Jose Nazario and Thortsen Holz back
in Malware08, they made a paper where they create a point system for all these different
things when you combined them together and add up the points. They actually say, “Hey,
that’s a fast-flux spot”. You know, like being a multiple networks within IP address ranges
or how often or how many host do you have in the A records. Dave Dagon and Wenke Lee,
they just basically took passive DNS, a little bit of a string matching like looking for
the record virus inside of the packets coming by and they are very fast and successful in
finding FakeAV sites even before they were blacklisted by other services. Richard Clayton
is a professor who was investigating one of the centers that’s being done in the UK. So
he takes–he takes a list of hosts of the passive DNS, combine with some active scans
that he does himself, and he can pretty much determine which of the URLs were getting blocked,
including at one point internet archive, which is a pretty heavy website to do blocking.
Andrew Fried is a consultant with us. He’s going to be talking of Blackhat DC in January
about a lot of stuff that he did of combining the spam, the spam’s BGP info, passive DNS,
analysis of some top level demands zone files to basically go after things like Zeus/Avalanche
or phishing or whatever. He’s very active in the community in just saying, “Hey, you
know, here are all these new domains, they get to stuck in the server bowl and they start
to get blocked.” But he used to do this full time for the IRS back when they were having
phishing problems, and now he’s actually helping not only the IRS but all these other people
who are getting hit with a lot of the same methods. Ed Stoner worked with CERT and in
Flocon in January. He’s going to be talking about how he puts together a passive DNS with
NetFlow to help expand on what botnet knowledge you already have. You can already get some
botnet knowledge out of NetFlow but if you take passive DNS to the next level, you can
basically use your IPs to help find more names which help you find more IPs to help you find
more names and eventually get map of everything. We’re actively looking to start the malware
channel. Some other people are creating products for DNS reputation based on the data we’re
interested and perhaps offering scanning because, you know, people who are doing scanning right
now for DNS, they’re finding out the bad guys in figuring out who they are so they’re getting
blocked. So let me offer some scanning infrastructure for people. Automated abuse or distributed
denial of service attack reporting, it might be a way to standardize and have some real
term report and saying, “Hey, I’m getting flooded”. And then you tell a whole bunch
of other people which might include, you know, your antivirus vendors or ISPs direct LAN,
whereas as opposed to picking up the telephone. With the URL search data, you can perhaps
find that a whole bunch of people are coming to some place at ones, and that might be because
Britney Spears did something that day or it could be that there’s a new virus that everyone’s
downloading at once. But if you have everyone looking at a new URL all at ones, you can
perhaps take a look at that and–you know, BGP updates, you can–that security data as
well, that could be helpful for finding out people’s networks getting stolen up from under
them. And here’s how people can help. You know, if there’s sharing methods that people
are using right now that are between themselves, they want to incorporate more people into
that. We can help reduce some of the overhead by having it to put in one central place and
using our broadcast infrastructure to get to all the people who need it. We don’t have
to resend or recopy that data between multiple phases. If you don’t like working with service
agreements and NDAs and stuff like that, we can help simplify things. Something else you
can do is bring some servers SID–SIE and actually take a look at what’s out there.
I mean, there’s a lot of good minds here at Google. I can imagine some of them might be
interested on the security side of seeing what’s out there and seeing what they can
figure out and see if they could combine it with some of the data that you guys already
have. And you can also install sensors, you know, particularly with ISPs or corporation
or security companies. Everyone’s got some kind of data that is perhaps worthless junk
to them, but to someone else they can–they could do something effective with it when
they combined it something else. So people can go ahead and send us some more data. We’d
appreciate it, and so the rest is security community that works with us. So we’re SIE,
you can send us email with that info. It will get to all three of us. We got a website.
There’s my phone number. And for nmsg, you can go and get download it yourself. We have
a developer mailing list that we recently set up where people can talk about how they
use it. This is generic, it’s open source. This is not SIE specific. You can use the
stuff internally and then tell us how you’re using it, it might be interesting. Ncap’s
available from us as well. We thank you guys for making Google Protocol Buffers and we
especially thank all the sensor operators who are out there donating data to make all
these useful.

You May Also Like

About the Author: Oren Garnes

Leave a Reply

Your email address will not be published. Required fields are marked *