|
There have been some spectacular fails when it comes to looking at Internet traffic, think Google Flu Trends; however, Predata, a company that helps people understand global events and market moves by interpreting signals in Internet traffic, has honed human-in-the-loop machine learning to get to the bottom of geopolitical risk and price movement.
Predata uncovers predictive behavior by applying machine learning techniques to online activity. The company has built the most comprehensive predictive analytics platform for geopolitical risk, enabling customers to discover, quantify and act on dynamic shifts in online behavior. The Predata platform provides users with quantitative measurements of digital concern and predictive indicators for different types of risk events for any given country or topic.
Dakota Killpack: Over the past few years, we’ve have collected a very large annotated data set about human judgment for how relevant many, many pieces of web content are to various tasks.
Ginette Methot: I’m Ginette,
Curtis Seare: and I’m Curtis,
Ginette: and you are listening to Data Crunch,
Curtis: a podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: Data Crunch is produced by the Data Crunch Corporation, an analytics training and consulting company.
Let’s jump into our episode today with the director of Machine Learning at Predata.
Dakota: My name is Dakota Killpack and I’m the director of machine learning at
Predata, and Predata is a company that using machine learning to look at the,
the spectrum of human behavior online organizes it into useful signals about
people’s attention and we use those to influence how people make decisions by
giving them a factor of what people are paying attention to. Because attention
is a scarce cognitive resource. People tend to pay attention only to very
important things, If they’re about to act in a way that might cause problems
for our potential clients, they’ll, they’ll spend a lot of time online doing
research, making preparations, and by unlocking this attention dimension to web
traffic, we’re able to give some unique insights to our clients.
Curtis: Can we jump into maybe a concrete use case into what you’re talking
about just to frame and put some details around how someone might use that
service?
Dakota: Absolutely. So one example that I find particularly useful for
revealing how attention works online is looking at what soybean farmers did in
response to a tariffs earlier this year. So knowing that the, they weren’t
going to get a very good price on soybeans at that particular moment. A lot of
them were looking up how to store their grain online and purchasing these very
long grain storage bags, purchasing some obscure scientific equipment needed to
insert big needles into the bags to get a sample for testing the soybeans and
moisture testing devices to make sure they wouldn’t grow mold. And all of these
webpages are things that tend to get very little traffic. And when we see an
increase in traffic to all of them, at the same time, we know that a, a very
influential group of individuals, namely farmers, is paying attention to this
topic. Using that we’re able to give early warning to our clients.
Curtis: Sounds like looking for needles in a haystack of data. Right? So how do
you determine what is a useful bit of information in the context of what your
clients are looking for? Do they kind of have an idea of what you’re looking
for and then you’d go out and search for that or, or does your algorithm find
anomalies in the data and then characterize those anomalies so that you can
then report that back? How does it work?
Dakota: It’s a mix of both. Because the, the Internet is such a rich and
complex domain. It’s, it’s very dangerous to just look for anomalies at scale.
There there’ve been some high profile failures, most notably the Google Flu Trends
experiment where people have tried to link arbitrary online activity to real
life human behavior and to get around some of the problems. We, we use
human-in-the-loop machine learning. It’s very much a process where human
expertise about the world is the foundation of all of our models. So we have a
team of analysts that knows how people might act both in real life and online,
and they’re able to constrain the space of things that we look at to make sure
we’re not finding spurious patterns. And we know that any anomalies that
surface in the set of things that we’re monitoring is going to be more likely
than not relevant. And once those are surfaced, we pass them back to a human
before finally alerting. And that gives us one more final check to make sure
we’re not sending out spurious alerts as well as giving us even more training
data using the whole human-in-the-loop system.
Curtis: This is sort of the setup where you have humans defining the problem
space or the areas to look in the first place. The algorithm then parses that
comes back with results and then human checks those and adds context to the
client. Is that, is that correct?
Dakota: Right. And using that over the past few years, we’ve have collected a
very large annotated data set about human judgment for how relevant many, many
pieces of web content are to various tasks, whether it be in the realm of
predicting geopolitical risk or predicting price movement for some currencies
or anything that our experts cover. We know exactly what web content is and
isn’t relevant and we’re using that to constantly improve our system.
Curtis: So you guys are finding actually that the, and I think this is true of
a lot of cases, although it’s not apparent a lot of times to people who are
starting out in this field, that looking at the data and making sure the data
that you’re, that you’re bringing in is the right set of data and that it’s
clean and actionable, that’s where you guys see, see the biggest lift, not necessarily
in cutting-edge algorithms,
Dakota: Right. If you have a a good enough set of predictors, then the
classical stuff works extremely well. Where the classical stuff does not work,
is trying to capture human intuition about what the various factors for
something might be and that’s where we find that having the human and the
machine work in conjunction is best. It’s a lot of these tasks we’re trying to
solve. Geopolitics and finance or even more broadly the problem of what are
humans paying attention to, when, how does that change over time and what are
the important parts of that? Those are problems that even humans themselves
can’t solve particularly well. So we need the best of both human and machine to
succeed at that time.
Curtis: Sure. And now you mentioned geopolitical risk and you mentioned
finances. Is there a, an area that you guys tend to focus? Is it pretty even
again across those two, or maybe there’s more use cases as well, that your guys
are finding that this state is good at solving. Can you talk a little bit about
the use cases?
Dakota: Yeah, anything where attention is an important component, I’d say, is
what connects our are use cases. So in geopolitics you often have many, many
actors, large state actors, and each one has its own viewpoint about the world,
and they’re viewing each other in very particular ways. So being able to track
what does the attention of a large organization towards another one look like?
What does the public’s attention to these organizations look like? Using that
kind of behavioral profiling and following the attention at every step is, is
key to solving these problems both in geopolitics as well as in finance where
we might try to track what are institutional investors paying attention to
today? How does that differ from retail traders? How does that differ from the
general public doing research into a company because they saw it in the news
today.
Curtis: And again, this may be you know too deep into your secret sauce and if
so, we don’t have to go down this road, but I am really curious about how you
even start to curate a dataset that tells you those things. Is that something
you can comment on?
Dakota: I’d say it starts with the human. A lot of our analyst team worked in
the professions that we’re, we’re trying to create an index of online, so
they’re able to make very deep judgment calls as to whether any piece of web
content would have been relevant to them in the past phase of their life.
Curtis: And how long you guys been at this now? You said you’ve, you’ve done at
least enough iterations of this that you have a pretty robust set of training
data and context for what you’re doing. How long you guys been at it?
Dakota: I’ve been with the company a little over three years and it had been
going on experimentally for around a year before that point.
Curtis: Okay, got it. And and how long would you say it took you guys to kinda
hit, hit your stride where you said, “okay, we have enough, we’ve done this
enough where we now have something that is useful and it works”? So I’m trying
to get to is how, you know, how, how long was that curve to really get to
something that was workable?
Dakota: I’d say it took a few years of building up enough of an internal
culture of how to understand these things since it’s really a unique way of
thinking about the world. Most people don’t have data about everyone’s
attention at their fingertips. Fundamentally, it’s something that our team has
been able to wrap their heads around, but it did take a few tries.
Curtis: So even how you decided to collect the data, the sources you were
looking at, it sounds like that was a long process to really nail that and get
that right.
Dakota: Right. We’ve been constantly adjusting which data sources we think are
our most valuable and the main takeaway is anything where people spend more
time and effort actually interacting with or accessing a piece of web content
tends to make it more predictive.
Curtis: Yeah, that’s interesting. Maybe give us an example of how one of your
clients might use some of these predictions and things. Being able to know
something in advance and then take an action on it. What’s some of the value
that people have been able to extract from these predictions?
Dakota: Then the geopolitical realm, we’ve had prenatal used in various
missions around the world and they’re finding that it gives them much much
greater lead time than a lot of things they’re currently looking at. I can’t
really say too much about those national security applications, but a
consistent theme is that it, it helps them prioritize having a dashboard of the
world’s attention to look at, tells them which of their classified resources
they opt to direct somewhere.
Curtis: So they actually then engage with the dashboard, and I’m assuming they
can punch in criteria things that they kind of want to look for and that helps
them with their overall strategy.
Dakota: Right. They’re avid users of the platform there. They’re building their
own signals with our, our customizable tools. So they’re, they’re conducting
their own research on the Internet and feeding that into our platform and
Curtis: Oh, that’s interesting. So your users can actually then bring in their
own data sets and, and kind of mix it with what you guys are doing to, to
enhance it.
Dakota: Right. It’s a fully extensible platform.
Curtis: That’s really cool. What would you say is the, you said it took a
couple of years to really get this right. What would you say was maybe one or
two of the biggest challenges you guys had in making this thing work and how’d
you overcome those?
Dakota: On the human side, it was realizing that attention is actually the most
valuable thing for making predictions. And not only that improve performance,
but it also made things much more interpretable when it finally got to the
client, which made it much easier for, for them to start incorporating it into
their decision making process.
Curtis: So you’ve found even just with some simple visualizations or even just
a simple notes that say this is happening, this is where people’s attention or
focus like that that’s sufficient to help your clients make decisions and get
value from this.
Dakota: Right. And that comes back to my earlier point where using simpler
classical models as the final stage makes a lot more sense because it’s much
easier to provide interpretability around those models. And that’s, that’s something
that our clients love, the fact that the final models are a glass box rather
than a black box.
Curtis: That’s awesome. I love that because you don’t hear often hear about
examples where the innovation is getting the right dataset. Right. And, and
that’s really what’s driving things as opposed to the newest fancy algorithm.
That’s, that’s really interesting. Did you guys come at this problem originally
with that idea that, “hey, we’re going to use some really powerful algorithms”
and then you were surprised when you found, “oh, like it’s just the data and
classical algorithms work well enough?” Or did you kind of have that notion?
Dakota: It was a lot of back and forth trying out lots of different things
from many different academic worlds. Originally we tried various signal
processing techniques, everything under the machine learning umbrella, even
some techniques from statistical chemistry, but the results we were getting
weren’t good until we, we really took time to work things through from first
principles. What are people doing online and why does it matter and when we
directed all the powerful machine learning at letting us answer that question
then we were able to get a data set that we could use for predictive tasks.
Curtis: You also mentioned the financial side of things. I’m curious if you
have a story or something that you could share from a use case in the financial
world that might be interesting for people to to hear about.
Dakota: Sure, so the soybean example is definitely one. Anything in the
commodities space tends to be good for us since we’re able to break it down in
terms of supply and demand factors and for commodity is the people responsible
for the tend to be in very particular industries with web browsing that doesn’t
overlap that much with the general public. A lot of us rarely viewed web pages
about industrial techniques.
Curtis: And, and I’m assuming that’s what differentiates you guys, would you
say, or maybe there’s some other differentiators that you could comment on, but
what, what makes Predata better than some other firms that are, they’re maybe
trying to do the same thing? I think there are a couple of other people in a similar
space,
Dakota: Right? And other firms in the alt-data space, they tend to have a good
human behavioral link, but they’re basing it on a rudimentary forms of data or
something where you don’t need a great leap of creativity or methodology to
really extract value. Something where simple machine learning can work. Things
like prying do track foot traffic to certain stores by looking at mobile phone
data. There’s that a simple one-to-one relationship or using satellites to
estimate economic development in an area or to look at the water level that a
certain tankers are floating at to get an idea of how much is in them or other
things with this, that one-to-one relationship and what we do is unlock the
power of data for, for use cases about human behavior in a much more
complicated way. I’d say our differentiator is that we’re able to describe the
full spectrum of what humans do online rather than being limited to things with
that one-to-one relationship because we we approach structuring the data in
this very behavior first way.
Curtis: I’m curious just, ah, cause that’s not a term that a lot of people
probably come across alternative data or alt data. Can you just define that for
us as someone who who is in this space?
Dakota: Sure, so that term refers to data that’s not market data. So data that
often tends to be generated by people going about their, their day-to-day life.
No credit card transaction data or satellite data, mobile phone, Geo-location
data. And that’s all used to extract an edge about how on a certain company’s
performing or get ahead of economic releases.
Attributions
Music
“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/ |