Protecting systems with data and algorithms

South Park Commons member Clarence Chio and his co-author David Freeman sat down to talk about the motivation and process for writing their…

Protecting systems with data and algorithms
Humans and dogs alike have benefitted from using machine learning to tackle misinformation.

South Park Commons member Clarence Chio and his co-author David Freeman sat down to talk about the motivation and process for writing their book, Machine Learning & Security.

Q: What are the biggest problems facing security/web abuse today?

A: People rely on web platforms more than ever today. We trust online sources with the most trivial of choices (picking which restaurant to go to) to the biggest decisions we make in our lives (whom to marry). Web platforms aren’t fundamentally becoming less safe, yet users have never been more vulnerable. Malicious actors are constantly looking for holes to exploit, and they naturally flock to platforms that allow them to get a bigger audience and maximize their returns. That’s why the most popular web services have to stay vigilant and focus on staying ahead of attackers and abusers that constantly try to bypass their defenses.

Q: How do you think machine learning can be used to tackle the problems of fake news/misinformation?

A: Machine learning excels at finding patterns in large datasets. In general, the more data you feed a system, the better it performs. Fake news and misinformation exhibit patterns that are not vastly dissimilar from spam and phishing. These are problems that the security community has been tackling for years, and machine learning has exhibited some very promising results in the detection of spam and phishing.

That said, misinformation is a very broad category that includes everything from fake news published on a pseudo-news publication, to synthetically-generated video clips of someone performing an act that he did not actually do. A common behavior that we observe in many places is that people tend to view machine learning as a “miracle solution” that somehow solves problems magically. We think that it’s important to always be the biggest skeptics of machine learning while advocating for it. Statistical techniques like machine learning come with a lot of baggage and usability issues. One might choose to adopt static rule-sets for detecting spam instead of using machine learning even if the former performs less well because it’s a lot easier to explain results of classifications with rules than with opaque black-box systems.

Q: What do you think of books as a medium of technical knowledge sharing today?

A: It’s amazing that there are so many ways to share technical knowledge today. Anyone could have a high quality education in machine learning by taking one of the dozens of free MOOC courses online, watching university lectures on YouTube, or reading some of the countless blog posts that enthusiasts have put up online. The number of people who regularly read books is not growing massively. Because of how accessible knowledge is in other forms, any one of these can reach a far wider audience than any book could.

Nevertheless, we think that technical books provide something important that can’t be replaced by short blog posts, lectures, or videos. A book allows the author to develop a complex set of ideas through long-form exposition, offering the opportunity to deep-dive into topics that would be difficult through other media. To us, physical books organized by chapters will always be the most useful reference medium especially when trying to dive into a foreign or unfamiliar subject area.

Q: What are some of your favorite technical books, websites, videos, and classes?

Freeman: Jure Leskovec’s Stanford course on Mining Massive Data Sets is what got me introduced to the field — Jure is an enthralling lecturer and makes the most complex concepts seem simple. My go-to reference for mathematical aspects of machine learning is The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman. It’s super rigorous but still manages to be very readable. For keeping up with research, my favorite conferences are Usenix Security and WWW.

Chio: One of the first technical books that I read was Programming Collective Intelligence, by Toby Segaran. It gave me my very first peek into the world of practical artificial intelligence and how it can change our world. It’s surreal to be releasing our book with the same publisher that introduced me to the field 10 years after reading that book. On a different note, I’m a big fan of Siraj Raval’s YouTube channel — he publishes a ~10 min video every week, offering both novices and experts a super entertaining and educational view into AI.

Q: What inspired you to write a book about this topic?

A: An acquisitions editor approached me (Clarence) in late 2016 and invited me to submit a proposal for a book about AI and security. I had gotten some exposure in the field due to some talks and workshops I had been doing at security conferences. I thought of who I would trust to write a technical book on this topic, and David (who was Head of Anti-Abuse efforts at LinkedIn at that time) immediately came to mind. I approached David with the idea of co-authoring this book with me. After several meetings, long conversations, and deliberation, we decided to do it.

Both of us had identified a massively growing interest in the field of AI and security, especially among the information security and web abuse fighting communities. Yet there was almost no published material on this, and conversations on these topics were always clouded by a lot of hype and disinformation. Our goal in writing this book was to ground the discussion by tackling long-standing, practical security problems from the lens of a skeptical data scientist, evaluating how machine learning can disrupt the status quo in security.

David Freeman, former Head of Anti-Abuse at LinkedIn, met co-author Clarence Chio at the “Data Mining for Cyber Security” meetup in 2015.

Q: What was the writing process like for a book like this?

A: The editorial process was long and involved. O’Reilly is very meticulous about the quality of the content they put out, and there was a team of people and resources available to us that helped in everything from content coverage to designing the cover, to the most minute formatting quirks that we requested.

It took us 12 months to get from the proposal to the final manuscript. Dave and I met face-to-face in a Mountain View cafe every 2 weeks where we discussed general directions and what we wanted to cover in different sections, but mostly collaborated online on Google Docs. As with most other O’Reilly publications, external technical reviewers (experts in the field from academia and industry) were engaged at the halfway point and after the first complete draft was submitted.

Q: Whats next? Would you write a book again?

Freeman: I personally want to get more involved in exploring how to use deep learning for abuse and security. Most of the techniques in fighting abuse so far have been pretty standard. Deep learning has shown so much promise in text and image processing, but it’s not clear whether similar deep learning methods can be effectively applied to areas like behavioral analysis, which is why I’m looking to dedicate time to conduct research in this area.

Chio: I always wanted to build a company, and I’ve dedicated the next few years of my life to achieving that. Whether or not that company is in any way related to security and/or machine learning remains to be seen!

We keep getting asked if we’ll write another book — this has been such an educational experience for us, but is ultimately a hugely time-consuming endeavor. We joke that we’d consider doing it again after we forget how draining the process was!


You can purchase Machine Learning & Security on Amazon and sign up for the South Park Commons newsletter here.