Skip to content

Category: General Thoughts

Why don’t Data Scientists use Splunk?

I am currently attending the Splunk .conf in Orlando, and a director at Accenture asked me this question, which I thought merited a blog post.  Why don’t data scientists use or like Splunk.  The inner child in me was thinking, “Splunk isn’t good at data science”, but the more seasoned professional in me actually articulated a more logical and coherent answer, which I thought I’d share whilst waiting for a talk to start.  Here goes:

I cannot pretend to speak for any community of “data scientists” but it is true that I know a decent number of data scientists, some very accomplished and some beginners, and not a one would claim to use Splunk as one of their preferred tools.  Indeed, when the topic of available tools comes up among most of my colleagues and the word Splunk is mentioned, it elicits groans and eye rolls.  So let’s look at why that is the case:

Leave a Comment

Book Review: Technically Wrong

I recently completed Technically Wrong by Sara Wachter-Boettcher.  Let me start by saying that I’m glad that Ms. Wachter-Boettcher wrote this book. The tech industry has a lot of issues which need to be brought out into the open and it is definitely a positive development that people such as Ms. Wachter-Boettcher are bringing these issues to the forefront.  It really is only recently that people are discussing the continuous erosion of privacy, misogyny in the tech industry, lack of diversity and many other issues. Whilst I would not deny any of these issues, I felt Wachter-Boettcher’s analysis was somewhat lacking and didn’t really get at the realities of working in the tech industry.  Wachter-Boettcher cites numerous examples of tech gone wrong, such as a smart scale telling a two year old that he needs to lose weight, FaceBook denying a Native American person an account because it felt that their name was not legitimate, and the abhorrent use of proprietary, black box algorithms to make parole recommendations.

Again, it is definitely a positive development that Wachter-Boettcher and others are writing about these issues, but the alternatives and solutions she proposes seem a bit simplistic.   While she doesn’t state this directly, much of the book seems to suggest that all of technology’s woes are caused by the lack of diversity in the tech industry.  Specifically that “white guys” from elite universities are running everything.  I don’t have an electronic copy of the book, but after about half way through this, I wanted to count the number of times the phrase “white guys” appears in the book.  Sometimes this phrase includes Asians, sometimes not.

Leave a Comment

Book Review: Automating Inequality

I recently read Automating Inequality by Virginia Eubanks and would like to share some thoughts.  This review is the first of several book reviews I’ve been working on about books relating to the problems which are emerging from technology. I’ll keep this brief…

The Good:

I am glad that the conversation about social problems caused by technology is expanding.  Books like Automating Inequality are good contributors to that discussion.  In this book, Eubanks highlights a few situations where technology has negatively affected people’s lives, primarily poor people.  This technology also serves to limit poor people’s lives and opportunities, creating what she refers to as a digital poorhouse.

The use of machine learning can be a powerful tool for developing predictive analytics to  One abuse which I found particularly troubling was cited on pg. 137 which is a risk model which calculates a risk score for unborn children.

Vaithinathan’s team developed a predictive model using 132 variables–including length of time on public benefits, past involvement with the child welfare system, mother’s age, whether or not the child was born to a single parent, mental health, and correctional history–to rate the maltreatment risk of children in MSD’s historical data.  They found that their algorithm could predict with “fair, approaching good” accuracy whether these children woudl have a “substantiated finding of maltreatment” by the time they turn five.

 

What I Found Lacking:

What I found lacking in Automating Inequality was the lack of alternative proposals.   It is easy to criticize a technical solution, but these systems are often deployed against complex problems and finding a solution often requires a lot of vigilance, persistence and iteration.  Eubanks discusses the issue of welfare abuse, and seems to downplay the fact that welfare fraud is in fact a major issue in this country.  With some basic research on Google you can unfortunately find countless cases of individuals convicted of welfare fraud.  Clearly, welfare programs should make efforts to reduce fraud and make sure that their resources are going to people who truly need the assistance.

What Eubanks seemed to miss was what went wrong in the implementations that she highlighted.   In two cases, Eubanks highlighted several systems designed to improve the efficiency and efficacy of welfare programs.  From the book, it sounded as if the designers of these programs implemented various technical systems to automate the intake process for benefits.  What didn’t happen, and what Eubanks didn’t discuss in the book, was what was missing in these programs: continuous improvement.  The government agencies that implemented these programs took the approach that one would take when one is building a bridge or tunnel: get it done and once its done, move on to the next project.  This doesn’t work for information systems because they are never done.  Once you start using them, there will always be faults and opportunities to improve.  If an organization can rapidly iterate and improve the solution over time, they will end up with an effective solution.

Eubanks ends the book with a proposed code of ethics for data scientists and other technologists.  I wrote my own code of ethics for data scientists, and it is always interesting to me what others write on the subject.   I particularly liked these points from Eubanks’ Code of Ethics

  • I will not collect data for data’s sake, nor keep it just because I can
  • When informed consent and design convenience come into conflict, informed consent will always prevail.  (If only it were so… )

Overall, I found the book to be quite thought provoking, but I did disagree with some of the conclusions.

Leave a Comment

Why more women don’t code: A heartbreaking story with a good ending

I’ve been reading a lot lately about the ills of the tech industry, with a few book reviews in my queue to finish, and I posted a question on LinkedIn about what inspired people to get into tech.  My motivation was to see if there was a difference in men and women.  My hypothesis is that there are societal and cultural factors which discourage girls and women from studying tech (math, Computer science, engineering etc) and hence there aren’t enough qualified women to fill the tech jobs, and ultimately we end up with the current state of affairs where men outnumber women 3 or 4 to 1 in most tech companies.

Anyway, I received the following private response from a former student to whom I shall refer as S.  S was a student in one of my recent classes and a delight to work with.  Her story is absolutely heartbreaking and needs to be heard.  I lightly edited it, only to remove some details which would identify her.

You can share my story but I am not ready to have my name on it. I am going to be looking for a job soon and not everyone will appreciate it. They will see me as slow and too old. I was 54 before I gave myself to permission to study tech and coding languages. Growing up, girls were not encouraged to study math. I was teased about my abilities in math, because I could not recite the times tables. I grew up believing that I couldn’t do math. At home, my brother received an early TI calculator. It was supposed to be shared between us but that didn’t happen. Besides being annoying, it was clear that electronics were not for girls.

I began my university studies in psychology which seemed the only science that did not require math. I was bored and I tried to learn math on my own. Actual classes involved grades and that was disastrous. I somehow passed all the mathematics prerequisites and ended up in graduate school for chemistry. During my quantum mechanics class, I struggled. I went for help from the professor. He realized that I couldn’t do the times tables verbally and completely humiliated me.

At my job, I learned about agile, innovation and human centered design. I loved that these ideas as they provided a framework and a fresh vocabulary to talk about science and problem solving instead of just math. I excelled at facilitating these techniques. Many of the prototypes we needed to wireframe involved a website or an application. I became curious about data and technology, but I would never let myself work in this area. The risk of humiliation was too great. My supervisor already realized that I stumbled verbally with numbers. I did not want to be in a position to lose my job while trying out new skills.

About the same time, I had a routine hearing evaluation and I was diagnosed with great hearing but a serious auditory processing disorder. The audiologist predicted that I probably had terrible problems with spoken arithmetic and verbal math. I was thunderstruck. How could he know that? I had been punished as a child for exactly this issue. I internalized it as part of my self image. Although I was a great reader, I was unable to recite the times tables and do my arithmetic. I couldn’t explain why, maybe I really was a bad kid. After digesting the audiologist’s report, I allowed myself to become more interested in data and technology.

I fight paralyzing “imposter syndrome” every time I sit in front of my computer. I began to take free classes on-line and go to meet-ups and learn even, when I couldn’t talk about it well. I joined groups for women who code.  I continue to learn and I just signed up for an intensive software engineering boot camp. I currently volunteer as a teaching assistant for introductory python at two different community women’s coding groups. I continue to attend meet-ups.  I am not yet where I want to be but I am finally allowed to move ahead. Data is going to change our world and I don’t want to miss out.

Leave a Comment

I took the #DeleteFacebook Challenge

In the last weeks, Facebook has been in the news a lot for its aggressive data gathering.  What has surprised me, is not that Facebook is in the news, but that it hasn’t happened much sooner.  Facebook is possibly the most invasive data gathering, privacy invading platform the world has ever seen, despite the fact that it is cloaked behind a veil of childish logos and thumbs up buttons.  Additionally, Facebook has engaged in some truly abhorrent practices, such as gathering text messaging and phone metadata from Android usersconducting secret psychological tests on over 700,000 users in 2012, ad programs that track users’ web activity off of Facebook, to say nothing of how Facebook was and most likely is being used to propagate fake news.

As someone who has worked in various regulated industries (banking, government) it appalls me how companies like Facebook abuse their users’ privacy.  My biggest issue is that Facebook disguises its data gathering efforts under a slick veneer of innocence which disguises their true intent.  Much like tobacco adverts of yore, Facebook and its “family” are targeted primarily towards younger people who don’t understand what they are giving up in exchange for the privilege of sharing their photos with their friends.

An extremely egregious example of this occurs on election days in the US.  Facebook will ask users a question: “Did you vote today?” and give you a little sticker on your profile if you answer that you did.  Now why do you think they would do that?  To encourage people to vote?  Hardly, though that may be a side benefit.  No, the real reason they do this is to gather information about people’s voting history, which Facebook then uses in their targeted political campaigns.  Don’t believe me?  You can read about it here: https://politics.fb.com.

The problem here is that Facebook doesn’t ask their users for consent in a way that a typical user will understand.  I am not trying to mock Facebook users, but most people who don’t work in data analytics, don’t really understand the implications of mass data gathering.  The image above is how Facebook Messenger asks for permission to gain access to your contacts, SMS and phone call logs. (Courtesy of ArsTechnica)  Nowhere in this image does it say anything about collecting SMS, phone logs or anything for that matter.  It looks cute and most people wouldn’t think twice about clicking on ok.

Silicon Valley’s Culture Needs to Change

The biggest issue I have with some of what Facebook has been caught doing is that enough of the company felt it was acceptable for them to do it.   That’s the bigger issue here.  Most likely, some manager at Facebook decided, why don’t we gather all our Android users’ text data and mine it!  And nobody said a bloody thing. No leaks to the news media, no disgruntled employees writing blog posts about it, nothing….  Which ultimately means that everyone involved felt it was totally acceptable to take their users’ SMS and phone logs.   This practice only ended when Android disabled the functionality, so it wasn’t as if Facebook execs had some crisis of conscious.

But, I’m a realist.  Facebook’s revenue is generated by selling targeted advertising and the way it targets its ads is by gathering data about its audience.  Whilst Mr. Zuckerberg can write pithy non-apologies about it, nothing will change because this is how Facebook makes money.  The only way this changes, is for people like you to get off of Facebook (and Instagram, and WhatsApp) in significant numbers and for advertisers to stop spending money on Facebook ads. As long as there is a market for this data, the sad reality is that there will be more and more companies trying to invade your privacy and sell it to the highest bidder.

Educate Yourself About How Companies Monetize Your Data

You need to understand how companies are using your data and make a conscious choice about whether that company provides enough value to justify that loss in privacy.  Frankly, this is why I prefer using companies whose primary revenue stream is not derived from data monetization.  This is why I choose to use iPhones instead of Android, iMessage instead of WhatsApp, socializing with real friends instead of Facebook.  You can generally tell this is the case by whether you have to pay for a service.  Generally speaking, companies which charge for their services are not looking to invade your privacy to the same degree as companies that offer their services “for free”.  As the saying goes: “If you aren’t paying for it, YOU are the product.

Leave a Comment

A New Threat: Stalkerware

What would you do if you attended a political event or protest and the next day, you receive targeted adverts for that political cause?  Would that be cause for concern?  After all, you don’t post about your political views, how did the advertisers know?  You didn’t sign any rosters or register, so how did they know you were there?

I recently became aware of a new category of computer-evil: stalkerware.  I thought I was being clever and would have the privilege of coining a new term, but a few other people have already coined the term.  However, I would like to propose a slightly different definition.  In an article originally appearing on Motherboard, stalkerware is defined as:

Stalkerware is defined as invasive applications running on computers and smartphones that basically send every bit of information about you to another person. This covers the gamut from programs that can be purchased online to give third parties access to basically everything on your computer from photos, text messages and emails to individual keystrokes, to apps that activate your Mac’s webcam without your knowledge.

I’m not really seeing the difference between this definition and “traditional” spyware, but stalkerware as I define it is:

Software that automatically reports your location on a regular basis without your knowledge or consent.

The stalkerware that Motherboard writes about are dedicated programs or apps that someone deliberately installs on a target’s mobile device in order to track their activity for whatever reason.  Stalkerware as I define it is a little different, in that it is not targeted at one individual.  These are applications that are installed on mobile devices that track your every move–literally stalking you–most likely without your knowledge.

1 Comment

Thoughts on Teaching Data Science

A big interest of mine is how to impart what little I know of the tools and techniques of data science to others.  When I was at Booz Allen, I taught numerous classes both for internal staff and for various clients.  I’ve also taught for Metis, O’Reilly Publishing and for the last three years, at BlackHat so I do have some experience in the matter.   I’ve looked at MANY data science programs to see if what they are teaching lines up what I’m teaching and I’d like to share some things which I’ve noticed which will hopefully help you build a better data science program.  My goal here is to share my mistakes and experiences over the years and hopefully if you are building a data science training program, you can learn from what I learned the hard way.  I make no claims to be the perfect data science instructor, and I’ve made plenty of mistakes along the way.

While I’m at it, I’ll put in a plug for an upcoming data science class which I am teaching with Jay Jacobs of BitSight Security at the O’Reilly Security Conference in NYC, October 29-30th.

Really, data science instruction is an optimization problem: as an instructor, your goal is to minimize confusion whilst maximizing understanding.  To do this, you must remove as many obstacles as possible from the students’ path which create dissonance.  This may seem silly, but I have observed that if you have small errata in your code, or your code doesn’t work on their machine, even due to something they did, it significantly detracts from their learning experience and their opinion of you as an instructor.  Therefore, removing all these obstacles to understanding is vital to your success as an instructor.

2 Comments

The Difference between Software Development and Data Science

I am fortunate enough to get regular messages from recruiters on LinkedIn asking to speak with me about software development jobs.  Here’s the thing… I’m not a software developer, I do data science and data analytics.  For the last seven years, my job title has included the words “data” and “scientist” in the title.  I have never held a position with the words “Software” and “Developer” in the title.  I have taught and am currently teaching classes with titles such as “Data Science for Security Professionals” and “Applied Data Science for Security”.   All of this is on my LinkedIn profile, yet despite this, the messages continue.

On some level, it makes sense.  If you look at my resume, you’d see that I have a degree in computer science, experience with various coding languages, and projects on github.  Hell, I’m a committer for Apache Drill…

So what’s the difference between a data scientist and software developer?

9 Comments

Academics and Data Science

I received the following comment on an article: Let’s Stop Using the Term Fake Data Scientist and thought it merited a response.  Usually the comments I receive are constructive even if they disagree with what I wrote, but this particular comment, demonstrated an arrogance which I believe is a huge problem in the data science world.

You can of course read the original article here, but the basic point was that data science is interdisciplinary field–consisting of a mixture of computer science, applied mathematics, and subject matter expertise, with a smattering of data visualization and communication skills.   I believe that it is inappropriate to label someone as a fake simply because their skillset is proportioned differently than many math-centric data scientists.  I’m also a believer in Dr. Carol Dweck’s thesis on having a growth-oriented mindset (as stated in her book Mindset) and that people who might be working in data science but whose skills need development in a certain area, should be given instruction and assistance rather than derogatory labels.

1 Comment

The End of Privacy As We Know It

In the news on Friday I saw a series of articles about a recent change in communication rules which was rejected by the Senate that would have prohibited ISPs from selling your browsing histories.  I understand why ISPs would want to monetize this data, after all, this data would be extremely valuable to online advertisers to more accurately serve ads.  But I think it should give us pause to ask the question is this in fact ethical?

While there really is no 1 to 1 comparison, the closest thing(s) would be either the telephone company selling your call records, or the post office (or other courier services such as UPS) aggregating and selling the information on the outside of your mail.  I would strongly suspect that most people, if asked, would certainly not want their communication records sold to the highest bidder and yet that is precisely what Congress is allowing.

What Does This Mean for Privacy?

If ISPs are allowed to sell your browsing histories, I don’t believe that it is overstating things to say that this represents the end of privacy on the internet.  While we didn’t have much privacy on the internet any these days anyway, but if the ISPs are allowed to sell browsing records, it’s pretty much over.

With that said, it is difficult to discern exactly what is going to be allowed under the new rule change, but if I’m reading the news articles correctly it will allow ISPs to sell records of metadata of your web browsing.  To a competent analyst, this data would be a virtual gold mine for targeted advertising and all sorts of other services, none of which are really beneficial to the individual.   As I’ve shown in my Strata talks about IoT data, (here and here) if you gather enough seemingly innocuous data about an individual, it is entirely possible to put together a very accurate picture of their life.  From my own experience, if you were to look at my browsing history for a few months, you could very easily determine things like when my bills are due, what companies I do business with, when I go to work/bed, what chat services I use, things I may be interested in buying, what places I’m interested in visiting, etc.  The bottom line is that I consider my web browsing to be personal.  I don’t want to share that with anyone, not because I have something to hide, but rather because I want the choice.  I see no benefit whatsoever to the consumer in this rule change.

What can you do to protect your privacy?

Unfortunately, there really aren’t a lot of options.  From the technical perspective, there are several technical options–none great–to preserve your privacy.  It is not possible to keep the ISPs from getting your data, but you can make that data useless with TOR and VPNs.

  • Virtual Private Network (VPN):  VPNs have been traditionally used by corporations to allow remote access into private networks using the public internet.  VPNs create a secure tunnel between your computer and a proxy server then your web traffic passes through that server–which can be anywhere in the world.  For those of you who don’t work for large corporations, there are free and paid VPNs that you can use to access the web, however, I would avoid any free VPN service as they are likely making money by, you guessed it, collecting web traffic and analyzing it.   VPNs may seem like an ideal countermeasure, however there are issues with VPNs as well.  For starters, you are adding bottlenecks and complexity and hence losing speed.  Secondly many sites–particularly sites that have geographically based licensing such as Netflix–block traffic from VPNs.   VPNs don’t make you anonymous but they can make your data much more difficult to collect.
  • TOR:  TOR stands for The Onion Router (https://en.wikipedia.org/wiki/Tor_(anonymity_network)) and it is similar to a VPN but instead of using one proxy server, TOR uses a series of encrypted relays and makes traffic much more difficult.  TOR has been used in many countries to successfully evade internet censorship.  TOR has the added benefit of allowing anonymous browsing, however, it does introduce additional complexity into your browsing.  There also is a speed penalty for using TOR and you will find that you will not be able to access certain services using TOR.

Depending on how protective of your privacy you are, this may or may not matter, but it is important to understand that when using these technologies, guaranteeing your privacy depends on properly configuring them.  One small misconfiguration can expose your personal data.

I should also mention here that the so-called privacy modes that most browsers include do absolutely nothing to protect your privacy over the network.  Privacy mode erases your browsing history and cookies on your local machine, but you are still vulnerable to snooping over the network.

What else can I do?

This rule change represents a complete failure of government to do the thing it is really supposed to do–protecting the rights of its citizens.  It’s sad that the whole world was up in arms in response to Snowden’s revelations, and yet the silence is deafening in response to unlimited, widespread corporate surveillance.  Indeed, you have to read the hacker blogs (and my site) to find any kind of discussion of this issue.  This story got virtually zero coverage in the news media.

What is a real shame is that this appears to have become a partisan issue in that the vote in the Senate was a strict party-line vote.  It is entirely possible that the new Congress voted to repeal these rules simply because they were put in place by the previous administration.

At this point, the government is not looking out for its citizens’ interests in this regard and therefore it is upon individual citizens to take action to preserve our privacy.  In addition to the technical measures listed above here are some suggestions for what you can do:

  1. Contact your Congressional Representative(s) and Senator(s):  The Congressional switchboard number is 202-224-3121.  Always be courteous, professional and polite when speaking with Congressional Staff.   Be sure to convey why you are calling.  While it is unlikely that you will speak directly to your Senator or Congressman, their Staff have enormous influence and you should be respectful to them.  Make it clear that you do not welcome corporate surveillance.
  2. Educate Others:  I suspect that the reason this received so little attention is that the average person doesn’t really understand security, privacy and the consequences of this kind of data collection.  Therefore, it is incumbent upon those of us who work in data analytics and security to explain the implications of these policies in an understandable manner to non-technical people.

I would strongly urge everyone to do what they can to protest this rule change.  If we do nothing, we might wake up one day and find that our online privacy has ceased to exist.

 

2 Comments