Skip to content

The Dataist Posts

Visualize Anything with Superset and Drill

Happy New Year everyone! I’ve been taking a bit of a blog break after completing Learning Apache Drillteaching a few classes, and some personal travel but I’m back now and have a lot planned for 2019! One of my long standing projects is to get Apache Drill to work with various open source visualization and data flow tools. I attended the Strata conference in San Jose in 2016 where I attended Maxime Beauchemin’s talk (slides available here) where he presented the tool then known as Caravel and I was impressed, really really impressed. I knew that my mission after the conference would be to get this tool to work with Drill. A little over two years later, I can finally declare victory. Caravel went through a lot of evolution. It is now an Apache Incubating project and the name has changed to Apache (Incubating) Superset.

Share the joy
Leave a Comment

Back to BlackHat…For the 5th Time!!

Happy belated New Year everyone! I’ve been taking a bit of a blog break as I’ve been quite busy between work, personal travel, and working on my startup GTK Cyber. But I’m back now and have some exciting news! My team and I have been accepted to teach Applied Data Science course once again at BlackHat in Las Vegas! This year we’ve made a major change to our course: it’s now a full four days instead of two!

Share the joy
Leave a Comment

So You Want to Write a Book…

Well, we did it.  I finally finished the book that I had been working on with my co-author for the last two years.  I thought I’d write a short post on my experiences writing a technical book and getting it published.  I know many people think about writing books, and I’d like to share my experiences so that others might learn from lessons that I learned the hard way.  Overall, it was an absolutely amazing experience and I have a feeling that the adventure is only beginning….

Share the joy
Leave a Comment

Why don’t Data Scientists use Splunk?

I am currently attending the Splunk .conf in Orlando, and a director at Accenture asked me this question, which I thought merited a blog post.  Why don’t data scientists use or like Splunk.  The inner child in me was thinking, “Splunk isn’t good at data science”, but the more seasoned professional in me actually articulated a more logical and coherent answer, which I thought I’d share whilst waiting for a talk to start.  Here goes:

I cannot pretend to speak for any community of “data scientists” but it is true that I know a decent number of data scientists, some very accomplished and some beginners, and not a one would claim to use Splunk as one of their preferred tools.  Indeed, when the topic of available tools comes up among most of my colleagues and the word Splunk is mentioned, it elicits groans and eye rolls.  So let’s look at why that is the case:

Share the joy
4 Comments

Can you use Machine Learning to detect Fake News?

Someone recently asked me for assistance with a university project whereby they were asked to predict whether a given article was fake news or not.  They had a target accuracy of 70%.  Since the topic of fake news has been in the news a lot, it made me think about how I would approach this problem and whether it is even possible to use machine learning to identify fake news.  At first glance, this problem might be comparable to spam detection, however the problem is actually much more complicated.  In an article on The VergeDean Pomerleau of Carnegie Mellon University states:

“We actually started out with a more ambitious goal of creating a system that could answer the question ‘Is this fake news, yes or no?’ We quickly realized machine learning just wasn’t up to the task.” 

Share the joy
Leave a Comment

Drilling Security Data

Last Friday, the Apache Drill released Drill version 1.14 which has a few significant features (plus a few that are really cool!) that will enable you to use Drill for analyzing security data.  Drill 1.14 introduced:

  • A logRegex reader which enables Drill to read anything you can describe with a Regex
  • An image metadata reader, which enables you to query images
  • A suite a of GIS functionality
  • A collection of phonetic and string distance functions which can be used for approximate string matching.  

These suite of functionality really expands what is possible with Drill, and makes analysis of many different types of data possible.  This brief tutorial will walk you through how to configure Apache Drill to query log files, or any file really that can be matched with a regex.

Share the joy
1 Comment

Book Review: Technically Wrong

I recently completed Technically Wrong by Sara Wachter-Boettcher.  Let me start by saying that I’m glad that Ms. Wachter-Boettcher wrote this book. The tech industry has a lot of issues which need to be brought out into the open and it is definitely a positive development that people such as Ms. Wachter-Boettcher are bringing these issues to the forefront.  It really is only recently that people are discussing the continuous erosion of privacy, misogyny in the tech industry, lack of diversity and many other issues. Whilst I would not deny any of these issues, I felt Wachter-Boettcher’s analysis was somewhat lacking and didn’t really get at the realities of working in the tech industry.  Wachter-Boettcher cites numerous examples of tech gone wrong, such as a smart scale telling a two year old that he needs to lose weight, FaceBook denying a Native American person an account because it felt that their name was not legitimate, and the abhorrent use of proprietary, black box algorithms to make parole recommendations.

Again, it is definitely a positive development that Wachter-Boettcher and others are writing about these issues, but the alternatives and solutions she proposes seem a bit simplistic.   While she doesn’t state this directly, much of the book seems to suggest that all of technology’s woes are caused by the lack of diversity in the tech industry.  Specifically that “white guys” from elite universities are running everything.  I don’t have an electronic copy of the book, but after about half way through this, I wanted to count the number of times the phrase “white guys” appears in the book.  Sometimes this phrase includes Asians, sometimes not.

Share the joy
Leave a Comment

Apple’s Newly Declared War on Data Collection (and Facebook?)

In the last week, beneath all the Trump and Kim Jong Un reporting, were several stories that state that Apple has in effect declared war on data collectors.  Make no mistake, what Apple is doing is making it significantly harder for companies big and small to collect your personal data.  The significance of this cannot be overstated in that many companies like Google and Facebook’s revenue is based on selling targeted advertising and if gathering this data becomes significantly more difficult, it could affect their bottom lines.

The First Volley:  No More Comments and Share Buttons

Last week, I was listening to the keynotes at the WWDC, and overall was pretty unimpressed as exec after exec droned on about new animojis or some other feature that I really didn’t care about, and then, Craig Federighi launched the first volley: Safari is going to block FaceBook and other social media like and share buttons as well as shared comment sections.  Facebook, Twitter and other sites use these buttons to track your activity when you are visiting other sites.  While it isn’t that big of a deal that this is happening on MacOS, it is VERY significant that Apple is instituting this change on iOS as well.  When I heard this, I was pretty shocked, but that was only the first volley, there were more to come.

Share the joy
Leave a Comment

Adventures and Misadventures in Data Science Interviews

I’ve been waiting for some time to publish this, but I wanted to write about my experiences interviewing for data science jobs. Here’s my story, I worked at Booz Allen for nearly seven years but I felt it was time for a change. I very much like Booz Allen as a company and if anyone is interested in working there, please don’t hesitate to contact me.  But I felt I was ready for different challenges and started looking for work elsewhere.

Now that I started a new position, I thought I’d share some observations about what I learned from interviewing at numerous companies. I wasn’t tracking how many companies I interviewed with, but it was a lot. I have a lot of government experience and got a number of offers from government contracting firms. However, I came to the conclusion that in terms of career progression, joining another government contracting firm was not what I was looking for.

So here’s what I learned…

Share the joy
1 Comment

My Ideal Workspace

As more and more research is showing that the open office design actually reduces productivity (here) and (here), I recently shared a post on LinkedIn about how github “de-broed” their workspace, but I thought I’d share my thoughts on what I like, and don’t like in a work space.  Above is a picture of my home office with some labels.  Not specifically labeled is that there is plenty of natural light.  One of the most depressing places I ever worked was a windowless cube farm where the developers liked to leave the lights off.  I was going out of my mind!!

  1. A Door:  My ideal workspace has a door so that when privacy is needed, I can close the door and when it is not, I can open it.
  2. A clock:  I know computers have clocks, but having a big visible clock is really helpful for making sure things run on time.
  3. A comfortable chair, with foot rest:  If I’m doing tech work for a long time, I don’t want to be sitting on something that will cause trips to the chiropractor.
  4. Big Monitors:  I’m a big fan of multiple, large monitors, as they really increase productivity.
  5. Music:  I like to listen to music, especially when coding.  When I’m working in more public spaces, I have headphones…
  6. Stress Relief:  I play trombone and when things get stressful, one can always reduce some stress by playing some Die Walkure …. LOUDLY.
  7. Lots of Geek Books:  Nothing sets the stage for coding than being surrounded by O’Reilly geek books.
  8. Family Photos or other Personal Items:  I do my best work in a space that feels like my own, so I think it is important that people can have a space with some of their personal items that feels like their own.   Hence… I’m not a fan of hoteling or workspaces that set people up to work on large tables.

What do you like in a work space?

Share the joy
Leave a Comment