Awesome AWS Now on GitHub!

A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources [...]

By |October 25th, 2015|Categories: Cloud, Data, GitHub|0 Comments

SAWS, A Supercharged AWS CLI, Now on GitHub!

Interactive command line interface that aims to supercharge the AWS CLI with features focusing on improving ease-of-use and increasing productivity. Under the hood, SAWS is powered by the AWS CLI and supports the same commands and command structure [...]

By |September 20th, 2015|Categories: Cloud, Data, GitHub|0 Comments

Dev Setup Now on GitHub!

Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based defaults for Mac OSX [...]

By |August 9th, 2015|Categories: Cloud, Data, GitHub, Mobile|0 Comments

Data Science Python Notebooks Now on GitHub!

Continually updated Data Science Python Notebooks: Spark, Hadoop MapReduce, HDFS, AWS, Kaggle, scikit-learn, matplotlib, pandas, NumPy, SciPy, and various command lines […]

By |June 6th, 2015|Categories: Cloud, Data, GitHub|0 Comments

Python Hadoop MapReduce: Analyzing AWS S3 Bucket Logs with mrjob

mrjob lets you write MapReduce jobs in Python 2.5+ and run them on several platforms. You can: Write multi-step MapReduce jobs in pure Python, test on your local machine, run on a Hadoop cluster, run in the cloud using Amazon Elastic MapReduce (EMR) [...]

By |May 17th, 2015|Categories: Cloud, Data|0 Comments

Tableau 9 Features: Impressions from Beta

With the final beta in the hands of testers, I thought I would give a quick overview of my favorite features in Tableau 9 [...]

By |March 24th, 2015|Categories: Data|0 Comments

Setting Up Splunk for AWS

I recently hooked up Splunk with AWS to search, monitor, and analyze log files. Splunk indexes data on read and allows for super-fast searching and visualization. I like to think of Splunk as Google Search for log files with visualization built-in [...]

By |February 1st, 2015|Categories: Cloud, Data|0 Comments

A Brief Introduction to R Unit Testing with test_that

Testing is a vital part of software development. I've always been a fan of Test Driven Development (TDD) and use nose for my python data analysis projects. I've recently hooked up test_that to my R Snippets Repo [...]

By |January 18th, 2015|Categories: Data|0 Comments

Website Redesign and Jekyll Mirror

I had some free time over the Christmas to New Year's break and completed overhauling my personal website donnemartin.com. I've also started to build up its mirror site donnemartin.github.io powered by Jekyll. I love that I can use my existing developer tools to generate content (SublimeText, Terminal, GitHub). [...]

By |January 1st, 2015|Categories: Cloud, Data, Mobile|0 Comments

Speeding Up Hadoop MapReduce Jobs with S3DistCp

When optimizing Hadoop MapReduce jobs on AWS Elastic Map Reduce, you often tweak the EC2 instance type and number of instances to obtain the optimal number of mappers. More data = more splits = more mappers. EC2 instances vary in the number of mappers they can support in parallel–for example an m1.XL can process 6-8 mappers in parallel, whereas an m1.small can only run up to 2 mappers in parallel. Input file size can also have a significant impact on the job length, due to the mapper setup time [...]

By |December 20th, 2014|Categories: Cloud, Data|0 Comments