Awesome AWS Now on GitHub!

A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources [...]

By |October 25th, 2015|Categories: Cloud, Data, GitHub|0 Comments

SAWS, A Supercharged AWS CLI, Now on GitHub!

Interactive command line interface that aims to supercharge the AWS CLI with features focusing on improving ease-of-use and increasing productivity. Under the hood, SAWS is powered by the AWS CLI and supports the same commands and command structure [...]

By |September 20th, 2015|Categories: Cloud, Data, GitHub|0 Comments

Dev Setup Now on GitHub!

Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based defaults for Mac OSX [...]

By |August 9th, 2015|Categories: Cloud, Data, GitHub, Mobile|0 Comments

Data Science Python Notebooks Now on GitHub!

Continually updated Data Science Python Notebooks: Spark, Hadoop MapReduce, HDFS, AWS, Kaggle, scikit-learn, matplotlib, pandas, NumPy, SciPy, and various command lines […]

By |June 6th, 2015|Categories: Cloud, Data, GitHub|0 Comments

Python Hadoop MapReduce: Analyzing AWS S3 Bucket Logs with mrjob

mrjob lets you write MapReduce jobs in Python 2.5+ and run them on several platforms. You can: Write multi-step MapReduce jobs in pure Python, test on your local machine, run on a Hadoop cluster, run in the cloud using Amazon Elastic MapReduce (EMR) [...]

By |May 17th, 2015|Categories: Cloud, Data|0 Comments

Setting Up Splunk for AWS

I recently hooked up Splunk with AWS to search, monitor, and analyze log files. Splunk indexes data on read and allows for super-fast searching and visualization. I like to think of Splunk as Google Search for log files with visualization built-in [...]

By |February 1st, 2015|Categories: Cloud, Data|0 Comments

Website Redesign and Jekyll Mirror

I had some free time over the Christmas to New Year's break and completed overhauling my personal website donnemartin.com. I've also started to build up its mirror site donnemartin.github.io powered by Jekyll. I love that I can use my existing developer tools to generate content (SublimeText, Terminal, GitHub). [...]

By |January 1st, 2015|Categories: Cloud, Data, Mobile|0 Comments

Speeding Up Hadoop MapReduce Jobs with S3DistCp

When optimizing Hadoop MapReduce jobs on AWS Elastic Map Reduce, you often tweak the EC2 instance type and number of instances to obtain the optimal number of mappers. More data = more splits = more mappers. EC2 instances vary in the number of mappers they can support in parallel–for example an m1.XL can process 6-8 mappers in parallel, whereas an m1.small can only run up to 2 mappers in parallel. Input file size can also have a significant impact on the job length, due to the mapper setup time [...]

By |December 20th, 2014|Categories: Cloud, Data|0 Comments

S3cmd: Frequently Used Commands

Before I discovered S3cmd, I had been using the S3 console to do basic operations and boto to do more of the heavy lifting. However, sometimes I just want to hack away at a command line to do my work. I’ve found S3cmd to be a great command line tool for interacting with S3 on AWS. S3cmd is written in Python, is open source, and is free even for commercial use. It offers more advanced features than those found in the AWS CLI [...]

By |December 4th, 2014|Categories: Cloud|0 Comments

My Reading List

I’ve started populating my Reading List! Updates will trickle in over the coming weeks. I’m an avid reader, consuming 1-2 books each month in addition to tech articles, tutorials, and university course materials [...]

By |November 29th, 2014|Categories: Cloud, Data, Mobile|0 Comments