overview
a study-material list i believe to be “sources of truth” on their respective topics; it is not enough for an analytics professional to ‘generate insights’; the professional must be able to (1) describe and communicate insights and recommendations and (2) have intelligent discussions as to how the model(s) should be designed and supported in production; required expertise (or familiarity) in the following subjects:
- general scientific methods (how to setup, run, qualify and ensure experiment reporducibility)
- maths, statistics, probability theory, et al
- scripting & package knowledge (Python, R and respective packages)
- visualization (packages, reporting tools, etc.)
- computer science / programming basics (version control, testing framework, design standards, etc.)
- data engineering (how are you ingesting, transforming and updating your data feed)
- cloud architecture (where / how are your models going to be supported?)
given the pace at which technology is ever-changing, there should be a portion of ones time spent, to use a nn reference, “exloring” versus exploiting the current mastered skill sets; certain tools may solve a problem more effecienty with minimum time investment (e.g. shiny versus tableau or power bi mastery)
current topics of interest
- ml - the goldilocks’ fit (avoiding under / overfitting)
- ml - improving model testing and cross vadlidation procedures in production
- ml - selecting & optimizing parameters & hyperparameters
- ml - objectives and loss-functions
- ml - proper utilization of label versus one hot encoding
- ml - missing data management (done the right way)
- py - improving management and use of scopes, classes, instances, methods
- py - data classes
resource links
-
awesome production machine learning - curated list of open source libs to deploy, monitor, version and scale your machine learning
-
data scientist roadmap - really well done repo by MrMimic that outlines fundamentals in various areas
-
statistics fundamentals - modules ranging from point estimation to bayesian inference
- python best practices, green tea press - how to think like a (python) computer scientist
-
hitchhikers guide to python, o’reilly - environment setup standards and generally how to write great python code
-
hundred page machine learning book - Andriy is nice enough to offer this book on a ‘read first, buy later’ principal; i eneded up purchasing the paperback copy… because why wouldnt you want to support people doing good for the community?
-
hundred page machine learning book - data & code samples - the code to go with the hundred page ml book’s examples
-
ml prod deployment methods - overview of some deployment methods with varying environments & scenarios
-
ml evaluation methods p1 - overview of basic evaluation metrics and methods for common model application scenarios
-
ml evaluation methods p2 - overview of basic evaluation metrics and methods for common model application scenarios
-
kaggle ensembling guide - methods of improving the accuracy of various ml tasks by joining models
-
application scenarios - i have linked to this before in previous posts, however, it is useful to understand where to apply the tools in your toolbox and firmai provides the best list i have found to date
-
managing missing data…the right way - many packages recommend you ‘fill in’ missing values with mean, medain, mode, etc. when in fact this is rarely the path to be taken; a good response to this problem is summarized in this stack exchange answer
- google’s guide to ml ops
research papaers
good reads (and how to find more of them)