The Spark Developer will directly report to the Director of Data Science and will be responsible for enhancing and maintaining various aspects of Shaw’s feature engineering suite. The role will focus on extending the existing framework to leverage novel large-scale datasets and various distributed technologies to support statistical model development that will aid in optimization of business operations.
The ideal candidate is a problem-solver who can thrive in an environment with a complex, diverse and rapidly expanding data infrastructure. You’ll know how to fully exploit the potential of Spark while possessing a broad familiarity of the available APIs. You will clean, transform, and analyze vast amounts of raw data from various systems using Spark to provide ready-to-use data to our Data Scientists. You’ll need to be comfortable managing multiple deliverables and communicating results to management as your efforts will directly impact the data science decision making process and the planning of strategic initiatives.
Job Description: Spark Developer
- Bachelor's Degree in Science or Technology.
- 3-5 years of technical experience working with large data sets.
- 3-5 years of experience working in distributed compute/storage environments including the development/maintenance of streaming applications (Apache Kafka preferred).
- 3-5 years of experience working with Python and PySpark with a focus on large scale data warehousing, data integration and/or development experience.
- 3-5 years of experience with Apache Spark with an emphasis on Spark query tuning and performance optimization.
- Demonstrated use of the Spark APIs (Spark 2.x/3.x) including RDDs, SQL DataFrames, MLib, GraphX and Streaming.
- Experience working with distributed file systems (HDFS, S3, etc.).
- AWS cloud services experience is an asset (EMR, EC2, etc.).
- Deep understanding of distributed systems (e.g., CAP theorem, partitioning, replication, consistency, and consensus).
- Strong knowledge and hands-on experience authoring and auditing advanced SQL and shell scripts.
- A commitment to the code review process, the utilization of modern coding methods (TDD, CI) and the authoring/curation of documentation.
- Background in software engineering with strong skills in parallel data processing, data flows, REST APIs, etc.
- Experience in relational database logical modeling (Oracle, Postgres, Snowflake, Teradata) and integration into a distributed workflows based on Spark.
- Excellent analytical and problem-solving skills.