Interested in a mission-driven job ensuring perpetual open access to information for a global audience? Enjoy helping scale the use of services and products critical to hundreds of national and international non-profits, libraries, universities, cultural heritage institutions, and mission-aligned organizations? If so, the Internet Archive is seeking a Software Engineer for its Archiving & Data Services team.
The Internet Archive (IA) is a non-profit digital library, top 200 website at archive.org, and an archive of over 80PB of digital information running in multiple self-owned data centers. The Internet Archive also partners with organizations worldwide to advance the shared goal of “Universal Access to All Knowledge.” The Archiving & Data Services group provides a suite of paid and free products focused on the archiving, management, analysis, and accessibility of digital information. Its web archiving, digital preservation, and web and data services are used by over 800 organizations around the world.
The Archiving & Data Services team is seeking a Software Engineer to advance our crawling capabilities and extend our suite of services. The ideal candidate has experience operating a critical workflow, attention to detail, and broad experience with web protocols and technologies. A key function of the role will be to contribute to the team that manages our large-scale web crawling and data processing. In addition, the role will contribute to a variety of projects, including feature development for our web crawlers, API and microservices development, managing clustered deployments of harvesting and processing tools, assisting with ongoing operational improvements, and opportunities for other creative software projects within the department.
Our services enable users to archive web-published content at scale and to completeness, process terabytes and petabytes of archived data, facilitate the discovery and use of these archived collections, and to curate and manage a diverse set of born-digital material. The Web Crawl Software Engineer will have the unique opportunity to build things that enable non-profit cultural heritage organizations around the world to build collections for future research, scholarship, and memory.
Essential Job Duties:
Contribute to, and often lead, complex web crawling projects.
Feature development, maintenance, and configuration for tool stacks related to web crawling, data-processing pipelines, digital preservation, access systems, open APIs, and databases.
Deliver on commitments with deadlines and project timelines, working in a collaborative, distributed team of junior and senior engineers.
Measure production system performance and benchmark new tools.
Participate in, and help propagate, engineering team procedures and ceremonies including testing, code review, documentation, retrospectives, team RFCs, et cetera.
Work closely with product, support, and other non-technical teams to translate requirements and features into technical designs -- strong communication and collaboration skills are a must.
Qualification and Skills:
Significant experience with web protocols and technologies.
Experience with client service delivery is a plus.
Experience with web browser automation, UI test automation, web crawling/scraping is a plus.
Experience with development environments and system monitoring/administration tools and experience with open source practices, version control, and code review.
Experience with Postgres, Elasticsearch, or Hadoop/HDFS is a plus.
Ansible, GitLab, GitHub, Sentry, Grafana, JIRA, are other tools we use.
Our independently operated data centers run Ubuntu Linux VMs and our department runs everything from the VM up, so deep Linux experience is a plus.
This is a remote-first position working in a distributed team. Candidates will need to have some time overlap with a primarily North America (and mostly Pacific Time) based distributed team for collaborative work and meetings. The role reports to the Senior Engineering Manager, Archiving & Data Services.
Benefits & Perks:
The Internet Archive is a remote first workplace and provides a comprehensive benefits package including; PTO, paid holidays, and medical benefits. Depending on where you live, we also provide these additional benefits; dental, vision, health savings accounts, flex spending accounts, commuter benefits, short term disability, long term disability and retirement programs.
At the Internet Archive, we believe we do our best work when our employees bring together diverse ideas. Members of all groups under represented in the tech industry and library world are strongly encouraged to apply. We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.