Site Reliability Engineer
Who are we?
FalconX is one of the fastest-growing startups in FinTech. We are redefining prime brokerage from the ground up.
We are backed by some of the best investors in the world including Accel, American Express, B Capital, Coinbase, Fidelity, Lightspeed Venture Partners, Fenbushi Capital and Tiger Global Management + more yet to be publicly disclosed.
We deliver institutional digital asset traders best-in-class trading, credit, custody and structured products. We trade, lend and secure tens of billions of dollars monthly, are highly profitable, and growing fast, so we need your help!
We are data-driven. Whether it's a growth or product decision, we believe data can always help us make more precise and informed choices.
We move fast. Speed of execution is essential for any startup, but we believe this is even more pertinent in our 24/7 industry.
We prioritize learning. Outcomes are mission-critical, but we also believe that learning in success and in failure will drive our continued success. Our industry is emergent - there’s no shortage of experiments to get involved with and to continue growing and learning together.
FalconX has offices in San Mateo, Chicago, New York, Bangalore, Malta, and Singapore.
Who is on the team?
We are entrepreneurs. Many in our company have been founders or have aspirations to eventually start their own company. We take these ambitions and experiences to bring a solutions-oriented mindset to the problems we encounter day-to-day.
We are experienced. We have been fortunate to have learned from mentors and peers at institutions such as Google, LinkedIn, JUMP Trading, Citadel, PEAK6 Investments, Goldman Sachs, Harvard Business School, Carnegie Mellon, IIT + more.
- Be part of a SRE team, dedicated to an internal platform.
- Work closely with middleware teams and trading teams to improve the system reliability, scalability and security.
- Engage in and improve the infrastructure quality supporting the platform.
- Build and manage systems, infrastructure and applications through automation.
- Provide operational support to internal teams working on the platform.
- Work on improvements to bring in high efficiency, reduce latency, deploy systems faster.
- Practice sustainable incident response and blameless postmortems.
- Together with your engineering team, you will share an on-call rotation and be an escalation contact for service incidents.
- BS with 5 years or MS with 3 years working experience as Site Reliability Engineering (SRE) / Devops Engineer
- Experience with programming in at least one of the following languages: C, C++, Java, Python, or Go
- Experience working with trading systems from a high frequency trading shop, investment banking or crypto company
- Deep knowledge of Linux internals.
- Strong skills around observability, debugging and performance tuning, willing to dive into understanding, debugging, and improving any layer of the stack.
- Strong experience in managing infrastructure with providers like AWS.
- Strong experience in Cloud Native technologies in kubernetes, docker etc.
- Basic knowledge of RDBMS systems like mysql.
- Understanding of networking protocols such as TCP/IP.
- System administrator
- OS setup and configure, modules install, dependencies and versions tracking
- Help to find modules, resolve missed dependencies or conflicts, setup new user, setup ulimit, core dumps setup
- Dev environment setup, help with customization (vim, git, terminal, etc)
- OS upgrade as needed, understanding of OS stability and bugs management
- Kernel / user services setup
- Shell scripts and automated jobs, help with shell scripting
- Work with devOps on production environment setup and support
- Resolve system issues (dmesg etc)
HFT system administrator and network engineer
- Network setup, interfaces and services configuration
- Security and permissions understanding, DMZ understanding
- Switch and routers configuration
- Linux routing table and services, multiple IPs per interface, virtual networks setup
- Experience with TCP dump, collect traffic statistics, understanding of network latency and bottlenecks
- HFT OS configuration, tickless core, cores isolation, irq binding, system processes affinity, numa pci configuration, numa memory configuration, huge pages setup, Intel CAT setup.
- OS HFT configuration validation, able to read system events (thread context switching, IRQ rate per core).
- Module and kernel customization and build.
- HFT network cards experience (SolarFlare or other), 10G, 100G ethernet experience, traditional markets colo experience, corvil experience, ethernet fragmentation, packets gap and frames format understanding
- Kernel bypass setup (ef_vi, openonload), FPGA network cards experience
Something looks off?