How to understand large codebases for Open-source contributions ?

This blog is for beginners who are planning to get involved in Open-source projects
I am describing how I do it, and it may help you as well.

This is how my typical workflow looks like →

  • Please read the readme file. I cannot emphasize this enough. It will help you gain information about project features and provide a high-level understanding of what the project actually does. You will also find many useful links like the contributing.md file, which explains how to set up the project locally, community information like Slack links, weekly/monthly meetings, guidelines, and much more. This is mandatory and should not be skipped. You can join the community on Slack or whichever platform it is on once you start liking the project.

  • Check out the contributing.md file. The majority of projects have it, and it usually contains a local setup guide, code formatting, PR commit message format, issue format, and how to communicate. The information in this file is enough to start running the project locally. This will begin your actual development process.

  • See if there is a Makefile present in the root path. Often, readme and contributing.md files contain usage of Makefiles, but sometimes it is not available. If it is present, take a look at it and see what commands are set. This will help you understand how many commands are set up for the project. These can include linting, tests, builds, etc.

    Many projects often do not use it and directly mention commands in their docs; therefore, keep this in mind. Dockerfiles are also a good resource if they are present.

  • Once all the above-listed steps are complete, start looking at what directories are present and try to understand what they do by reading their names and the names of subdirectories and code files inside. Common projects have a similar type of code structure. For example, if it’s a CLI project built in Go, it most probably uses the Cobra package and follows the guidelines of Cobra. If it’s a Java project, it probably uses Maven as a build tool, and each directory acts as a specific module, and so on.
    If it’s a Helm chart, you will see a consistent Helm-prescribed code structure in every project. This applies to every project and for every tech stack.

  • Once you have an idea of the directories and code present inside, start looking at the actual code. I personally start reading random code files and then try to make sense of them. I also figure out the entry point of the project (main.java/main.go, etc.) and start looking at the code of every function called in it.
    Depending on your IDE, whether it’s VS Code, JetBrains IDE, or anything else, there are shortcuts to find usages of functions and jump to the source of functions. You can utilize those shortcuts to quickly jump through code and get an idea of what’s going on. You can also understand imported package functions through this. Keep in mind that you do not need to know the entire codebase. Even maintainers don’t have a complete idea of the code. Therefore, adopt a Learning by doing approach.

  • By now, you will have a high-level understanding of the code structure and where a certain component of the project is placed. You can then jump to open issues to understand what the issue is and if the resolution is within your scope. You can also see others' pull requests and learn from them.

    You can use whimsical or any other websites to construct flow diagrams to understand better. Pen and paper are the way to go as well.

    These are the common steps to follow for every project. If it seems overwhelming to you, please understand that it ultimately requires hard work and consistency.

    That's it from my side. Please leave comments if you have any doubts, and I will answer them. Subscribe to my newsletter !

    Connect with me on LinkedIn | GitHub.

    Thanks for reading :)