Introduction
In the realm of plant bioinformatics and functional genomics, extracting gene family protein sequences is a fundamental process. This workflow focuses on the Arabidopsis thaliana (TAIR10) genome, a prominent model plant widely used for method development and comparative genomics. Here, we outline a beginner-friendly approach for extracting these sequences using Microsoft Excel and TBtools.
Why Extract Gene Family Protein Sequences?
Extracting protein sequences from specific gene families is crucial for various analyses, including phylogenetic studies, conserved motif identification, gene structure analysis, and functional annotation. Prior to delving into these downstream applications, researchers rely on generating a clean FASTA file that includes only the desired protein sequences.
Step-by-Step Process
To begin, you will need the following:
- Arabidopsis TAIR10 protein FASTA file (.faa)
- List of gene family IDs (e.g., MYB, NAC, WRKY genes)
- Software: Microsoft Excel and TBtools
Firstly, download the TAIR10 protein sequences from the TAIR database or Ensembl Plants, ensuring that the FASTA headers contain standard Arabidopsis gene IDs. Next, prepare gene family IDs using Excel by collecting, formatting, and saving them as a plain text file. After that, launch TBtools, load your protein FASTA file, and use the Fasta Extractor tool to match the IDs and extract the respective gene family protein sequences.
How to Extract Gene Family Protein Sequences from Arabidopsis TAIR10 Genome Using Excel and TBtools
Introduction
Genome-wide gene family analysis is a foundational step in plant bioinformatics and functional genomics. One of the most common preliminary requirements for such studies is the extraction of protein sequences belonging to a specific gene family from a reference genome. Arabidopsis thaliana (TAIR10) serves as a gold-standard model plant genome and is widely used for method development and comparative genomics.
This article provides a practical, beginner-friendly, and non-coding workflow to extract gene family protein sequences from the Arabidopsis TAIR10 genome using Microsoft Excel and TBtools. The approach is ideal for MSc and PhD students, plant molecular biologists, and researchers who want accurate results without complex scripting.
Why Extract Gene Family Protein Sequences?
Extracting gene family protein sequences is essential for:
-
Phylogenetic analysis
-
Conserved motif identification (MEME)
-
Gene structure analysis
-
Functional annotation (GO and KEGG)
-
Comparative genomics and evolutionary studies
Before performing any of these downstream analyses, researchers must generate a clean FASTA file containing only the protein sequences of the target gene family.
Required Data and Tools
Data
-
Arabidopsis TAIR10 protein FASTA file (.faa)
-
List of gene family IDs (e.g., MYB, NAC, WRKY genes)
Software
-
Microsoft Excel (for ID preparation)
-
TBtools (for sequence extraction)
TBtools is a widely used graphical bioinformatics toolkit that allows sequence manipulation without command-line expertise.
Step 1: Download Arabidopsis TAIR10 Protein Sequences
-
Visit the TAIR database or Ensembl Plants
-
Download the TAIR10 protein FASTA file
-
Ensure the FASTA headers contain standard Arabidopsis gene IDs (e.g., AT1G01010.1)
This file will serve as the reference protein dataset.
Step 2: Prepare Gene Family IDs Using Excel
2.1 Collect Gene Family IDs
Gene family IDs may be obtained from:
-
HMMER search results
-
Published literature
-
PlantTFDB or similar databases
Example gene IDs:
2.2 Format IDs in Excel
-
Paste all gene IDs into a single column
-
Remove duplicates
-
Ensure there are no extra spaces or hidden characters
-
Save the file as a plain text (.txt) file, one gene ID per line
This step is critical for accurate sequence extraction.
Step 3: Open TBtools and Load Protein FASTA
-
Launch TBtools
-
Navigate to Sequence Tools
-
Select Fasta Extractor or Extract Sequences by ID
-
Load the TAIR10 protein FASTA file
Step 4: Extract Gene Family Protein Sequences
-
Upload the prepared gene ID text file
-
Choose the option to match IDs using FASTA headers
-
Specify output format as FASTA
-
Run the extraction
TBtools will automatically scan the TAIR10 protein file and extract only the sequences corresponding to your gene family.
Step 5: Verify the Output FASTA File
After extraction:
-
Check the number of sequences
-
Confirm gene IDs match your input list
-
Open the FASTA file to ensure proper formatting
The resulting FASTA file is now ready for:
-
Phylogenetic tree construction
-
Motif and domain analysis
-
GO and KEGG functional annotation
Common Problems and Solutions
| Problem | Solution |
|---|---|
| Missing sequences | Check gene ID format consistency |
| Zero output | Ensure IDs match FASTA headers |
| Duplicate sequences | Remove redundant IDs in Excel |
| Partial extraction | Use gene-level IDs instead of transcript IDs |
Applications in Genome-Wide Studies
This Excel + TBtools workflow is widely used in:
-
Genome-wide transcription factor studies
-
Comparative gene family analysis across species
-
Stress-responsive gene identification
-
Functional genomics and systems biology
Because Arabidopsis is often used as a reference, this method can also be adapted to non-model plant genomes.
Conclusion
Extracting gene family protein sequences from the Arabidopsis TAIR10 genome using Excel and TBtools is a simple, reliable, and reproducible approach. It eliminates the need for programming while maintaining high accuracy, making it ideal for teaching, training, and research applications.
Excel and TBtools workflow streamlines the process of extracting gene family protein sequences, proving valuable for novice researchers. Utilizing these techniques not only enhances accuracy but also paves the way for further analyses like phylogenetic tree construction and functional annotation. As such, it stands as an essential step in plant bioinformatics.

