How to find a sequence from a txt file?

9 views (last 30 days)
Gabriela
Gabriela on 15 Mar 2024
Commented: Voss on 15 Mar 2024
I'm trying to identify a specific primer sequence from three different text files. I already have this code:
clear;
clc;
% Open the file
fileID = fopen('cDNA1-1.txt', 'r');
% Read the DNA sequence from the file
dna_sequence = strfind(fileID, '%s');
%
DNAsequence=string(dna_sequence)
% Define the primer sequence
primer_sequence = 'TACG';
% Find the location of the primer sequence in the DNA sequence
primer_location = strfind(DNAsequence, primer_sequence);
% Display the location of the primer sequence
if isempty(primer_location)
disp('Primer sequence not found in the DNA sequence.');
else
disp(['Primer sequence found at position(s): ', num2str(primer_location)]);
end
However, for some reason my dna_sequence variable is empty, and I keep getting that the sequence is not found in any of the text files. I know that that's wrong, so I need help.
I will include the three txt files along with my code.
Thank you!

Accepted Answer

Voss
Voss on 15 Mar 2024
Edited: Voss on 15 Mar 2024
The use of strfind here is not correct:
% Read the DNA sequence from the file
dna_sequence = strfind(fileID, '%s');
It looks like you meant to use fscanf:
% Read the DNA sequence from the file
dna_sequence = fscanf(fileID, '%s');
With that change, the code appears to work (but don't forget to fclose!!!!). (Note that the sequence 'TACG' is found in files 2-1 and 3-1 but not 1-1.)
% Open the file
fileID = fopen('cDNA2-1.txt', 'r');
% Read the DNA sequence from the file
dna_sequence = fscanf(fileID, '%s');
% Close the file
fclose(fileID);
%
DNAsequence=string(dna_sequence)
DNAsequence = "GACGCGGCGCAGGCGGCGGGAGTGCGAGCTGGGCCCGTGTTTCGGCCGCCGCCATGGCCGCGGTGGACCTGGAGAAGCTGCGGGCGTCGGGCGCGGGCAAGGCCATCGGCGTCCTGACCAGCGGCGGCGACGCGCAAGGTCCCCTGACAAGCCCACCAGGCCCCCTGCTGAGATGGCTGTGACCCTGGGCTGACCCGCCCAGTGGCACATTGACTCCGCCTGGAGCTGGGGAGACCAGAGAGGCCCTGTGGTTGGACGGTGGCCTGGGTGCGCTGCTCCTGCCCTCTCCTTGCCCTGCCTCAGCTGCTGCCTGCCAGAGGCGTGGCACCTCACCTCACACCTGCTCCCTGCTGCTGAGCCCCACGCCAAGCTGGAGAGCGGATGAGAAGCATGTGTAACCAGGGTAGAGGTCGAGAGTCCTCTCGTGGGGGTCTCCATGTTCAAGGGAGCTGCCGAGGCTTGAGCAGGAGCCCCCAGCAGGAAACTGGCTTTGCCAAGGCCCCCGCTGGGACAGACTGTTTCTTTCACTGCAGTCCTGGGAGCCGAGGGCAAGGGGACAGGAAAGAGGAAGTGACCTCAGAGCCTGGTGGCACCAGCATCATGTCCAGGCTGGGGGGCATGAACGCTGCTGTCCGGGCTGTGACGCGCATGGGCATTTATGTGGGTGCCAAAGTCTTCCTCATCTACGAGGGCTATGAGGGCCTCGTGGAGGGAGGTGAGAACATCAAGCAGGCCAACTGGCTGAGCGTCTCCAACATCATCCAGCTGGGCGGCACTATCATTGGCAGCGCTCGCTGCAAGGCCTTTACCACCAGGGAGGGGCGCCGGGCAGCGGCCTACAACCTGGTCCAGCACGGCATCACCAACCTGTGCGTCATCGGCGGGGATGGCAGCCTCACAGGTGCCAACATCTTCCGCAGCGAGTGGGGCAGCCTGCTGGAGGAGCTGGTGGCGGAAGGTAAGATCTCAGAGACTACAGCCCGGACCTACTCGCACCTGAACATCGCGGGCCTAGTGGGCTCCATCGATAACGACTTCTGCGGCACCGACATGACCATCGGCACGGACTCGGCCCTCCACCGCATCATGGAGGTCATCGATGCCATCACCACCACTGCCCAGAGCCACCAGAGGACCTTCGTGCTGGAAGTGATGGGCCGGCACTGCGGGTACCTGGCGCTGGTATCTGCACTGGCCTCAGGGGCCGACTGGCTGTTCATCCCCGAGGCTCCACCCGAGGACGGCTGGGAGAACTTCATGTGTGAGAGGCTGGGTGAGACTCGGAGCCGTGGGTCCCGACTGAACATCATCATCATCGCTGAGGGTGCCATTGACCGCAACGGGAAGCCCATCTCGTCCAGCTACGTGAAGGACCTGGTGGTTCAGAGGCTGGGCTTCGACACCCGTGTAACTGTGCTGGGCCACGTGCAGCGGGGAGGGACGCCCTCTGCCTTCGACCGGATCCTGAGCAGCAAGATGGGCATGGAGGCGGTGATGGCGCTGCTGGAAGCCACGCCTGACACGCCGGCCTGCGTGGTCACCCTCTCGGGGAACCAGTCAGTGCGGCTGCCCCTCATGGAGTGCGTGCAGATGACCAAGGAAGTGCAGAAAGCCATGGATGACAAGAGGTTTGACGAGGCCACCCAGCTCCGTGGTGGGAGCTTCGAGAACAACTGGAACATTTACAAGCTCCTCGCCCACCAGAAGCCCCCCAAGGAGAAGTCTAACTTCTCCCTGGCCATCCTGAATGTGGGGGCCCCGGCGGCTGGCATGAATGCGGCCGTGCGCTCGGCGGTGCGGACCGGCATCTCCCATGGACACACAGTATACGTGGTGCACGATGGCTTCGAAGGCCTAGCCAAGGGTCAGGTGCAAGAAGTAGGCTGGCACGACGTGGCCGGCTGGTTGGGGCGTGGTGGCTCCATGCTGGGGACCAAGAGGACCCTGCCCAAGGGCCAGCTGGAGTCCATTGTGGAGAACATCCGCATCTATGGTATTCACGCCCTGCTGGTGGTCGGTGGGTTTGAGGCCTATGAAGGGGTGCTGCAGCTGGTGGAGGCTCGCGGGCGCTACGAGGAGCTCTGCATCGTCATGTGTGTCATCCCAGCCACCATCAGCAACAACGTCCCTGGCACCGACTTCAGCCTGGGCTCCGACACTGCTGTAAATGCCGCCATGGAGAGCTGTGACCGCATCAAACAGTCTGCCTCGGGGACCAAGCGCCGTGTGTTCATCGTGGAGACCATGGGGGGTTACTGTGGCTACCTGGCCACCGTGACTGGCATTGCTGTGGGGGCCGACGCCGCCTACGTCTTCGAGGACCCTTTCAACATCCACGACTTAAAGGTCAACGTGGAGCACATGACGGAGAAGATGAAGACAGACATTCAGAGGGGCCTGGTGCTGCGGAACGAGAAGTGCCATGACTACTACACCACGGAGTTCCTGTACAACCTGTACTCATCAGAGGGCAAGGGCGTCTTCGACTGCAGGACCAATGTCCTGGGCCACCTGCAGCAGGGTGGCGCTCCAACCCCCTTTGACCGGAACTATGGGACCAAGCTGGGGGTGAAGGCCATGCTGTGGTTGTCGGAGAAGCTGCGCGAGGTTTACCGCAAGGGACGGGTGTTCGCCAATGCCCCAGACTCGGCCTGCGTGATCGGCCTGAAGAAGAAGGCGGTGGCCTTCAGCCCCGTCACTGAGCTCAAGAAAGACACTGATTTCGAGCACCGCATGCCACGGGAGCAGTGGTGGCTGAGCCTGCGGCTCATGCTGAAGATGCTGGCACAATACCGCATCAGTATGGCCGCCTACGTGTCAGGGGAGCTGGAGCACGTGACCCGCCGCACCCTGAGCATGGACAAGGGCTTCTGAGGCCAGCCATGCCCACGCCCCTCCCCAGCCCCCACCCATGCCAGCGCAGCGCCAGGGCTCAGATGGGGCCTGGGCTGTTGTGTCTGGAGCCTGCAGGCAGGTGGGGGCTGCGTCCCTGCTCAGCCCATCCCCTGCCTCTATCCCTGGCCACCTGCCAGGCCTCCCTCGGGCTGGTGTCTTGAGACCAGCCTGCCAGGCCCTCCAGCAGGAGGACAGAGTGCCCTGGGGCATCCACCTTCCTGCCCAGGGGACGTGGCGCTGTCGGTGTTTGGAGGCTGCTGCCCCCTGGCTTTGGCGCCCCATGGGCCCTCAGCGTCTCCCCATGCTGGGCTCACTACATGGGCCAGCCCTTGCTCTACCTGGCCGGTAGGCTGCTGGCGCCTAGGTTGTGTTGAGAGGGGGATGCCCCTGGCCCTGCCTCACTGTGACCTGCTCCTGCCCACGTGCAGCACCTGTCACCTTTTCTAGAAATAAAATCACCCTGACTGTGGGGTGCATCGGTCTCCGGAGA"
% Define the primer sequence
primer_sequence = 'TACG';
% Find the location of the primer sequence in the DNA sequence
primer_location = strfind(DNAsequence, primer_sequence)
primer_location = 1×6
685 1363 1828 2071 2308 2812
% Display the location of the primer sequence
if isempty(primer_location)
disp('Primer sequence not found in the DNA sequence.');
else
disp(['Primer sequence found at position(s): ', num2str(primer_location)]);
end
Primer sequence found at position(s): 685 1363 1828 2071 2308 2812

More Answers (1)

John D'Errico
John D'Errico on 15 Mar 2024
Edited: John D'Errico on 15 Mar 2024
strfind does NOT read a string from a file! You did this:
% Open the file
fileID = fopen('cDNA1-1.txt', 'r');
% Read the DNA sequence from the file
dna_sequence = strfind(fileID, '%s');
WRONG. You opened the file, but then never read anything from the file. Essentially, you got ahead of yourself.
fileID = fopen('cDNA1-1.txt', 'r');
I'll use fread, which brings them in as ascii. So char will convert them. As well, I'll make it a row vector. (There are many ways we could do this. I'm just grabbing one that works.)
D = char(fread(fileID))';
But note that the file contains carriage returns and line feed characters, so I'll strip them out. Keep only the DNA part.
D = D(ismember(D,'ACGT'))
D = 'TCACTGACCCCACTCCTGAGCATGAACTCTCCTCCCCTCCACTCTGCTGTCAGGTTTTGTCTCCATTGGCCAAGAACCTCTTCCACCGGGCCATTTCTGAGAGTGGCGTGGCCCTCACTTCTGTTCTGGTGAAGAAAGGTGATGTCAAGCCCTTGGCTGAGCAAATTGCTATCACTGCTGGGTGCAAAACCACCACCTCTGCTGTCATGGTTCACTGCCTGCGACAGAAGACGGAAGAGGAGCTCTTGGAGACGACATTGAAAATGAAATTCTTATCTCTGGACTTACAGGGAGACCCCAGAGAGAGTCAACCCCTTCTGGGCACTGTGATTGATGGGATGCTGCTGCTGAAAACACCTGAAGAGCTTCAAGCTGAAAGGAATTTCCACACTGTCCCCTACATGGTCGGAATTAACAAGCAGGAGTTTGGCTGGTTGATTCCAATGCAGTTGATGAGCTATCCACTCTCCGAAGGGCAACTGGACCAGAAGACAGCCATGTCACTCCTGTGGAAGTCCTATCCCCTTGTTTGCATTGCTAAGGAACTGATTCCAGAAGCCACTGAGAAATACTTAGGAGGAACAGACGACACTGTCAAAAAGAAAGACCTGTTCCTGGACTTGATAGCAGATGTGATGTTTGGTGTCCCATCTGTGATTGTGGCCCGGAACCACAGAGATGCTGGAGCACCCACCTACATGTATGAGTTTCAGTACCGTCCAAGCTTCTCATCAGACATGAAACCCAAGACGGTGATAGGAGACCACGGGGATGAGCTCTTCTCCGTCTTTGGGGCCCCATTTTTAAAAGAGGGTGCCTCAGAAGAGGAGATCAGACTTAGCAAGATGGTGATGAAATTCTGGGCCAACTTTGCTCGCAATGGAAACCCCAATGGGGAAGGGCTGCCCCACTGGCCAGAGTACAACCAGAAGGAAGGGTATCTGCAGATTGGTGCCAACACCCAGGCGGCCCAGAAGCTGAAGGACAAAGAAGTAGCTTTCTGGACCAACCTCTTTGCCAAGAAGGCAGTGGAGAAGCCACCCCAGACAGAACACATAGAGCTGTGAATGAAGATCCAGCCGGCCTTGGGAGCCTGGAGGAGCAAAGACTGGGGTCTTTTGCGAAAGGGATTGCAGGTTCAGAAGGCATCTTACCATGGCTGGGGAATTGTCTGGTGGTGGGGGGCAGGGGACAGAGGCCATGAAGGAGCAAGTTTTGTATTTGTGACCTCAGCTTTGGGAATAAAGGATCTTTTGAAGGCCAAA'
strfind(D,'TCAG')
ans = 1×7
50 710 732 819 832 1139 1230
And that would be the locations of that substring in your file.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!