The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
For example, "ACGAATTCCG" is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
**Input:** s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
**Output:** ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
**Input:** s = "AAAAAAAAAAAAA"
**Output:** ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 105
s[i] is either 'A', 'C', 'G', or 'T'.
ac
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new ArrayList<String>();
Set<Integer> repeat = new HashSet<Integer>();
Set<Integer> repeatMore = new HashSet<Integer>();
// dictionary for A C G T
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
int val = 0;
for (int i = 0; i < s.length(); i++) {
val <<= 2; // move 2 digits left, make room for new letter
val |= map.get(s.charAt(i));
val &= 0xfffff; // window technique: any bit larger than 20 is set to 0
if (i < 9) continue;
if (!repeat.add(val) && repeatMore.add(val)) {
res.add(s.substring(i-9, i+1));
}
}
return res;
}
}
/**
1. 10-letter-long means every 10 letter in the sequence
2. [important] storing 10 letters is too space consuming to implement, so make it smaller, "compress" it to a 4-bit int
3. store 4-bit int in one hashset, when repeat, add to result. when repeat more than twice, ignore it.
4. window technique: val &= 0xfffff; any bit larger than 20 is set to 0
5. set.add(val) can be used to replace `if (set.contains(val)) do sth else set.add()` because it will return boolean result.
A 00
C 01
G 10
T 11
**/