Thursday, October 1, 2015

Is deduplication for you?
Re: @brianbushnell on seqanswers:Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

Suggested order:
1. Initial quality control BGI did this
2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads BGI did this
3. Adapter removal BGI did this
4. phix/contaminant removal
5. [optional] Error-correction --use
6. [optional] Deduplication (depending on experiment) --use
7. Merging of PE reads --use
8. Quality trimming + phix/contaminant removal --use
9. Contamination check

More advice:
As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge

Also useful reading about merging your paired end reads before assembly:

Okay, time to try...

Step 1: Quality trimming

#quality control trimming of paired-end reads
#DR086_E.bombifrons_MOE quality control and plots -Xmx1g in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_clean.fq out2=DR086_Erbom_MOE_2_clean.fq minlen=25 qtrim=rl trimq=10 qhist=qc_plots/DR086_Erbom_MOE_qhist.txt aqhist=qc_plots/DR086_Erbom_MOE_aqhist.txt lhist=qc_plots/DR086_Erbom_MOE_readLength.txt bqhist=qc_plots/DR086_Erbom_MOE_bqhist.txt overwrite=true

About 8-9% of each read was trimmed.

Normal level sequence duplication in mammals is 20 million reads
Normal sequence bias at beginning of reads due to nonrandom hybridization of random primers.

#I put this all in a shell script
sh >log.txt

I tried to do error correction but these files are HUGE!! Need to remove duplicates first.
sh in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq
/usr/local/bin/bbmap// line 111: [: unlimited: integer expression expected
java -ea -Xmx-410m -Xms-410m -cp /usr/local/bin/bbmap/current/ jgi.KmerNormalize bits=16 ecc=t passes=1 keepall dr=f prefilter in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq
Invalid maximum heap size: -Xmx-410m
Could not create the Java virtual machine.

#even when I increased the memory, no cigar
sh in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq -Xmx20g
Exception in thread "Thread-60" java.lang.NoClassDefFoundError: java/util/concurrent/ThreadLocalRandom
        at jgi.KmerNormalize$

Step 2: Deduplication in=NBS1170_Desrot_MOE_1_clean.fq out=NBS1170_Desrot_MOE_1_clean_dedupe.fq am ac fo renameclusters=f storename=f pto cc qin=33 csf=stats.txt -Xmx20g overwrite=t in=NBS1170_Desrot_MOE_2_clean.fq out=NBS1170_Desrot_MOE_2_clean_dedupe.fq am ac fo renameclusters=f storename=f pto cc qin=33 csf=NBS1170_Desrot_MOE_2_clean_dedupe_stats.txt -Xmx20g overwrite=t

Step 3: Merge Paired End Reads
#merge PE in1=NBS1170_Desrot_MOE_1_clean.fq.gz in2=NBS1170_Desrot_MOE_2_clean.fq.gz out=NBS1170_Desrot_MOE_merged.fq -Xmx20g

Step 3.5: Could also remove the duplicates after merged paired ends
#remove duplicates of the two lanes in=NBS1170_Desrot_MOE_merged.fq out=NBS1170_Desrot_MOE_merged_dedupe.fq am ac renameclusters=f storename=f pto cc qin=33 csf=dedupe_logs/NBS1170_Desrot_MOE_merged_dedupe_stats.txt dot=dedupe_logs/ -Xmx20g overwrite=t

No comments:

Post a Comment