用Salmon计算宏基因组预测基因的丰度,建index是否需要decoy文件。
相关信息
1
Salmon用户手册有以下描述:
Finally, we recommend using selective alignment with a decoy-aware transcriptome, to mitigate potential spurious mapping of reads that actually arise from some unannotated genomic locus that is sequence-similar to an annotated transcriptome.
2
Salmon的开发者Rob在biostars的一篇帖子的回复:
The decoy sequences are regions of the target genome that are sequence similar to annotated transcripts.These are the regions of the genome most likely to cause mismapping (e.g. transcribed pseudogenes, etc.).
There are 3 ways to run salmon :
(a) with just the annotated transcriptome being indexed
(b) with the annotated transcriptome and a small set of decoys computed using MASHMAP to search transcripts against the genome and
(c) with the annotated transcriptome and using the entire genome as decoy sequence.
The (a) method requires the fewest resources,
(b) requires a good deal of resources to run the MASHMAP step, but the resulting index is similar to that of (a) and it avoids the most obvious cases of misalignment.
(c) results in the largest index, but it’s the most effective at avoiding potentially spurious mappings.
Salmon can be used without decoy sequences (and sometimes, this is necessary — e.g. in a de novo assembly, there will likely be no possibility for decoys).
It can also be run without decoys in reference organisms.
It is simply the case that decoys help avoid certain cases of misalignment that can’t be adjudicated with the transcriptome alone, and therefore can lead to somewhat more robust estimates of abundance in the presence of the expression of unannotated sequences.
结论
decoy序列指的是基因组上与注释出的真实转录本相似的序列。
建index时decoy文件的作用是避免后续比对时的错误,如比对到transcribed pseudogenes(可转录的假基因?)。
作者列举的3种方法中,(a)是只用注释出的转录本建index的。后面也有说Salmon在某些必要情况下是可以没有decoy序列信息的,例如de novo组装的基因组。
同理,如果是宏基因组分析中通过预测的方法得到基因/转录本,本身就无法得到所谓的基因组上与转录本相似的区域。
因此宏基因组的基因丰度计算前,建Salmon index不需要decoy文件。
脚本
建库
1 | salmon index -p 30 -k 31 -t Gene.fa -i ./Index/Gene |
定量
1 | salmon quant -l IU --validateMappings --meta -p 10 -i ./Index/Gene -1 Sample.R1.clean.rmhost.fq.gz -2 Sample.R2.clean.rmhost.fq.gz -o ./Result/Sample |