Because sequences representing an exact match to a true barcode are likely to be sequenced at much higher frequencies than those with errors, one approach would be to ignore reads below a predefined frequency threshold and treat all other reads as true barcodes. That is, reads that identify a true barcode must be differentiated from reads that contain PCR or sequencing errors. Therefore, an unbiased barcode detection strategy that does not depend on prior information is needed.įor random barcode libraries, an additional computational problem is discovering the true barcodes in the pool. Additionally, a priori errors in the set of known barcodes have been found to be common ( Smith et al., 2009), meaning that unexpected barcodes that are present in the pool may be missed. However, this strategy is computationally expensive for large barcode libraries. Alternatively, the best match for each read can be compared to the set of putative barcodes by calculating the Hamming ( Hamming, 1950) or Levenshtein distance ( Levenshtein, 1966). However, given that some barcodes in the pool may be more prone to PCR or sequencing errors ( Goren et al., 2010 Gundry and Vijg, 2012 Meyerhans et al., 1990 Schmitt et al., 2012), this strategy could introduce counting biases. One naïve strategy would be to ignore reads that do not exactly match any putative barcode. For barcodes of known sequence, the primary concern is mapping reads that may contain PCR or sequencing errors to the known barcodes. However, computational pipelines for bar-seq have not been well developed. For example erroneous barcodes can inflate measures of the adaptive mutation rate ( Levy et al., 2015), and undercounting of low frequency barcodes can result in meaningful data loss in interaction screens ( Jaffe et al., 2017 Schlecht et al., 2017). Errors in bar-seq analysis can have profound consequences on the biological interpretation of these assays. In addition to the above approaches where the sequence of the barcode or pseudo-barcode is known a priori, more recent studies employ barcodes with random sequences to serve as neutral cell markers to study the dynamics of development, evolution or cancer progression ( Bhang et al., 2015 Blundell and Levy, 2014 Levy et al., 2015 Lu et al., 2011 Nguyen et al., 2015), or as markers of engineered constructs to study genetic or protein–protein interactions ( Jaffe et al., 2017 Schlecht et al., 2017). Analogously, a growing number of studies in mammalian cells sequence pseudo-barcodes: short nucleotide sequences such as shRNAs or sgRNAs, that serve as both the cell-specific perturbation and the unique cell identifier for short-read sequencing ( Bassik et al., 2009 Schlabach et al., 2008 Silva et al., 2008 Sims et al., 2011 Wang et al., 2014 Wong et al., 2015). This bar-seq approach was first used with the Saccharomyces cerevisiae deletion collection, which was designed such that each individual deletion strain is marked with a unique barcode ( Giaever et al., 2002 Gibney et al., 2013 Gresham et al., 2011 Smith et al., 2009 Winzeler et al., 1999), and subsequently in a number of other barcoded bacteria and yeast collections ( Han et al., 2010 Hobbs et al., 2010 Noble et al., 2010 Schwarzmuller et al., 2014). Barcoded cells are grown under selective conditions, barcodes are amplified using common primers, and relative barcode frequencies are quantified by sequencing barcode amplicons. High-throughput sequencing of nucleotide barcodes (bar-seq) provides a powerful tool to assay and track dynamics of large numbers of lineages, genotypes or perturbations in complex cell pools.