Genomic and Chemical Diversity of Commercially Available High-CBD Industrial Hemp Accessions

High consumer demand for cannabidiol (CBD) has made high-CBD hemp (Cannabis sativa) an extremely high-value crop. However, high demand has resulted in the industry developing faster than the research, resulting in the sale of many hemp accessions with inconsistent performance and chemical profiles. These inconsistencies cause significant economic and legal problems for growers interested in producing high-CBD hemp. To determine the genetic and phenotypic consistency in available high-CBD hemp varieties, we obtained seed or clones from 22 different named accessions meant for commercial production. Genotypes (∼48,000 SNPs) and chemical profiles (% CBD and THC by dry weight) were determined for up to 8 plants per accession. Many accessions–including several with the same name–showed little consistency either genetically or chemically. Most seed-grown accessions also deviated significantly from their purported levels of CBD and THC based on the supplied certificates of analysis. Several also showed evidence of an active tetrahydrocannabinolic acid (THCa) synthase gene, leading to unacceptably high levels of THC in female flowers. We conclude that the current market for high-CBD hemp varieties is highly unreliable, making many purchases risky for growers. We suggest options for addressing these issues, such using unique names and developing seed and plant certification programs to ensure the availability of high-quality, verified planting materials.


INTRODUCTION
Hemp (Cannabis sativa) is a dioecious annual plant that is thought to have been domesticated around 6,000 years ago in China, with some evidence of use as far back as 12,000 years ago (Li, 1974;Fleming and Clarke, 1998;Merlin, 2003). Different hemp varieties have been used for fiber, seeds, medicine, and recreation for thousands of years (Russo, 2007). Hemp has also recently been used to produce biofuels (Li et al., 2010), plastics (Wretfors et al., 2009;Khattab and Dahman, 2019), and building composites (Sassoni et al., 2014;Hussain et al., 2019). Similar to other crops, different hemp varieties serve specific uses. However, until recently the United States had banned all C. sativa varieties from commercial production due some of them being used as recreational drugs (marijuana) (Alliance, 2014). The psychoactive properties of marijuana are due to high amounts of a specific secondary metabolite, tetrahydrocannabinol (THC). Recognizing that not all hemp is created equal, the 2018 United States Farm Bill allowed growers to cultivate "industrial" hemp (defined as varieties with < 0.3% THC by dry weight) throughout the United States (S. 2667 (115th): Hemp farming act of 2018,2018). This has led to a surge of interest in hemp production, especially for varieties with high production of other, nonpsychoactive metabolites (cannabinoids). Currently, the largest interest is in varieties bred to produce cannabidiol (CBD), a nonpsychoactive cannabinoid used as a medicine and health food supplement. Both CBD and THC derive from the same precursor, cannabigerolic acid (CBGa) (Sirikantaramas et al., 2004;Taura et al., 2007). They are most concentrated in the trichomes of female flowers, as are the over 100 other known cannabinoids present at much lower concentrations (Turner et al., 1981;Elsohly and Slade, 2005).
The most well-studied application of CBD is to control seizures, which is the basis of the FDA-approved drug epidiolex. CBD is also marketed as a nutritional supplement to help with anxiety, pain, depression, and sleep; most of these claims are anecdotal, although there are some studies supporting them (Perucca, 2017;Glass and Gilleece, 2019;Hurd et al., 2019). Regardless of efficacy, the market value of CBD products is currently estimated at over $4.7 billion dollars per year US CBD market forecast announcement, 2020.

Hemp Genetics
Cannabis sativa has a diploid genome (2n = 20) with an estimated size of 818 Mb for female plants and 843 Mb for males (Sakamoto et al., 1998). Since pollination lowers cannabinoid yield by ∼50% (Meier and Mediavilla, 1998), growers interested in these compounds often use "feminized" seed or clones of female plants.
[Feminized seed is produced from two genetically female plants, one of which has been chemically treated to produce male flowers (Ram et al., 1972;Lubell and Brand, 2018).] Until recently, hemp's legal status prevented most research on it. Lack of research made it essentially an orphan crop, with few genomic resources and almost no public germplasm collections. Despite these restrictions, there have been significant advances in understanding hemp genetics for several key traits, including cannabinoid production (Taura et al., 1995;de Meijer et al., 2003de Meijer et al., , 2009Weiblen et al., 2015), sex expression (Faux et al., 2016), fiber quality (van den Broeck et al., 2008), and population structure and diversity (Sawler et al., 2015;Lynch et al., 2016;Dufresnes et al., 2017). Many more genomics resources have become available for hemp over the past decade, including multiple genome sequences (van Bakel et al., 2011;Laverty et al., 2019), transcriptomes analyses (Liu et al., 2016;Braich et al., 2019;Huang et al., 2019;McGarvey et al., 2020), and proteome analyses (Jenkins and Orsburn, 2020;Conneely et al., 2021). These resources are rapidly bringing hemp into the genomics era and ending its status as an orphan crop.
Due to a long history of breeding for different purposes, drugtype C. sativa plants form genetically distinct clusters from fiber and grain types (van Bakel et al., 2011;Sawler et al., 2015;Lynch et al., 2016;Vergara et al., 2021). Prior studies usually focused on marijuana, but sometimes included high CBD/low THC varieties as well. Perhaps not surprisingly, high-CBD varieties are closely related to marijuana varieties (Grassa et al., 2021) since both types have been bred for production of specific secondary metabolites (cannabinoids and terpenoids, etc.).

Issues With Commercial Hemp Cultivation
Although interest in commercial hemp cultivation has exploded since the 2018 Farm Bill, many issues of naming and quality control plague the field. Because all varieties of hemp were outlawed for several decades, breeding and naming of varieties has been largely clandestine and ad hoc, with names frequently recycled to reflect the most successful or desirable cultivars. Thus there is no guarantee that the variety "Cherry Wine" received from one supplier is the same as-or even related to-a variety of the same name from a different supplier. For example, Both Lynch et al. (2016) and  found that the traditional classifications of "indica" and "sativa" for drug-type Cannabis did not reflect their genetic relationships, and that high-CBD/low THC varieties generally cluster separately from drugtype marijuana plants. The high-CBD reference genome, CBDRx, is an exception, sharing 89% of its genome with marijuana varieties (Grassa et al., 2021). Marijuana-type Cannabis is also known to have significant naming inconsistencies, where multiple plants with the same name are actually genetically distinct (Sawler et al., 2015;. This inconsistency includes not just background genetics but also how many copies of the cannabinoid biosynthesis genes are present (Vergara et al., 2019), arguably the most important trait for these varieties.
In addition to names, standards for CBD and THC production are lacking. This is particularly important because any plants with >0.3% THC are classified as marijuana and must be destroyed, causing significant loss of revenue. To increase grower confidence, some companies provide certificates of authenticity (COAs) that attest to how much of each compound a variety will produce.
Although Cannabis genetics has been developing quickly, the major focus is usually on marijuana-type C. sativa, with CBD-type hemp often much less represented (e.g., Lynch et al., 2016;Vergara et al., 2021). To date, there have been no studies that focus on the consistency of high-CBD hemp from the point of view of a commercial grower. We aimed to fill this gap by specifically investigating if the naming issues found in drug-type C. sativa also occurred in CBDtype varieties available for large-scale commercial production. To this end, we developed genetic and chemical diversity data sets on twenty-two commercially available hemp accessions. We identified both the genetic relationships among the accessions and the genetic consistency within each accession. We also tested the production of total CBD and THC for each line and compared these to industry and legal standards and the provided Certificates of Authenticity. These comparisons are not meant to evaluate specific sources or accessions per se, but rather to demonstrate the overall state of the market and give an idea of how reliable (or not) it is for interested growers.

Plant Material
Twenty-two commercial hemp accessions were purchased or donated from various sources ( Table 1). Since accessions frequently pass among groups, it is impossible to say if these companies are the original sources, or if they have done any selections to differentiate them from the original source. This collection focused on cannabidiol (CBD) production, but accessions for fiber and seed were also included. Twenty of the accessions were distributed as seeds (some feminized, others not) and two of them were clonally propagated. For our experiments, all clonal plants were propagated from a single original plant to ensure that each replicate was an exact genetic copy.
Seeds were soaked in water for 12 h to initiate germination. Due to low germination rates, 15 seeds were planted per halfgallon pot and thinned to only 1 plant per pot after 2 weeks. Clones were made by cutting a seven-inch section of stem from the mother plant, trimming off all leaves and growing points except the topmost one, and dipping in cloneX (Growth Technology) rooting solutions before planting into half-gallon pots. All plants were grown in a commercial potting medium (Sun Gro Metro Mix 830). Plants were fertilized twice a week using a diluted 20-20-20 fertilizer (1000 ppms) and a diluted micronutrient mixture (Jackpot Micronutrient Mixture; 500 ppm). To maintain the plants in a controlled vegetative state, growth conditions were kept under an 18-h light/6-h dark cycle. All plants were grown in greenhouses at the University of Georgia (Athens, GA, United States).

Genomic Data
Ten leaf punches were taken from each plant and sent to LGC Genomics for DNA extraction and genotyping-by-sequencing (GBS) (Elshire et al., 2011) with restriction enzyme MslI. GBS was chosen over shotgun sequencing due to the ability to get greater depth at sites, allowing us to accurately call heterozygous alleles. Paired-end 150 bp reads were generated using Illumina NextSeq V500/550. Libraries were demultiplexed using the Illumina bcl2fastq software (version 2.17.14). SNPs were aligned to the CBDRx reference genome [NCBI GCF_900626175.1; (Grassa et al., 2021)] with BWA mem (Li, 2013) and SNPs called with BCFtools (Li, 2011) requiring a minimum base quality of 20 and only outputting SNPs (not indels). All bioinformatic scripts (including exact parameters used) are available at https: //github.com/wallacelab/paper-johnson-hemp-gs, and adaptorand restriction-fragment-verified sequencing data is available at NCBI under Bioproject PRJNA707556. Raw SNPs were then filtered in a series of steps. Misalignments and low-coverage sites were filtered out by removing all sites with an average genotyping depth of <15 reads per individual, and paralogs were removed by filtering out sites with >125 reads per individual. These cutoffs were based on initial data exploration that showed average genotyping depth to be ∼100 reads per individual (Supplementary Figure 1). We then removed sites present in < 80% of samples, with minor allele frequencies <2.5% (since most of these are sequencing errors) and with >10% heterozygosity (since these are often paralogs being misaligned to the same location). This resulted in 48,029 SNPs in the final dataset. Cladograms were generated using the neighbor-joining method in TASSEL v5.2.40 (Bradbury et al., 2007). For the neighbor-net analysis, TASSEL was also used to generate a genetic distance matrix, which was then processed with RSplitsTree (Bickel and Zakharko, 2016) and SpitsTree4 (Huson and Bryant, 2006).
To compare against prior work, public whole-genome sequencing data from 55 C. sativa varieties (Lynch et al., 2016) was downloaded from NCBI and SNPs called using the same pipeline as above, except that no depth filters were applied due to the much lower sequencing depth of this data (∼5×). We merged the two datasets by keeping only the sites present in both datasets after quality filtering, resulting in 8867 SNPs. Neighbornet network analysis was performed as above.

Cannabinoid Analysis
Fifty-two days after sowing, eight replicates of each accession were placed into a flower room with a 12-h light/12-h dark cycle to initiate flowering. Plants were laid out in a randomized complete block design. Any plants that showed male flowers were removed from the room to eliminate pollination. (All such plants were from non-feminized seed lots; Supplementary Table 1.) The remaining female plants were kept in the flowering room for 12 weeks, at which point the panicles in the top six inches of each plant were harvested, trimmed of excess leaf material, placed in a paper bag, and dried at 35 • C for 2 weeks.
Cannabinoids levels were assayed with a Shimadzu LC2030C high-performance liquid chromatography (HPLC) machine according to the manufacturer's recommended protocol. In brief, 200 mg of dried flower material was weighed out for each sample and the exact weight recorded. Samples were placed in 20 ml of methanol and agitated for 3 min to extract the cannabinoids. 1 ml of the supernatant was passed through a 0.22 um filter, and 50 µl of filtrate was diluted into 950 µl of methanol, resulting in a 400× dilution of the original samples. HPLC was carried out using a NexLeaf CBX column (2.7 µm, 4.6 × 150 mm; part number: 220-91525-70), NexLeaf CBX guard column (part number: 220-91525-72), eleven-cannabinoid standard mix (part number: 220-91239-21), and high sensitivity method solvents A (0.085% phosphoric acid in water) and B (0.085% phosphoric acid in acetonitrile) (part number: 220-91394-81). The flow rate was 1.5 ml/min with a gradient starting from 30% solvent/70% solvent B and ramping to 5% solvent A/95% solvent B over 8 min. Injection volume was 5 µL, and a guard column temperature of 35 • C was maintained by an internal oven. Standard curves were generated for each target cannabinoid with minimum correlation coefficients (R 2 ) of 0.999 over the six concentration levels (0.5, 1, 5, 10, 50, and 100 ppm). The original sample weights were used to determine the precise cannabinoid concentration in the original sample.
Genetic and chemical variation within accessions was compared by calculating the average genetic distance within each accession [calculated in TASSEL v5.2.58 (Bradbury et al., 2007)] and comparing it to the variance of measured CBD or THC for plants of that accession, using either raw or log-transformed data.

RESULTS
We planted 8 replicates of the 22 accessions, resulting in 176 total pots. Of these, 3 had no seeds germinate despite having 15 seeds originally planted, and 26 developed male flowers and so were removed from the experiment. The males were from 11 different accessions, none of which claimed to be feminized seed; Supplementary Table 1).
All 148 remaining female plants were genotyped with genotyping-by-sequencing (GBS), resulting in ∼48,000 markers after filtering (see Methods). Genetic clustering showed some expected patterns, such as the fiber accessions clustering together and clones clustering tightly to each other ( Figure 1A). Minor differences among clones are presumably due to a low level of sequencing errors that made it through our filters.
Most of the seed-grown CBD accessions showed little consistency. For example, two accessions named "Baox" ("Baox" and "BaoxSP07, " from 2 different suppliers) showed no real relationship to each other. Most accessions are split across the tree and involve at least two separate clusters. Some of these seem to be the result of a single outlier (e.g., Baox and Kau'XXX), but others include multiple individuals in each cluster (Abacus Early Bird; Abacus Early Bird 2; KauXX, Otto II), and still others are scattered across the entire tree (AbacusxBB1, Berry Blossom, Wife) (Figures 1B-D; Supplementary Figures 2, 3).
To compare these data to prior results, we downloaded public whole-genome sequencing data from 55 Cannabis sativa accessions (Lynch et al., 2016) and called SNPs using our same pipeline. This resulted in 8867 quality-filtered SNPs shared between the datasets. Based on these SNPs, we find that CBD-, fiber-, and seed-type accessions cluster with their respective types, regardless of which dataset they originate from. Meanwhile, all the published marijuana-type plants cluster separately (Supplementary Figure 4). These results match prior published data, where CBD-type hemp plants are usually genetically distinct from marijuana (Sawler et al., 2015;Lynch et al., 2016;Grassa et al., 2021).
Of the 148 female plants, eleven had flowers that did not properly develop, leaving 137 flower samples to test for cannabinoid content. THC levels ranged from undetectable up to 11.08% THC by dry weight. Eighty-nine plants produced flowers with more than the legal limit of 0.3% THC, including 15 plants with >1% THC and one with >10% THC (Figure 2A). CBD levels ranged from undetectable (mostly fiber varieties) up to 16.7% dry weight ( Figure 2B). Twelve plants produced THC without any CBD, including three of them with >1% THC by dry weight, though all of these were fiber varieties.
Similar to the genetic relationships, most accessions showed little consistency for cannabinoid production. The clones ("BaoxSP 07" and "LifterSP 01") were generally tightly clustered for both THC and CBD (Figure 2), but seed-grown accessions showed significant variability for both phenotypes. The most concerning ones were several accessions (AbacusxBB1, Berry Blossom, and Cherry original) which were sold as high-CBD lines but had multiple plants with no detectable CBD production, representing a wasted investment for growers. Meanwhile, one plant of "AbacusxBB1" contained >10% THC by dry weight, meaning it is not just legally but functionally a drug-type marijuana plant. Although this chemical variability reflects the genetic variability in accessions, there was not a clear relationship between the two, meaning that the more genetically variable accessions were not also more chemically variable (Supplementary Figure 5).
The ratio of CBD to THC varied from ∼0 (for the plants that produced no CBD) up to ∼28:1 (Figure 2C). Some plants produced no detectable THC (and thus have no ratio), but these plants also had very low levels of CBD (Supplementary Table 2). Independent of genetic data, the CBD:THC ratio can indicate which genes are present in the plant, with ratios of ∼20:1 indicating no active THCa synthase and ratios of ∼2:1 indicating at least one active THCa synthase gene (Toth et al., 2020). Based on their chemical profiles, twenty-eight plants had ratios that indicate the potential presence of a copy of the THCa synthase gene (Figure 2C).
Twelve accessions came with a certificate of analysis (COA), which suppliers use to show what level of cannabinoid production should be expected from the plant. These certificates are important for growers to know that their crop will remain under the 0.3% legal limit for THC, along with estimating the return on investment for CBD. However, most accessions had less CBD than their COA showed, and almost all of them had more THC (Figure 2), both of which could potentially cause issues for commercial growers.

DISCUSSION
The current federal regulations have created a fine line between legal hemp and illegal marijuana. Inconsistencies in plant genetics can greatly complicate the already complex process of legally growing high-CBD hemp. Although planting clones ensures the highest consistency, many farmers choose to plant seeds because of their much lower cost. This lower cost comes with risk due to inconsistent plant genetics and seed feminization that can make it difficult to produce hemp profitably (Meier and Mediavilla, 1998).

Seed Feminization
Many farmers have received seeds that were improperly feminized or not feminized at all, resulting in lost revenue and lawsuits (Associated Press, 2019). All feminized product used in this experiment produced only female flowers, although the numbers were too small to draw definite conclusions from Farmers consider a seed lot to be well feminized if less than 1 in 4,000 plants produce male flowers; (personal observation).

Genetic Relationships
Most accessions tested showed little genetic consistency (Figure 1 and Supplementary Figures 2, 3), which likely explains their phenotypic inconsistency (Figure 2). As expected, clonal accessions were the major exception, though some seed accessions (like Chardonnay) showed good within-accession consistency. This indicates that at least some accessions from some suppliers are reliable, although without extensive testing it is impossible to say which. Conversely, plant accessions with the same name but from different suppliers did not show any genetic clustering (Figures 1A,B and Supplementary Figure 2), meaning that they are actually no more related than any two random accessions. Growers should be careful when purchasing materials; the best approach is probably to just assume that each supplier is selling completely different seed regardless of what they name it. In this way, the naming issues of high-CBD hemp appear to parallel those of marijuana-type Cannabis (Sawler et al., 2015;Vergara et al., 2019).

CBD Accessions Without CBD
The fact that many CBD accessions contained plants that produced no detectable CBD is concerning (Figure 2A). The most likely reason for this result is that the seed lots contained mixed varieties of plants, some of which lacked the CBDa synthase gene. Copy-number variation for CBDa synthesis genes is extensive in Cannabis (Vergara et al., 2019), and although this is concerning for growers, it has a simple fix: producers should periodically screen their materials via chemical and/or genetic tests to confirm that all of the plants in seed production contain active CBDa synthase. The results of such screening can then be included on the Certificate of Analysis.

THC Production
The most concerning results from this experiment were the number of plants that produced excessively high levels of THC. The CBDa synthase gene naturally produces low levels of THC (Zirpel et al., 2018), so any plant producing CBD will have some amount of THC. However, plants with as much or more THC production than CBD almost certainly have an active THCa synthase gene (Toth et al., 2020). Eight CBD accessions had at least one plant that showed chemical evidence of an active THCa synthase gene (Figure 2), even though all were supposed to be low-THC varieties. The apparent presence of an active THCa synthase gene in CBD-production lines is very concerning, and the rate was surprisingly high (28 of 121 seed-grown plants, including one plant with > 10% THC). More extensive testing would be needed to see if any of the other CBD accessions also contain plants with active THCa synthase.
All four fiber accessions also showed evidence of active THCa synthase. Since fiber varieties are not grown in such a way to produce large amounts of cannabinoids, their containing THCa synthase is not a problem for fiber growers per se. It is, however, of potential concern insofar as any contamination of fiber varieties into CBD accessions (via seed swaps and pollen contamination, etc.) could potentially introduce an active THCa synthase gene into supposedly THC-free varieties. Producers should invest in regularly screening materials for THCa synthase genes in the same way we recommend they test for active CBDa synthase (above) so as to keep their accessions pure. In the meantime, growers may want to invest some time and resources into testing small batches of seeds from different suppliers to identify which ones are the most stable and trustworthy (not to mention highperforming).

Limits to CBD Production
As previously mentioned, CBDa synthase naturally produces low levels of THC, which explains why almost all the plants tested showed some level of THC (Supplementary Table 2). Even with THC much below CBD production, the plants which produced the highest levels of CBD all exceeded the federal level of 0.3% THC at full maturity. This implies that, with current varieties, there may be a limit to how much CBD a plant can produce while staying below the legal limit of THC. In the long run this limit might be improved by using natural or induced variation in the CBDa synthase gene to select for more specific enzyme variants. For now, however, frequent testing of plants as the flowers mature can help farmers determine when their plants are getting too close to that limit and adjust their harvest times accordingly.

Certificates of Authenticity
One concerning pattern we noticed was that several COAs were printed so that they show misleadingly low levels of THC. Specifically, they highlighted the low levels of 9-THC (the actual psychoactive form) while de-emphasizing THCa (the acid precursor that is decarboxylated into 9-THC by heat). United States federal testing guidelines require including both (Agricultural Marketing Service, 2019, and using the official formula of [total THC] = [ 9-THC] + 0.877 * [THCa] (Agricultural Marketing Service, 2021), only 7 of the 14 accessions with COAs actually claimed total THC levels below the 0.3% limit. This is a separate issue from how much CBD/THC the plants actually produce (Figure 2), since the COA functions as a decision-making tool for the grower before planting even begins. Some of these COAs may have been issued before the interim final rule that established these guidelines (October 31, 2019) (Agricultural Marketing Service, 2019); if so, one would hope that the companies have updated them with the new guidelines. Nonetheless, growers should pay close attention when ordering materials and ensure that the product information is reported accurately so that they can make the most informed decisions about their product.

CONCLUSION
The high-CBD hemp industry is experiencing many growing pains associated with its rapid development in the last 6 years. There are issues with genetic stability, economic viability, and governmental regulations. Despite these issues, the market continues to grow year after year, and interest in this crop continues to expand. With the support that this crop receives from consumers and the support it is beginning to receive from a wide range of researchers, there is a great opportunity for hemp to play an increasingly important role in a wide range of industries. Given the variability we found both among and within accessions, some sort of standardization is needed so that producers can be confident in the material they receive. A good first step would be for suppliers to start using unique names for each of their accessions. Not only will this help clarify the market, it will also allow each company to capitalize on branding their own unique varieties. A more complex but badly needed step is an industry seed-certification process to allow growers to purchase with confidence. Some states are already moving forward with their own seed certifications (HEMP, 2019; Georgia Crop Improvement Association, 2020), but until rigorous, independent verification is implemented across the industry, growers face the prospect of getting a bad lot any time they purchase from a new supplier. Ultimately, these and other changes will need to occur to make the market for high-CBD hemp robust, reliable, and sustainable over the long-term.

DATA AVAILABILITY STATEMENT
Raw sequencing data is available through the NCBI Sequence Read Archive (BioProject PRJNA707556). Bioinformatic scripts and key intermediate files are available on Github at https:// github.com/wallacelab/paper-johnson-hemp-gs.