PAZAR - XML format
Step-by-Step Documentation

Step2: Capturing the regulatory sequence and/or TF basic information

Once the project element has been defined (see Step 1), you are ready to enter sequence and transcription factor information. These will be entered within the 'data' element, which is a child element within the 'pazar' element.

2.0- Initialization
The 'data' element stores all the annotations separately. They will be linked together later in the 'analysis' element (see Step 3).
First the 'data' element has to initialized:

  <data>

Then, different type of annotations can be inserted:
  1. Regulatory Sequence for a Specific Gene
  2. Regulatory Sequence without any gene information
  3. Transcription Factor
  4. Artificial sequence/Sequence not attached to genomic coordinates

2.1 - Annotating a Regulatory Sequence for a Specific Gene TOP
The 'reg_seq' is embedded within 'tsr' and 'gene_source' elements. The 'gene_source' element informs about the gene accession number. The 'tsr' element describes the transcription start region based on the observation that transcription does not always start at exactly the same nucleotide (however, a unique start site can be described by inserting the same value in fuzzy_start and fuzzy_end).
Thus, if a gene has 2 alternative promoters, each of which can be described with a different 'tsr' element within the 'gene_source' element, different regulatory sequences can be associated with each 'tsr'.

    <gene_source db_accn="ENSG00000133256" description="PDE6B" pazar_id="gs_0001">
      <db_source db_name="EnsEMBL" assembly="37_35j"/>
      <tsr fuzzy_end="609373" fuzzy_start="609373" pazar_id="tsr_0001">
        <reg_seq pazar_id="rs_0001" quality="tested" sequence="ATTTGTAGGAGTGAGTCAGCTGACCCGC">
          <coordinate begin="609283" end="609310" length="28" strand="+">
            <location band="p16.3" chr="4" species="Homo sapiens">
              <db_source db_name="EnsEMBL" assembly="NCBI 35"/>
            </location>
          </coordinate>
        </reg_seq>
      </tsr>
    </gene_source>

Replace the red values with your own information.
The pazar IDs are internal IDs that will not be stored. They can be anything as long as they are unique throughout the file.


2.2 - Annotating a Regulatory Sequence without any gene information TOP
The 'reg_seq' element can also be embedded in a 'marker' element if the gene regulated by the sequence is not defined yet. The marker can be a gene but then it is just used for location purpose and not to infer any role for the sequence on this gene.

    <marker db_accn="ENSG00000133256" description="PDE6B" pazar_id="ma_0001">
      <db_source db_name="EnsEMBL" assembly="37_35j"/>
         <reg_seq pazar_id="rs_0001" quality="tested" sequence="ATTTGTAGGAGTGAGTCAGCTGACCCGC">
          <coordinate begin="609283" end="609310" length="28" strand="+">
            <location band="p16.3" chr="4" species="Homo sapiens">
              <db_source db_name="EnsEMBL" assembly="37_35j"/>
            </location>
          </coordinate>
        </reg_seq>
    </marker>

Replace the red values with your own information.
The pazar IDs are internal IDs that will not be stored. They can be anything as long as they are unique throughout the file.

2.3 - Annotating a Transcription Factor TOP
A transcription factor is described in multiple steps.
First, at the gene level: The 'tf' element is embedded in both 'transcript' and 'gene_source' elements. Multiple 'transcript' elements can be used to describe multiple isoforms of a gene.
Then, at the protein level: The 'funct_tf' element captures the functional protein information with as many 'tf_unit' elements as there are proteins in the complex (1 for monomers, 2 for dimers,...). The tf_id calls a pazar_id from a 'tf' element.

    <gene_source db_accn="ENSG00000129535" description="NRL" pazar_id="gs_0002">
      <db_source db_name="EnsEMBL" assembly="37_35j"/>
      <transcript db_accn="ENST00000250471" pazar_id="tr_0002">
        <db_source db_name="EnsEMBL" assembly="37_35j"/>
        <tf class="bZIP" family="MAF" pazar_id="tf_0001"/>
      </transcript>
    </gene_source>
    <funct_tf funct_tf_name="NRL" pazar_id="fu_0001">
      <tf_unit pazar_id="tu_0001" tf_id="tf_0001"/>
    </funct_tf>

Replace the red values with your own information.
The pazar IDs are internal IDs that will not be stored. They can be anything as long as they are unique throughout the file.


2.4 - Annotating an Artificial Sequence TOP
The 'construct' element can be used to describe any sequence without specific genomic coordinates (e.g. a synthesized oligonucleotide representing a consensus binding site).

    <construct construct_name="FN-13A" description="random oligo" sequence="gggtgagtcagcg" pazar_id="co_0001"/>
    <
Replace the red values with your own information.
The pazar IDs are internal IDs that will not be stored. They can be anything as long as they are unique throughout the file.