Difference between revisions of "CTSC Ellen Grant, CHB"

From NAMIC Wiki
Jump to: navigation, search
Line 312: Line 312:
 
* I've only uploaded the data for mrid=000000, and only one mrsession was in the test set -- so only one MRSession has scan data in it:
 
* I've only uploaded the data for mrid=000000, and only one mrsession was in the test set -- so only one MRSession has scan data in it:
 
** GENESIS_SIGNA-000000000000234-20041122-211850
 
** GENESIS_SIGNA-000000000000234-20041122-211850
 +
* if this doesn't work, let's meet and figure out what changes should be made!
  
 
=STATUS=
 
=STATUS=

Revision as of 10:52, 30 September 2009

Home < CTSC Ellen Grant, CHB

Back to CTSC Imaging Informatics Initiative


Mission

Use-Case Goals

We will approach this use-case in three distinct steps, including Basic Data Management, Query Formulation and Processing Support.

  • Step 1: Data Management
    • Step 1a.: Describe and upload retrospective datasets (roughly 1 terabyte) onto the CHB XNAT instance and confirm appropriate organization and naming scheme via web GUI.
    • Step 1b.: Describe and upload new acquisitions as part of data management process.
  • Step 2: Query Formulation
    • making specific queries using XNAT web services,
    • data download conforming to specific naming convention and directory structure, using XNAT web services
    • ensure all queries required to support processing workflow are working.
  • Step 3: Data Processing
    • Implement & execute the script-driven tractography workflow using web services,
    • describe and upload results.
    • ensure results are appropriately structured and named in repository, and queriable via web GUI and web services.


Participants

  • sites involved: MGH NMR center, MGH Radiology, CHB Radiology
  • number of users: ~10
  • PI: Ellen Grant
  • staff: Rudolph Pienaar
  • clinicians
  • IT staff

Outcome Metrics

Step 1: Data Management

  • Visual confirmation (via web GUI) that all data is present, organized and named appropriately
  • other?

Step 2: Query Formulation

  • Successful tests that responses to XNAT queries for all MRIDs given a protocol name match results returned from currently-used search on the local filesystem.
  • Query/Response should be efficient

Step 3: Data Processing

  • Pipeline executes correctly
  • Pipeline execution not substantially longer than when all data is housed locally
  • other?

Overall

  • Local disk space saved?
  • Data management more efficient?
  • Data management errors reduced?
  • Barriers to sharing data lowered?
  • Processing time reduced?
  • User experience improved?

Fundamental Requirements

  • System must be accessible 24/7
  • System must be redundant (no data loss)
  • Need a better client than current web GUI provides:
    • faster
    • PACS-like interface.
    • image viewer should open in SAME window (not pop up a new)
    • number of clicks to get to image view should be as few as possible.

Outstanding Questions

Plans for improving web GUI?

Data

Retrospective data consists of ~1787 studies, ~1TB total. Data consists of

  • MR data, DICOM format
  • Demographics from DICOM headers
  • Subsequent processsing generates ".trk" files
  • ascii text files ".txt"
  • files that contain protocol information

Workflows

Current Data Management Process

DICOM raw images are produced at radiology PACS at MGH, and are manually pushed to the PACS hosted on KAOS resided at MGH NMR center. The images are processed by a set of PERL scripts to be renamed and re-organized into a file structure where all images for a study are saved into a directory named for the study. DICOM images are currently viewed with Osiris on Macs.

CTSCInformatics GrantPienaarCurrentDataManagement.png

Target Data Management Process (Step 1)

Step 1: Develop an Image Management System for BDC (IMS4BDC) with which at least the following can be done:

  • Move images from MGH (KAOS) to a BDC machine at Children's
  • Step 1a: Import legacy data into IMS4BDC from existing file structure and CDs
  • Step 1b: Write scripts to execute upload of newly acquired data.

CTSCInformatics GrantPienaarDataManagementStep1.png

Target Query Formulation (Step 2)

Step 2. Develop Query capabilities using scripted client calls to XNAT web services, such as:

Show all subjectIDs scanned with protocol_name = ProtocolName
Show all diffusion studies where patients ages are < 6”
  • Scripting capabilities: Scripts need to query and download data into appropriate directory structure, and support appropriate naming scheme to be compatible with existing processing workflow.

CTSCInformatics GrantPienaarDataManagementStep2.png

Target Processing Workflow (Step 3)

Step 3:

CTSCInformatics GrantPienaarDataManagementStep3.png

  • Execute query/download scripts
  • Run processing locally, on cluster, etc.
  • Describe & upload processing results
  • (eventually want to) Share images with clinical physicians
  • (eventually want to) Export post-processed data back to clinical PACS

Fitting Data to XNAT Data Model

Test data from Rudolph

I think we have this mappint from project to XNE data model:

  • MRID = SubjectID (1687 subjects?)
  • each SubjectID may have single experiment, but multiple MRSessions within that experiment
  • each "storage" directory for a particular MRID (in dcm_mrid.log) = MRSessionID
  • each scan listed (int toc.txt) = ScanID in the MRsession
  • important metadata contained in dicom headers, in dcm_mrid_age.log, dcm_mrid_age_days.log, and in the toc.txt file in each session directory.

This gives us a unique way to

  • have unique subject IDs
  • have unique MRSessionIDs for each subject,
  • have unique scanIDs within each MRSession
  • search for subject (by ID, age, or dicom header info) or
  • search for image data by age (or dicom header info)

As regards anonymization

Rudolph doesn't specifically need XNAT to do the anonymization. Wants XNAT to contain all relevant data and where/if necessary export/transmit DICOM data anonymized.

Rudolph has own MatLAB script that can do batch anonymization -- but if possible XNAT package should probably provide a means for that.

Draft approach to uploading data for Rudolph

  • Create a Project on the webGUI
  • Write a webservices-client script that will batch:
    • create subject (tested)
    • create experiment for subject (tested)
    • create mrsession for subject (tested)
    • for each scan in mrsession
      • anonymize (not sure)
      • do dicom markup (not sure)
      • add other metadata from toc.txt and *.log files (not sure)
      • upload scan data into db (tested)


Current data (subset) organization

RudolphTestData.png

The top level directory contains

  • a dcm_MRID.log file that contains a mapping between MRID's (PatientIDs?) and unique MRSessionNames
  • a dcm_MRID_age.log file that maps MRID's to ages in months and years
  • a dcm_MRID_age_days.log file tha tmaps MRID's to ages in days
  • subdirectories named for MRSessions.
  • each subdirectory contains a toc.txt file that includes patient and session information and a list of scans and scan types.

See examples below:

dcm_MRID.log

dcm_MRID_age.log

dcm_MRID_age_days.log

toc.txt

Questions sent to Rudolph about test data:

First, in the top-level dir, there are three log files: dcm_MRID.log dcm_MRID_age.log dcm_MRID_ageDays.log

1. do these files contain the MRIDs for *all* subjects in the entire retrospective study?

--Yes, at least current to the timestamp of the log file.

2. Some MRIDs appear to be purely numerical, and some alphanumerical. (3_S_658300). Is that correct?

Yes again. Unfortunately there seems to no standard technique for spec'ing 
the MRID number. This number, however, is the key most often used by clinicians, 
and thus is a primary key for the database. Problems abound, of course -- the MRID is 
linked to a single patient, but is not necessarily guaranteed to be unique. A patient 
keeps the same MRID, so multiple scans result in multiple instances. The combination of  
MRID+<storageDirectory> is unique (but also redundant, since the <storageDirectory> is 
unique, by definition. So essentially the log files are lookup tables for MRIDs and 
actual storage locations in the filesystem.

3. Two age files, one contains age in months or years (1687 entries) -- the other contains age in days (1687 entries):

  • does this mean there are 1687 MRID's in total?
  • what does age (days) = -1 mean?
These are just convenience files. Often times a typical 'query' would be: 
"findallMRID WHERE age IN <someAgeConstraint> AND protocol IN <someProtocolConstraint>."
The dcm_MRID_age.log maps the MRID to a storage location, and provides the 
age as tagged in the DICOM header. Of course, mixing different age formats 
(like 012M and 004W etc) isn't batch processing friendly. So, the 
dcm_MRID_ageDays.log converts all these age specifiers to days, and sorts 
the table on that field.
The '-1' means that some error occurred in the day calculation. 
Most likely, the associated age value wasn't present. 

4. And the two dataset directories you shared:

Avanto-26039-20080130-134825-078000/

GENESIS_SIGNA-000000000000234-20041122-211850/

each directory contain data and a .toc file that includes:

  • the "PatientID" is this equivalent to MRID?
  • and some other info including age, scan date, etc.
  • the filenames and scan types of a *set* of scans:
    • collected in one MRsession on that scan date?
    • or in the entire retrospective project?
  • and is all the data for the set of scans listed contained in this directory?
true: PatientID == MRID 
The filenames and scantypes correspond to one session on that scan date. 
Other scan dates for that MRID will be in different directories. 
Essentially, the data is packed according to 
<scannerSpec>-<scanprefix>-<scandate>-<scantime>-<trailingID>.

Upload tests:

Remote script to upload -- parsing Rudolph's data and making webservices calls to create subjects:

subject creation while running; each subject takes ~ 1second

Local data organization to XNAT data model mapping

implementing script to batch upload data from local to remote using webservices:

#----------------------------------------------------------------------------------------
# Local and Remote data organization notes
#----------------------------------------------------------------------------------------
# Batch upload of retrospective study maps local data organization
# to XNAT data model in the following way:
# xnat Project = PienaarGrant
# xnat Project contains list of unique xnat SubjectIDs (= local MRIDs)
# Each xnat SubjectID contains a list of xnat "Experiments"
# Each xnat Experiment is an "MRSession" with unique ID (and label = local dirname from dcm_mrid.txt)
# Each xnat MRSession labeled with local dirname contains set of xnat "Scans" (listed in local toc.txt)
# Each xnat Scan contains the image data.
# Project, Subjects, Experiments, and Scans can all have searchable metadata
# Projects, Experiments can have associated files (dcm_mrid.txt, toc.txt, error logs, etc.)


#----------------------------------------------------------------------------------------
# A schematic of the local organization looks like this:
#----------------------------------------------------------------------------------------
#
# Project root dir
#     |
# { MRSession1,  MRSession2, ..., MRSessionN} + {projectfiles = dcm_MRID.log, dcm_MRID_age.log, dcm_MRID_ageDays.log, error.log}
#     |...
# { scanfiles=scan1.dcm, scan2.dcm,...,scanM.dcm } + {sessionfiles = toc.txt, toc.err, log/*, log_V/*}
#


#----------------------------------------------------------------------------------------
# A schematic of the xnat organization looks like this:
#----------------------------------------------------------------------------------------
#
# Project + {projectfiles.zip}
#     |
# { SubjectID1, SubjectID2, ... SubjectIDN }
#                                    ...|
#            List of "Experiments" { MRSession, MRSession, ..., MRSession}
#                                                                ...|              
#                                                       {scan1...scanN} + {sessionfiles.zip}    
#
# An XNAT "Experiment" is an XNAT "imagingSession"
# An ImagingSession may be a MRSession, PETSession or CTSession (on central.xnat.org)
# -we are calling each local directory for a given MRID a new XNAT Experiment (MRSession).
# Each XNAT "imagingSession" contains a collection of XNAT "scans".
# we are calling each scan in a given local directory an XNAT "scan" in the MRSession.

Experiment to try with Rudolph

  • Upload script is ready to be tested.
  • Query and Download script still being developed

Resources needed to run the scripts

Scripts (ready to test for bulk upload)

Instructions & Notes

  • Create a test project using xnat web gui
  • Using the "Manage" tab:
    • Choose to make the project "private"
    • Choose to place data directly into the databasee (not prearchive)
    • Choose to skip the quarantine
  • Download the XnatRESTClient
  • Download and install Tcl
  • Download the scripts and unzip
  • Edit the XNATglobals.tcl file to customize
  • run ./PGBulkUp.tcl -d rootDataDirectory

Results of this script's upload are on central.xnat.org under the test project "PienaarGrantRetrospectiveTest". This is a private project because it contains protected medical information.

Rudolph:

  • once you make an account on central, send me your login and I'll add you to the list of users who can access the data.
  • Then, click thru and let me know if the way data is organized appropriately matches your data.
  • I've only uploaded the data for mrid=000000, and only one mrsession was in the test set -- so only one MRSession has scan data in it:
    • GENESIS_SIGNA-000000000000234-20041122-211850
  • if this doesn't work, let's meet and figure out what changes should be made!

STATUS

  • 9/29/09 - Scripts for bulk uploading made available for testing.
    • Right now these scripts will upload entire retrospective study
    • They do not parse DICOM headers and add header information to XNAT's "fileData"
    • They do not anonymize

This seems like a reasonable first step since rudolph has his own scripts to anonymize, but did not want his data necessarily anonymized in the repository.

continuing work

  • Wendy is working on pydicom script to optionally anonymize from this script
  • and also working on pydicom script to parse headers and add "fileData" on upload
  • also working on extending the XNATquery.tcl tool to perform flexible query/parse.