Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] dataloader accept information of 'selected fields' #236

Open
Tracked by #297
thfai2000 opened this issue Mar 1, 2020 · 6 comments
Open
Tracked by #297

[REQUEST] dataloader accept information of 'selected fields' #236

thfai2000 opened this issue Mar 1, 2020 · 6 comments

Comments

@thfai2000
Copy link

thfai2000 commented Mar 1, 2020

What problem am I trying to solve?

Dataloader currently can solve the problem of collection of data leads to multiple database access.
But batchLoadFn may query unnecessary data from database in order to suit for all kind of resolvers.

Here is my example:
There is a Graph Type called Car and it represents my database table cars.

sql database

create table cars(
 id INT,
 brand string
)

create table car_trade_records(
 id INT,
 carId INT,
 owner varchar(255),
 logDatetime timestamp
 )

create table car_accident_records(
 id INT,
 carId INT,
 location varchar(255),
 logDatetime timestamp
)

car.graphql

type Car {
  id: Int,
  brand: String,
  lastOwner: String,
  lastAccidentLocation: String
}

type Query {
   cars(ids: [Int]): [Car]
}

resolvers.js

let resolvers = {
  Car :{
    latestOwner: async (car, args, context, info) => {
      let carRecord = await carLoader.load(car.id)
      return carRecord.lastOwner
    },
    lastAccidentLocation: async (car, args, context, info) => {
      let carRecord = await carLoader.load(car.id)
      return carRecord.lastAccidentLocation
    }
  },
  Query: {
    cars: async (car, args, context, info) => {
      return await carLoader.loadMany(args.ids)
    }
  }
}

the dataloader

// native sql approach
const carLoader = new DataLoader(keys => {
  return await mysql.query(`SELECT *, 
    (SELECT owner FROM car_trade_records trade 
      WHERE trade.cardId = car.id ORDER BY logDatetime DESC LIMIT 1) AS lastOwner ,
    (SELECT location FROM car_accident_records acc 
      WHERE acc.carId = car.id ORDER BY logDatetime DESC LIMIT 1) AS lastAccidentLocation
    FROM car WHERE id IN ?`, keys)
})

// ORM approach. It is sails.js example. ORM usually can get the related table's data easily.
const carLoader = new DataLoader(keys => {
  return (await Car.find({where:{id: keys}})
      .populate('carOwners', {limit: 1})
      .populate('carAccidents', {limit: 1}) 
    ).map( c => {
      c.lastOwner = c.carOwners.owner
      c.lastAccidentLocation = c.carAccidents.location
      return c 
    })
})

If the graphql query doesn't involve both lastOwner or lastAccidentLocation fields at same time, there will be unnecessary database query cost because the SQL query will make unnecessary table joins/selection.
If the car_trade_records and car_accident_records are huge table or with some complex data manipulation, the unnecessary cost will be significant.

Query examples:

# Example 1: No unnecessary cost

cars(ids: [1,2,3,4]) {
   id,
   brand,
   lastOwner,
   lastAccidentLocation
}

# Example 2: selecting unnecessary data from the table 'car_accident_record'
cars(ids: [1,2,3,4]) {
   id,
   brand,
   lastOwner
}

# Example 3: selecting unnecessary data from the table 'car_trade_record'
cars(ids: [1,2,3,4]) {
   id,
   brand,
   lastAccidentLocation
}

# Example 4: selecting unnecessary data from both two tables
cars(ids: [1,2,3,4]) {
   id,
   brand
}

Someone might think that we can 'join other tables' on demand (in the resolvers) like below. But there will be additional database queries. If 4 car records are selected, 8 (4 * 2=8) more sql queries will be made.

// using sails.js ORM to get data from database  
let resolvers = {
  Car: {
    latestOwner: async (car, args, context, info) => {
      let carRecord = await carLoader.load(car.id)
      return (await CarTradeRecord.find({
        where: {
          carId: car.id
        },
        sort: [{'logDatetime': 'desc'}],
        limit: 1
       }
      )).owner
    },
    lastAccidentLocation: async (car, args, context, info) => {
      let carRecord = await carLoader.load(car.id)
      return (await CarAccidentRecord.find({
        where: {
          carId: car.id
        },
        sort: [{'logDatetime': 'desc'}],
        limit: 1
      }
     )).location
    }
  },
  Query: {
    cars: async (car, args, context, info) => {
      return await carLoader.loadMany(args.ids)
    }
  }
}

const carLoader = new DataLoader(keys => {	
  return await Car.find({ where: {id: keys} })
})

The new feature I'd like to have

If dataloader can accept more infomation about what attributes will be selected, it can reduce unnecessary cost in querying data.

Proposed changes of DataLoader functions:

load(key, fields)

  • key: A key value to load.
  • fields: (optional) An array of field to load. Default value = []

loadMany(keys, fields)

  • keys: An array of key values to load.
  • fields: (optional) An array of field to load. Default value = []

new DataLoader(batchLoadFn [, options])

  • batchLoadFn: function(keys, fields)
    • keys: an array of keys
    • fields: an array of field that are being queried.

For load() and loadMany(), they will return a cached value only if both the key and fields matches those key and fields in batchLoadFn().

Example 1

// assume that cached value is never expired. 

const loader = new DataLoader((keys, fields)=> {
  console.log('keys:', keys)
  console.log('fields:', fields)
  return batchLoadImpl(keys, fields)
})

(async ()=> {
  console.log('No cache')
  let a1 = loader.load(1, ['field1', 'field2'])
  let a2 = loader.load(2, ['field2', 'field3'])
  await a1 && await a2

  console.log('All hit cache')
  let b1 = loader.load(1, ['field2'])
  let b2 = loader.load(1, ['field3'])
  let b3 = loader.load(2, ['field1'])
  let b4 = loader.load(2, ['field2'])
  let b5 = loader.load(2)
  // value can be taken from cached if the requested fields are the part of the cached fields 
  await b1 && await b2 && await b3 && await b4 && await b5

  console.log('Some hit cache')
  let c1 = loader.load(1, ['field1', 'field4'])
  let c2 = loader.load(2, ['field1', 'field2', 'field3'])
  //value with key 2 can be taken from cached
  await c1 && await c2

  console.log('Different cache even with same key')
  let d1 = loader.load(2, ['field1'])
  let d2 = loader.load(2, ['field1', 'field4'])
  console.log('d1 is from cache?', (await d1) === (await c2))  // get from cache?
  console.log('d2 is from cache?', (await d2) === (await c2))  // get from cache?
})()

Expected console outputs:

No cache
keys: [1, 2]
fields: ['field1', 'field2', 'field3']
All hit cache
Some hit cache
keys: [1]        
fields: ['field1', 'field4']
Different cache even with same key
keys: [2]
fields: ['field1, 'field4']
d1 is from cache? true
d2 is from cache? false

Example 2: demostrate the real case

// native sql approach
const carLoader = new DataLoader( (keys, fields) => {
  let subqueries = []

  if(fields.includes('latestOwner')){
    subqueries.push(`(SELECT owner FROM car_trade_records trade 
      WHERE trade.cardId = car.id ORDER BY logDatetime DESC LIMIT 1) AS lastOwner`)
  }

  if(fields.includes('lastAccidentLocation')){
    subqueries.push(`(SELECT location FROM car_accident_records acc 
      WHERE acc.carId = car.id ORDER BY logDatetime DESC LIMIT 1) AS lastAccidentLocation`)
  }
	
  return await mysql.query(`SELECT *, ${subqueries.join(',')} FROM car WHERE id IN ?`, keys)
	
})

let resolvers = {
  Car :{
    latestOwner: async (car, args, context, info) => {
      // info.fieldName: 'latestOwner'
      let carRecord = await carLoader.load(car.id, [info.fieldName])	
      return carRecord.lastOwner
    },
    lastAccidentLocation: async (car, args, context, info) => {
      // info.fieldName: 'lastAccidentLocation'
      let carRecord = await carLoader.load(car.id, [info.fieldName])
      return carRecord.lastAccidentLocation
    }
  },
  Query: {
    cars: async (car, args, context, info) => {
      //determine the necessary fields of the Car Entity in graphql query
      let involvedFields = info.fieldNodes.map(node => node.name.value)
      return await carLoader.loadMany(args.ids, involvedFields)
    }
  }
}

If it is implemented, no matter what fields are selected to Type Car, only one single sql query will be made.

@muddlebee
Copy link

@thfai2000 @leebyron I would like to contribute to this. Shall I go ahead?

@thfai2000
Copy link
Author

thfai2000 commented Mar 6, 2020

@anweshknayak I'm glad that you look into my problem. Thanks.
I'm fine if you want to contribute to this but it would be good if @leebyron can comment on this before we formally start. I afraid my requested feature is not useful.

Actually I've started brainstorming. Let me share my thoughts. One of the ways is to extend DataLoader though it may be not a efficient way in aspect of performance. I don't know.

Below is my draft..(in a rush). I've never run it or tested it. XD

class DataWithFieldsLoader extends DataLoader {

  // My terminology:
  // fieldSet = an array of fields  e.g. ['field1', 'field2', 'field3']
  // info = a object {key, fields} which fields is a fieldSet 
  // e.g. {key: 'key1', fields:['field1', 'field2']}
  // into contains information of what fields are being selected from database
  
  constructor(batchLoadFn, options = {}){
    let actualBatchLoadFn = (infoArray) => {
      // spread infoArray into and de-duplicate some fields
      // make a array of fields with unique field names
      let {keys, fields} = destructInfoArray(infoArray)
      batchLoadFn(keys, fields)
    }
    let resolvedKey = options.cacheKeyFn && options.cacheKeyFn(info.key)
    let actualCacheKeyFn = info => `${resolvedKey || info.key}#${info.fields.join(',')}`

    super(actualBatchLoadFn, {
      ...options,
      cacheKeyFn: actualCacheKeyFn
    })
    // we need a map that record what fields a data value contains  
    this.fieldMap = new Map()
    this.originalCacheKeyFn = options.cacheKeyFn || (key => key) 
  }

  load(key, fields){
    let [value] = this.loadMany([key], fields)
    return value
  }

  loadMany(keys, fields) {
    let infoArray = constructInfoArray(keys, fields)

    infoArray = infoArray.map(info => {
      
      let fieldSetArray = this.fieldMap.get(this.originalCacheKeyFn(info.key) )
      // e.g. fieldSetArray = [ [field1, field3], [field1, field2], [field3, field4] ]
      // it is history of fields retrieval of object with that key
      // we have to find the first fieldSet that 
      // fillful the basic requirement stated in the info object
      let foundFieldSet = null
      if(fieldSetArray){
        foundFieldSet = fieldSetArray.find(set => includesAllFields(set, info.fields) )
      }
      return {
        ...info,
        fields: foundFieldSet || info.fields
      }
    })

    let values = super.loadMany(infoArray)
    // record the latest fieldSet into fieldMap
    // assumption: values.length === infoArray.length
    values.forEach( (value, idx) => {
      let fieldSetArray = this.fieldMap.get(this.originalCacheKeyFn(info.key) )
      if(!fieldSetArray){
        fieldSetArray = []
        this.fieldMap.set(this.originalCacheKeyFn(info.key), fieldSetArray)
      }
      // TODO: don't push the fieldSet if same exists
      fieldSetArray.push(infoArray[idx].fields)
    })
    return values
  }
}

function constructInfoArray(keys, fields) {
  //TODO
  //return [{key, fields}, {key, fields}]
}

function destructInfoArray(infoArray) {
  //TODO
  // return {keys, fields}
}

function includesAllFields(targetFieldSet, searchFields){
  //TODO: 
  // return true if it finds the intersection
}

@thfai2000
Copy link
Author

Dataloader currently is popular and widely used by many developers. so a big change/feature on this standard would need careful consideration. I think it may be a good idea to make a POC first and do some tests on some projects (e.g. my private web system) to examine the actual benefit. If the benefit maybe small but the change is big, it isn't worth.

I created a repository for this. Please feel free to edit or use.
https://github.com/thfai2000/ext-dataloader

@sheepsteak
Copy link

I don't think there is anything stopping you from doing this with the current API. You don't just have to pass IDs to the DataLoader; it can be anything you want. So you could for example have this:

carLoader.loadMany([
  { id: 1, fields: ["field1", "field2"] }], 
  { id: 2, fields: ["field1"] },
])`

In your DataLoader you can map through the keys and merge any for the same IDs before retrieving. DataLoader takes a cacheKeyFn and the docs for that say:

Produces cache key for a given load key. Useful when objects are keys and two objects should be considered equivalent.

So I think you have all the parts necessary to do this now. It just requires a bit more work than a normal DataLoader that just does car.Loader.load(1). 😄

@a7ul
Copy link

a7ul commented Apr 5, 2021

For those looking for a solution:
dataLoader.ts

import DataLoader from "dataloader";
import { groupBy } from "lodash";
/**
 * -----------------
 * createFieldLoader
 * -----------------
 */
type KeyType = string | number;


type KeyAndFields<K, F> = { key: K; fields?: F[] };

function createFullHash(k: string, fieldHash: string) {
  return `${k}:${fieldHash}`;
}
function addHashes<K, F>(keyAndFields: KeyAndFields<K, F>) {
  const fieldSet = new Set(keyAndFields.fields);
  const fieldHash = "f:" + Array.from(fieldSet).sort().join(":");
  const fullHash = createFullHash(`${keyAndFields.key}`, fieldHash);
  return { ...keyAndFields, fieldHash, fullHash };
}

/**
 * A Dataloader that will batch the loading of data for same set of fields.
 * Requires sortIdKey to be passed to help it find the values to be sorted against
 */
export function createFieldLoader<
  V,
  F extends keyof V,
  K extends KeyType = string
>(batchLoadFn: (keys: K[], fields?: F[]) => Promise<V[]>, sortIdKey: keyof V) {
  type LoadFnKey = { key: K; fields?: F[] };
  return new DataLoader(
    async (loadKeys: LoadFnKey[]) => {
      const loadKeysWithHash = loadKeys.map(k => addHashes(k));
      const store = new Map<string, V>();

      //1. Form batches to load at once based on fieldHash. all ids with same field hash should be loaded at the same time.
      const fieldHashGroups = groupBy(loadKeysWithHash, "fieldHash");
      await Promise.all(
        Object.entries(fieldHashGroups).map(async ([fieldHash, keysToLoad]) => {
          //2. In each batch, load values based on the id and fields. Make sure the id field is present here.
          const ids = keysToLoad.map(k => k.key);
          let fields = keysToLoad[0]?.fields;
          if (fields?.length) {
            fields = Array.from(new Set([...fields, sortIdKey as F])); // Make sure we add the id field here
          }
          const values = await batchLoadFn(ids, fields);
          //3. Create fullHash again for each returned value and store it in the store.
          values.forEach(v => {
            const fullHash = createFullHash(`${v[sortIdKey]}`, fieldHash);
            store.set(fullHash, v);
          });
        })
      );
      //4. Get values based on the fullHash in the same order as it was requested.
      return loadKeysWithHash.map(lk => store.get(lk.fullHash));
    },
    { cacheKeyFn: kf => addHashes(kf).fullHash }
  );
}

Test cases:

dataLoader.test.ts

import { createFieldLoader } from "./dataLoader";

type Model = {
  id: string;
  title: string;
  eid: string;
  description: string;
};

const A: Model = { id: "a", title: "A", eid: "1", description: "A-Model" };
const B: Model = { id: "b", title: "B", eid: "2", description: "B-Model" };
const C: Model = { id: "c", title: "C", eid: "3", description: "C-Model" };
const D: Model = { id: "d", title: "D", eid: "4", description: "D-Model" };

class MockDataModel {
  static items = [D, B, A, C];
  static async getByIds(ids: string[]) {
    return MockDataModel.items.filter(item => ids.includes(item.id));
  }
  static async getByIdsWithFields(
    ids: Model["id"][],
    fields?: (keyof Model)[]
  ) {
    const items = await MockDataModel.getByIds(ids);
    if (!fields?.length) {
      return items;
    }
    return items.map(item => {
      const newItem: Partial<Model> = {};
      fields.forEach(f => (newItem[f] = item[f]));
      return newItem;
    });
  }
  static async getByExtenalIds(eids: string[]) {
    return MockDataModel.items.filter(item => eids.includes(item.eid));
  }
  static async getByExtenalIdsWithFields(
    eids: string[],
    fields?: (keyof Model)[]
  ) {
    const items = await MockDataModel.getByExtenalIds(eids);
    if (!fields?.length) {
      return items;
    }
    return items.map(item => {
      const newItem: Partial<Model> = {};
      fields.forEach(f => (newItem[f] = item[f]));
      return newItem;
    });
  }
}

describe("createFieldLoader", () => {
  const fieldLoadFn = MockDataModel.getByIdsWithFields;

  it("should sort the items in the order asked for", async () => {
    const loader = createFieldLoader(fieldLoadFn, "id");

    const [a, b, c] = await Promise.all([
      loader.load({ key: A.id }),
      loader.load({ key: B.id }),
      loader.load({ key: C.id })
    ]);

    expect(a).toEqual(A);
    expect(b).toEqual(B);
    expect(c).toEqual(C);
  });

  it("should only load duplicate items once", async () => {
    const spyLoadFn = jest.fn(fieldLoadFn);
    const loader = createFieldLoader(spyLoadFn, "id");

    await Promise.all([
      loader.load({ key: A.id }),
      loader.load({ key: B.id }),
      loader.load({ key: B.id }),
      loader.load({ key: B.id }),
      loader.load({ key: B.id })
    ]);

    expect(spyLoadFn).toBeCalledWith(["a", "b"], undefined);
  });

  it("should return undefined for the items not existing", async () => {
    const loader = createFieldLoader(fieldLoadFn, "id");

    const [a, y, b, x] = await Promise.all([
      loader.load({ key: A.id }),
      loader.load({ key: "y" }),
      loader.load({ key: B.id }),
      loader.load({ key: "x" })
    ]);

    expect(a).toEqual(A);
    expect(b).toEqual(B);
    expect(x).toBeUndefined();
    expect(y).toBeUndefined();
  });

  it("should only load with only the fields requested", async () => {
    const loader = createFieldLoader(fieldLoadFn, "id");

    const [a1, a2, a3, a4, b] = await Promise.all([
      loader.load({ key: A.id, fields: ["id", "eid"] }),
      loader.load({ key: A.id, fields: ["id", "eid"] }),
      loader.load({ key: A.id, fields: ["description"] }),
      loader.load({ key: A.id }),
      loader.load({ key: B.id })
    ]);

    // should load everything for B since we didnt request any specific fields for it.
    expect(b).toEqual(B);

    expect(a1).toEqual({ id: A.id, eid: A.eid });
    expect(a2).toEqual({ id: A.id, eid: A.eid });

    expect(a3).toEqual({ id: A.id, description: A.description });

    // should load everything for a4 since we didnt request any specific fields for it.
    expect(a4).toEqual(A);
  });

  it("should work for other ids too in the data model", async () => {
    const loader = createFieldLoader(
      MockDataModel.getByExtenalIdsWithFields,
      "eid"
    );

    const [a1, a2, a3, a4, b] = await Promise.all([
      loader.load({ key: A.eid, fields: ["id", "eid"] }),
      loader.load({ key: A.eid, fields: ["id", "eid"] }),
      loader.load({ key: A.eid, fields: ["id", "description"] }),
      loader.load({ key: A.eid }),
      loader.load({ key: B.eid })
    ]);

    // should load everything for B since we didnt request any specific fields for it.
    expect(b).toEqual(B);

    expect(a1).toEqual({ id: A.id, eid: A.eid });
    expect(a2).toEqual({ id: A.id, eid: A.eid });

    expect(a3).toEqual({ id: A.id, description: A.description, eid: A.eid });

    // should load everything for a4 since we didnt request any specific fields for it.
    expect(a4).toEqual(A);
  });
});

@nicoabie
Copy link

nicoabie commented May 7, 2023

I don't think there is anything stopping you from doing this with the current API. You don't just have to pass IDs to the DataLoader; it can be anything you want. So you could for example have this:

carLoader.loadMany([

  { id: 1, fields: ["field1", "field2"] }], 

  { id: 2, fields: ["field1"] },

])`

In your DataLoader you can map through the keys and merge any for the same IDs before retrieving. DataLoader takes a cacheKeyFn and the docs for that say:

Produces cache key for a given load key. Useful when objects are keys and two objects should be considered equivalent.

So I think you have all the parts necessary to do this now. It just requires a bit more work than a normal DataLoader that just does car.Loader.load(1). 😄

I believe that would dedupe request if I as id1 field1 and id1 field2 data loader will query id1 field1 and field2. Doing one request instead of two but then you will need another step to take the corresponding fields to each call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants